It should be impossible for an iPhone 17 Pro to run a gigantic 400B AI model. Ought

The iPhone 17 Pro has 12 GB of unified memory. It is a very decent figure for a mobile phone, but in theory absolutely insufficient to run large AI models locally. And therein lies the surprise: a new project has made it possible for this mobile phone to run locally a model with 400,000 million parameters (400B). And that opens the doors to a promising horizon. Giant AI model, dwarf memory. A developer named Daniel Woods (@dandeveloper) has created, thanks to AI, a new inference engine called Flash-MoE whose code has been published as Open Source on GitHub accompanied by a study about his behavior. woods managed to run locally the Qwen 3.5 397B model (the full version, without distillation or quantization) on your MacBook Pro with 48 GB of RAM. Downloaded the model (209 GB on disk) and developed that inference engine to achieve something that seemed almost impossible. Other developers have gone even further and have managed to run models like DeepSeek-V3 (671B) or even Kimi K2.5 (1.026B!!) on their MacBooks. The speed is slow, no doubt, but they work, they work. It’s amazing. iPhone 17 Pro is capable of running a 400B model. Another developer called Anemll wanted to go a little further and try to run this model with almost 400,000 million parameters on his iPhone 17 Pro with 12 GB of RAM… and he succeeded. It is true that the model is very slow in responses (0.6 tokens per second, very unusable), but achieving something like this opens the doors to a future in which video or unified memory is no longer so critical to be able to use huge AI models locally. a few hours ago doubled the speed at 1.1 tokens per second, reducing the number of experts to four (2.5% quality loss in responses). It is still not entirely usable, but the technical demonstration is evident. Another user has preferred to use a somewhat smaller model (Qwen 3.5 35B) but still huge for the iPhone, and has already managed to get it to run locally at about more than acceptable 13.1 tokens per second. Why it matters. The AI ​​models we use in the cloud (ChatGPT, Gemini, Claude) are gigantic and run in data centers with thousands of chips and enormous amounts of memory and storage. They are the most powerful because they run on the most powerful machines. Although it is possible to use AI models locally, the models that we can run are much smaller and that makes it difficult for them to behave equally well both in quality of responses and in their speed or precision. This method opens the door to a future in which even on “modest” machines it is possible to run giant AI models that give better answers and allow us to avoid using models in the cloud. Apple already warned. Three years ago a group of Apple researchers published the study ‘LLM in a flash‘ which precisely pointed to that: to run AI models locally it would be possible not only to take advantage of the unified memory of Macs, but also their storage units. The speed would be slow, yes, but this would open up the possibility of running gigantic models locally on machines with much smaller amounts of unified memory. Woods used Claude Code with Claude Opus 4.6 and applied the new methodology “autoresearch” by Andrej Karpathy to implement Flash-MoE based on that research. The result is really promising. Video memory was everything. On my Mac mini M4, for example, I have 16 GB of unified memory. This means that with tools like Ollama you can install and run models like Qwen 3.5 4B locally with some fluidity, but 7B models or others like gpt-oss 20B would be much slower in responding (or would get stuck altogether). Video memory (or unified on Apple devices) is the most important parameter when running local models, both in terms of quantity and bandwidth. If you want to use them fluidly, that’s the limiting factor. It is possible to use “regular” RAM, but the speeds when using it are reduced so drastically that it is often better not to use that option at all. If you have a fast SSD, you have a treasure. Now the limiting factor is our SSD drive, since the model uses it as if it were a kind of substitute for video memory. And the faster the SSD drive on our computer, the better. There is good news here, because lately we are seeing how PCIe 5.0 drives they achieve about 15 GB/s without too many problems, and that speed already gives enough room for maneuver to use much larger AI models locally than we could use before. A promising future for local (and more private) AI. This discovery is really striking for everyone who wants to use AI locally, because it allows you to use huge models without having to make a huge investment in the latest generation graphics cards or, for example, in a Mac with a lot of unified memory: a Mac Studio M3 Ultra with 512 GB of memory, for example, costs more than 10,000 euros. With this new method we could opt for a much cheaper machine that, with a good SSD unit, would allow us to use giant models in a fairly decent way. Not as fast as those other options, sure, but still very decent. It’s a notable step forward in enjoying the benefits of running local AI models, including the biggest of them all: privacy. With this type of local execution, our conversations and everything we tell the chatbot stays on our machine, it does not end up on the servers of companies like Google, OpenAI, Meta or Anthropic. In Xataka | Jensen Huang believes we have reached the “coming of the AI ​​wolf.” It is perfect for feeding a Tamagotchi

Log In

Forgot password?

Forgot password?

Enter your account data and we will send you a link to reset your password.

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections

Here you'll find all collections you've created before.