Now you have an omnipotent model that reads, sees and listens. Everything at once
Eight years ago, when Nvidia was still a company that made graphics for video games, the company pointed out to something that is starting to enter the conversation: physical robotics. They are the robots with artificial intelligence integrated to behave autonomously. Like a ChatGPT with arms, ears and eyes. It has rained a lot since then and It’s now when we’re starting to enter that future. However, Nvidia has continued to experiment with that way of making the physical and digital worlds converge, and its latest product is the Nemotron 3 Nano Omni. An AI model that sees, hears and reads the physical world. Omni Models. These models are multimodalbut in a much stricter sense. While the models we use every day require separate channels to process and generate audio, text, image and video, an omni model is designed to be inherently multimodal. This implies that they use a unique neural network architecture trained end-to-end so that the interaction between models and stimuli is more natural, faster and capable of recognizing more nuances. An example is an AI that can “see” what a camera captures, analyze the entire situation and give feedback to the user more quickly than one that can do the same, but whose text model has to ask the video model what it has seen and then generate the content. In even fewer words: it better imitates the way humans perceive and respond to the stimuli of the world. Integration. And that’s what Nvidia affirms What Nemotron 3 Nano Omni can do. In the same architecture, it is a model that integrates vision, audio and language capabilities to eliminate the fragmented workflow of current AI agents. According to the company, it is built on a hybrid architecture of mixing experts (AIs trained in various subjects) with 30 billion parameters, of which 3 billion are for inference. It has been designed as a model that is nine times faster than separate models and has three times the performance of other open omni models, consuming 2.75 times less computing power in tasks such as reasoning from a video. Okay, but why?. That is the key question, beyond the numbers and the raw capabilities of this technology. The use cases detailed by the company are the following: Agents: power those agents that navigate graphical user interfaces, reasoning based on the content on the screen and understanding what they are seeing in real time and persistently. The native input resolution is 1920 x 1080 for that HD visual understanding. Documents– Interprets graphs, tables, documents, screenshots, and mixed media inputs. Comprehension audio and video: is able to understand what he sees and hears to maintain consistency in his interpretation instead of reasoning based on disconnected models. For professionals. What is clear is that Nemotron 3 Nano Omni is not something that is launched with the goal of being something for the masses like other AI models that we see every day. Nvidia focuses it on something business, a tool that can be accessed through platforms like Hugging Face and to be deployed on local systems such as DGX Spack or Jetson. That is, it is not something available to everyone. The interesting thing is that it is a technology that is strongly pushing the narrative of agents as omnipotent entities, and it fits with the speech latest from Jensen Huang, CEO of the company, that AI will not come to take away our jobs, but to ‘micromanage’ us. Image | Nvidia In Xataka | There is a company that has grown 3,000% in the stock market, even beating the performance of Nvidia: Sandisk