As we expected, Google has rolled out Gemini 2.0, their latest AI model that’s all about taking multimodal capabilities to the next level and adding some cool agent-like features.
“We’re super excited to kick off this new era with Gemini 2.0, our most advanced model yet. With fresh updates in multimodality—like being able to handle images and audio natively—and using tools right out of the box, we’re getting closer to our dream of a universal assistant,” Google shared in a blog post.
“This is just the start. We believe 2025 is going to be the year of AI agents, and Gemini 2.0 is the model that will drive our work in that direction,” said Demis Hassabis, the head honcho at Google DeepMind.
Gemini 2.0 Flash can handle all sorts of inputs—images, videos, and audio—and it can spit out multimodal outputs like images generated alongside text and multilingual audio that sounds pretty natural. Plus, it can use tools like Google Search, run code, and even work with third-party functions.
This new model is faster and outshines its predecessors in key tests. Developers can get their hands on Gemini 2.0 Flash via Google AI Studio and Vertex AI, with a full rollout expected by January 2025.
Google also launched the Multimodal Live API, which lets developers create interactive apps with real-time audio and video input.
Project Astra and AI Agents
At Google I/O 2024, they introduced Project Astra, a universal AI assistant that’s been getting some nice upgrades. It now supports conversations in multiple languages and can better understand different accents and tricky words.
Thanks to Gemini 2.0, Project Astra can tap into Google Search, Lens, and Maps, making it way more useful for everyday tasks. It also has a memory boost, allowing it to remember things for up to 10 minutes during a session and personalize interactions based on past chats. Plus, with improved streaming and audio processing, it can chat almost as fast as a human.
Google also teased an early-stage project called Project Mariner, which will help users navigate the web by understanding and reasoning with the info on their screens.
This agent can read text, code, images, forms, and even follow voice commands. “Book a flight from SF to Berlin, leaving on March 5 and coming back on the 12. We’re getting to a point where you can give a computer a complex task and it’ll handle a lot of the work for you,” said Jeff Dean, Google DeepMind’s chief scientist.
They also introduced Jules, a developer-focused agent that works with GitHub to help with coding tasks under supervision.
Google DeepMind is also exploring AI agents that can enhance video games and navigate 3D environments. They’ve teamed up with game developers like Supercell to see what the future holds for AI-powered gaming buddies. Plus, Gemini 2.0’s spatial reasoning is being tested in robotics for real-world applications.
And let’s not forget about Genie 2, a large-scale model that can create all sorts of playable 3D environments.