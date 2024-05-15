You’re on a phone call (on your Android phone, let’s be specific). Some dodgy “bank executives” call you saying some fraudulent transactions have been attempted via your account and suggest transferring money to another account to keep it safe. Is the miracle of humanity at work, or simply malicious intent? Your Android phone will alert you, real-time as the call is happening, about a “likely scam” attempt with a specific detail – banks never ask you to move your money to keep it safe”. End call, an option on that notification. That is a world Google envisions, powered by artificial intelligence (AI). Powering this is the on-device Gemini Nano, the smallest among all Gemini AI models.

If Google’s intent to not be perceived to be left behind in the larger AI stakes was on full show at this week’s I/O 2024 keynote, we can credit fast-moving competition for that. OpenAI’s GPT can be credited with kickstarting the AI age a bit more than a year ago and hasn’t looked back despite some management upheaval last year.

Everyone’s played catch-up, so much so that Microsoft smartly decided to align with OpenAI with over $10 billion in investments and getting GPT as the underlier for its Copilot assistant for Microsoft 365.

A day before Google was all set and ready to talk about Gemini’s updates and its vision for AI on Android, OpenAI’s curveball was the new, and iterative, GPT-4o model. Make no mistake, this is a significant step forward in accepting inputs that are video and audio, much beyond text we already know it is proficient with.

It is now an assistant that can read your facial expressions and tone of your voice, can hold a conversation with context as a smart-enough human would do real-time translations of what it reads or hears, and can even see the world through your phone’s camera as well as talk about it. Google Lens and Microsoft Translator, better watch out?

Talk to AI, discuss questions that are burning inside, and even rehearse a performance you've to do. AI shall be your companion. At this point, my mind wandered over to an OpenAI partnership with Be My Eyes, which will now upgrade from GPT-4 to GPT-4o as the foundation for guidance for users with visual impairments. That’s a specific use case, which will definitely bring good to the world.

This is what I’ve been working to get at. Generative AI, it is irrelevant whether on-device or relying on the cloud, has now become incredibly powerful and therefore more capable than we may imagine. Models are becoming smarter, with an ability to follow complex and nuanced instructions, including format and style. No wonder, they can now plan travel itineraries for human users. At least on paper, till we have real-world experience that they can’t, and end up tripping over their own feet.

For now, be amazed and terrified, in equal measure.

The fact that Google responded to OpenAI’s GPT-4o’s ability to see the world through the phone camera with Gemini Live, within 24 hours, details just how little margin there is to get any of this wrong. Announcements are done, the pressure only increases now for teams to get this right. It’ll be akin to humans chatting, with natural-sounding voices, conversational flow and mid-sentence interruptions. In Google and OpenAI’s demos, AI was able to correctly identify the world as it saw through a phone camera. Did I mention amazing and terrifying, as two underlying sentiments?

My attention got drawn to certain announcements, underlying Google’s advantage, with the ecosystem of services it has, available to millions of users. At some point, you’ll begin to wonder too – how much does AI know about us as individuals already, and how much more will it know a few months later?

What could the future hold?

Here’s where we are headed. Google intends to integrate its Gemini models in a sidebar in Gmail, Docs, Drive and more, not just for Workplace users, but for all of us who pay for Gemini Advanced (I don’t yet, should I? Unlikely.). It’ll happen, at some point. You can then have an AI agent working to organise all receipts (shopping, travel etc) in your Gmail into a spreadsheet. A tool that’ll help locate order details for a product you’d like to return and help with that process (global implementation will be difficult, and shopping sites will have their restrictive processes). An AI assistant planning a trip for you, based on what you told it about the places you’d like to visit in a city or the food you’d want to have.

The key for all this smartness to work is, your data. The key to Gemini in Gmail, Docs, Drive and other Google services is making the model at work on your device and account, specifically tuned for your data. The responses to queries will likely have better context, it’ll be able to search for answers in your documents or chats and know what you search for over time (sports scores, for instance). Less chances of AI hallucinating or getting the context wrong.

Google Assistant being replaced with Gemini is for a singular reason – to be able to anticipate what a user is trying to do to build context and relevance of suggestions. That’ll include tasks on the phone or contextualising documents. AI in Messages will have access to conversations for context and for AI to be ready with suggestions or help specific to what you’re doing at the time.

The thing about it is– didn’t we lose the battle to hold on to our data with a layer of privacy, a long time ago? Google executives did insist time and again when I asked them, that no user data will be used to train AI models. And for AI features, be it in Google services or in Android phones, there will be clear options to turn those off too. OpenAI is placing limitations on a number of voice options available for GPT-4o’s audio modalities, till they have more training and safety measures in place.

Competition is why we are, where we are, with rapidly realistic AI. No discussion is complete without a mention of Google’s Veo text-to-video tool, which is for now available to select creators. The level of realism, at least with what’s demoed, has to be seen to be believed. Reminds me of a time earlier this year, when OpenAI detailed its own text-to-video generative AI tool called Sora. So realistic, they have refused to release it for public use, at least for the time being.

It was an interesting move by OpenAI to release a ChatGPT app for Apple’s Mac computing devices. Not Windows, though Microsoft has been a pillar of support over the years. Is that a way for Mira Murati, OpenAI’s CTO to confirm that a speculated deal with Apple is ready to be confirmed? Is that really where most ChatGPT users really are, or do we read between the lines – that’s where most of ChatGPT users will be? I’d not be surprised if Apple does rely on OpenAI to sort out AI for iPhone, iPad and Mac, at least for the next few years. If that has to happen, we’ll know next month.

Vishal Mathur is the technology editor for Hindustan Times. Tech Tonic is a weekly column that looks at the impact of personal technology on the way we live, and vice-versa. The views expressed are personal.