Despite their name, large language models (LLMs) do more than just read and generate text. They're also a key component in AI image generators—not only are they essential for understanding user prompting, but they can also pull useful information from visuals, generate detailed descriptions, and combine text and images to enhance image understanding and solve more complex problems.
A good example of a platform that leverages this technique is Google's Ask Photos, which was announced by Sundar Pichai at Google I/O earlier this May and began rolling out last month. As people try out this new feature, we're excited to be chatting about the future of visual search with Ananda Rao Handadi, a senior software engineer at Google and one of the lead developers of Ask Photos. He has extensive experience in applying state-of-the-art LLMs in both consumer and enterprise products, particularly in enhancing photo search and image understanding.
Learn more about his insights into AI-powered visual search, the future of image recognition, and the challenges of applying LLMs to visual data.
What is the role of LLMs in image and text understanding?
Traditionally, LLMs have been used for text-based tasks, like answering questions, summarizing documents, or generating content. This is what you see in platforms like ChatGPT and Gemini, where it feels like you're talking to an actual person. But now, LLMs are expanding beyond just handling language—they're starting to understand images too. That is, they're improving AI's ability to extract meaningful and contextual information from images. As a result, they're making AI more versatile across fields like journalism, forensics, and virtual reality.
What does image understanding look like in practice?
Take Ask Photos, for example—a new feature in Google Photos that I helped develop. Ask Photos uses Gemini AI to answer questions about your photos, extracting details from images based on what you ask, like "What themes have we had for my daughter's birthday parties?" By analyzing details like what decorations are in the background or on the birthday cake, it can give you an accurate answer.
Of course, tools like these can't understand the picture like humans do. They use a technology called convolutional neural networks (CNN) to help break the image down into numbers and data points. They then scan the image for patterns like faces, objects, and textures, simplifying the data into something the LLM can understand and analyze.
With this tech, LLMs are setting the stage for more conversational and context-aware image searches, moving beyond traditional methods that rely on static tags and keywords and evolving to recognize more complex patterns and context within images—and, as a result, making searches more intuitive and accurate.
What do you think about the future advancements in AI-powered visual search?
Right now, AI is already advanced enough that it can search for a specific event or theme by gathering relevant details from backgrounds, objects, and even emotions. So, instead of saying, "Show me my vacation photos," you'd say, "Show me the best sunset from my trip to Greece."
But AI-powered visual search is getting even smarter, customizable, and more user-friendly. Soon, you might say, "Create an anniversary card for my wife," and AI will find the best photos of you and your wife from previous anniversaries or special moments, and then it'll create a card that can be a nice gift.
How can LLMs be better used for applications at the intersection of image and text understanding?
Well, think about a real estate agent taking a few pictures of a property and asking AI to instantly create a listing. In seconds, they'll have a complete listing that describes the house in great detail for potential buyers, saving hours of manual work in the process and freeing up time to engage more with their clients and drive more revenue. This is just one example of how LLMs can be used to make life more efficient.
What are some challenges you've faced in applying LLMs to image search, and how have you overcome them?
One major challenge was implementing safety measures into the tool. If unchecked, LLMs can offer potentially offensive or harmful information in the interest of being helpful, and we needed to ensure all answers provided by the platform are safe and appropriate for all audiences.
Another was giving the AI the ability to learn from corrections and incrementally improve itself. This way, if a user corrects an answer or provides extra information, Ask Photos can remember it and improve future responses.
Finally, a significant challenge we faced was scaling the infrastructure to handle a massive amount of user uploads, about 6 billion photos per day. This meant both improving existing systems and developing new ones. We had to do this particularly efficiently since compute costs are only going to go up. It's a challenge that the entire tech industry is facing, making optimization a top priority.
What other efforts related to AI and visual understanding have you worked on?
In addition to Ask Photos, I led the development of Photos Stacks—an organizational feature in Google Photos that uses machine learning to tidy up your photo library. It automatically groups similar photos into stacks, selecting the best one to represent the group and hiding the rest to reduce clutter. This intuitive grouping enhances the user experience and makes it easier to find what you need without endless scrolling.
What excites you most about the future of LLMs and AI in image processing?
What really excites me is the potential to spot details or patterns in our photo libraries that we might otherwise miss—even though we took the pictures ourselves. Being able to rapidly identify hidden connections or details from a mountain of photos has the potential to revolutionize so many industries—journalism, research, medicine, the list goes on—and assist users in ways we can scarcely imagine.
LLMs' Potential Impact on Image Understanding
Ananda Rao Handadi's work is already transforming the way we interact with visual data. And as the technology advances further, users will be able to find, analyze, and organize visual content with greater precision and ease.
Curious to learn more? Check out Google's Ask Photos feature and follow Ananda's journey on Linkedin.