It is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.
In fact GPT5 wrote an edge detection script to see where "golden dog feet" met "bright green grass" to prove to me that there were only 4 legs. The script found 5, and GPT-5 then said it was a bug, and adjusted the script sensitivity so it only located 4, lol.
Anyway, Gemini 3, while still being unable to count the legs first try, did identify "male anatomy" (it's own words) also visible in the picture. The 5th leg was approximately where you could expect a well endowed dog to have a "5th leg".
That aside though, I still wouldn't call it particularly impressive.
As a note, Meta's image slicer correctly highlighted all 5 legs without a hitch. Maybe not quite a transformer, but interesting that it could properly interpret "dog leg" and ID them. Also the dog with many legs (I have a few of them) all had there extra legs added by nano-banana.
reed1234 15 minutes ago [-]
Could you link the image? Interesting stuff.
cancan 17 minutes ago [-]
this is hilarious and incredibly interesting at the same time! thanks for writing it up.
TheAceOfHearts 6 minutes ago [-]
Since I think it's interesting to highlight the jagged intelligence, I have a simple word search puzzle [0] that Nano Banana Pro stills struggles to solve correctly. Gemini 3 Pro with Code Execution is able to one-shot the problem and find the positions of each word (this is super impressive! one year ago it wasn't possible), but Nano Banana Pro fails to highlight the words correctly.
Here's the output from two tests I ran:
1. Asking Nano Banana Pro to solve the word search puzzle directly [1].
2. Asking Nano Banana Pro to highlight each word on the grid, with the position of every word included as part of the prompt [2].
The fact that it gets 2 words correct demonstrates meaningful progress, and it seems like we're really close to having a model that can one-shot this problem soon.
There's actually a bit of nuance required to solve this puzzle correctly which an older Gemini model struggled to do without additional nudging. You have to convert the grid or word list to use matching casing (the grid uses uppercase, the word list uses lowercase), and you need to recognize that "soup mix" needs to have the space removed when doing the search.
I was surprised at how poorly GPT-5 did in comparison to Opus 4.1 and Gemini 2.5 on a pretty simple OCR task a few months ago - I should run that again against the latest models and see how they do. https://simonwillison.net/2025/Aug/29/the-perils-of-vibe-cod...
jasonjmcghee 37 minutes ago [-]
That is... astronomically different. Is GPT-5.1 downscaling and losing critical information or something? How could it be so different?
agentifysh 1 hours ago [-]
impressive.....most impressive
its going to reach low 90s very soon if trends continue
knollimar 46 minutes ago [-]
I do some electrical drafting work for construction and throw basic tasks at LLMs.
I gave it a shitty harness and it almost 1 shotted laying out outlets in a room based on a shitty pdf. I think if I gave it better control it could do a huge portion of my coworkers jobs very soon
amorzor 28 minutes ago [-]
Can you give an example of the sort of harness you used for that? Would love to play around with it
fngjdflmdflg 2 hours ago [-]
These OCR improvements will almost certainly be brought to google books, which is great. Long term it can enable compressing all non-digital rare books into a manageable size that can be stored for less than $5,000.[0] It would also be great for archive.org to move to this from Tesseract. I wonder what the cost would be, both in raw cost to run, and via a paid API, to do that.
Love how employee portals for many companies essentially never get updated design wise over the decades, lol. That page styling and the balls certainly take me back.
inerte 10 minutes ago [-]
I used to work for a company where the SSO screen had a nice corporate happy people at the office type of image. 25mb. I was in Brazil on a crappy roaming 2g service and couldn't login at all. I know most of the work happens on desktop but geee.....
Oh speaking on mobile, I remember when I tried to use Jira mobile web to move a few tickets up on priority by drag and dropping and ended up closing the Sprint. That stuff was horrible.
jamiek88 1 hours ago [-]
Wow yeah. Flashbacks to when Gmail
Invites were cool! Google too.
ed 46 minutes ago [-]
Same with "See prompt in Google AI Studio" which links to an unpublished prompt in AI Studio.
caseyf 13 minutes ago [-]
I'm playing with this and wondering if this is an actually good way to identify dominant colors and other features of a garment/product when using a photo where the item is styled and not isolated from the model or other garments
ed 56 minutes ago [-]
What’s new here? I believe this is just gemini 3 which was released last month (the model id hasn’t changed AFAICT)
minimaxir 52 minutes ago [-]
Nothing new, it's just highlighting practical vision use cases.
devinprater 45 minutes ago [-]
Audio described Youtube please? That'd be so amazing! Even if I couldn't play Zelda yet, I could listen to a playthrough with Gemini describing it.
SXX 8 minutes ago [-]
Hey, I just made simple test on 5 minute downloaded YouTube video uploading it to Gemini app.
Source video title: Zelda: Breath of the Wild - Opening five minutes of gameplay
Please describe what happening in each scene of this video.
List scenes with timestamp, then describe separately:
- Setup and background, colors
- What is moving, what appear
- What objects in this scene and what is happening,
Basically make desceiption of 5 minutes video for a person who cant watch it.
I'm really fascinate by the opportunities to analyze videos. The amount of tokens it compresses down to, and what you can reason across those tokens, is incredible.
minimaxir 35 minutes ago [-]
The actual token calculations with input videos for Gemini 3 Pro is...confusing.
"Gemini 3 Pro represents a generational leap from simple recognition to true visual and spatial reasoning."
Prompt: "wine glass full to the brim"
Image generated: 2/3 full wine glass.
True visual and spatial reasoning denied.
minimaxir 50 minutes ago [-]
Gemini 3 Pro is not Nano Banana Pro, and the image generation/model that decodes the generated image tokens may not be as robust.
The thinking step of Nano Banana Pro can refine some lateral steps (i.e. the errors in the homework correction and where they are spatially in the image) but it isn't perfect and can encounter some of the typical pitfalls. It's a lot better than Nano Banana base, though.
hodder 44 minutes ago [-]
As a consumer I typed this into "Gemini". The behind the scenes model selection just adds confusion.
If "AI" trust is the big barrier for widespread adoption to these products, Alphabet soup isn't the solution (pun intended).
iknowstuff 7 minutes ago [-]
Nano Banana generates images.
This article is about understanding images.
Your task is unrelated to the article.
spchampion2 40 minutes ago [-]
I actually did this prompt and found that it worked with a single nudge on a followup prompt. My first shot got me a wine glass that was almost full but not quite. I told it I wanted it full to the top - another drop would overflow. The second shot was perfectly full.
RyJones 37 minutes ago [-]
The correction I expect to give to an intern, not a junior person.
ugh123 15 minutes ago [-]
did it return the exact same glass and surrounding imagery, just with more wine?
iamjackg 1 hours ago [-]
Curious how this will fare when playing Pokemon Red.
Gemini 3 Pro has been making steady progress (12/16 badges) while Gemini 2.5 Pro is stuck (3/16 badges) despite using double the turns and tokens.
euvin 1 hours ago [-]
Yeah the "High frame rate understanding" feature caught my eye, actual real time analysis of live video feeds seems really cool. Also wondering what they mean by "video reasoning/thinking"?
skybrian 1 hours ago [-]
I don’t think it’s real time? The videos were likely taken previously.
causal 2 hours ago [-]
Okay maybe this one isn't an exaggeration when they say leap forward
1 hours ago [-]
siva7 1 hours ago [-]
Interesting. When i asked Gemini 3 Pro to generate a Infographic from my personal accounting sheet, it first failed to generate anything except a black background, then it generated something where it mixed different languages in a non-sensical way, with obvious typos and irrelevant information grouping. It's certainly a leap forward in OCR, rendering classic OCR useless.
minimaxir 1 hours ago [-]
That's more of an issue with Nano Banana Pro than with Gemini 3 Pro.
siva7 1 hours ago [-]
What's the difference? I thought the vision ai component of gemini 3 is called nano banana?
IanCal 58 minutes ago [-]
That’s about generating images, the other side is about understanding images.
brokensegue 49 minutes ago [-]
i assumed nano banana was just a tool that gemini 3 used though i don't know
minimaxir 46 minutes ago [-]
Gemini 3 Pro's text encoder powers Nano Banana Pro, but it has its own image decoding model that decodes the generated image tokens into an actual image, which appears to be the more pertinent issue in this case.
jonplackett 1 hours ago [-]
Google really are a fully woken sleeping giant. More code reds being issued today I expect.
ch2026 2 hours ago [-]
what framework is being utilized for computer use here?
dmarzio 38 minutes ago [-]
So we’re going to use this to make the maid from the Jetsons finally. Right?
empressplay 1 hours ago [-]
Yes, but can it play PacMan yet?
2 hours ago [-]
agentifysh 1 hours ago [-]
im realizing how much of a bottleneck vision models are
im just a glorified speedreadin' promptin' QA at this point with codex
once it replaces the QA layer its truly over for software dev jobs
future would be a software genie where on aistudio you type: "go make counterstrike 1.6 clone, here is $500, you have two hours"
edit: saw the Screenspot benchmark and holy ** this is an insane jump!!! 11% to 71% even beating Opus 4.5's 50%...chatgpt is at 3.5% and it matches my experience with codex
alex1138 1 hours ago [-]
> once it replaces the QA layer its truly over for software dev jobs
Maybe. However, with CYA requirements being everywhere in industry, there would have to be 100 waiver forms signed. I-promise-not-to-sue-company-if-AI-deletes-the-entire-database
It won't happen for that reason alone. Oh who am I kidding of course it will
hklrekeclhkle 42 minutes ago [-]
[dead]
stego-tech 1 hours ago [-]
The document is paints a super impressive picture, but the core constraint of “network connection to Google required so we can harvest your data” is still a big showstopper for me (and all cloud-based AI tooling, really).
I’d be curious to see how well something like this can be distilled down for isolated acceleration on SBCs or consumer kit, because that’s where the billions to be made reside (factories, remote sites, dangerous or sensitive facilities, etc).
oklahomasports 43 minutes ago [-]
People with your concerns probably make up 1% of the market if that. Also I don’t upload stuff I’m worried about Google seeing. I wonder if they will allows special plans for corporations
stego-tech 29 minutes ago [-]
I’m very curious where you get that number from, because I thought the same thing until I got a job inside that market and realized how much more vast it actually is. The revenue numbers might not be as big as Big Tech, but the product market is shockingly vast. My advice is not to confuse Big Tech revenues for total market size, because they bring in such revenue by catering to everyone, rather than specific segments or niches; a McDonald’s will always do more volume than a steakhouse, but it doesn’t mean the market for steakhouses is small enough to ignore.
As for this throwaway line:
> Also I don’t upload stuff I’m worried about Google seeing.
You do realize that these companies harvest even private data, right? Like, even in places you think you own, or that you pay for, they’re mining for revenue opportunities and using you as the product even when you’re a customer, right?
> I wonder if they will allows special plans for corporations
They do, but no matter how much redlining Legal does to protect IP interests, the consensus I keep hearing is “don’t put private or sensitive corporate data into third-parties because no legal agreement will sufficiently protect us from harm if they steal our IP or data”. Just look at the glut of lawsuits against Apple, Google, Microsoft, etc from smaller companies that trusted them to act in good faith but got burned for evidence that you cannot trust these entities.
bgwalter 48 minutes ago [-]
Arpanet was supposed to be decentralized. Now everyone wants to centralize everything so in a war it is sufficient to strike 100 data centers and the whole tethered economy collapses.
That is called progress.
stego-tech 46 minutes ago [-]
Ah, the fond memories of telnetting to NCSA to upload the raw HTML of my first website, written on an OG Macintosh computer and ported via floppy to a PowerMac for network connectivity.
Simple, elegant. I do miss those days.
Rendered at 20:44:45 GMT+0000 (Coordinated Universal Time) with Vercel.
It is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.
In fact GPT5 wrote an edge detection script to see where "golden dog feet" met "bright green grass" to prove to me that there were only 4 legs. The script found 5, and GPT-5 then said it was a bug, and adjusted the script sensitivity so it only located 4, lol.
Anyway, Gemini 3, while still being unable to count the legs first try, did identify "male anatomy" (it's own words) also visible in the picture. The 5th leg was approximately where you could expect a well endowed dog to have a "5th leg".
That aside though, I still wouldn't call it particularly impressive.
As a note, Meta's image slicer correctly highlighted all 5 legs without a hitch. Maybe not quite a transformer, but interesting that it could properly interpret "dog leg" and ID them. Also the dog with many legs (I have a few of them) all had there extra legs added by nano-banana.
Here's the output from two tests I ran:
1. Asking Nano Banana Pro to solve the word search puzzle directly [1].
2. Asking Nano Banana Pro to highlight each word on the grid, with the position of every word included as part of the prompt [2].
The fact that it gets 2 words correct demonstrates meaningful progress, and it seems like we're really close to having a model that can one-shot this problem soon.
There's actually a bit of nuance required to solve this puzzle correctly which an older Gemini model struggled to do without additional nudging. You have to convert the grid or word list to use matching casing (the grid uses uppercase, the word list uses lowercase), and you need to recognize that "soup mix" needs to have the space removed when doing the search.
[0] https://imgur.com/ekwfHrN
[1] https://imgur.com/1nybezU
[2] https://imgur.com/18mK5i5
https://arxiv.org/abs/2504.07981
its going to reach low 90s very soon if trends continue
I gave it a shitty harness and it almost 1 shotted laying out outlets in a room based on a shitty pdf. I think if I gave it better control it could do a huge portion of my coworkers jobs very soon
[0] https://annas-archive.org/blog/critical-window.html
Oh speaking on mobile, I remember when I tried to use Jira mobile web to move a few tickets up on priority by drag and dropping and ended up closing the Sprint. That stuff was horrible.
Source video title: Zelda: Breath of the Wild - Opening five minutes of gameplay
https://www.youtube.com/watch?v=xbt7ZYdUXn8
Prompt:
Result on github gist since there too much text:https://gist.github.com/ArseniyShestakov/43fe8b8c1dca
I'd say thi is quite accurate.
https://ai.google.dev/gemini-api/docs/media-resolution
Prompt: "wine glass full to the brim"
Image generated: 2/3 full wine glass.
True visual and spatial reasoning denied.
The thinking step of Nano Banana Pro can refine some lateral steps (i.e. the errors in the homework correction and where they are spatially in the image) but it isn't perfect and can encounter some of the typical pitfalls. It's a lot better than Nano Banana base, though.
If "AI" trust is the big barrier for widespread adoption to these products, Alphabet soup isn't the solution (pun intended).
This article is about understanding images.
Your task is unrelated to the article.
Gemini 3 Pro has been making steady progress (12/16 badges) while Gemini 2.5 Pro is stuck (3/16 badges) despite using double the turns and tokens.
im just a glorified speedreadin' promptin' QA at this point with codex
once it replaces the QA layer its truly over for software dev jobs
future would be a software genie where on aistudio you type: "go make counterstrike 1.6 clone, here is $500, you have two hours"
edit: saw the Screenspot benchmark and holy ** this is an insane jump!!! 11% to 71% even beating Opus 4.5's 50%...chatgpt is at 3.5% and it matches my experience with codex
Maybe. However, with CYA requirements being everywhere in industry, there would have to be 100 waiver forms signed. I-promise-not-to-sue-company-if-AI-deletes-the-entire-database
It won't happen for that reason alone. Oh who am I kidding of course it will
I’d be curious to see how well something like this can be distilled down for isolated acceleration on SBCs or consumer kit, because that’s where the billions to be made reside (factories, remote sites, dangerous or sensitive facilities, etc).
As for this throwaway line:
> Also I don’t upload stuff I’m worried about Google seeing.
You do realize that these companies harvest even private data, right? Like, even in places you think you own, or that you pay for, they’re mining for revenue opportunities and using you as the product even when you’re a customer, right?
> I wonder if they will allows special plans for corporations
They do, but no matter how much redlining Legal does to protect IP interests, the consensus I keep hearing is “don’t put private or sensitive corporate data into third-parties because no legal agreement will sufficiently protect us from harm if they steal our IP or data”. Just look at the glut of lawsuits against Apple, Google, Microsoft, etc from smaller companies that trusted them to act in good faith but got burned for evidence that you cannot trust these entities.
That is called progress.
Simple, elegant. I do miss those days.