Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲60% Fable cost cut by converting code to images and having the model OCR it (github.com)

41 points by dimitropoulos 2 hours ago | 13 comments

aabhay 24 minutes ago [-]

In Gemini at least, if you look at how they process PDFs, they do an OCR and then feed the text + image to the model, without charging you for the text tokens (I believe).

So my guess is that Claude’s backend is doing the same — so this hack is probably more of a loophole in token accounting that might get closed if Claude is doing what Gemini does

lpellis 38 minutes ago [-]

I tried the same thing last year (with openai models), back then it worked to reduce prompt tokens, but you needed way more completion tokens, ultimately more expensive (and slower) https://pagewatch.ai/blog/post/llm-text-as-image-tokens/

aabhay 1 hours ago [-]

Ahhh my eyes the vibe coded readme

mpalmer 25 minutes ago [-]

What, you don't like your caveats to be honest?

genxy 1 hours ago [-]

This seems like a pricing hack that burns resources, that when the loophole gets closed the price of OCR will have to rise?

ricardobeat 1 hours ago [-]

It’s not a loophole, it just happens that encoding information as optical tokens is much more efficient than text.

guardiangod 30 minutes ago [-]

Truly a picture is worth a thousand words.

TZubiri 22 minutes ago [-]

Of course it isn't

A text encoding uses 8bits per character on average, tokenization further compresses that

An image font would be 25 bits if 5x5, and most fonts are 12 pixels high

Of course it isn't efficient, this is a pricing inefficiency and a hack to exploit it (even the author describes it as an exploit)

legel 22 seconds ago [-]

You are wrong.

Text tokens are high-dimensional vectors, not 8 bits per character. Every token has a deep embedding, e.g. 1024 float values per text token.

DeepSeek-OCR proved 10x+ compression from visual embedding of text, which was a groundbreaking result. [1]

Very cool to see OP's project hacking on this principle. It's still not lossless, as noted in the github, but is a promising research direction.

[1] https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSe...

samrus 38 minutes ago [-]

Not really. They arent actually using more resources this way either. This might be a fundamental inefficiency thats being removed

It kinda makes sense too. Because while people do read code word by word, we often "glance over" it and do roughly pattern recognition on it to know what it does. Only homing in on something when we need to answer a specific question. I think humans kinda naturally do this exploit anyway

puppycodes 16 minutes ago [-]

That is hilarious and an amazing find.

dimitropoulos 2 hours ago [-]

there's also a DeepSeek whitepaper on this technique https://www.seangoedecke.com/text-tokens-as-image-tokens

dippogriff 23 minutes ago [-]

I want to see more text-free foundation models

Rendered at 17:56:06 GMT+0000 (Coordinated Universal Time) with Vercel.