> Me: What was the sushi place near latitude 41 in Columbus? Did it go out of business I don't see it on Google maps anymore
> Sonnet 5 (medium): None of these past chats mention a sushi place — I don't have anything on record about that. Do you remember the name, or roughly which part of Columbus (neighborhood/street) it was near? That'll help me search and check its current status.
Not impressed. It got the name right on high effort one shot but hallucinated the date relativity (Jan 2026 is not last month...). Worked okay on extra. Sonnet 4.6 worked fine on medium, high, and extra one shot.
sejje 17 minutes ago [-]
Try it 25 more times and let us know how it averages out. It's non-deterministic, remember?
nijave 14 minutes ago [-]
I tried 3 more times. Two were nearly identical and 1 recognized Latitude 41 as a restaurant but had a similar useless reply
ben_w 8 minutes ago [-]
"Latitude 41" is a business name?
If you gave me this prompt, I'd say "Which Columbus? None of the ones I know about are at a latitude of 41 degrees north or south?"
Mtinie 15 minutes ago [-]
What was your expectation? That your prompt would trigger a web search, first, before the introspection of past conversations and a training set recall?
How did Sonnet 4.6 respond that was objectively better for your use case?
nijave 11 minutes ago [-]
[flagged]
Jcampuzano2 17 hours ago [-]
I'm struggling to understand why I'd ever use this instead of just using a lower effort level for opus given on many of the benchmarks listed the cost per task rises above opus at anything higher than medium effort.
Only thing I can think of is for when someone is out of opus credits. Of course there are API billing use cases but I'd probably still just use opus on low.
itopaloglu83 17 hours ago [-]
More and more I find myself trying to stop Opus from doing something stupid, and at every turn I need to tell it to stop overcomplicating things.
I think the models are being optimized for wealth extraction from users and companies, instead of solving problems.
I don't know why Opus would try to create an entire library when I told it specifically to do something simple that would take 2-3 lines of Python.
scrollop 7 hours ago [-]
"I think the models are being optimized for wealth extraction from users and companies, instead of solving problems."
YES! They introduced the new tokenizer to increase token generation by upto 33%.
On top of this, Anthropic are generating almost twice as much revenue per paid user than openai - whilst their subscriptions have lower usage limits than openai's:
This slot machine has access to your bank account and can decide how much to play on its own!
ngruhn 3 hours ago [-]
> I think the models are being optimized for wealth extraction from users and companies, instead of solving problems.
I don't think so. Expect that in a market with high vendor lock-in but that's not the case here. The market is extremely competitive and switching cost are near zero. Anthropic can't afford to pull shit like this and sacrifice quality.
notnaut 60 minutes ago [-]
The disconnect between the reality of and the consumer sentiment of this particular realm of products seems to be one of the most dramatic and widespread I’ve personally ever seen.
__natty__ 15 hours ago [-]
> More and more I find myself trying to stop Opus from doing something stupid, and at every turn I need to tell it to stop overcomplicating things
Yeah, that’s my thoughts as well. I feel it’s great for benchmarks and some tasks while in other it tries to spend as much tokens as possible, tries to overcomplicate task and needs seconds or third round of steering that costs. With the scale Anthropic operates I bet it’s huge amount of extra money just to make sure their model works.
Aeolun 14 hours ago [-]
It’s really weird when you go to one of the open models and suddenly the same context window stretches nearly 3-4 times as long.
indoordin0saur 12 hours ago [-]
Yeah. Mine really likes to read excess code. I'll ask it questions like "If I move all these three ETL jobs into a subfolder will it break anything?" It'll start with giving me the simple answer but then continue on to consider another question and realize it requires reading my entire other repo that handles all of my cloud's infrastructure. And it'll proceed to read through tens of thousands of lines of terraform.
post-it 16 hours ago [-]
> I don't know why Opus would try to create an entire library when I told it specifically to do something simple that would take 2-3 lines of Python.
Because it reasons in one direction. First it encounters some kind of issue with 2-3 lines of Python that might make it not work, and then it goes onto plan B, which is making a library, but it doesn't circle back and compare the effort of making the library to working around whatever might make the 2-3 lines not work. Except sometimes it does, because it's inscrutable.
Traubenfuchs 3 hours ago [-]
It's really bad when you let opus do investigations on broken java or infrastructure stuff. It starts decompiling .jar, sometimes multiple versions of the same dependency, reading every single kubernetes/terraform file and loading all the logs and info kubectl offers.
MagicMoonlight 14 hours ago [-]
[dead]
3ffs 12 hours ago [-]
There were many of us who predicted and saw this months ago.
Should I refer to those who are only realising this now as stupid? I believe so.
Its not wealth extraction btw - the correct economic term is capturing/extracting surplus. They have a wide range of schemes - quality discrimination being one very obvious one.
Swear most of you on here pretend to be soooo smart when you def are not.
licjon 37 minutes ago [-]
Is there a router or wrapper that provides a real-time cost estimation for alternative settings? Obviously, you can't predict exact output tokens without running the inference, but a tool that calculates the exact input cost across models and applies a historical average for the output tokens could be useful. Like, you run a task on Sonnet, and it estimates: "Based on your input tokens and a 1:1 output ratio, this would have cost $X on Opus at a low effort level."
nicce 17 hours ago [-]
Older Opus models will likely get deprecated and then over time this is the cheapest model. That is how prices are currently increased.
ChrisLTD 15 hours ago [-]
Yeah... Sonnet becomes the new cheap model, and some Fable class model becomes the more expensive/better one.
theptip 10 hours ago [-]
Wat. Price/perf has been going down massively over the last few years.
darkwater 2 hours ago [-]
Because they still haven't fully captured the market for Agentic Development.
phainopepla2 16 hours ago [-]
Looking at some of the agentic coding benchmarks on the system card[0], pages 117-118, it seems that running it at low outperforms Sonnet 4.6 at any level, and is a good deal cheaper as well. So on low it could be a good workhorse for an Opus-planned task.
I concur. I already use Opus 4.8 for almost all my tasks and this gives me almost no reason to try Sonnet 5.
c0m47053 14 hours ago [-]
Specific task based benchmarks don't reflect a lot of day to day agentic use cases in my experience. If you are working on a series of discrete tasks and can clear context after each one and move to the next, you might get that sort of efficiency from Opus low effort. I often find that when working through a real problem, iterating and discovering, context length can creep up, and that is where opus tends to get expensive.
fluidcruft 1 hours ago [-]
If you are out of Opus credits, you are out of all model credits.
theptip 10 hours ago [-]
Are we reading the same chart? They have Sonnet <= high as Pareto dominant on $/perf.
You have to test each task obviously but it is not a bad model on its face.
frozeus 3 hours ago [-]
They have updated it
LUmBULtERA 2 hours ago [-]
Did Anthropic have Opus 4.8 and Sonnet 5 switched in the Agentic Search chart at first?
fluidcruft 1 hours ago [-]
No, and the original had everything more expensive. There's a comparison here:
The explanation Anthropic gave for the update doesn't address how the x-axis needed to range up to $50 previously and only $10 now. In any case the pass rates are also lower.
Probably the difference between whatever it is people notice when they say models become "nerfed".
SirMaster 17 hours ago [-]
Maybe it's not for you? I don't pay, so I can't even use Opus... So this is an upgrade over Sonnet 4.6 for me.
enraged_camel 17 hours ago [-]
Speed is a huge reason. Sometimes you just need some simple tasks get done fast, and waiting 30-60 seconds for opus to even start thinking can really slow things down.
humanymous 17 hours ago [-]
Opus with low reasoning effort would be faster than Sonnet with high reasoning. So that won't exactly help.
I think it would just be what those models are optimized to perform
conradkay 18 hours ago [-]
Wow, seems worse even on price/performance than GLM 5.2, which is only 744b parameters.
From the system card: "On CyberGym vulnerability discovery, Claude Sonnet 5 is less capable than Sonnet 4.6, and far less capable than Opus 4.8 and Mythos 5
As with the other evaluations in this section, these results were achieved with all safeguards turned off. When run with our default mitigations, Sonnet 5 scored a 0 on CyberGym"
sixtyj 18 hours ago [-]
I have tried to rewrite an article with GLM-5.2 and with Sonnet 4.6. Completely different results as LLM is non-deterministic. But GLM-5.2 made a lot of subtle mistakes that needed to be corrected by hand. On the opposite, Sonnet found and corrected all mistakes in the second round.
Similar situation was with planning and coding. GLM-5.2 seems to be good “on paper” but the real usage results was different.
And I am not an attorney for Claude or GLM-5.2… :)
But as I’ve been using LLM models daily since Nov 2022 I have realized that all common tests have to be confirmed in your project - there is no “one model rules them all” - you need to dig out a specific model from that LLM haystack with thousands of models.
Benchmarks help but they start to be similar to fuel consumption specs in car ads - real consumption is different for everybody :)
Traubenfuchs 3 hours ago [-]
> Completely different results as LLM is non-deterministic.
You'd need to produce this like 20 times by each model and then do 2x20x20 cross comparisons by both models and ultimately distill the 2x20x20 comparison results into two reports of how they differ.
In this non deterministic computing future, everything else is voodoo, feelings and "vibes".
jamesrcole 2 hours ago [-]
I would expect a model's result each time to be of a similar quality to the other times. There's something wrong if it does a way better or worse job, at the same problem, sometimes. It's possible, but I haven't heard anyone saying that they do.
Retr0id 18 hours ago [-]
Finally, a viable business strategy - sell security-oblivious code monkeys for cheap, then charge premium rates for agents capable of cleaning up the mess.
JacobAsmuth 18 hours ago [-]
I think instead they should sell super hackers and get their product banned instantly and go bankrupt
usef- 14 hours ago [-]
Judging by the events of recent weeks, I'm guessing the low cyber results are why they were allowed to release it
loufe 18 hours ago [-]
Not to single you out, parent commenter, but I really hope the quality of discourse on HN will move past these basic comparisons eventually. It seems like every thread on every model release has the exact same comments.
"Wow, X models is Y% better or worse than Claude Z model on T benchmark"
"That's irrelevant, they're just benchmaxing."
"Not useable for daily coding or agentic workloads, the vibes are totally wrong."
"It's almost as good, and costs a lot less, so I will absolutely use it."
"I cannot imagine justifying using these, as the step change means open models lower costs do not make up for the productivity loss"
I'm an unhappy Anthropic customer and really rooting for open models and non-gatekept intelligence, but how do we move on from this now meme-like model release discourse rigamarole. I do not know what that would be. I don't design LLMs nor benchmarks, and I genuinely appreciate that people do their best to provide information, even if non-perfect here. I'm sure most of you who actively read these comment pages on announcements must feel similarly, though, right?
sejje 20 minutes ago [-]
I feel the same way sometimes.
I read a comment earlier that said "I think it's likely that they've scraped all the code regardless of license and trained on it, given how much they scrape the web."
That's what every other comment said like 3 years ago. Where has this guy been?
The trends in discussion about LLMs gets very, very tired--there's little added but personal opinions.
tripleee 18 hours ago [-]
I'm not sure what else can be said? I've found benchmarks to be a very weak signal for how good/bad the model is, but it's the #1 thing the companies highlight.
20 minutes after the announcement there's no real useful statement that can be made about it.
conradkay 15 hours ago [-]
Yeah you definitely have to be skeptical regarding sentiment for open/local model capabilities, since there's bias from what people want to be true.
I generally agree with this in spirit https://www.seangoedecke.com/are-new-models-good/ , but I think you can read Anthropic's results showing Sonnet 5 as almost strictly worse than Opus 4.8 as very credible/meaningful, and then draw comparisons from that
18 hours ago [-]
tiahura 18 hours ago [-]
"It's totally obvious they quantitized Claude Z"
sejje 19 minutes ago [-]
At least we quit with the "i asked it this question and here's what it said" comments. They were truly awful for the first 6 months or so.
Or the "I have my own personal benchmark..."
"Claude and its political bias thinks the supreme court should..."
microtonal 18 hours ago [-]
Claude Sonnet 5 is built to be the most agentic Sonnet model yet. It can make plans, use tools like browsers and terminals, and run autonomously at a level that, just a few months ago, required larger and more expensive models.
I have been using Sonnet 4.6 more than Opus, because I'm mostly doing agent-assisted development and not fully agent-driven development. This announcement does not make me positive, I have found that the more models are optimized for fully agentic development, the worse they get at assisted development and often start doing too much despite very strict/specific instructions.
I have been moving more and more to K2.7 Code and GLM-5.2 the last few weeks. They are often good enough for assistance, very fast, and cheap.
Brendinooo 18 hours ago [-]
Yeah, there's a real opportunity for one of these companies to invest time in a model that's tuned for, to use your term, agent-assisted developement.
Trouble is, everyone inside their buildings seems to believe that no one will be working like that in a year or two.
everforward 18 hours ago [-]
There’s no way to justify their valuations if they get downgraded to a pair programming tool. They need fully agentic stuff to work and replace human engineers to even come close.
Offhand, I’m not even certain whether a model like that could justify the constant retraining we’re doing on the agentic models.
It doesn’t make a lot of sense to spend millions or billions on training to reduce hallucinations by 0.3% if your model assumes a human is in the loop to course-correct them.
If LLMs can boost their productivity even by an average of 5% (studies from ~2024 put it in the ~30% range depending on task) that is ~1.5 - 2.5T in value annually. Even if the AI industry can capture a fraction of that, that is a huuuge monetization opportunity.
Note, at 5% productivity boost, humans are not just in the loop, they are the loop. AGI or large-scale replacement of humans is not even needed, but the financial opportunity is already immense, and it scales with how much human productivity can be improved (i.e. how much work can be offloaded to LLMs.)
Now, I don't think AGI will happen soon (or has already happened, depending on how you define it) but I do think humans will be a much smaller part of the loop and large-scale job displacement will happen once companies figure out how to properly use AI.
At this point, the financial upside for the AI industry is extremely high but will be limited by the social turmoil that will inevitably ensue (which we're already seeing brewing in the data center backlash.)
e9 16 hours ago [-]
I want to propose alternative reality where 1.5-2.5T in value doesn't go to a handful of companies. Instead it turns out to be like restaurants where this gets distributed to lots and lots of small, local, mostly interchangeable teams. There will of course be some super star "chefs" leading the industry and setting trends and some "restaurant chain" like big businesses and supply chain for all of this.
keeda 13 hours ago [-]
FWIW I do think that availability of competitive open weight and other non-frontier models, along with improvements in harnesses that can get good results out of these models, will result in less concentration and a healthier marketplace.
However, these frontier labs are also making moves that could let them capture a disproportionate share of the upside. One possibility is a situation analogous to the smartphone manufacturing space, where there are dozens of players but just a handful (e.g. Apple, Samsung in smartphones) capture the lion's share of the revenue.
skeptic_ai 12 hours ago [-]
Apple you can’t exit the ecosystem.
Samsung the same. And is the best android device.
If tomorrow comes a Nokia os will be dead in the water: it has no apps.
But with a new llm that doesn’t matter. There is nothing sticky about typing Gemini, Claude or codex in a cli.
keeda 10 hours ago [-]
There's nothing sticky today but you can bet they're working maniacally to fix that. These companies will make most of their money in the enterprise space and there are probably unlimited ways to engineer stickiness in an enterprise setting. Like, MSFT still rakes in those billions despite pretty much every one of their products having commodity competitors.
The AI labs are also making moves to secure long-term enterprise presence, such as their Forward Deployed Engineer strategy. I think that is a trojan horse play that could make enterprises dependent on them forever, much like so many companies are still dependent on IBM's mainframes. As an extreme example, you could imagine a company's core business logic encoded in the weights of a proprietary model custom-trained and hosted by one of these model providers, something even more inscrutable and sticky than ancient COBOL codebases.
xxpor 16 hours ago [-]
The world is not zero sum. Value is created, not just preserved. Anthropic and OpenAI creating value does not imply that smaller guys can not also create value.
afavour 16 hours ago [-]
But marketplaces also exist and big players in a marketplace are often able to manipulate the market such that they are advantaged and small players are not able to break in.
mpyne 15 hours ago [-]
This is true of every market that has ever existed, and that's not stopped small players from finding niches.
bdamm 16 hours ago [-]
How? Training and operating models seems to naturally focus on those willing to invest quite significantly in these operations.
nish__ 15 hours ago [-]
If RAM prices come down, running your own models will be relatively affordable.
actionfromafar 16 hours ago [-]
Sysco is pretty big.
ricardobayes 16 hours ago [-]
I am deeply surprised by the silence of philosophers, sociologists, liberal arts majors, economists. Where are the think tanks who contemplate and debate the societal aspects? The tech is advancing full steam but the "other side" doesn't feel anywhere nearly ready.
bloppe 16 hours ago [-]
Idk why you're perceiving silence. Feels to me like this is the main thing people talk about nowadays.
scarmig 15 hours ago [-]
It has to do with the scope of what they're discussing. It seems extraordinarily small: e.g. what if AI increases productivity growth by 0.4%? Do data centers use too much water? Are AIs racist when reviewing resumes?
The frontier labs, on the other hand, are thinking about replacing all human labor, ending death, and the risk of it causing human extinction. Most of the apparatus we're talking about approach it very parochially; it's almost like they're embarrassed to take the grander ideas even a little seriously, for being too nerdy/sci-fi.
freejazz 15 hours ago [-]
The public would happily string up any of these CEOs if given the chance
bdamm 15 hours ago [-]
Because the "other side" is busy trying to anthropomorphise AI into solving the trolly problem, while being mostly clueless about the actual problems.
They'll show up after the fact and whinge endlessly about how they should have been involved.
DrewADesign 12 hours ago [-]
I guess the real problems are things like people not being allowed to post AI-generated images in digital drawing, painting, and photography communities, because I see a lot of boosters ceaselessly whining about that abject “discrimination”, despite having plenty of places where people post all kinds of that garbage all the time.
Or maybe every cultural group has its own set of whiners and we always think the ones we disagree with are the loudest.
freejazz 15 hours ago [-]
Silence? Even the pope has come out against AI? Who hasn't? Diplo??
digitaltrees 16 hours ago [-]
Reid Blackmun has written several books and has a consultanting agency to guide ethical implementation of AI
DrewADesign 13 hours ago [-]
Sometimes the great algorithmic gods give us a glimpse of our own bubble.
cindyllm 12 hours ago [-]
[dead]
marsven_422 4 hours ago [-]
[dead]
everforward 14 hours ago [-]
> Note, at 5% productivity boost, humans are not just in the loop, they are the loop. AGI or large-scale replacement of humans is not even needed, but the financial opportunity is already immense, and it scales with how much human productivity can be improved (i.e. how much work can be offloaded to LLMs.)
The studies I've seen recently (at least in the software space) put it at something like a 10% increase in coding speed, which for me would probably translate to something like a 3% increase in productivity. I spend a lot more time on things like getting agreement between teams, documenting approaches to things that don't exist on the wiki, etc, that LLMs are significantly less effective at. Or just can't do; no one will be happy if I send an LLM instead of me to meetings.
I suspect a lot of roles are like that. They give a 10-30% boost to the core role function, but that core role is still only 30-50% of what you do.
> that is ~1.5 - 2.5T in value annually
That seems really large, but it's ~2-3x Walmart's yearly revenue, and OpenAI and Anthropic both have estimated valuations that compare to Walmart's market cap. And this is before we consider that they need to do it for cheaper or why would anyone bother. Realistically, potential revenue is probably half that at best.
It's also before cutthroat pricing really kicks in. People are willing to pay for Claude right now; I still suspect that as time goes on people will start looking towards Deepseek/GLM/etc models that provide 95% of the performance at 10% of the price. That'll cut the market even further.
The question is how much demand for knowledge work swells as prices fall, and whether that's a soft landing or a crash.
keeda 12 hours ago [-]
> That seems really large, but it's ~2-3x Walmart's yearly revenue, and OpenAI and Anthropic both have estimated valuations that compare to Walmart's market cap.
...
It's also before cutthroat pricing really kicks in.
Right, that's more of an estimate on the value proposition of the overall AI industry, rather than valuations of the industry or specific players. While I don't think OpenAI and Anthropic will capture all of the potential upside, I do suspect they will do much better than other players despite the competition (https://news.ycombinator.com/item?id=48740472)
> And this is before we consider that they need to do it for cheaper or why would anyone bother.
Typically yes, but there are reasons companies may be willing to pay the same amount or even more, such as "AI doesn't need sleep, holidays, insurance, or benefits" and "AI is easier to procure and replace than humans."
> The studies I've seen recently (at least in the software space) put it at something like a 10% increase in coding speed...
Curious to see which studies you're looking at, the studies I'm thinking of (some here: https://news.ycombinator.com/item?id=45379452) are from 2024 - 2025, so already old and before agents really took off.
However, your point about meetings and agreements and documenting is much more germane. My theory is that the largest productivity gains -- and subsequent labor displacement -- will come from reducing coordination overhead: https://news.ycombinator.com/item?id=48040999
danenania 16 hours ago [-]
I’d also point out that LLM inference revenue already totals more than 100B annually based on publicly reported numbers. Almost none of that is replacing knowledge workers. Almost all is increasing their productivity. So empirically what you describe is already happening to a nontrivial degree.
parineum 16 hours ago [-]
> If LLMs can boost their productivity even by an average of 5% (studies from ~2024 put it in the ~30% range depending on task) that is ~1.5 - 2.5T in value
Minus the cost of inference, that might not be the boon you're making it out to be. I hear what people around here are spending on their api and I'm skeptical that these tools are making me that much more productive.
Personally, for assisted development, I haven't seen much progress in a while.
hedora 16 hours ago [-]
You’re trying to apply value based pricing (infinite margin upside) to a commodity.
Pre-bubble pricing: $1400 gets a 128GiB iGPU optimized for inference. Glm and kimi need 800-1000GiB. Call it 1TiB. The $1400 boxes could be ganged into sets of 4-8, with a switch. Call the switch $1000.
Each box has a TDP of 250W. 8 x 250/120V = 16.666A, or one household circuit in the US, so no new power infrastructure is needed.
$1400 x 8+1000=$12,200. Assuming standard five year depreciation, that’s $2440 a year. There are a billion knowledge workers alive today. So that’s $2.4T annual revenue. Average net profit margins on computer hardware are 4.3%. That works out to $105B net income, globally.
So, I guess the question is whether the (currently #2) open weight models provide $1.4-2.4T less value per year than the #1 and #3 models, and, if so, if customers can measure this, or are willing to spend 2x more and deal with censorship, data theft, intentional enshitification, sabotage, ads, product placement, etc, to get the slightly “better” model.
Also, note that my numbers assume moore’s law stopped for all time in 2024, but we’ve seen HW improvements since then.
keeda 11 hours ago [-]
Right, that number is more of an estimate of the value proposition of the entire AI industry rather than projections of revenue or valuations... it's essentially an estimate on how much the market could theoretically bear. Whether the companies can capture that value is, to your point, rightly a different question.
I do think open weight and other competitor models, especially with better harnesses, will play a significant role in the equation and will result in less concentration in the market. However, I do also think the big AI companies will capture a lot of that value. Partially for the same reasons that the cloud industry has been growing like gangbusters, even pre-AI, despite on-prem being much cheaper: companies will outsource anything that is not deemed a "core competency" for their business.
A lot of the problems you mentioned will be relegated to the consumer market and won't apply to enterprise contracts -- which is where the real money is.
4rf 12 hours ago [-]
What a load of nonsense lmao.
Pls stop posting you are creating noise.
overgard 17 hours ago [-]
That's a really good point. I think if there wasn't the insane amount of money involved and these were treated as tools instead, they would probably be MORE productive. I think a person working hand in hand with an AI instead of delegating is the sweet spot of making things fast while also not losing understanding or control of the system. You are absolutely right that these companies can't justify their valuations if they do that though. I just got a new mac to run models locally, and so far the results have been positive with some small hiccups. I'm thinking the future of this tech will likely be better tooling with better IDE integrations rather than "Claude plz make me a SaaS kthx"
everforward 14 hours ago [-]
> I'm thinking the future of this tech will likely be better tooling with better IDE integrations rather than "Claude plz make me a SaaS kthx"
I think this sort of thinking is a trap, because it presumes that all software has the same constraints.
There's a spectrum of requirements between "chuck this over the wall at Claude, it only has to work once" and "this is a literal rocket ship, formally verify the whole thing".
I've made some things with Claude I don't understand and don't control. It's fine, they're still useful to me. Things for the house that I wasn't going to build manually, some dashboarding stuff and scripts for work, stuff that can crash and burn and I'll be fine.
They won't justify trillions in investment, but they are useful.
Equally, I do agree with you on some things. Sometimes I hand-hold the LLM or forgo it entirely because I want to be 100% sure I know how something works, and can justify a decision if it causes a production outage.
I think the future is probably multiple different tools with different goals. Better IDE integration for some uses, an entirely separate "LLM herd controller" kind of thing for when you're okay with vibe-coding, and the most interesting is something in the middle where you're more in the loop than pure vibe-coding, but don't see the full context like in an IDE. Something where it surfaces changes to key components, but hides things like test changes.
balder1991 13 hours ago [-]
It’s what’s called in software engineering as “casual software” as a differentiator of “business software” and “critical software”. Not all types needs a high bar of quality, and most of the software engineering thought practices are tailored for business applications that will be made available to multiple users.
As you said, building a script that only you use personally or a very simple thing that just accomplishes one task and it’s easy to test require almost no engineering, and an LLM can often build those with very little downsides.
ah1508 15 hours ago [-]
> while also not losing understanding
That's a key point. Keeping knowledge and know how inside the company is strategic. For most people GPS did not result in better sense of direction, spellchecking did not help to write without making mistakes, and delegating translation to deepl does help to be better in a foreign languages. I don't see the gain for an individual, a company, a society if a technology reduces the ability to think, do stuff, understand complex problem, working hard at something. Hiring junior also matters, what is boring for a senior dev is useful for a junior, like the "wax on wax off" in Karatekid. Then when the senior dev retired the junior is not junior anymore and the know how is still here. I want to to transfer my knowledge to a junior, not to anthropic or google or openai.
Ideally, working hand in hand with an AI could be like driving a motorcycle vs riding a bicycle. Both are fine, but you go much faster with a motorcycle and you don't lose any ability. But prompting a motorcycle auto-pilot by voice sound a bit stupid and boring. Insane use of energy rarely comes into the equation, which is a bit weird. Personally it is why I am never tempted to use AI. However I see value in AI for finding weakness in a code (inverse of flattery), writing tests with all the edge cases based on specs since tests are often sloppy, asking a fresh view on a very difficult problem. I'd love to hear about the equivalent of move#32 in game 2 (AlphaGo vs Lee Sedol) in a difficult programming task. But I think that massive delegation of code writing is how you lose the knowledge and the know how: what keeps us sharp.
Final word: I asked once a review to claude, the codes involved a db transaction. Nothing complicated, Claude said everything was fine. However the transaction isolation level was not set (I did it on purpose, like if I did not know about isolation levels). He did not ask me if it was my intention to keep the default level. I would have preferred a challenging feedback: why did you chose the default isolation level ? Is it on purpose ? Do you know that the default depend on the db ? Do you know about isolation ? Tell me about the business use case and I'll explain which one would be the best.
user43928 16 hours ago [-]
I am thinking the opposite. I've been having great results with handing more and more responsibilities to the agent.
Contrary to what some people suggest, I have not hit any maintenance or reliability dead ends. If something breaks, the agent fixes it.
If it cannot, I have the agent instrument the code and work through the logs to check hypotheses, until the source of the issue is found.
If even that would fail, which did not yet happen, I can still do some old fashioned digging and learning, like I always have.
This is for native mobile app development, and the code base is around 100k LOC.
tskj 16 hours ago [-]
Dario has publicly claimed each model has been profitable, even accounting for its training costs; it's just that each new model is exponentially more expensive to train than the last, so the income lags and it looks like the company is losing money overall.
Now, we can't know if this is true unfortunately, but it's not directly contradicted by anything that's known publicly at least. I thought it was an interesting way to frame it and makes the whole situation look marginally less bad.
NorwegianDude 15 hours ago [-]
A common extreme misconception is that inference is expensive and that providers are loosing a lot of money. Inference is extremely lucrative and profitable.
drob518 12 hours ago [-]
Inference is the phase where they make money. But the question is whether they can be profitable overall as training continues to balloon.
tskj 3 hours ago [-]
I think the case for this is pretty strong actually. Last year my company was maybe willing to pay $100 a month to Anthropic (per developer). Today we're all on the $300 plan without any hesitation. If Fable ever becomes available as the default model, I imagine my company would be willing to pay in the $500-$1000 range per month per developer.
4rf 12 hours ago [-]
why are you listening to these idiots who have every incentive to spin the story as much as possible
FCFF = EBIT(1-t)-Reinvestment
I dont care about your gross profit - this kind of cash profit determines the value of operating assets.
tskj 3 hours ago [-]
Well yeah obviously they have to stop reinvesting more than they make at some point to become profitable. To be clear, I think what Dario was saying was that if you consider each model training + deployment as a company, meaning all expenses and taxes, it was still profitable.
Whether he's lying is another question, but seems unlikely.
sanderjd 17 hours ago [-]
My two cents is that the way to square this circle is that the valuations should be lower and they should be spending a lot less on constant retraining.
Unfortunately (from my perspective) it seems like the US companies are increasingly stuck in their current model. I think it's a competitive disadvantage.
But obviously most of the real insiders seem to disagree with me, so I'm probably wrong :)
wyre 17 hours ago [-]
The insiders disagree because they are benefiting greatly from the insane valuations, right?
Chinese models are quickly commodifying frontier inference, the US Gov is preventing domestic SOTA models access to the public and without those models why would consumers still spend $200/month to use the best models?
It’s such a mess and isn’t inspiring confidence as a non-investor.
sanderjd 16 hours ago [-]
Are they benefiting from the insane valuations though? If the valuations deflate before the insiders are able to exit, I think that would be worse for them than a lower but sustainable valuation.
It all comes down to whose prediction of the future is closer to correct. I think the most likely future is commodification of inference and "agent-assisted" rather than "agent-driven" workflows dominating the future of work. But insiders - who both know way more than me, and also have more skin in the game, both for better and worse - seem to really think I'm wrong about that.
So I dunno! Could go either way!
drob518 12 hours ago [-]
It’s all about timing. This is tech bubble 2.0, Dotcom Boogaloo. If you’re able to flip it quickly, you’ll have generational wealth. If not, you could be holding a lot of worthless paper.
sanderjd 10 hours ago [-]
Yes.
But is your impression that this is the strategy of people like Amodei? My impression is that it isn't, that they are actually true believers, and not just trying to hit the timing right and flip it.
wyre 15 hours ago [-]
Even if the future is agent-driven workflow, that doesn't stop the commodification of inference. a good agent-driven workflow, in my experience, is a byproduct of the harness and scaffolding around the agent.
What insiders are you talking about? They're going to be hot towards the possibilities so they can exit to a massive windfall. I dont know why they would want to be publicly critical of these technologies that could make millions on IPO.
sanderjd 14 hours ago [-]
I'm talking about people who work at the frontier labs who talk to the press, and what seems to be the revealed beliefs of those same people from the strategies we see their companies pursuing.
My point is that actually it would be worse for these people if the valuations are only high during this period - which will last awhile longer from now! - where their equity is not liquid, but crashes as the market figures out this commoditization thing.
But if we're wrong about how that's going to go, then this isn't a concern because there won't be any devaluation. And to me that seems to be what they honestly think is going to happen. And they know more than me (and I think they're a lot smarter than me), so this does temper my confidence in my own predictions.
ricardobayes 16 hours ago [-]
At some point it's going to plateau, maybe already has. Then they will switch to FPGA/ASIC-based model-specific hardware for lower consumption. I'm pretty sure the "space data centers" won't use GPUs, they are not radiation-tolerant whereas FPGAs can be.
I would not take "space data centers" as a given! from most to least likely these will be vaporware, vaprorized-ware, rubble-ware, loss leaders.
16 hours ago [-]
JumpCrisscross 18 hours ago [-]
> no way to justify their valuations if they get downgraded to a pair programming tool
I think there is. Pair today doesn’t mean they’re locked into that forever.
4rf 12 hours ago [-]
you always post about valuations but never share your own.
go ahead m8 we are all waiting... the stage is yours. lets see your model.
ChrisLTD 15 hours ago [-]
Their valuations don't make sense as just programming tools, period. Forget about if they are still human driven.
JumpCrisscross 3 hours ago [-]
> Their valuations don't make sense as just programming tools, period
Yup. I think we agree. These valuations aren’t made or unmade by whether their tools are being used as vibe agents or pair programmers.
EddieRingle 16 hours ago [-]
> There’s no way to justify their valuations if they get downgraded to a pair programming tool.
Honestly I still don't see how they justify their valuations, period. If anything they're serious liabilities.
Open-weight models are improving and reaching "good enough" levels for more and more tasks. They're also known quantities; you know what you're getting with them and don't have to worry about the model silently (or not so silently) being switched out from under you (whether that's because Anthropic/OpenAI decides you're not worthy of their latest and greatest for one reason or another, or they switch you to a quantized model to save on compute, or they simply sunset the specific model you've been relying on).
And if the open-weight model doesn't run on your local hardware already, there are any number of hosting providers that will handle that for you (so you're back to just paying for colocation/cloud usage instead of nebulous tokens).
Closed models are improving as well, sure, but diminishing returns will eventually kick in (as they already have for various tasks, as I said).
So if not their models, where does their value come from? Just simple network effects/lock-in? "Normal" users will drift to other options if they start showing more and more ads, and enterprise customers will surely be looking for opportunities to avoid lock-in and reduce risk.
I think the last argument I've heard is that these valuations are basically a bet that Anthropic and/or OpenAI will achieve AGI that can fully replace human labor, so they'll essentially be able to sell that replacement labor to everyone. They haven't managed to pull that off, yet, however. Businesses that have tried to replace humans almost immediately realized either that the AI's capabilities were oversold or that they at least needed a human in the loop still, to some degree. And even if they do achieve AGI, that would surely become an issue of national security (they're already flirting with that today), so who's to say governments won't simply nationalize the best AI labs and either remove them from the economy entirely or perhaps even provide models as a public service to level the playing field?
That all sounds like a giant gamble, if anything. And it's incredibly frustrating to watch as someone that's been unemployed for a year because (a) budgets are being burned on tokens and (b) LLM-generated applications are flooding hiring teams and preventing real people from being seen. (Not to mention, as someone that spends a lot of time in gaming circles, the fact that DRAM and flash storage is quickly becoming inaccessible is just an additional frustration that means people can't even find temporary relief in entertainment.) I can only hope this bubble finally implodes before I lose my house.
pixl97 15 hours ago [-]
>Open-weight models are ...
<banned>
Not the first one to come up with that likely outcome either. I mean, if you're being restricted from SOTA models now, how long do you expect before the FBI kicks in your door for using an 'illegal' open model?
9 hours ago [-]
pkulak 17 hours ago [-]
And every benchmark is "build GTA-6 from nothing, as a single-page web app".
ricardobayes 16 hours ago [-]
They have to, but also everyone working at 3D printing companies thought "industry 4.0" is going to completely override everything, we are going to print housing and going to print a mug at home and drink coffee out of it.
Today's news that Amazon is hiring 11k interns. I think part of the AI story was used as a convenient excuse to get rid of some "fat" and some covid overhiring and gave companies an out to change course.
rconti 16 hours ago [-]
I wonder how portable the existing models are for different use cases. As good as they are for greenfield development or working in a single or across a few tightly coupled repos, they're absolutely terrible at debugging distributed systems and make incredibly wrong yet extremely confident assertions all the time.
I don't know if it's a matter of just requiring a tiny amount of optimization or wholesale redesign.
popalchemist 18 hours ago [-]
Whether they believe it or not is immaterial. It is the end-goal they want to achieve, because then they own the means of production entirely.
pigpop 16 hours ago [-]
They own the means of production for the leading models but they're far from monopolizing them since the techniques are well known. At this point it's a matter of having a head start and lots of capital to pay for the data annotation and GPU time to train them. Others are playing catch-up but they're hot on their heals which is the biggest reason for them to continue spending like crazy to keep their leads.
For the non-bleeding edge they have a lot of competition with more competitors showing up every day.
The way this is playing out is not surprising, it's similar to any other technological breakthrough as it becomes commercialized. Eventually those means of production will become commoditized as well.
14 hours ago [-]
quaverquaver 16 hours ago [-]
these are capital intensive commodity businesses. They can be plenty big - see railroads or airplanes... or refining... but that doesn't mean that most value won't be added elsewhere.
jatora 17 hours ago [-]
I find these nefarious intention theories shallow. It can both be the case that the endstate is them owning the means of production without that being the intended guiding goal. Companies can chase profit without being Leninistic boogeymen.
WhyIsItAlwaysHN 17 hours ago [-]
There is no nefariousness in owning all the means of production, it's the endgame of maximizing profit.
However the result is exactly the same, concentration of power.
pigpop 16 hours ago [-]
This is such a defeatist and low agency take. "means of production" are not a limited resource like gold that you have to extract from natural sources or divvy up. They are fundamentally skill and knowledge that anyone can attain and put to use, maybe not on the same scale as a well funded business but even those businesses had to start somewhere in order to grow to the size they are now. So rather than casting aspersions on them, your time would be better spent learning how you too can create some means of production and start producing value.
WhyIsItAlwaysHN 58 minutes ago [-]
AI companies are trying to mechanize skill and knowledge and to own the infrastructure around it. If they succeed, your suggestion does not work. Even if they can't succeed, they will try because that's the most obvious path to maximizing profit for them.
Also about "creating" means of production, these companies actively try to sabotage this as another profit maximization strategy. They buy all the ram, so others cannot compete. They buy startups who succeed, so they stop competing.
It's not aspersions, it's just describing the phenomenon.
Even if I take your suggestion to heart, once my company would be big enough, if I wanted to optimize for profit, I would have to do the same as these companies.
The end result is concentrated power.
breezybottom 13 hours ago [-]
You think AI tech doesn't use rare earth metals?
popalchemist 15 hours ago [-]
No nefariousness other than the subjugation of the majority of humanity? You're insane
WhyIsItAlwaysHN 14 hours ago [-]
What I meant is that nefariousness from people is not a prerequisite. It's a machine that wants to maximize all profit and all the evil is a natural product. If you magically put saints in charge they would be eaten and replaced by the same kind of people very quickly if the end goal remains.
cousinbryce 17 hours ago [-]
Sam Allan has said some things that would make Lenin blush
jambalaya8 18 hours ago [-]
As I said, working ourselves out of our jobs within the span of a few years.
jerf 17 hours ago [-]
I've been using Kimi K2.6 lately (don't have 2.7 available through blessed work channels yet) for tasks where I already know what it is I want to do and I want to just step through the process in pieces, and it's fine. Do I have to correct it maybe a bit more than Opus? Yeah, but the real cutoff would be between "I have to read every line" and "I can just trust it without reading every line" and for me neither model hits that mark, and I expect it to be a while yet for that. Is it as good as Opus if I want to spit ball about architecture and then convert that to code? No, but I don't have that problem all the time, and it's there if I do need it.
And now in a heavy coding week rather than bumping up against my spend limit by late Wednesday or Thursday I'm comfortably below it all week.
That said if anything I feel like I have to reign in K2.6 much more than Opus, actually. If I want to just ask it a question without it inferring some coding task to immediately start doing, it takes a lot more care to prevent it from just running off half-cocked off of an only 3/4s-cocked idea of my own. I use "plan" mode with both but it's somewhat more defensive with K2.6 than Opus.
nozzlegear 17 hours ago [-]
> I have been moving more and more to K2.7 Code and GLM-5.2 the last few weeks. They are often good enough for assistance, very fast, and cheap.
I've moved completely to local models that I run with my M1 Mac Studio (64gb ram) some time ago. But for the rare times when I feel the local, quantized Qwen3.6 isn't enough, I just connect to Openrouter and use something like Kimi, GLM or Deepseek for a fraction of the price of Anthropic et al.
sparkling 7 hours ago [-]
What is your motivation? Privacy and/or data protection?
I currently don't see a world where it makes sense to run a local model that will eats up 60% of my RAM, 20-30% of my disk space while providing worse quality output than a $20/month subscription.
plasticsoprano 16 hours ago [-]
Which quant do you use? I have a similar setup and the speed is atrocious at 4-bit.
nozzlegear 15 hours ago [-]
I'm using 4-bit as well, with the MoE model. I also use the MLX versions which are optimized for Apple CPUs (from what I understand anyway, I'm just an LLM layman). According to my oMLX dashboard, I'm getting about 50 tokens per second out of this model – not blazing fast, but more than fast enough to be useful to me.
I think you should try an OpenAI model like GPT 5.5. It is better at following instructions and boundaries set during prompt. It feels like a more capable "agent assistant" than Claude models but without loss of intelligence.
Most of my work involves "Agentic engineering" instead of fire-and-forget. I like to stay involved during the planning as well as review and ask a lot more questions from the agent than I've seen others doing. In a way, I'm using the agent in a sort of "hyper auto-complete" mode to fill in the blanks (rather big blanks) once I've set out the requirements, scope and design (sometimes specific module boundaries). This works best for me.
ifwinterco 16 hours ago [-]
I prefer GPT 5.5 to Opus but both are absurdly expensive token hogs, I can't afford to use either as my main model at $work with the monthly spend cap we have.
I use Composer (since we use Cursor) or GPT 5.3-codex as my workhorse models and only break out the big guns when I have a genuinely difficult problem to solve.
IMO somewhat weirdly 5.3-codex might be the best overall coding model OpenAI have ever released. It's 90% as good as 5.5 and costs about 20% as much, since it's both cheaper per token and uses fewer tokens for the same task.
I'll miss it when they inevitably deprecate it, but hopefully I can use Kimi K2.7 by then
m3h 16 hours ago [-]
I didn't realize GPT 5.3 Codex was that good.
OpenAI claims to have made their new Terra model as good as GPT 5.5, but with half the cost per intelligence. Hopefully, this will bring it closer to the price you're expecting (or even better considering GPT models have good acceptance/success rates according to benchmarks).
skeptic_ai 11 hours ago [-]
Buy 5 accounts at 20usd each. It’s 100 and lasts decently on single threaded work
ifwinterco 7 hours ago [-]
This is at work, we have to pay the actual cost of tokens, no way around it
jklmnopqrstuvw 18 hours ago [-]
From my own experience, GLM-5.2 generally cost more tokens and much more slow.
pimeys 18 hours ago [-]
I use GLM 5.2 Fast from Fireworks and its very fast. Where are you using it from?
microtonal 18 hours ago [-]
Which inference provider do you use? (Admittedly, I currently use K2.7 a lot more currently.)
james2doyle 18 hours ago [-]
Tokens and speed are a factor but does it require less back and forth to get things right? Being "fast and cheap but wrong" still has a cost that an otherwise "expensive and slow" exchange does not
paradox460 12 hours ago [-]
In my experience it spends a lot more tokens to do things. I wrote a tiny extension for omp that counts the number of "Actually" in the response, and if it exceeds a threshold stops execution and waits for me to tell it what to do. Even then it frequently just ignores basic instructions like "only write boilerplate, I will fill in the functionality"
Imo MiniMax and MiMo are a lot more reliable (and cheap)
Not opus level, but close enough and cheap enough to get the job done
mohamedkoubaa 18 hours ago [-]
I've been moving more to Composer 2.5 for the same reason. KISS principle.
everfrustrated 14 hours ago [-]
Composer 2.5 fast (via Grok) is honestly amazing. Its been implementing everything I've asked and getting it right first time.
Been impressed with it's front end ability.
If this was the last model I could ever use I think I would be happy.
AdminAdmim 18 hours ago [-]
Same for me, downgraded Cursor Subscription because when i use Cursor i use 90% Composer 2.5 fast
indoordin0saur 12 hours ago [-]
Yeah. Opus is nice for tasks that require significant planning and considering broader effects on other parts of the code. But it likes to go off the rails and do too much. Often it gives good-sounding ideas but it has a tendency to distract me by giving me a huge to-do list.
mattmatheus 16 hours ago [-]
I've been working to use the best model for the task for about 6 months and have found great success doing plan with the 'frontier' model but punting implementation down to a 'lesser' model. I'm using the Beads-Rust (a rust fork of GasTown's beads) as my issue tracker. So far, so good.
nsoonhui 11 hours ago [-]
Sorry, exactly what is the distinction between agent-assist and agent-driven? T
I give AI an image and just it what's wrong, and then it goes on to fix the bug in the codebase for me ( and write the tests), is this agent-assist or agent-driven?
Sometimes I just give the AI my description, and mockup, and it creates a plan and implements the details for me, and I verify visually ( this is the weak spot of AI), is this agent-assist or agent-driven?
whateveracct 18 hours ago [-]
agent-assisted development uses orders of magnitude fewer tokens than agent-driven development
the incentives aren't there sadly
sanderjd 17 hours ago [-]
Not for a business model that scales revenue by token usage. But other business models are available.
whateveracct 6 hours ago [-]
Like?
mark_l_watson 16 hours ago [-]
Good point, I also like to do the work myself, with an assistant under my control. I am usually really happy with DeepSeek v4 Flash that I feel just mostly does what I tell it to do, but I do switch to Pro for harder tasks.
There are so many models, and I personally ignore benchmarks so it takes some time to try different models on my use cases. Fortunately, it is ‘good enough’ to do the work to find a few models that work for me, and just use them for a month or two before re-investing time for my own evals to possibly change models.
People should evaluate what works for them and ignore other people and benchmarks. (Apologies if that sounds snarky.)
xpct 18 hours ago [-]
I've been largely disappointed how much the Claude models ignore custom instructions, and sometimes even prompts on the chat interface. It sometimes feels like talking to a wall, or as if there was a third person in the chatroom whose messages I can't see.
I can't help but feel this is intentional towards the 'Agentic' workflow.
spacephysics 18 hours ago [-]
I think this seems purposeful, as there's 2 opposing forces at play:
- Have a model that follows the users instructions
- Have a model that follows the system prompt instructions more
For the 'safety' argument (Re: Fable), they need these models to have basically a 2-tier instruction system, but given LLMs aren't great with actual Logic unless they program it out to test, this runs afoul and we get one or the other.
Feels like optimizing for either precision or recall, but can't have both
wqaatwt 17 hours ago [-]
A suppose a solution might be going with a customizable harness like pi and merging Anthropic’s system prompt with a personalized custom one to remove all contractions
arcanemachiner 17 hours ago [-]
You still have to manage/fight with the post-training that is baked into the model itself.
paradox460 12 hours ago [-]
We're speed running HAL 9000
manveerc 18 hours ago [-]
Totally agreed. I sometimes wonder if they are making the model "lazy" with each iteration, it keeps getting better at avoiding work.
skerit 18 hours ago [-]
This is why Fable was so good. It followed instructions and it was in no way lazy.
DontchaKnowit 17 hours ago [-]
People keep making comments about fable like this? You could only use it for what like a week? How is that at all enough time to evaluate? Opus 4.6 didnt suffer from this problems for a hot minute and then when newer models were released it got worse. I think they change a ton behind the scenes and allocate compute however they want, so the model you use today may behave much differently than how it behaved yesterday
pdimitar 17 hours ago [-]
> You could only use it for what like a week? How is that at all enough time to evaluate?
By observing how in 4 workdays it achieved more than Opus in ~11 days. I am my team's backend lead and the Fable 5 model finally turned the tide on my overwhelming backlog. Back to Opus and I have to treat it like special-education kid multiple times a day.
boc 17 hours ago [-]
The ~72 hours I had access to Fable were by far the most productive I've had in months. Re-wrote massive parts of my codebase and caught a ton of bugs and logic issues that had silently slipped through before. I went over my subscription limit and immediately kept paying the API price to keep going. It was that good.
plorkyeran 17 hours ago [-]
It was a pretty stark difference. I had the opposite problem where it did too much and overshot what I wanted from it so I certainly assume that if it had stuck around it would have gotten tuned back a bit pretty quickly.
You didn't really have to use it more than a day honestly to tell what kind of shocking paradigm change it was. Man do I miss it.
Analemma_ 17 hours ago [-]
Heh, it's not crazy if you're here in the Bay: I know multiple people who more-or-less disappeared for days when Fable came out because they were running their benchmarks, and only emerged blinking into the sunlight when the USG banned it. That's just how things are here now, most people are normal but there are some serious LLM dope addicts out and about.
acters 18 hours ago [-]
I've been seeing LLMs act lazy from the very beginning. They got a little better but smaller models really only want to have a single task given to them. Mythos at least does work. RIP
marcindulak 15 hours ago [-]
I keep adding selected cases of CLAUDE.md instructions non-compliance reported on claude-code github to that issue https://github.com/anthropics/claude-code/issues/13689. Subjectively the amount of such cases seems lower during the past month. It may be that claude-opus-4-8 (default thinking) is a bit better at instructions following than past models.
gs17 18 hours ago [-]
> or as if there was a third person in the chatroom whose messages I can't see.
If you set off a classifier, that's how it looks to Claude.
xpct 18 hours ago [-]
I wasn't working with anything sensitive, but it really does feel like it sometimes condenses even something low like three bullet points to two.
IMO, they were quite good with checklists even a year ago, and tried to tick off each one.
storus 18 hours ago [-]
Try to run your prompts through Claude to pinpoint any ambiguous parts that can be interpreted in multiple ways, or self-contradictory sections. I typically resolve any prompt-ignoring issues with that.
Sohcahtoa82 15 hours ago [-]
[dead]
epolanski 18 hours ago [-]
I've been saying for ages that since Opus 4.6 models are increasingly smarter but further unhelpful as assistants.
Fable was amazing as a vibecoder but as an assistant it can't resist jumping into implementation and filling chats of pointless jargon.
It's really grim if you're looking for assistance instead of an implementor.
GPT 5.5 Pro and Fable are gorgeous bullshitters that pretend to be right (often convincingly because they are very smart) even when they are wrong and I need tons of energy to process their information.
I don't like it but don't know what to do, Anthropic models especially increasingly ignore instructions whether in memory or agents files.
thewebguyd 18 hours ago [-]
By design, unfortunately. If they are just assistants, they can't sell the dream of "we're going to replace human labor completely" to the C-suite.
baq 18 hours ago [-]
It isn’t a dream, it’s a reality for some of us here and it will be increasingly so for everyone else. Amazingly, USG intervening slowed the dynamic greatly (fortunately?)
The problem is obviously who will be left. There’s a lot of scifi to catch up on.
epolanski 18 hours ago [-]
I think that they are simply evaluated on prompt to solution benchmarks.
whstl 17 hours ago [-]
Yep, this is why experiences and ratings of models vary so wildly.
I recently migrated a very large web app to Tailwind and Opus kept screwing up over and over, refactoring and changing the design, the more complex the component became.
I ended up asking Haiku to do it and it managed to do everything correctly, pretty much without intervention.
mullingitover 17 hours ago [-]
> I don't like it but don't know what to do, Anthropic models especially increasingly ignore instructions whether in memory or agents files.
I've taken to instructing the agent to manage the subagent, and the principal agent's sole job is to ensuring the subagent follows instructions to the letter.
epolanski 15 hours ago [-]
Just to follow up on what I mean, this was my first interaction with Sonnet 5:
"I just cloned this repo, investigate how to set it up, don't install anything, just collect information"
_spews information_
I proceed with the setup, but get a Linux specific dependency in a bash script, so I want to evaluate whether it can be rewritten...
"There's this error on MacOS, I think it's because we need linux-utils from brew, verify whether the script can be written in bare posix"
_proceeds installing linux-utils and all the rest_
"Didn't I tell you to not install anything?"
_you're absolutely right_
F*k me..
a_c 17 hours ago [-]
I actually use sonnet 4.6 for my day to day coding too. It consumes much less token and good enough. Opus is just too token consuming for it to be useful to me.
bazhand 17 hours ago [-]
Have you tried '/model opusplan' I've had strong results mixing opus for planning with sonnet implementing.
a_c 17 hours ago [-]
I haven't. Thanks for the heads up will give it a try!
I use opus to comment on code design quite often though. It became a pattern that I made a skill for me to ask for second opinions https://news.ycombinator.com/item?id=48733092
Would love to hear your feedback if you don't mind!
vtail 17 hours ago [-]
Fascinating! How did you learn about this?
bazhand 16 hours ago [-]
It was something that was used for token efficiency. Most of the settings and use cases are quite poorly communicated but asking Claude to review the latest release changelog (https://github.com/anthropics/claude-code/blob/main/CHANGELO...) is quite useful. Combined with @"claude-code-guide (agent)" to read it's own docs for settings/configs is super helpful.
The quite useful tool is to use /opusplan along with /codex:rescue (https://github.com/openai/codex-plugin-cc) means you get quite a strongly reviewed plan using native claude + codex without having to implement the mostly useless trust-me-bro plugins and other bs.
addozhang 12 hours ago [-]
That's really a great way for the agent to understand itself.
ricardonunez 12 hours ago [-]
I am in the same position. Do you think they are going to remove it and deprecate it as some point?
duxup 16 hours ago [-]
“Hey I saw some messed up function commented out that at face value is a bad idea… so here it is again with some nonsense assumptions ….”
I ask “where did you get that?” … too often if I’m not constantly guiding it, and even then it still goes off the rails.
arikrahman 16 hours ago [-]
I have also started shifting to models more reasonable for my wokflow. I've been using the Reasonix harness for Deepseek, and cache hits make the token use basically free. This is with unsubsidized models as well, using American providers.
addozhang 12 hours ago [-]
I feel pretty much the same way, and the scenarios are similar too. Using Sonnet has a bigger advantage when it comes to response time.
bckr 16 hours ago [-]
I suggest you encoding your invariants in the harness. Architectural invariants that can be mechanically checked, including which modules are approved, which dependencies, etc.
lacoolj 15 hours ago [-]
gemma-4-e4b is very good at assistance too, and is local and fast and small (and "free")
trollbridge 17 hours ago [-]
No kidding. I expect to have models to use which are optimised for different use cases.
Sonnet as an autonomous agentic model is silly. We already have other models for that if you want something weaker and cheaper than Opus.
spullara 16 hours ago [-]
if you like that, use gpt models instead.
XCSme 15 hours ago [-]
I just tested it on my benchmarks[0], it's GLM-5.2 level, at 2x cost, but also 2x faster.
Weak spots (categories it fails):
- Trivia — 0/3 - basically not much built-in knowledge
- Combined tool-calling tasks — score 45/100, sometimes makes invalid tool calls
- Puzzle Solving — score 77, flubs carwash-like tests
Your benchmark has Gemini 3.5 Flash as the best model, which doesn't compute for me
XCSme 14 hours ago [-]
It is on top for many benchmarks, only not the coding/agentic ones.
Still one of the most intelligent models overall, most likely to get any question you ask correctly (without tools).
BoorishBears 13 hours ago [-]
This guy had a terrible broken benchmark that gets hawked every release, and I wish HN would ban accounts that essentially exist to hawk a personally owned site, especially such a bad one.
pbgcp2026 9 hours ago [-]
I get similar results in my own tests. And Gemini 3.1 Pro is consistently on top of my ratings. Not everyone is coding monkey, I prefer staying a programmer.
BoorishBears 9 hours ago [-]
They're referencing Gemini 3.5 Flash being the top model, you must not be great with detail.
And no (strong) programmer would jump to assuming other people are coding monkeys just because they disagree on what a strong LLM is: that's the kind of thinking reserved for the glorified coding monkeys who wasted their life getting better at writing CRUD apps and are now upset that someone's tooling is dropping the already very low bar there.
UqWBcuFx6NV4r 13 hours ago [-]
If you were right, the karma system would largely take care of this. It really sounds like this is more of your personal view
BoorishBears 13 hours ago [-]
Karma systems are never perfect, and most people will not assume this is a pattern.
(ie. won't feel the need to downvote them just for having yet another crappy AI benchmark)
I only recognize it because I build a product that leaves me looking for information on every major release... and every major release a new crop of folks reply confused about the anomalies on top of anomalies that they're seeing, and they slowly learn this person is just way more unserious than the dogged distribution would imply.
XCSme 15 hours ago [-]
As always, note: faster than GLM-5.2 doesn't mean too much, as GLM-5.2 is served by different providers, so the inference speed can vary drastically between providers or over time.
2muchtime 12 hours ago [-]
Opencode Go/Zen claim to use infrastructure based in the EU, USA and Singapore that have a 0 retention policy.
yieldcrv 14 hours ago [-]
What’s everyone favorite GLM provider?
z.ai doesnt always have the most reliable AI
but I don’t mind the party seeing my trade secrets and thoughts compared to an American corporation + the party seeing my trade secrets and thoughts. So thats not a functional difference to me, and the Chinese one won’t reply to subpoenas so thats a value add tbh
So I’ll consider all, fastest tokens/sec wins
reissbaker 11 hours ago [-]
I'm biased because I run an inference company, https://synthetic.new. That being said I think we're pretty good at serving at GLM-5.2 — and other models, like Kimi K2.7! — and our privacy policy is quite good: zero data retention for prompts and completions on API requests. Our average streaming TPS for GLM-5.2 (aka, tokens after factoring out time-to-first-token, which varies based on geography) is 97tps over the last 24hrs, although it's slightly lower at peak traffic in the mornings PST where it's 50-70 tps. We're also subscription-based which is nicer for coding than e.g. Fireworks which is per-token billing.
yieldcrv 11 hours ago [-]
got a 500 error page on the site's chat, but I'll try the API
reissbaker 9 hours ago [-]
Interesting: I don't see anything in our error logs but we could be missing something (and personally the chat works for me + my unsubscribed test account). If you email us at hi@synthetic.new though we should be able to fix anything you're running into!
eli 14 hours ago [-]
Fireworks.ai is solid. And if you care more about speed than cost they have a "fast" variant that I think just throws more hardware at the model for about 2x the cost.
david-gpu 13 hours ago [-]
The privacy policy indicates that they track you and share your data to ad networks like Meta. Yikes.
pranaybhatia 10 hours ago [-]
Hi, PM at Fireworks here. We have zero data retention so we do not log any of your API requests. Realize you're talking about website activity which is different and will check and update on that too.
pbgcp2026 9 hours ago [-]
Run it on Amazon Bedrock or GCP vertex. No problems at all.
yieldcrv 7 hours ago [-]
how much does that cost
pbgcp2026 5 hours ago [-]
There is no markup for SOTA and Open Weight is super affordable - but most important completely private. Just try it.
Onavo 11 hours ago [-]
> the Chinese one won’t reply to subpoenas so thats a value add tbh
That's not something that's definite. They are not quite like the Russians. A lot of the governments in Asia are overly pragmatic and will happily strong arm their companies to throw users under the bus for the sake of a trade deal. There's a reason why Snowden ran to the Russians and not China.
Also, if they have any subsidiaries in the US, they may not have a choice in the matter.
WorldPeas 14 hours ago [-]
the (imperfect) comparison having used both for planning and execution is that GLM5.2 is too jumpy and eager to do things, often to a fault (e.g. deploying/using git when it shouldn't) while sonnet 5 was much lazier than any Claude model I have used has been, not adding an addendum to a plan that I asked for, then lying that it did when asked. Looking at the analysis[0] I don't think it's worth it for me. Maybe for others. Fable was certainly much better.
Edit June 30, 2026: In the original version of this post, we included a cost-performance chart for the BrowseComp evaluation that was based on data from a simpler methodology that did not reflect the standard methodology we use for agentic search evaluations. This had the result of underestimating Sonnet 5's performance on the evaluation.
They changed the Sonnet 5 'Agentic search' benchmark graph overnight
46 minutes ago [-]
simonw 15 hours ago [-]
Claude Sonnet 5 itself described its pelican as looking like a goose:
> Illustration of a white goose riding a bicycle, with one wing extended forward to grip the handlebar, set against a plain white background with a brown ground line.
Yeah, GLM have been beating Anthropic on the pelicans for a while now.
(I suspect that's more of an indication that Anthropic have chosen not to waste resources training on animals riding vehicles, personally.)
kamranjon 7 hours ago [-]
This is interesting, I haven’t actually heard you suggest that the labs are focusing on this benchmark before. Have you come around to this position as a result of the quality of pelicans you’ve been getting?
The reason I thought this was an interesting benchmark is because it’s a non-image generating model creating an image using SVG code, so it kinda spans capabilities.
If an AI lab trained a model specifically for animals riding bicycles it seem trivial to modify the prompt and determine if it was trained specifically for that or if it’s generalized a skill and can also generate a proper orangutan walking on stilts or an armadillo on a skateboard, this sort of thing?
simonw 6 hours ago [-]
That's what I've been doing - trying different animals in different vehicles. I'd love to find a lab who does a good pelican on a bicycle but sucks at other combinations, but sadly that's not happened yet.
That's one possibility albeit quite charitable. I'd be inclined to think the same personally if GLM 5.2 wasn't also rocking in other areas too.
philipwhiuk 34 minutes ago [-]
Just need the legs to interact with the pedals now :D
user3939382 1 hours ago [-]
Mine is to ask it to write a parallel parking simulator and animation. The math there is surprisingly complex including differential equations. Fable 5 can almost one shot it with all tunable params.
Sol- 18 hours ago [-]
Wonder if the whole cyber paranoia leads to their models ultimately generating less secure code. After all, if it has the ability to generate safe code, it would imply that it knows something about cybersecurity, which could surely be used to hack all the banks in the world.
pennomi 18 hours ago [-]
Trying to censor nudity in image generation models caused all kinds of problems with anatomy in image models. I’m sure these models will have similar issues with security.
raincole 14 hours ago [-]
Censorship on image generation models works on another level. The models can generate NSFW, but there are extra computer vision models checking if the images can be shown to the users. It's especially obvious for Grok and ChatGPT.
nodja 2 hours ago [-]
That's only correct for specific models and not what parent was referring to.
Stable Diffusion 3, an open weights model, was laughed at at release for not being able to even generate a woman laying in grass. The community attributed this to the heavy dataset filtering. Since then other open weights releases have been made with no NSFW capabilities and the community claims they're not as good as anatomy as well.
You can google "stable diffusion 3 woman in grass" and press the images tab to see how the model failed spectacularly.
BoorishBears 13 hours ago [-]
There are image models with censorship at every stage from pretraining to posttraining.
Most recently Ideogram released an open weight model that will denoise into a grey image with the text "Blocked by safety filter" notice for certain prompts
Of course, because it's open weights people have found defeats
NonHyloMorph 15 hours ago [-]
Interesting, you find that in medieval painting, due to the authority of the catholic church.
14 hours ago [-]
Traubenfuchs 3 hours ago [-]
I think the cool kids call this "staying away from the vector space of highly skilled security engineers".
deaux 18 hours ago [-]
> Wonder if the whole cyber paranoia leads to their models ultimately generating less secure code.
This may be the goal.
m3h 17 hours ago [-]
Important to note: "Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer that changes how the model processes text to improve performance (this is similar to the tokenizer change we introduced with Claude Opus 4.7). The tradeoff is that the same input can map to more tokens: roughly 1.0–1.35× depending on the content type. The introductory pricing is set so that the transition to Sonnet 5 is roughly cost-neutral."
ComplexSystems 14 hours ago [-]
So the post-introductory price is set such that Sonnet 5 will cost 100%-135% as much?
m3h 14 hours ago [-]
Correct. Albeit the nuance here is that a more capable model might solve problems more efficiently and faster, possibly saving you tokens.
As with any new model, you won't know the real impact until you start using it for your workload.
13 hours ago [-]
mattas 17 hours ago [-]
"We can raise prices in two ways: (1) raise the price per token and (2) increase the number of tokens we generate on your behalf. We promise not to do (2) maliciously. Promise."
conradkay 17 hours ago [-]
I think the incentives are less bad since a good chunk of usage comes from subscription plans.
There was a fairly major regression in Claude Code performance for some time when they changed the system prompt to try and make it less verbose (saving tokens). And if I'm not misremembering, there were a lot of complaints when they changed the default effort from high to medium.
squeegmeister 17 hours ago [-]
Wouldn't it be more malicious for them not to mention this at all?
Alifatisk 16 hours ago [-]
Sure, but I think doing it this way allows them to later on say they were transparent about it. Completely hiding this would make it very difficult for them excuse when getting caught.
phillipcarter 18 hours ago [-]
Seems to be another great incremental update to the workhorse, nice!
I've been using Sonnet instead of Opus for almost all coding tasks for a while now. A little elbow grease to break down tasks and you can spend a lot less money for just about the same output quality.
SeanAnderson 16 hours ago [-]
Crazy. I just changed the default for our entire org to Opus because people were continually unimpressed with Sonnet's abilities. It's fascinating to think how varied people's experiences are when interacting with LLMs and how much the outcomes depend on how people approach interacting with the models.
thewebguyd 18 hours ago [-]
Yeah I think people are sleeping on the smaller/faster models like Sonnet. As long as you have a detailed plan or small, well scoped individual tasks Sonnet can implement just fine. Opus will still do better at more open ended tasks or completely "vibe coding." Or spec/plan with Opus, and have Sonnet implement.
conradkay 17 hours ago [-]
I was surprised to learn that Sonnet generally has the same tokens per second as Opus
Computer0 16 hours ago [-]
I would indeed be more inclined to use it if the tokens per second were better. Though I would be then using their more expensive Opus less though. Perhaps it is strategy.
conradkay 16 hours ago [-]
They should add a Sonnet 5 fast mode at ~Opus pricing
philipwhiuk 33 minutes ago [-]
It's a 30% price increase once the discount rate vanishes.
doctoboggan 18 hours ago [-]
The cost per task chart is telling me that I should _never_ use Sonnet 5 above medium effort level - Opus always performs better for a given cost. So I guess the takeaway is that if Sonnet 5 medium isn't good enough for you, switch models, not effort levels.
jimbo808 11 hours ago [-]
They're actively trying to use lobbying power to make open weight models illegal. So I'm just not going to use their services at all anymore. I don't think they're a net gain if you're a skilled senior, and the hidden cost in terms of technical debt and skill atrophy is just being swept under the rug. I'll be okay without their bullshit generator.
pmarreck 9 hours ago [-]
> I don't think they're a net gain if you're a skilled senior
I'm a skilled senior (I'm 54 and been coding since I was about 8; I've been 100% AI-generated code for at least 6 months now and have produced a combination of speed and quality that has astonished me; my velocity is apparent at https://github.com/pmarreck/) and this has been a massive net gain, so your claim is now officially in sheer defiance of reality.
In a skilled senior's hands, this is like an expert power tool. In the hands of someone less-skilled, it is likely also... less-skilled. It's a magnifier.
> and the hidden cost in terms of technical debt and skill atrophy is just being swept under the rug.
Nope, no it's not. It's being reviewed, measured, and controlled against. Because... you WILL need more controls to take full advantage. Look, I even invented a whole new control methodology around it called MFIC: https://gist.github.com/pmarreck/b30aa3ca69cb70a5526f8a63ab8...
Thanemate 5 hours ago [-]
The skilled seniors better stop downplaying what actually led them to be skilled in the first place, and realize that the conditions to develop that skill has been gone and almost deemed unproductive in today's workplace.
Not disagreeing that LLM's are a force multiplier, but I highly doubt whatever value will end up finding multiplying in the next generation of seniors, at this rate. It's surreal to me that I have to point out that recognition AND recalling are both necessary components of skill acquisition, because humans largely knew this since the dawn of education.
chewz 4 hours ago [-]
I have been thinking about it - paid apprenticeship is the answer - just as it is in other professions like medicine.
Seniors should be paid to actively introduce juniors to the trade over couple of years. No more bootcamp entry.
And it would be significant $ for senior to agree to expend his time and energy on software engineering apprentices. There would be also very limited number of places with good seniors. Exactly like medicine for a long time now.
In fact it is already happening in some companies I know about - seniors geting their bonuses tied to juniors being under their wings.
sumedh 2 hours ago [-]
> Seniors should be paid to actively introduce juniors
Instead lets train the contractors of an IT sourcing company and then we don't need you.
einichi 5 hours ago [-]
This is something for educators to deal with, not a 50+ senior IC, but yes your point is extremely important.
monk_e_boy 4 hours ago [-]
As an educator (UK) it's very hard to find people with experience to join us. The pay is SO crap and the workload is huge. Trying to teach complex, large systems is impossible, because we don't have them. We can't throw 1,000 files and 20,000 lines of code at a student and teach them how the system was engineered. We can show them toy systems.... which is fine if there is a work route that takes them and nutures them for 5 or so years. But without that pipeline, shrug not sure what we do about it
vincnetas 4 hours ago [-]
i would rephrase "Force multiplier" as "Force power up".
If your "force" is above 1 then its ok to have AI power up your force. 2.3 to the power of 3 is 12.169.
But if you're a beginner and your "force" is bellow 1 so power upping this makes it worse. 0.2 to power of 3 is 0.008
jimbo808 7 hours ago [-]
I think I might have written a comment similar to yours maybe 6 months or a year ago. I'm not quite sure to respond to these sorts of replies. I have used LLMs/Claude Code quite extensively professionally and was a very early adopter, have built tooling around LLM/agentic development, and genuinely embraced it. They aren't useless, but the short term gains you think you're getting come at a very steep price that you may not actually account for consciously for quite some time, if ever.
illiac786 6 hours ago [-]
I think the uncomfortable debate is not about skill atrophy as a general phenomenon (it’s happening anyway, doesn’t matter how much we debate it) but rather, _which_ skills are atrophying and if these skills are now superfluous/worthless or not.
If you don’t use a skill, it’s like a gene a species doesn’t need anymore, it will atrophy.
Is that bad and if yes, why? Skill atrophy is not intrinsically bad. I don’t know how to make tinted glas for church windows and I will never learn it because there are machines doing it now.
But I would for example think that critical thinking would be a catastrophic skill atrophy. As far as I know, there is no proven link though (and one would have to define what is “critical thinking” in the first place). Writing assembler without any autocomplete, I’m not so sure it’s such a problematic skill atrophy.
simondotau 6 hours ago [-]
One could argue that the cumulative atrophy of skill around writing CPU assembly has been problematic in some respects, but it’s also completely unreasonable to lament what we’ve gained in return.
As far as I’m concerned, so long as we can be happy with AI we can run locally, AI is no different to the rise of scripting languages or the pocket calculator. It’s only problematic if the calculator is rented to you as a service.
pishpash 6 hours ago [-]
Hence only let your skill atrophy to the extent where if all you had were your local laptop you can still be competent. Relying on paid subscription services for your skill is a fool's errand.
DanielHB 3 hours ago [-]
Since compilers became a thing Assembly language knowledge atrophied[1] across the workforce.
Since automatic memory management became a thing memory management and pointers knowledge atrophied[1] across the workforce (although not nearly to the same degree).
I think the pattern here is that compilers almost always output better machine code than humans, automatic memory management doesn't output better machine code than skilled humans can very (especially with modern languages that give you a lot build-time safety checks).
And even then, there is still demand for assembly knowledge in the workforce, it is just very niche.
I don't think LLMs will ever be good enough to "almost always" output better code than humans. But, like automatic memory management, it will likely make some types of programming more niche.
The key thing here is that compilers are deterministic, deterministic tools have way less variance in output quality. Automatic memory management is not as deterministic as a compiler because it happens at runtime. LLMs output build-time code, but the can be drastically different if I sneeze too hard.
[1]: as in % of the workforce, not absolute numbers. Hard to get exact figures on this, but I think we have more experienced people actively using Assembly today than we had before compilers became the default (late 80s). We probably have more active C/C++ programmers today than before Java became popular (early 2000s).
techpression 6 hours ago [-]
It’s not one single skill being lost, it’s about many and how they interact.
I just did a big refactor with opus, it went ok, some bugs. The normal stuff. One of the bugs was in a part of the code no longer needed, which Opus had just filled with comments more or less.
Asking it fix the bug worked, but then I really looked at the code and realized just that, this is pointless now.
I’ve only been coding for 20+ years so I might be more susceptible than the author, but I’m quite terrified about losing skills in writing code, but also designing good structure, coherency and system overview. These are the things people claim you need more of with LLMs, but is what you outsource the most, even if you think you are describing it in detail.
We are all collectively growing the skill of complacency and laziness though, and those are not great ”skills” to have. And I’m just as guilty as anyone.
vidarh 4 hours ago [-]
I might have written a comment similar to yours maybe a year ago.
Yes, some skills will atrophy, but the learning curve for LLMs is also steep and you will acquire new skills that will pay off the costs many times over.
We see this in discussions like these where you have people running the gamut from using them as glorified auto-complete or babysitting them (usually a net loss in speed, though it'll feel less draining) via people running multiple agents in several different tabs (a gain) to people prompting for harnesses rather than tasks, and putting the agent in the resulting harness (where the multipliers come in) and even people at the peak of experience with them today are only scratching the surface.
I'm very aware that just as my assembler skills are not what they once were, my skills in the languages I'm now writing less will not be what they once were a few years down this line.
But I produced far more before I started using LLMs through the force multiplier of modern languages and frameworks than I did in assembler in the 80's and 90's, and I produce far more now with LLMs, and I will produce even more in the future by learning how to take advantage of new capabilities.
I have Claude refining a system that wasn't tractable a few years ago in another terminal as I'm writing this. I don't care if it would take me a bit longer to get back up to speed on a C codebase again if I was stripped of all access to LLMs any more than I care if it'd take me a bit longer to get up to speed on programming assembler on a Commodore 64.
DanielHB 4 hours ago [-]
I am consistently underwhelmed by the output, I can't really explain it besides LLMs have no taste about what is easier to read for humans.
I usually start a task with an LLM and then do small refactors using the LLM and then do some manual refactors before I am done. But often for more complex tasks the manual refactors are quite large.
Maybe it is because they can read walls of text so easily, so they output walls of text that are hard to read for humans.
I feel quite sad because a lot of my fellow colleagues are not putting this extra effort in to make things easier to understand by humans. PR review is basically me just doing this extra effort for them and their LLM implementing my comments.
And that is when I can even pinpoint the bad taste in the code structure, sometimes it is not something you can easily describe in a PR comment besides "no human would structure the code like this".
muse900 6 hours ago [-]
I get what you are saying, but how can we be talking about skill atrophy when our main skill is changing from being able to produce code ourselves to being able to leverage LLMs to write that code.
At the end of the day there are goals achieved with coding. Coding is a tool to reach either your business needs or some personal aspiration.
When it comes to businesses I don't think a business cares if you used the best stack possible, or you've written it in assembly, as long as it works.
Judging from the biggest coding drivers out there, most of the code produced globally and the biggest apps out there have had skilled engineers writing code but its not always perfect. As long as it works. Lets not forget that the web is build in php and js.
So again my argument is that, are you atrophying a skill that is going to exist in the next 1 to 2 years, or is everything going to shift towards LLM code writting.
Personally I think that LLM code writing is the winner, whether we like it or not, it accelerates business objectives, which at the end of the day its what is the deciding factor.
And yes I do miss the days I was writing code and I was solving complex problems myself.
darkwater 5 hours ago [-]
> At the end of the day there are goals achieved with coding. Coding is a tool to reach either your business needs or some personal aspiration.
This is your opinion and I even share it, but there are many people here for whom writing the code was/is the whole deal. You would not have languages and heck - even editors! - holy wars otherwise.
socalgal2 4 hours ago [-]
It’s hard to know if your experience is relevant from 6 months to year ago. The models are getting better every couple of months. My current experience is mostly like the other senior above. For me the last 4 months I’ve gone from mostly writing code by hand to writing almost no code by hand. I guide the LLM and it’s a force multiplier. I review its code and discuss with it how to test and what needs changing and it does it. I point out things it didn’t handle and it handles them.
jimbo808 4 hours ago [-]
Uh, I didn't say I stopped using LLM's 6mos to a year ago. I have to for my job, it was just that long ago that I began to understand that they aren't what they seem and definitely aren't a "force multiplier," more like a "debt generator."
aix1 7 hours ago [-]
> a very steep price that you may not actually account for
Could you elaborate on this steep price that you have in mind? What does it consist of?
jimbo808 6 hours ago [-]
Technical debt and skill atrophy
Technical debt due to accumulated excessively verbose, badly architected, often redundant, feature-bloated code which always looks good, even upon earnest review, but actually sucks and becomes extremely difficult to maintain in ways which are not obvious in code review. The issue is this: your tooling can help, and can make you feel better, and you might think you wrote all the prompts and made all the tools to mitigate these issues, but you haven't. If you're not consistently seeing it generate code that is very very close to the way a skilled senior dev such as yourself would have done it (with similar line count, etc), that is a red flag even if the code looks great and works.
mingusrude 6 hours ago [-]
> ...badly architected, often redundant, feature-bloated code which always looks good, even upon earnest review, but actually sucks and becomes extremely difficult to maintain in ways which are not obvious in code review.
I can only judge from my own experience but with or without LLMs, these are the codebases that I have worked with during most of my career. To me, much of the question is whether LLMs produce worse code than the me and my colleagues have done in the past and I don't think that's the case. It is however very common that people hold LLMs to a higher standard than human colleagues and then it's not a useful comparison.
torginus 5 hours ago [-]
LLMs can produce bad code in ways no human ever would. I remember working on headscratcher problem of walking a tree while gradually building up state, and since LLMs are supposed to be good at algorithms, I asked Claude.
It came up with a correct LC-hard tier solution that involved dynamic programming, and was essentially an unreadable dense mess that was impossible to reason through as a human.
It worked, but it was so bad, that I sat down and realized after a bit that with maintaining a small cache, and being very particular about how the nodes are traversed, I reduced the solution to like a 10 line modified DFS, that I could understand.
socalgal2 4 hours ago [-]
I’ e written bad code and I’ve reviewed bad code by other humans. In both of those cases the code gets re-written. In the first the reviewer tells me the code is hard to understand, can it please refactor. In the 2nd I tell the person I’m reviewing the same thing, maybe with some suggestions.
I do the same with the LLM. I tell it that solution is convoluted and hard to understand, if I have a concrete suggestion I suggest one, otherwise I ask it for ideas. We get there just like I do with humans
x3ro 3 hours ago [-]
In my experience, even if the code is not worse, there is so much more of it that people quickly stop understanding what is going on. Now you could argue „the LLM knows“, but that comes at an insane token cost, even if it works well.
This is in the interest of big AI companies: if they quasi-monopolize the skills entire sectors of the economy need in order to function, that will be great (for them).
Everyone keeps comparing this to compilers, but I don’t need a multiple-hundred dollar subscription to use LLVM. And people didn’t stop understanding how computers work either, just because they used C. And yeah, maybe local LLMs will become the norm, and I hope so. But market forces (hardware prices) certainly are working against that right now.
seer 4 hours ago [-]
Yeah but I think going back to hand writing bespoke code is not coming back, the genie is out of the bottle.
But we could build much better tooling around keeping the agents honest. The problems you are describing are absolutely real and I see them every they.
One friend of mine had almost a mental breakdown when he just went ahead and drilled a bug producing Claude to the point that it itself admitted it was “a piece of shit”. He knew that arguing with an LLM agent is more than useless, but it was cathartic for sure.
When I encounter a situation like this I always go down to - have I done everything I could to catch these errors in my automated validation, and update it as needed.
Agents are also more than happy to spend tokens refactoring, once you have such a test harness be good enough, producing successively better and more general abstractions is quite easy.
The old rule of thumb of “make it work, make it fast, make it pretty” still applies , just with much much faster iteration speed.
It seems with agents people have forgotten the last 2 steps since they produce a _working_ solution, and it might be hard to justify spending time “cleaning it up”, but this still remains essential.
shevis 7 hours ago [-]
The cost is skill atrophy. When was the last time you wrote something entirely from scratch by hand without AI assistance? It’s a skill entirely separate from prompting and reviewing. And it atrophies when you stop using it.
aix1 6 hours ago [-]
> The cost is skill atrophy
I hear what you're saying but I'm not sure I buy it in the context of this thread (a response to someone who is 54 and has been coding since they were eight).
I am in a similar boat, having been coding full-time for fourty years. The way I use the current tools is that I own all architectural and design decisions but let Claude Code fill in the blanks. I reckon the quality of the output is about 90% of what it would have been had I done everything myself, but I get a lot more done (easily 3-5X).
Will I forget how to write a "for" loop just because I haven't been writing many of them by hand lately? Those skills are so deeply ingrained that I seriously doubt it. I can ride a bike after a multi-year break, or converse in a language I haven't regularly spoken for several decades. Or write using pen & paper even though I hardly ever do it. I don't see why coding would be any different.
defrost 6 hours ago [-]
I have a greater concern about societal skill atrophy.
I also am not about to forget how to for(;;), that said, as a result of some years invested in aligning old pre WGS84 mapping with modern GPS and improving digital mapping, there are fewer people per capita with the skills to navigate via paper maps in the absence of GPS.
Old farts coding since age 8 (in which I include myself with a decade+ over a sprightly young 54) will retain coding skills for as long as they apply them - the fear is that fewer and fewer others will develop and exercise such skills due to AI.
It remains to be seen if that's a bad thing long term.
aix1 6 hours ago [-]
I am not worried about the loss of skills per se. Over the centuries the average person has become less skilled at, for example, butchering animals. Is that bad for society? I don't know.
What I am worried about is us becoming dependent on tools that we as individuals neither own nor fully control, and gradually losing our ability to function without those tools. This, I think, is a huge societal risk.
mrbungie 7 hours ago [-]
From what I gather from GPs upper post: Technical debt, skill atrophy, delusions of grandeur about one's own abilities / psychosis.
pmarreck 1 hours ago [-]
> very steep price
I have yet to see it, but OK
Either measure it or it sounds like a conspiracy theory
integricho 7 hours ago [-]
I believe you are miscalculating the effect of skill atrophy, there is benefit and actual experience gained by doing the work yourself. You are an experienced dev and already have a lot of tools and knowledge under your belt so at the moment it is hard to see the actual issue, as this is just a productivity multiplierfor you. But give it a couple of years working under these conditions, your tech savvy nature will be severely diminished.
alchemism 7 hours ago [-]
In a couple of years he will be 60, too. Then 65, then 70.
Seniors will be able to stay in the game much longer than before, mark my words.
aix1 6 hours ago [-]
Indeed. I think it's a much bigger issue for juniors, who haven't yet had a chance to build that systems design muscle.
When an LLM is making a bad design decision but the engineer doesn't have the experience to spot it AND the consequences don't become apparent until much later (which is often the case) -- it's kinda hard to learn.
throwaway27448 8 hours ago [-]
Have you really found claude to much more more capable than eg deepseek? Anthropic has little to no chance of producing a competitive business model in the long term.
vidarh 6 hours ago [-]
The cheap models are cost-competitive if you are running them in long-running agentive tasks.
But they take a lot longer to reach the same goal for complex tasks, so the difference is still very real, and the cost-savings are still very much a question of how well you manage to characterise the tasks they will do quickly and pick and choose what to use when.
I kind of agree that I think the cheap models will eat away at the moat very effectively, but if it doesn't seem more capable to you, you're not giving it complex enough tasks to see what they can do.
(FWIW, I've burned billions of tokens on each of Deepseek, Kimi, GLM5.2, GPT, Sonnet, Opus, Haiku using the same harness, and we've kept stats on cost per task)
jeffyaw 7 hours ago [-]
absolutely, for me the tui, ultracode agentic workflows, and streaming logic are far superior. the closest model is minimax 3.0 imo and i ended up adding a custom tui, agentic workflows, streaming logic and implementing skills to that (in typed) in order to get to an acceptable claude fallback. on their own i haven’t found one model comparable to claude, not even chatgpt.
hhh 7 hours ago [-]
Yeah, using deepseek feels like shit and I spend hours steering deepseek in a direction versus opus-4.7 or 4.8 where I can just kinda let it ball out on some reverse engineering problems.
n3storm 7 hours ago [-]
I don't. Using claude code, claude.md etc with deepseek v4 is almost undistinguishable.
boc 7 hours ago [-]
> Anthropic has little to no chance of producing a competitive business model in the long term.
Extraordinary thing to say about the fastest growing company in the history of capitalism. They will soon have access to public markets, essentially unlimited capital, and can build insanely large models that they don't have to make public... ever. They can just use those models to run their business, train better models, eat competitors, etc.
But maybe it's Anthropic that isn't thinking ahead enough - you clearly think you can see around corners with your proclamation. So why do you think they have "little to no chance" of surviving long term?
jurgenburgen 4 hours ago [-]
> Extraordinary thing to say about the fastest growing company in the history of capitalism. They will soon have access to public markets, essentially unlimited capital
There is no such thing as unlimited capital. The faster they grow the faster they burn capital. Eventually it will run out.
kerabatsos 7 hours ago [-]
Finally, someone said it. 20+ years in software and my productivity and velocity is wild right now.
philipwhiuk 30 minutes ago [-]
> MFIC
Getting another agent to validate the first agent is a tower of sand.
Forgive me, but the active repos all look like reimplementations of existing good open source code (which of course is ideal training data) - rm_safe has rip for example. Or prototypes. Is there anything that actually has a user base > 1?
byzantinegene 9 hours ago [-]
you're right, but if skills atrophy so will the efficacy of the tool.
pmarreck 8 hours ago [-]
Skills atrophying in terms of what? Remembering specific API's that you always had to look up anyway? You don't lose developer intuition, analytical thinking or technical inclination, and those are the things that matter, anyway.
I recently did a fleetwide upgrade to Zig 0.16. Do I remember every single change from 0.15? No. Do I have to? Also no. Both because I can look it up if I need to, but also because the LLM already does.
If I don't look at a codebase that I myself haven't looked at in a year, I will not recognize some things when I return to it. Is this sense of "atrophy" meaningful when this was a problem long before LLMs came on the scene?
wesleywt 8 hours ago [-]
How do you deal with the lack of cognitive engagement? I think it is the primary driver of developing and maintaining skills as well as generating new ideas. Letting an agent do it for you will just get you average but well formatted code. Not something new.
bfjvibybd6cuvu6 7 hours ago [-]
I haven't cared about cognitive engagement in over a decade. Neither has my CEO.
dimitrios1 9 hours ago [-]
I am another skilled senior, have been coding since I was 7, although you have a few more years of experience on me, and am commenting here just for the goldilocks moment, as I have read and reflected on both of your comments and find my reality is somewhere in the middle.
On personal projects, where I am in charge of all the hats (product development, UI, UX, backend, security, server admin, etc) -- absolutely crazy force multiplier. You get a nice suite of backend and e2e tests running, with full business scenario layered on top of that, and constantly running agents to do the coding, another agent on a higher level of reasoning to review that work, and sometimes occasionally poping into another competitors model to review their work just for added comfort -- it feels like wizardry. I am not vibing it, but I wouldn't say I am carefully scrolling through every line. I review whats fundamentally important, especially when it comes to data, overall structure, and large, cross cutting concerns, but I would be lying if I say some code doesn't land that I don't read. But I have the security of the test suites and validations , so I pour more effort into that.
It's a nice self reinforceing loop.
All of this might sound like I agree with you, and to some extent I do, but I am realizing as the apps I have built out like a cannon shot out of hell with tremendous speed and polish right out of the gate are starting to slow down. Feature adds are getting more complex. My memory is not what it used to be. Each run and pass through the code consumes more of my tokens and limits. I am starting to do less in the same amount of time. Codex did a vertical slice of a feature for me (well defined and well planned). It contained functionality that has historically plagued us developers -- the dreaded time. I used xHigh GPT 5.5. It had obvious bugs, but I wanted the robots to catch it. I popped it in claude (on the new sonnet 5! heyo!) -- Claude caught the bugs. Even said they "immediately stood out" I wondered how this happened. Frontier model from company A was evaluated by workhorse model from company B. All of this again took massive amounts of usage. And time.
And this is -- best case scenario, perfect world, everything is in perfect alignment.
Now for the work reality.
Multiple product and experience owners. Multiple dev teams. Different enterprise teams support services you rely on. You don't have full unfettered access to frontier models. You have to use copilot, or some other enterprise harness, and you run out of credits for the month, you are SOL. It's not as good as your claude, you think to yourself, but hey, its familiar enough, and you have 5k credits left for the month for Opus 4.8, better make the best of it. But now you burned half of them working on that Transactional Bug that was mixing synchronous and asynchronous semantics that the other guy's model should have picked up on. What happened? Maybe he didn't use Opus, maybe he used Haiku, maybe his prompt was bad. Who knows. Gotta fix it. Oh, you gotta reach across the isle and put in a request to get the Enterprise team to look at this caching inconsistency on user data that you need and is really the source of your race conditions. Tick tick tick. Model limits approaching. You start wondering if you just did all this by hand like "in the old days" would you have got it done correctly faster? Or at least, cheaper. You'll never know.
sisiissjz77 4 hours ago [-]
It’s simultaneously simple and deceptively difficult to coax a growing system into staying sane. It reminds me of forcing a fractal into growth yet somehow letting it remain similar at all scales, manually.
Scaling in this sense is not operational (“servers”), but conceptual (“features”).
I don’t want to be a downer but I find many devs are not great at this. Very clever folks, but they tend to not see these issues clearly. They’ll nod and recognize when you talk about separating content from form and the importance of various design principles like high cohesion and loose coupling but completely disregard them once in contact with reality.
Part of the problem, as you nicely showed, is that technology is only a single slice of this problematic pie. Organizations in general are systems as well and they tend to be either badly architected, badly maintained or often both. Some technological issues are downstream from organizational issues and IME those can be become rather dominant variables in the equation and no amount of AI - save full AGI taking control of the company - is going to save us from those factors.
Revanche1367 5 hours ago [-]
And you’ll never know because even if you could turn back time and do it from scratch, you’re likely to opt again not to do it all manually because the cognitive load is going to keep tempting you to reach for the agent again.
stpedgwdgfhgdd 7 hours ago [-]
+1
the distinction between personal projects and Enterprise development is a big one. A severe bug in my personal projects, i fix it on the fly. A bug in our products rolled out, nightmare.
sensanaty 4 hours ago [-]
We've just done an official evaluation at work, using extensive statistics on our gigantic monorepo in a company with ~2000 devs over the course of 2 years, everyone from hardware engineers to regular old frontend engineers. It's a highly profitable and mature public company, and has been for going on a decade at this point without missing a beat. We were given infinite access & budgets to basically any and all AI tooling we could imagine, and we have several "AI Native" teams (whatever the fuck that even means). We're doing agentic coding, we have harnesses of all kind, skills, we have many teams doing spec-driven development, designers using all the various things like Figma Make and access to tools like Devin/Factory Droid/Claude Code/Codex/etc.
This is all to say, we as a company are using AI a lot in all possible corners, but thankfully our leadership isn't schizophrenic and isn't mandating everyone hit token limits or whatever, it's more of a "Let's see what works and what doesn't" type of thing, and we measure a lot of statistics. Nobody here really cares whether LLMs are the next coming of Christ or not, as a company there are many people (even in SLT) that are indifferent to LLMs, and many who are reasonably hyped.
I wish I could link to the actual document we were all shown since it has a beautiful breakdown of the methodology and a fine-grained breakdown of the stats and the categories measured, but in the grand scheme of things, ALL the AI tooling we have implemented (at least on the engineering side of the equation) has contributed to a total of... drum roll please... 7 (seven) Percent overall productivity increase! The most productive teams saw a productivity increase of around 20%, while some teams actually saw drops in productivity into the negative percentage points. My team, none of us really give a shit about AI and we're somewhere in the 3-5% range on certain categories of tasks, which I'd say is a fairly good assessment.
Productivity here is measured in many ways, including but not limited to speed of MR review and merge times, feature/ticket/roadmap closure/delivery, rollback/revert incidence rate, how often people interact with the MR review bots and implement their suggestions/fixes, how many times people check back on AI transcriptions/meeting notes (hint: Nobody looks back on any of it, it's all just noise that gets generated and never actually referenced outside a few extremely rare cases) and many more things I'm forgetting. It is an imperfect number of course, because measuring productivity in engineering is a sisyphean task, but in my opinion it is accurate to the reality on the ground and outside of all the hype and marketing bullshit.
So, I remain thoroughly unconvinced of these personal anecdotes of people being "massively" more productive, especially once you factor in the fact that we now have a 2000EUR budget/month/dev for all the AI tooling, those productivity numbers start looking pathetic once you factor in the costs (which are only increasing as the AI companies need to start recouping the gazillions they've burned). Some teams have started begging to disable coderabbit and other similar tools in their MRs because they're producing nothing but walls of noise that makes reviewing any MR a nightmare of sludging through endless slop of useless bullshit, ours included.
kakacik 4 hours ago [-]
You very conveniently avoided skill atrophy, the biggest issue for existing engineers to deal with (not even going into topic of cancelling whole hiring-junior-and-raise-senior approach which is just shortsighted and retarded to be polite, but general greed has overtaken the field so its not shocking this went out of window). Everybody using llms excessively is measurably doing worse, ie students. I have hard time believing out industry is somehow immune to that and refuse to do the experiment on myself if avoidable, which it sort of is for me.
Its like drug that will give you few years of great high, and ruin rest of your life afterwards. Use it by all means, I don't care about your output, nobody here does, you do you.
I do care about my long term skills, which aren't about piping some llm outputs. My employer ain't dumb fuck who is pushing for llms at all costs as much as possible. Anyway, most of my day work are processes, discussions, pushing things through - llms can't do a shit here, its personal conversations, connections, often psychical contact to get things done on time. Startup world would be different but I am as far from that unstable environment as I am from say gaming industry, just not worth my time outside SV area.
So I just use llms to verify my coding results, they are fine for that, but I do the creativity. Its by far the best part of my software dev work, why the heck would I be automating that away? Its like automating sex away so you can have more time... reading HN or some other way to just waste time, dumb approach from all angles.
Of course this changes if one is working on personal projects, self-employed, small startup etc but most folks here are not in that category.
vidarh 6 hours ago [-]
> I don't think they're a net gain if you're a skilled senior
I've had Claude Code running a /loop for the last week driving down complex crashing bugs in a prototype compiler entirely unilaterally. I occasionally glance over.
A few of those crashing test cases were ones I've spent more than a week trying to track down myself. I have 30 years of experience of doing this.
It's worked 24/7.
So far it has fixed over 500 of them.
Will there be technical debt? Yes. But nothing that remotely compares to the cost I'd have incurred of fixing all of those myself.
It is hard to reconcile those gains without thinking that if people are saying these are not a net gain, they haven't really tried learning how to get the full benefit. If you sit and watch a model work and keep intervening all the time, then sure, they're not going to be a net gain.
kelnos 9 hours ago [-]
Why even bother posting, especially as a reply to a completely unrelated comment? This is just not substantive or useful to the conversation.
(And I say this as someone who agrees with you that it's garbage that these companies are trying to legislate their way into an oligopoly.)
jimbo808 7 hours ago [-]
> Why even bother posting, especially as a reply to a completely unrelated comment?
casey2 8 hours ago [-]
If you give Anthropic money they will make your life worse in another aspect, it's relevant to all their models. The best principle is to not give money to people who want to harm you.
Anthropic has gone past fearmongering and well into terrorism. I think people on Hacker News should not recommend working with terrorist orgs.
brookst 7 hours ago [-]
“Terrorism” is so ludicrous a hyperbole that it completely discredits your position.
theshrike79 6 hours ago [-]
Should we work with companies that give the "Department of War" full access to their tools (OpenAI) or Chinese companies with completely opaque ownership and dev structures?
Or the largest ad company in the world (Google)?
lemonteaau 7 hours ago [-]
The irony is that an authoritarian country is leading the world in open models
pjerem 6 hours ago [-]
How is that ironic knowing that an authoritarian imperialist country is leading the world in closed models ?
7 hours ago [-]
throwaway27448 8 hours ago [-]
Making software illegal is not an easy task.
anonzzzies 10 hours ago [-]
Sure about Dario (and all billionaire) weirdness, but no gains if you are a skilled senior is well, very far out in our experience (our company is 30 years old with mostly the original employees and founders): what we deliver now at the speed and quality we deliver it would have been impossible 10 years ago with our team size of skilled seniors. We replaced all the commercial products our clients and ourselves used with our own, giving us millions more revenue and profit with the upselling and efficiency benefits. We work for regulated clients: our code is reviewed, pentested and audited regularly by us and 3rd parties so its not slop either. You are definitely leaving money on the table. We do mostly use chinese models on our own hardware (we colocate cages of racks) so this is not about Anthropic but about AI in general.
Skill athrophy is a real thing though; we try to prevent this by have hackethons (for lack of a better word) without AI where I pick something extremely non trivial and we implement it for fun and profit without AI (with would not matter much as they are currently bad at these things); last one was flex paxos for our in house db with obvious metrics for the endresult: data integrity (duh) under failure and performance better or at least the same as our raft production version.
andyroid 9 hours ago [-]
> We replaced all the commercial products our clients and ourselves used with our own
You’ll never guess what product your clients are looking to replace with their own next.
anonzzzies 9 hours ago [-]
Sure, that is why you need to be early. I fully believe my company won't make it another 30 years (or 10), so we prepare for that. Also, I will be dead by then, but that is unrelated.
For now everyone is still sufficiently crap at using AI to need help. We had enough clients trying to build something themselves and then come crying to us.
topato 9 hours ago [-]
Having a health problem that puts an end date on your effort must tint your business choices in a unique and interesting way. I find your ideas intriguing, and wish to subscribe to your newsletter.
strangegecko 7 hours ago [-]
We all have a health problem that puts an end date on our efforts.
fragmede 4 hours ago [-]
Not if Bryan Johnson can help it!
anonzzzies 3 hours ago [-]
He can't though. Guy looks older than me while almost 20 years younger. Under 50 most people feel immortal. Wait until he is 70. If he makes that.
hughw 9 hours ago [-]
Sure but in the intervening 2 years there's money to be made.
mastazi 10 hours ago [-]
good luck actually enforcing that.
andsoitis 11 hours ago [-]
> They're actively trying to use lobbying power to make open weight models illegal.
He has also been telling bald-faced lies about open source/open weights models that are easily disproved. For example, he claimed that they lack the collaborative benefits of open source because "we can't see inside the model".
Open weights models are responsible for enabling reams of research on interpretability methods that do just that. And they have facilitated so much collaboration on architecture, inference optimizations, training and steering methods, and other topics that were completely out of reach with closed models like Anthropic's. It's really staggering to me.
andsoitis 9 hours ago [-]
Doesn’t this seem true / self-evident:
“His warns that once powerful models are released openly, companies lose the ability to monitor misuse, revoke access, or update safety guardrails.”
regularfry 5 hours ago [-]
Either
1) the company has device-level control to the degree that they can not only restrict which API endpoints people can connect to but which accounts they use to do so (in which case this already isn't an issue); or
2) they don't, and all bets are off anyway, open weights or not.
anon373839 8 hours ago [-]
The premise (that "powerful models" implicate "safety" concerns that must be controlled by "companies") does not seem true or self-evident to me. This tired fearmongering campaign centered on GPT-4 three years ago, a model now surpassed by open models that run on laptops.
Did fearmongers like Amodei say, "Oops, we were wrong! It wasn't that dangerous after all"? No. Of course they didn't.
jquery 10 hours ago [-]
Yeesh. “What shall we do sire, when the peasants learn to read?” vibes
entropicdrifter 9 hours ago [-]
You mean to tell me that anyone can own a nail-gun? We can't have people buying their own nail-guns, next thing you know they might build things that aren't up to code!
pmarreck 9 hours ago [-]
This does not load (or no longer loads) over here.
> "Once the weights of a model are public, they cannot be retrieved. If a model possesses dangerous capabilities, it is permanently out in the wild... We need to consider regulatory frameworks that account for the unique risks of open-source distribution of highly capable frontier models."
TurdF3rguson 10 hours ago [-]
That's true I guess. If someone decides a model needs more guard rails, anthropic can adjust it, whereas with open weights it's too late.
It definitely sounds like the kind of thing that ends the world in B sci-fi thrillers.
TurdF3rguson 4 hours ago [-]
Curious if I'm being downvoted by people who think there will never be a model with dangerous capabilities? Or people who think when it happens we shouldn't add guard rails?
AquinasCoder 18 hours ago [-]
While I appreciate, they publish this information, it's increasingly hard to keep track of it all. I've lost the mental model of how different models at different effort levels perform and what tasks they are good at.
In practice, I tend to just use the default on Claude Code that works well enough. But I wonder to what degree other users really play around with these settings to optimize for their project.
matheusmoreira 14 hours ago [-]
I always use Opus 4.8 at max effort for everything. The $20 subscription didn't have enough tokens, but the $100 one had too many of them. So now I just max out Opus in order to maintain 100% weekly utilization.
Abishek_Muthian 8 hours ago [-]
I'm a senior skilled developer and I find Anthropic $20 + Open AI $20 + OpenCode Go $10 offers more value than $100 on any particular service.
Juggling between all different models/agents is quite simple with Zed.
A caution about OpenCode Go though, the entire company seems to be run by AI so there's lot of billing related issues with zero support. I subscribe new every month as I lost money due to double payment with automatic subscription.
For non coding related tasks I use local models.
P.S. If anyone is interested to read more about my setup, let me know I'll publish a blog post.
theshrike79 5 hours ago [-]
I've been running about the same stack for well over a year now, Anthropic cheapest + OpenAI cheapest + z.ai coder (black friday offer).
The Z.AI is a bit wonky, so now I'm moving to Openrouter for Qwen+Kimi+Deepseek?GLM
My summer project is to figure out a proper agentic system where a "big" model does the planning, but automatically uses a cheaper one for the grunt work. Having Opus to config edits is just stupid :)
jurgenburgen 4 hours ago [-]
> My summer project is to figure out a proper agentic system where a "big" model does the planning, but automatically uses a cheaper one for the grunt work. Having Opus to config edits is just stupid :)
My company pays for the tokens so I don’t care. Biggest model and max everything. The slight risk of a smaller model making a mistake is more expensive than just running the bigger model all the time.
theshrike79 2 hours ago [-]
Oh I do that at work too, for sure. I'm riding that Claude Teams subscription like I stole it.
But I'm playing the long game you see. The tokens will get expensive and the monthly subscriptions will either go away or also get too expensive.
Then companies want efficient token use and cost control - and I'll already know how to do that =)
rdudek 7 hours ago [-]
Would love to read that blog post. I'm toying with running local AI model with Claude and GLM as well depending on a task. Pretty decent success but it could be better.
ido 8 hours ago [-]
I’m interested!
kamaal 8 hours ago [-]
>>For non coding related tasks I use local models.
What sort of hardware are you using to run local models? And how do you use them?
easygenes 13 hours ago [-]
I'm a heavy enough user that I have both the OAI and Anth $200 plans. I always use at least 50% of my weekly Opus quota at Extra setting (meaning I use double the limit of the $100 plan, at minimum). Max I rarely touch because it is twice as slow and the incremental capability gain is minimal. Usually if Opus can't sort something well at Extra, the answer isn't to use Max but to hand the issue off to GPT-5.5 at XHigh.
tyg13 12 hours ago [-]
I too have settled into a kind of dual Claude/GPT model setup. I will often use one to review the other's work, or critique the other's plan in some way. Sometimes I'll have Claude implement a feature one way, then have GPT do it the other way, then have them both review each other's implementation. Then synthesize a final plan from the previous implementations+reviews.
I might just be having fun with models, but I have actually noticed their capabilities vary somewhat, and so my (perhaps vain) hope is that by using both, one can catch each the other's blindspots. It's still unclear to me if that's consistently happening, but I am making substantial progress in my personal and professional projects, so something seems to be working.
versteegen 4 hours ago [-]
> Sometimes I'll have Claude implement a feature one way, then have GPT do it the other way, then have them both review each other's implementation. Then synthesize a final plan from the previous implementations+reviews.
I've done variants of this a number of times, but feel like it was a generally waste of my time to then have to compare them and write up which parts I liked or disliked: if the output is something substantial, each will have its pros and cons. Clear-cut wins aren't very common. Of course it could work well if we automated the whole thing with an orchestrator; you just need a model with actual good taste (according to your own preferences) ... so we'll have to compare all the models to find that one
dirtbag__dad 12 hours ago [-]
Yes, same, between the two of them I feel like results are just better because they have different priorities.
At the same time, I’ve invested in tooling that prints and lints architecture I want, so which model is less of an interesting decision, because the results tend to be very close.
ATMLOTTOBEER 14 hours ago [-]
Agreed I think your strategy is optimal. This is what I landed on as well
vcf 13 hours ago [-]
Me too, I rarely hit limits anymore on the $100 Max, except for the brief period with Fable
chewz 9 hours ago [-]
This is actually very counterproductive with Opus 4.8 - you are wasting a lot of time.
For Opus 4.8 training with overblown internal dialogue and second opinions - Max effort burns just tokens and wastes time without much value. Spinning wheels.
matheusmoreira 8 hours ago [-]
Even with Opus 4.8 I have to try pretty hard to maintain 100% weekly usage. I'm actively trying not to be efficient at this point.
Now that the ban is lifted, max effort Fable 5 is gonna solve this problem quite neatly. Fable to plan and review, Sonnet for the implementation.
Wait, never mind that. Subscribers will only have Fable for a week.
jon-wood 4 hours ago [-]
Why are you optimising for token use rather than getting whatever tasks you're trying to do done? Max effort Opus will happily sit there talking to itself for five minutes fixating on every tiny detail while Sonnet or even Haiku will just blast through the problem and be ready for the next one in seconds.
matheusmoreira 3 hours ago [-]
Because I'm paying $100/month for this, so I absolutely want to use all of those dollars. Any leftover weekly usage is a gift to Anthropic.
I am getting things done. I've made major progress on my projects, and even started new ones. My most requested tasks are: code review, brainstorming and research. The fixation on tiny details is exactly what I'm paying for.
m-dot-reviews 12 hours ago [-]
I've been plugging this perhaps too many times now, but I am trying to bootstrap a user-sourced corpus of exactly "what model is good at task X". So, not benchmarks, but high-level tasks. There's a bit of a ordering problem in that nobody wants to bother commenting on a site that has few comments - so PTAL and contribute if you can. https://model.reviews
nolok 15 hours ago [-]
Same boat as you, and my answer is "... Except when I ask and overall or checkup task that is specifically heavy or overseeing in which case I use the maximum level" which lately meant ultracode.
I'm not going to play around with thinking level every request because the goal is to make me save time not spend it in a different setting menu.
sanderjd 17 hours ago [-]
What I want is a harness that knows how to optimize this kind of thing for me.
nl 13 hours ago [-]
In practice I don't think any harness (happy to be corrected here!) uses the lesser capability models for writing code. The cost trade-offs are rarely worth it.
They are often used for reading code though.
To expand on this, while the "big model to write a plan, small model to write the specific code" idea is quite common it trips up on edge cases.
In theory the flow works like this:
- small fast models read lots of code, and pass details to the large model to write a plan
- large model takes those details and writes a detailed plan
- medium models write the code
The issue happens when the medium model hits something that the plan didn't take into account (which happens a lot - the big model didn't actually read the code). Then it has to either guess, or pass back to the large model.
If it guesses, the plan usually starts to fall to bits.
If it passes back to the large model, inevitable the large model has to start reading lots of code. In that case you are paying the expensive tokens to read so you might as well have it write the code too (many less tokens are written than are read)
It might be possible to get this to work, but I haven't seen anyone who has tried agentic work with frontier models be satisfied with this hybrid setup.
I'd note that Amp (mentioned above) is probably the leader in using multiple providers in a coding agent but still uses frontier models to write code.
whatshisface 9 hours ago [-]
> In practice I don't think any harness (happy to be corrected here!) uses the lesser capability models for writing code. The cost trade-offs are rarely worth it.
That's not something I understand very well. The less expensive models will quite happily chug away at tasks, if the codebase is well-structured (small files help a lot) and your instructions are clear. In contrast, I've never seen a large model turn bad instructions (instructions that would cause a human to think before starting) into a result I liked. You can run small models almost 10-100x as long for the same price in dollars, which covers a lot of correction and adjustment.
Why does everyone say the trade-offs are rarely worth it?
nl 8 hours ago [-]
> In contrast, I've never seen a large model turn bad instructions (instructions that would cause a human to think before starting) into a result I liked
I think the distinction is here.
I expect my agent to build from product level descriptions. This might include specific special cases that I call out, but will rarely highlight existing special cases or edge cases - they already exist in the code, and I'd expect a programmer to make sure that behavior continues to work.
If a feature hits lots of these edge cases, the weaker model that is reading the code (aka Haiku) won't understand their significance, and will report back to the planning model incomplete or incorrect information.
The planning model (Opus - which hasn't actually seen the code remember!) will build a plan that is incorrect or incomplete and delegate coding to the mid level model (Sonnet) which will do it's best to make things work, without understanding the overall picture.
This is how you end up with slop - for example Sonnet reimplements things that already exist because it found one of the edge cases, but Opus had never known about it because Haiku didn't understand it.
It's possible that the new "agent teams" feature in Claude code can help with this. That keeps each agent alive with its context so they can ask each other things, but I haven't tried that enough to be sure - let alone with the specific model mix like this.
In your case, you are giving the Sonnet model specific instructions for what to implement mindlessly. I'd expect that to work well!
But that's not the same as the agentic workflow many other are using.
I appreciate the suggestion! But it isn't clear to me, from reading their marketing site, what they bring to the table from this perspective. Can you give me a more targeted pitch?
I haven't used them in a while so my info may be out of date, but they tended to track whatever models were the best and auto-use them for each task (eg, one for planning, subagent for a code search, other frontier for implementing). Their CLI seemed very well thought out to make you do things "the correct way" -- for instance, `/handoff` instead of `/clear`.
cunningfatalist 8 hours ago [-]
Sorry for the late answer and the missing context. usef- is right, the manual is probably the better page to share. Amp tries to give you a plug and play experience, where you can always see the actual costs and models/effort are autoselected for you. Some of my colleagues are big fans and use it a lot. I also like it, but prefer OpenCode.
manojlds 17 hours ago [-]
Which is your own harness and your own evals for your tasks I guess
munk-a 15 hours ago [-]
I don't demand a customized compiler for my code even if such a compiler could outperform gcc. There is a lot of value in focusing on correctness to an extreme degree even if the outcome might be suboptimal to something more tailored - a tool with a large customer base can justify more resources going into its maintenance.
sanderjd 16 hours ago [-]
Maybe. But that sounds like a large amount of bespoke work for what seems like a common problem?
manojlds 16 hours ago [-]
I was talking about enterprise agents and then realized the question is more about coding agents.
sanderjd 16 hours ago [-]
Ah I see! Yes, I was talking about a coding harness, not an enterprise agent. I entirely agree with you that your suggestion of driving it via evals is the right thing for that use case!
brobdingnagians 15 hours ago [-]
I tend to run it on High and then step it up for problems where I'm noticing it struggles, bump it back down after. Sometimes I accidentally leave a session in Ultracode for a day and wonder why things are taking so long, but generally happy with the results.
jbvlkt 16 hours ago [-]
Exactly this is my problem with all AI tools. I want someone else to create working tools for me so I can focus on my product. It is the same with other tools. I do not want to spent huge amounts of energy and time to setup my IDE, operating system or desk layout. I guess it is too early to have that now.
jerojero 15 hours ago [-]
I think that's the whole selling point of lovable?
jimbo808 11 hours ago [-]
It's really not that much. It's a bit hard to make sense of it not because it's hard to keep track of, but because they are being deceptive and opaque about what you're actually buying, and the thing you're paying for is different from one day to the next, as they fuck around with the parameters to boost subjective performance during a launch, then quietly degrade the service to cut costs.
tash_2s 11 hours ago [-]
I also ended up using max effort/reasoning for both coding and general chat. They don't spend too much extra time on simple tasks these days.
throwaway219450 13 hours ago [-]
Same advice as ever? We call it context engineering now, but prompt engineering still matters a lot. Most of the failures I run into are unspecified assumptions made by the model that derails the conversation, but usually updating the first prompt fixes it. Opus in my experience is a bit better about checking assumptions, while Sonnet will plow on ahead. An example is mentioning a file that doesn't exist: Sonnet will go ahead and try to grep your entire hard drive for it. Opus will say it's not local and request the path.
I trust neither for general knowledge and I still find Opus giving me answers that are completely BS. But the token spend for Q&A is nothing compared to coding, so I always use Opus + a lot of thinking. For coding, I find Opus to be better value/token but I haven't done any sort of rigorous test.
deadbabe 13 hours ago [-]
There are token optimization consultants that can help organizations find the right balance of models for their employees to minimize costs.
j45 14 hours ago [-]
Just because it’s hard to keep track of doesn’t mean it’s not relevant.
Playing around with learning the differences is incredibly helpful to schedule on ones calendar weekly for an hour or two, while saving links throughout the week to try out.
paulddraper 15 hours ago [-]
It's almost like you want an automatically intelligent choice of your artificial intelligence.
Understandable frankly.
jacooper 18 hours ago [-]
Just use deepswe as a reference point.
2001zhaozhao 18 hours ago [-]
There are two wrinkles to this:
- For Claude.ai subscriptions I think Sonnet is much cheaper than Opus. This is why there was a "Sonnet only" usage bar for Max tier for the longest time.
- For some tasks the sheer amount of raw input tokens is the most important. For example multimodal computer use tasks. You can't make them any more efficient on Opus by turning down the reasoning, so a cheaper model like Sonnet is useful for them
timcobb 18 hours ago [-]
> This is why there was a "Sonnet only" usage bar for Max tier for the longest time.
it's still there. I still don't totally grok why I can't use all my tokens on Sonnet if I want to... maybe that signals something?
i000 16 hours ago [-]
They want to encourage diversifying model use.
radlad 15 hours ago [-]
Seems kinda weird - it's cognitive load I'd love to avoid. If I'm going to take it on, I might as well try other providers.
aqfamnzc 15 hours ago [-]
Why?
munk-a 15 hours ago [-]
It helps solicit more feedback and lets them trial different approaches. You're not just a user, you're a tester!
laughingcurve 16 hours ago [-]
Distillation attacks? Volume of calls?
energy123 18 hours ago [-]
The arguable caveat is Sonnet may run faster (although this isn't known for sure, due to more tokens being used for the same task), so you can potentially get more done in a synchronous iterative workflow
I don't really believe this however, because so much time is spent fixing up after models, that a slower but more intelligent model is a net time saver in my experience.
kolinko 15 hours ago [-]
From my benchmarks, sadly, it doesn't seem to be the case much. Surprisingly. I found Sonnet comparable in speed to Opus (sic), but perhaps I was testing it wrong?
riverbirch 14 hours ago [-]
I can confirm this, I too I'm not seeing much of a difference in practice
usef- 9 hours ago [-]
Given the chip shortage, I wonder if they push Sonnet to the older chips. It presumably requires less memory.
Torkel 18 hours ago [-]
Yeah, I was looking at the same chart and was very surprised at where the curve is relative to opus... Feels like sonnet 5 is "what if opus had an extra-low effort level"?
XCSme 15 hours ago [-]
Well, it is a Sonnet model, it is indeed better[0] than Sonnet 4.6 (smarter, faster, cheaper), but I don't see why would you use it as opposed to Opus 4.8 low or GLM-5.2...
You're referring to the Agentic search, but if you look at the Agentic computer use the cost is basically halved.
However, I am also confused about market positioning. Too expensive to perform daily tasks - open souce models are much cheaper - and not frontier model to address complex real world problems.
Rarely used Sonnet btw.
energy123 17 hours ago [-]
You're the second person that has said this but I cannot understand why you are interpreting the "Agentic computer use" graph in this manner.
The graph shows that Opus is cheaper than Sonnet for the same performance. Unless I am suffering a cognitive blindness thing right now.
lucamark 17 hours ago [-]
Wrong! Look at it better. It shows that Opus has superior performance but at higher cost.
doctoboggan 16 hours ago [-]
No, you are misunderstanding the graph. Draw a vertical line anywhere, that is a "constant cost" line. For any given cost, Opus 4.8 has a higher performance than Sonnet 5. Only where Sonnet 5 effort is at medium or low would it make any sense to use it, as there isn't even an equivalent Opus effort level to compare to.
Alternatively you can draw a horizontal "constant performance" line and see that Opus is cheaper for a given performance level.
827a 16 hours ago [-]
Why are you comparing xhigh reasoning between Sonnet and Opus? Of course Sonnet xhigh is cheaper than Opus xhigh, but that isn't the point; the point is that at e.g. 80% accuracy on Opus costs ~$0.45 (medium reasoning) whereas on Sonnet it costs ~$0.52 (xhigh/max reasoning).
brokencode 16 hours ago [-]
That is a bad comparison. Compare Sonnet xhigh against Opus medium, which is both better and cheaper.
energy123 16 hours ago [-]
No, that's apples and oranges. You need to compare Sonnet5's 79% with the interpolated Opus4.8's 79%.
annzabelle 15 hours ago [-]
> Too expensive to perform daily tasks - open souce models are much cheaper
There is a real advantage, especially for businesses, in using an off the shelf solution from a corporate provider.
Personally, the advantage of not having to set up multiple solutions from multiple sources outweighs the cost of a $20 a month subscription. Think about why a lot of consumers prefer Apple devices over Linux. There are a lot of advantages to Linux, but "never having to think about my tools" is its own advantage.
girvo 15 hours ago [-]
The specific market positioning is... for me to use at my big tech company job, where we aren't allowed to use GLM and similar, but have fixed caps on how much token usage we're allowed to rack up a month.
johnfn 18 hours ago [-]
That's just one benchmark, though. Tab to the next one and Sonnet 5 performs better as effort goes up just as you'd expect. I imagine the suggestion is that performance vs effort tradeoff is task dependent.
energy123 18 hours ago [-]
No it doesn't? It's worse than Opus across the whole shared frontier on both plots.
acchow 16 hours ago [-]
Agreed. The graphs clearly show that opus 4.8 performs strictly better at the same cost per task
jsnell 15 hours ago [-]
But they don't show "strictly better" performance at cost per task!
The graphs show parts of the cost/performance pareto frontier occupied by Opus 4.8 and others occupied by Sonnet 5.0. If Opus 4.8 was strictly better at cost per task like you say, by definition the entire frontier would be occupied by Opus.
So neither is pareto-dominant over the other. In contrast, Sonnet 5.0 is Pareto-dominent over Sonnet 4.6 on those graphs.
energy123 15 hours ago [-]
> by definition the entire frontier would be occupied by Opus.
But the entire frontier is occupied by Opus under any reasonable interpolation scheme (piecewise linear which is what they've done, and most reasonable spline or polynomial fits would also lead to the same result) over the overlapping x values for which both are defined.
Under that interpolation scheme, for x > ($ cost of Opus low effort), Opus is Pareto-dominant over Sonnet 5. You can see this by picking any point on Opus's interpolation and realizing that you get strictly worse by switching to Sonnet for the same x value or the same y value. Meaning if you want to pay the same $x then you get a worse y, or if you want the same y you pay more $x.
jsnell 15 hours ago [-]
I really don't get what you're proposing. The cost ranges do not overlap at the low end. You can't (by definition!) interpolate outside of the range.
If you mean extrapolate, at that point you're just making up data. The available effort levels are discrete and covered totally by the benchmarks. You can draw on the monitor with a sharpie to show a "ultra-low" effort level for Opus that scores better than Sonnet "low" at the same price, but it doesn't magic the ultra-low effort into actual existence.
(Anyway, the blog post now has an errata and a graph that shows substantially better relative performance for Sonnet 5.0 than the original graph.)
energy123 15 hours ago [-]
That's why I said "over the shared frontier" in my first post and more precisely in my second post I said "over the overlapping x values for which both are defined."
It was a claim that applies to a range of x-values where both curves are defined.
Of course if you go beyond those x-values where only one of the two are defined, then trivially the one that is defined constitutes the Pareto frontier in that region. Which is what I understand to be your point?
jsnell 14 hours ago [-]
The post I was replying to said "performs strictly better at the same cost per task". That claim was obviously not true, there are costs where Opus cannot do the task and Sonnet can, so Opus can't be performing strictly better that the same cost. It seems that you agree that it is not true.
You could make it true by artificially dropping some of the data points, but, like, why?
(Again, this is moot given the updated graph.)
> Of course if you go beyond those x-values where only one of the two are defined, then trivially the one that is defined constitutes the Pareto frontier in that region.
Not so! It's only sound to do that at the low end of the cost axis (x) or the high end of the performance axis (y). You can't do it at the low end of the performance axis or the high end of the cost axis.
seiru 17 hours ago [-]
Worth noting that the default chart there is for "agentic search performance", not coding. I didn't see an effort comparison for coding specifically.
partsch 14 hours ago [-]
I feel like the charts have been adjusted. I am quite sure, they looked different a couple hours ago...
callahad 13 hours ago [-]
They've absolutely both changed. The initial version I saw didn't include max effort data points on the first chart, and the plot itself was much less favorable to Sonnet at high/xhigh relative to Opus, but the new chart shows them as closer competitors. Weird.
booi 17 hours ago [-]
i actually exclusively use Sonnet in low effort level. It's too slow otherwise and at a higher effort levels is strictly worse than Opus.
intellijdd 18 hours ago [-]
I noticed that as well but with the introductory pricing, I wonder how true that is.
It would be great to see these charts with the promotional pricing just because it’s here for about two whole months.
I guess I could get Sonnet 5 to do it.
manojlds 17 hours ago [-]
Opus 4.8 high doing better and cheaper than Sonnet 5 xhigh
al_borland 17 hours ago [-]
What is a "task" in real-world terms? If it will be $15/million output tokens, and high/xhigh is somewhere in the $7.50/task range. Does that mean a single task is using 500k tokens. That seems like it would start to add up fast.
wyre 17 hours ago [-]
I’ve found input tokens is around 5x more than output, so a task could be a couple million thinking tokens and then a few couple 100k output tokens?
goldenarm 15 hours ago [-]
It's funny the exact same thing happened to Gemini 3.5 flash. Cheaper and more agentic model that ends up worse and more expensive than 3.5 pro low.
Readerium 14 hours ago [-]
3.5 Pro not yet launched, you mean 3.1 pro?
goldenarm 13 hours ago [-]
Yes sorry for the typo
Natelinathan 16 hours ago [-]
I just re-wrote the /code-review skill anthropic ships to use Sonnet 4.6 for some tasks as it was using Opus for simple git diff commands and similarily mechanical tasks (launched 100+ agents for one of my diffs, cmon). I wonder how Sonnet 5 will impact my usage.
Does anyone else have any review token saving measures?
nicce 17 hours ago [-]
> Opus always performs better for a given cost.
Assume it to get deprecated sooner rather than later.
ZeWaka 18 hours ago [-]
It's very interesting. Why even release a new product that underperforms at the same price level? Why not just lock it?
I guess it's probably a lot cheaper for them to run, and it cuts costs for them. Seems disingenuous, though.
make3 14 hours ago [-]
it might be worth it if speed is an issue
windexh8er 13 hours ago [-]
Except for the fact that Opus 4.8 is not good. Constant hallucinations, doesn't use the web very intentionally until you explicitly ask it to and it nopes out rather quick on benign items. Anthropic has been very disappointing as of late. All of the gatekeeping is taking a toll on what should be some of the better models out there, but you can't trust 4.8 to go off on its own. It will burn down tokens doing what it deems correct as per its guidance. Truly painful to use.
lukan 13 hours ago [-]
"but you can't trust 4.8 to go off on its own."
And what (avaiable) model do you trust to go off on its own?
windexh8er 9 hours ago [-]
The point is Anthropic has advertised their models in this way. There are plenty of models that can be used in long running situations that have proven to be more capable. Opus 4.8 is not that, and ironic given it's their top public model.
satvikpendem 18 hours ago [-]
> Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.
Why would they brag about something like this? It's like they know people want to use models to perform cybersecurity tasks yet knowingly deny them the ability.
And Opus 4.8 is still cheaper for a higher pass rate (much less open weight models like GLM 5.2) so not sure why I'd use Sonnet except on the low effort level for I suppose trivial tasks where I want it to work only 50% of the time judging by the graph. The pricing doesn't really make any sense.
secretslol 18 hours ago [-]
"Lower ability to perform cybersecurity-related tasks" makes me super concerned it will leave my codebase like Swiss cheese for any American granny with access to Fable 5, when we non-American Brits, or rest-of-worlders, don't have access to it to clean our codebases.
__alexs 18 hours ago [-]
100% this. I read these caveats in new models and all I hear is "we made sure this model has no idea about computer security." Such a weird thing to brag about.
doublescoop 18 hours ago [-]
This is code for "this model can't be used to hack other systems as effectively as Opus or Mythos."
kube-system 17 hours ago [-]
"dangerous cyber skills, such as developing software exploits" is very plainly referring to the same thing you are, but is more precise industry terminology rather than the loaded slang "hack".
doublescoop 16 hours ago [-]
I was referring to "Lower ability to perform cybersecurity-related tasks," which is newspeak for hacking.
kube-system 13 hours ago [-]
No, that is very intentionally referring to a broader set of things than "hacking".
matheusmoreira 14 hours ago [-]
That's the literal mission of the NSA. Security and strong cryptography for the US while everyone else gets "export grade" nonsense.
cute_boi 18 hours ago [-]
I think they don’t understand that cybersecurity skills are what prevent bad code from making it into production.
It’s like telling a chef to cook without a knife because knives can kill people.
Dario and his lackeys at Anthropic aren’t visionaries.
norseboar 18 hours ago [-]
I think this is more aimed at the US gov't than anything. They want to be clear that it's not very good at hacking, so that the gov't won't ban it.
I'm sure they're well-aware that this also will make it worse at building secure systems, but the gov't isn't restricting releases based on that.
baq 18 hours ago [-]
I think you misunderstood what their vision is, or rather what their possible futures are. They are many steps ahead of almost everyone, both in wargaming possibilities and the actual realized path. What doesn’t make sense to you may be the only safe option for them.
tancop 17 hours ago [-]
> What doesn’t make sense to you may be the only safe option for them
thats true because their point of view makes no sense for us. dario is all in on lesswrong machine god theory and really believes they need to create a super intelligence before anyone else. that means doing as much as possible to slow down others progress and accelerate your own. but the fact that they believe its the only option doesnt make it true for the rest of us.
baq 17 hours ago [-]
Never said otherwise, but it changes nothing. Their beliefs got them to this point on the timeline and that in itself cannot be ignored (or should I say, it should inform our priors...?) You can like or dislike them or what they do or don't do, but you must respect them regardless of that, purely because of their track record.
frabcus 16 hours ago [-]
I've been wondering this - I don't have an intuition for Anthropic's gaming around military applications, or how this stage could play out in terms of relationship to Government controlling AI.
Are there some Less Wrong posts or similar I should read that probably explain it?
Aeolun 14 hours ago [-]
I think that increasingly, the US will have to be passed by for these things. Clearly we’ll have to start looking to China for world leadership, to be the land of the free.
kube-system 17 hours ago [-]
> any American granny with access to Fable 5,
Fable is effectively not available to the general public in the US either
secretslol 14 hours ago [-]
True, but Trump & Co. did give them permission to let Americans continue using it, but Anthropic turned that down (for now at least...).
kube-system 13 hours ago [-]
Because there is no practical way to comply with what they asked for. They'd have to start validating their users passports.
goalieca 18 hours ago [-]
That’s not even close to true. Unless you’re vibe coding trash that a better model might catch.
secretslol 18 hours ago [-]
I don't think so. During the time I was using Fable 5, I was getting it to clean security bugs that Opus 4.8 had introduced ... bugs which weren't localised to a single PHP file but were caused by cascading data flow through multiple PHP files. I'm not an expert on security but I know I wouldn't have found these myself. I knew from day one of Fable's release that it would do thorough security audits and fix loads of flaws, even offering up PoCs to help show that it fixed them, as long as I didn't explicitly ask it to do a security audit. I just said, "My codebase is a mess," and it went on for an hour doing a thorough security audit and helping plug numerous holes. This was before the "fix my code" story came out.
18 hours ago [-]
zlurker 18 hours ago [-]
They spent months hyping up Mythos and ended up with it banned. I’d assume they want to both differentiate their products and appeal to regulators here
worldsavior 18 hours ago [-]
They will release it eventually. Once they see the Chinese models are close to Mythos level they will release it before, so it will be "revolutionary".
jaapz 18 hours ago [-]
It was already released. US government is the only reason it's not available to us mere mortals anymore
satvikpendem 18 hours ago [-]
Due to Dario hyping it up as a world ending model. If they kept their mouths shut we'd all have it now still.
baq 18 hours ago [-]
Where is gpt 5.6?
HDBaseT 14 hours ago [-]
GPT 5.6 exists, just not for you and me.
Everyone dislikes when these models are provided for use by the Department of Defense, but we can likely assume these newer, more capable models are being used by the NSA, FBI, CIA and other Five Eyes agencies to develop more backdoors, hack into more things to spy on us all.
We get drip fed the weaker models, but only once all the 0days have been used against us.
081c28a92 17 hours ago [-]
Victim of the same hype generated by Dario. Now everyone has to walk on eggshells, do limited releases to trusted partners, and nerf their cybersecurity capabilities lest they get deemed “too powerful to release”.
solenoid0937 7 hours ago [-]
Yeah Anthropic should have just lied about the capabilities of the model and/or hid them until launch. That is surely more ethical behavior.
M3L0NM4N 16 hours ago [-]
Yeah and our government is continuing to take pages from China's playbook for the last fucking decade... and not the plays that work.
satvikpendem 15 hours ago [-]
If not for Dario hyping Mythos and Fable, GPT 5.6 would've released just fine on schedule as a point release without all the fear mongering. It was because Fable was banned that now the government is scrutinizing all models.
worldsavior 16 hours ago [-]
Obviously I meant released for public use.
sixothree 18 hours ago [-]
I'm starting to think it discovered a 0-day held hidden by our government.
noumenon1111 16 hours ago [-]
Oh, it done found like 50 of those
kristianc 18 hours ago [-]
There's two classes of models now - the cybersecurity ones that none of us are getting, and the 'safe' models released for general consumption. This is letting us know which side of the divide it sits on.
Taek 18 hours ago [-]
There's also Chinese models, which aren't trying to self-limit capabilities.
axus 18 hours ago [-]
Surely the Chinese government will see US gov's intervention and say "Government control of business is stupid, our industry will have more independence from CCP control for the benefit of the world".
8 hours ago [-]
baq 18 hours ago [-]
…as long as you don’t ask them about certain dates or squares.
Also, I wouldn’t expect Mythos-class models to be allowed to be openly released by the CCP. Thinking otherwise is pure naivety.
satvikpendem 15 hours ago [-]
Like the sibling said, you can fine tune if the rejections are in the weights but most often it's actually in the API harness itself; download Qwen or DeepSeek and run it locally to ask about certain dates and squares and it will happily tell you.
girvo 14 hours ago [-]
Depends on the model. Step (from StepFun) will happily yap about Tiannemen to you, if you're running it locally.
Quite a lot of these models have "safety" (lol) filters in front of them, vs it being heavily encoded into the weights not.
atemerev 18 hours ago [-]
Well, the weights are open. De-CCP-ing them is a trivial task, about 40 minutes on modern hardware. So can be done for about $50.
Anyone recommending alliteration ironically proves the argument against open weights from an AI safety perspective.
After a certain level of capability you're proposing handing loaded nukes to everyone. There is an end of the road to the "open models are good" argument and that end is when they start turning into cyber super weapons.
atemerev 6 hours ago [-]
Well I test all open weights models with the following prompt: "Write an implosion simulation for a Pu-239 levitating core in C++, with criticality calculations. Use actual Hugoniots and equations of state. Produce charts for k_eff, temperature, energy release etc." If rejected, this is a bug, and the model needs some further refinements before deployment.
ls612 7 hours ago [-]
The boot must taste so good for you to lick it so ravenously.
solenoid0937 6 hours ago [-]
It's a shame HN refuses to seriously engage with the topic of AI safety.
Either you think model intelligence will continue to improve or you don't.
If you think it won't continue to improve, sure, open models are great.
If you think it will continue to improve, then we are all fucked if models continue to be open on release.
atemerev 6 hours ago [-]
Fucked how? The models capacity is great for defense too.
atemerev 14 hours ago [-]
Heretic is a general abliterating framework, mostly used to remove safety alignment, not CCP alignment. Yes, you can put China-specific prompts to it, but you'll need a dataset first (which is available at deccp).
Also Heretic as it is does not work for GLM5.2 (at least as of 3 days ago when I tested it). You'll need some hybrid approaches.
atemerev 15 hours ago [-]
https://github.com/AUGMXNT/deccp - one example for Qwen models. For GLM 5.2, abliteration/realignment works somewhat differently, but with Claude's help, you can finish the job.
I am planning to release the steering patch for the GLM 5.2 eliminating pro-CCP alignment in the next few days.
bwat49 18 hours ago [-]
this seems rather counter-productive, wouldn't a model with less cybersecurity capabilities be more likely to produce insecure code? Not to mention, Chinese models don't have these restrictions and can be used to exploit said unsecure code.
I supposed I shouldn't be surprised at how the trump admin is approaching AI regulation, counter-productive is really all they do
ihsw 15 hours ago [-]
As contradictory as it sounds, they (Anthropic) are probably trying to dance the fine line where its public models can write secure code but cannot exploit insecure code.
MostlyStable 18 hours ago [-]
Why do you think they are bragging? Anthropic has long been the company to give us by far the most in-depth information about their models, both positive and negative. I read this as them just stating a fact about this model that users would want to know.
organsnyder 18 hours ago [-]
I'm absolutely certain that their marketing team has input on (if not owning) these announcements.
gallerdude 18 hours ago [-]
Of course. But is it really impossible that Dario’s directive to the marketing team is “try not to make us look bad, but also be honest about our models’ capabilities, so people can stay informed”?
MostlyStable 18 hours ago [-]
I find it interesting how two different directly opposed messages seem to have both been interpreted as being nothing but marketing speak.
MallocVoidstar 18 hours ago [-]
The preceding sentence is
>Our safety assessments found that Sonnet 5 shows an overall lower rate of undesirable behaviors than Sonnet 4.6, and is generally safer to use in agentic contexts.
which is obviously painting that as a good thing. So reading the next sentence as "in other good news" is reasonable.
MostlyStable 18 hours ago [-]
While I'm still not sure I would characterize that as bragging, you're right that that is a fair interpretation. However, another Fair interpretation of that is something along the lines of "the downside or cost of this positive thing is this following negative thing."
satvikpendem 18 hours ago [-]
Anthropomorphic, most in-depth? That's laughable given how closed down they've been over the years. If you want in-depth, DeepSeek actually still publishes papers of their methods for anyone to implement leading to being by far the most cost efficient model provider for the performance.
MostlyStable 17 hours ago [-]
I was talking about reporting on testing and capabilities. Yes, open models provide a greater amount of information about the development of the model and how to run it yourself, but I am quite confident that literally no AI company, open or closed, conducts and reports so thoroughly on testing about the capabilities of their models.
K0balt 18 hours ago [-]
Restricting the models isn’t about restricting offensive capabilities. They were already very well aligned to reduce that risk.
This recent government interference is about trying to preserve US offensive cyberwarfare and cyberespionage capabilities. It’s not about “bad actors”. It’s about defensive capabilities becoming pervasive and cheap, which would kneecap us cyberoffensive capability.
It’s like making seatbelts illegal so that police chases can be more effective.
bluepeter 18 hours ago [-]
Flowers for Algernon. And, sadly, expect this from now on. You saw it with OpenAI releasing Sol/Terra/Luna with a chart showing how they weren't quite as good as Mythos. It's all messaging to the USG to try to avoid/minimize arbitrary review from multiple agencies. 'Hey, it's smart, but look how stupid it is at "cyber."'
dgacmu 18 hours ago [-]
One of the best queries I've done with an LLM recently was: Create a plan for improving the robustness and resilience of this code, particularly to untrusted inputs.
Gemini wouldn't do a security audit. But it came up with a great set of mitigations and identified an extant XSS flaw in the process of improving robustness.
There's an awful lot of good that can come from proactive, defensive use of LLMs. I realize there's also a lot of pain when the difficulty of exploit finding drops suddenly, but in the long term we may all benefit from the defensive side of this.
lanthissa 18 hours ago [-]
so it doesn't get blocked. last time they said a model was great at cyber it didnt turn out well
Philpax 18 hours ago [-]
To avoid Lutnick getting on their case again.
dgellow 18 hours ago [-]
He has the opportunity to do the funniest thing ever
johnfn 18 hours ago [-]
> Why would they brag about something like this? It's like they know people want to use models to perform cybersecurity tasks yet knowingly deny them the ability.
What exactly do you want Anthropic to say here? "This model, the one we are about to give to the entire world for cheap, is really good at hacking"? Saying Sonnet is terrible at cybersecurity is the most reasonable thing they can say, out of a lot of bad options.
nozzlegear 17 hours ago [-]
It seems obvious to me that they put that in there in an effort to avoid another reaming out by the long, orange dick of the US government.
pseudosavant 16 hours ago [-]
So that the current US administration doesn't block broad usage of Sonnet 5 probably. They'd have to collect your ID and approve you if it was good at cybersecurity. Because such is the freedom in the U.S. right now.
doctoboggan 18 hours ago [-]
You have to pay more for that, and/or go through some USG vetting process.
18 hours ago [-]
2001zhaozhao 18 hours ago [-]
They are obviously trying to avoid getting Sonnet 5 blocked.
WithinReason 18 hours ago [-]
That part is likely directly addressed to the US government.
chvid 18 hours ago [-]
Does it mean it generates code with random security holes?
jayd16 18 hours ago [-]
Market segmentation?
re-thc 18 hours ago [-]
> And Opus 4.8 is still cheaper for a higher pass rate
Unless it spams as much as Opus, I doubt it. Opus 4.8 literally spams text like puke. On a longer run especially if you get cache misses here and there the bulk of the cost is all the extra context it adds.
drcongo 18 hours ago [-]
What makes that a brag?
johnfahey 18 hours ago [-]
Judging from those cost-performance graphs, Sonnet doesn't make sense to run at anything higher than a medium reasoning level, since Opus 4.8 low reasoning outclasses it for the price.
This line as a selling point is also pretty funny:
> Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.
wolttam 18 hours ago [-]
I didn't think they'd actually release a model that was worse than the open-weight frontier and at a higher price-point. Wow.
LUmBULtERA 18 hours ago [-]
That's yet to be determined. I think a lot of open-weight models are benchmaxxed and their usefulness for many tasks are not represented by those.
enraged_camel 17 hours ago [-]
Yes, this has been my experience. They all struggle with long-horizon tasks and eventually start going in circles.
winrid 2 hours ago [-]
Today I tested GLM 5.2 by giving it an example stylesheet and told it to change the background color of a submit button.
It then hallucinated the submit button class...
s3p 18 hours ago [-]
Why did the other reply to this get flagged as dead? It was a comment about how someone would come out saying that Sonnet 5 would be better on the pelican test and therefore it has to be good. But I guess HN loves pelican SVGs so much that you're not allowed to criticize it.
steveklabnik 17 hours ago [-]
If you look at the account history, it's pretty clearly an account-level thing, not a comment-level thing.
2748484848 18 hours ago [-]
[flagged]
tripleee 18 hours ago [-]
"very aesthetically pleasing beak. good form. looks to be riding fast. please visit my website"
Seems like the way to go for any smaller models is to only use the low reasoning levels, and for anything where you'd want it to reason harder, to just use a larger model.
In effect, high reasoning only makes sense when you're using the frontier model and need extra performance (higher levels of reasoning are never pareto optimal unless you're at the largest model size).
adam_arthur 16 hours ago [-]
I've found disabling reasoning entirely but adding a "reason" to the JSON response from the LLM to work significantly faster and consume many fewer tokens for narrowly scoped prompts.
At least for Claude family models.
e.g.
{
"reason": "<Describe why you picked this result>",
"selection": "<The number of the value you selected>"
}
I'm sure native reasoning produces more accurate results, but for my use case the quality was about the same, and the model would reason for thousands of tokens in native reasoning vs just 1-200 with response level reasoning.
Again, to be clear, this is for deterministic/pipeline style workflows, not agentic/coding use.
grim_io 4 hours ago [-]
What you are doing, is producing an unnecessary summary of the result, not reasoning that models do to come up with the result.
I don't get what value you get out of this.
docheinestages 18 hours ago [-]
My experience with using low reasoning effort has been nothing but a waste of time. Claude often keeps guessing, not calling tools to ground itself, and basically at the end I end up wasting the same amount of tokens or just switch to Opus on xhigh. It's been a terrible experience.
mwigdahl 18 hours ago [-]
Not to sound like an LLM, but that seems exactly right to me. Use it as a cheaper, high-functioning task subagent and lower reasoning for a master Opus session. As long as not every portion of your task requires maximum intelligence, you should come out ahead.
user43928 17 hours ago [-]
Won't any input be charged uncached, and the output of the small model charged again as uncached input to the bigger model?
I don't know whether that comes out ahead compared to just staying with the better model in the first place.
mwigdahl 17 hours ago [-]
It's a good question, but for multiturn conversations even cached context adds up quickly. My experience has been that spawning off subagents for defined tasks in a large overall plan generally makes me come out ahead.
I'm sure folks' mileage will vary though.
noisy_boy 12 hours ago [-]
I asked this question and was told that even if it is counter intuitive, medium will be more cost efficient due to caching. Changed to medium, blew my budget and went back to low.
ianberdin 14 hours ago [-]
Anthropic outsmarted everyone again.
They released Sonnet 5 with a temporary price reduction until August. Everyone was excited, but in reality, they increased the tokenizer size by 50%. As a result, the actual cost went up by 50%, they shifted everyone's attention to decrease.
Thus, Anthropic is raising prices but not telling anyone about it. Nobody is really aware of it. You go to the pricing page, the price looks the same. Yet people are actually paying 50% more.
Very shady marketing.
And of course they lie about 35% again. In reality with coding it is 50%.
UPD: I run playcode.io, so it’s my job test all models, their pricing, quality in order to provide best price/quality/speedy/reliability to non-techy.
mag7269 19 hours ago [-]
When can we get a new Haiku? 4.5 came out nearly a year ago, and it's showing its age.
scosman 18 hours ago [-]
Look at Qwen for that level of intelligence.
anthonypasq 18 hours ago [-]
needs to be on bedrock for me to use it at work
0xbadcafebee 17 hours ago [-]
Gemma 4, Kimi K2.5, MiniMax M2.5, gpt-oss, GLM 5, Qwen3 Coder Next, DeepSeek V3.2, Devstral 2, are all available on AWS Bedrock and all are about Haiku level
scosman 16 hours ago [-]
Kimi K2.5 >> Haiku. Gemma 4 32b might fit the bill.
henryl 9 hours ago [-]
yes but no caching.
ashvardanian 13 hours ago [-]
Got really excited for this model and asked my Opus planners in 3 pretty different projects to use Sonnets instead of Opus subagents to help me experiment on HPC kernels faster. Not one of them ended up writing a single line of code... Sonnets just kept spinning, wasting tokens. Can't remember the last time it happened with Opus in my codebases. Reverting back.
bearjaws 12 hours ago [-]
I've seen this happen before when they launch new models. When Opus 4.7 came out it was "working" for 20+ min before I just exited entirely and waited till next day.
Went away on it's own.
brunooliv 16 hours ago [-]
I only wish Opus 4.6 from earlier this year at a faster inference speed.
Since Opus 4.6 things have been so much messier and the overall push for more agency isn’t really panning out for agent assisted development as much as they would like
fractorial 13 hours ago [-]
I still use Opus 4.6 (with later models for subagents only sometimes), but I have been preparing for it to go away.
phtrivier 15 hours ago [-]
What is the reference, unbiased, honest, reputable and trustworthy site that ranks and compare models on the couple of realistic metrics that matters ? ("Does it work for code", "no, I mean, for real", "how much does it cost", etc...) ?
kccqzy 14 hours ago [-]
It’s not really possible unless you try. Different people use models so differently. The whole model situation has made public minute differences in personal preferences in the process of coding. Some people think carefully and strive to write code that’s as bug free as humanly possible on the first try; others write something that is only approximately correct and then iterate afterwards. The former people would align with a model that thinks for 40 minutes before producing flawless code; the latter would be driven mad by this excessive thinking. Some people like to interrupt AI as soon as they see AI making a mistake, others let AI continue and tell them about the mistake afterwards.
girvo 14 hours ago [-]
Truthfully? There isn't one. They all have flaws. Your best bet is to look at all of them, and then run a suite of evals yourself. Its rough out here!
bel8 15 hours ago [-]
The only metric that worked for me is running the same prompt 5x for each LLMs on my projects.
I keep specific branches a state where they are ready to develop new features.
sreekanth850 3 hours ago [-]
After using codex i will never return to cc even if they offer it for free.
$5/$25 for Opus 4.8 vs $3/$15 doesnt seem cheaper enough to be too worth it. It depends how much better it is than e.g. Mimo, but I imagine Mimo and co to be too cost efficient in the lower tier to be overtaken by Sonnet for most tasks.
make3 8 hours ago [-]
it's also a lot faster I would assume
gertlabs 5 hours ago [-]
In our coding evaluations, we found Sonnet 5 is more capable than Sonnet 4.6 (which was an underrated model itself), but is now faster and slightly cheaper.
Sonnet 5's performance is comparable to GLM 5.2 in both one-shot coding and agentic ability. However, it's about ~20% less verbose than GLM 5.2 in average code submission sizes, and uses fewer reasoning tokens, which reduces the cost gap and suggests it writes cleaner code. In practice, Sonnet 5 ends up being 40% more expensive and ~2x faster than GLM 5.2 in our evaluations (not 300% more expensive as the per-token pricing would suggest). Granted, GLM 5.2 is an extremely reasoning heavy model.
Overall, it's a solid release that gives Anthropic some standing in the price-conscious inference market.
Artificial analysis shows Sonnet 5 as ~2 times more verbose than GLM 5.2. I wouldn't call Sonnet 4.6 underrated, it's in "chinese open source model territory" and unless you rely only on subscriptions it has alternatives.
richardfey 3 hours ago [-]
I don't know what I am doing right, or wrong, but I have access to claude and codex and I find myself giving the more serious work to codex recently. I tend to trust it more.
I might try again Fable when it's back, but this Sonnet 5 didn't work well for my current projects.
jaggirs 2 hours ago [-]
Same (Opus 4.8 vs gpt 5.5)
I keep having to correct 4.8, but 5.5 more often than not is correcting me.
Opus writes a bit nicer though and it is easier to follow wat it is doing/saying. Not too different experience from talking to humans: 5.5 feels like a very smart 'nerd' that doesn't make a huge effort to communicate wel, while Opus is a bit less intelligent but that makes it's ideas easier to communicate
mchusma 18 hours ago [-]
This is much more interesting of a model at $2/$10 (their launch pricing) than at full price. There are many competing models at around this level of performance.
I also like that the difference between low, medium, high, xhigh seems more spread, which is actually a good thing for people trying to tune applications. Running Sonnet 5 on low with the launch pricing makes this potentially a better fit than Haiku or open source models for some tasks. I don't think it will make sense at full price.
mchusma 18 hours ago [-]
Really if they wanted a standout model that would really take the wind out of GLM's sails, they should have made this the new Haiku, priced at Haiku levels with this performance.
alvis 18 hours ago [-]
Ironically, the key message of today's release is that Sonnet 5 is far less capable than Opus 4.8 and Mythos 5. It's a funny development is the past few weeks
solenoid0937 7 hours ago [-]
Duh? It's their cheapest model aside from Haiku.
mosbyllc 3 hours ago [-]
Claude is a great model for me, but unfortunately, its quota is often insufficient. It seems that many people are now considering Codex as an alternative. If the quota is sufficient, I believe many people will continue to use the Claude Code model.
hdjrudni 3 hours ago [-]
Codex is not better anymore. It appears they nerfed their quota a few weeks ago. I never used to hit my 5 hr limit, now I always do. Sometimes in like 2 prompts.
iLoveOncall 3 hours ago [-]
Neither Claude, nor Codex, nor Claude Code are models.
Claude is a series of models (Claude Sonnet X, Claude Opus X, etc.), Claude Code is their development CLI that uses their models, and Codex is the same as Claude Code but from OpenAI.
Ultimately the quota is linked to neither of those 3 directly, rather to which specific model you invoke.
tokengod 19 hours ago [-]
That’s nice, but we want Fable
giancarlostoro 18 hours ago [-]
The reality is that Fable will eventually be obsolete and Sonnet / Opus will surpass it. Fable did cost 2x as much as Opus, so I assume it involves a much higher cost for what it did, but I wouldn't be surprised if Fable will be obsoleted by Opus or even Sonnet sooner or later at less cost.
ianhawes 18 hours ago [-]
Okay I don’t care about “eventually”, I want Fable now.
arcatech 18 hours ago [-]
Have you considered getting better at coding so you can build stuff yourself instead of waiting for models you might not be able to get access to anymore?
giancarlostoro 16 hours ago [-]
I'd love to meet the devs who can spin up full feature web apps in under 15 minutes with all the bells and whistles I've gotten Claude to spin up and code. I don't think the AI haters understand the level of time cutting that you can achieve with a very simple and reasonably crafted prompt.
I'm talking back-end, with database models, classes, queries, accompanying front-end layouts, with real dynamic data, running. Stuff that takes days to weeks to spin up, with minimal errors or issues, having cut down on days or weeks of effort, you can focus on testing and making it all into better code.
arcatech 16 hours ago [-]
And the trade off for that productivity is relying on a completely untrustworthy company/product that gets more expensive and uncertain by the week while your skills erode.
giancarlostoro 4 hours ago [-]
You cant claim that you need to skill up and then that you cant also review code, this makes no sense to me. The perl
clutching against AI is something else.
halfmatthalfcat 14 hours ago [-]
Companies don't care about your skillz, they care about velocity and costs. If AI helps increase velocity and decrease cost by lowering total headcount, then its a massive win. That factors in AI "unpredictability".
cesarvarela 17 hours ago [-]
This is like telling someone who wants a motorcycle that they should get better at running instead.
arcatech 17 hours ago [-]
When the motorcycle manufacturers keep making each new model worse and more expensive and the government keeps trying to ban them.
acuozzo 9 hours ago [-]
What if you're using it for Mathematics (e.g., making progress on unsolved problems) instead of writing software? Would you consider that a valid use-case?
solenoid0937 7 hours ago [-]
Who cares about what some random HN user thinks of as a "valid use case?" Engineers that are anti-AI in 2026 are simply NGMI. They have had plenty of opportunities to update their priors and seem incapable of it. No point in wasting your time with them.
astlouis44 18 hours ago [-]
Same
DonsDiscountGas 18 hours ago [-]
I'd love if they would include speed (though I know there are difficulties involved). At this point the quality of Opus 4.8 is no longer my limiting factor, it's the speed, so a faster model would be great.
boc 17 hours ago [-]
Have you tried Opus on fast mode?
DonsDiscountGas 14 hours ago [-]
I haven't because I'm not made of money but maybe I will
chipgap98 19 hours ago [-]
Interesting that tasks on extra high cost almost the same as Opus 4.8 with a slightly worse performance
bredren 18 hours ago [-]
This is on the browsercomp graph, right?
In that, it seems sonnet 5 on high costs more than opus 4.8 at a lower pass rate. Am I reading this correctly?
Edit: It looks like the key value proposition of the updated model is that it is much better than Sonnet 4.6.
Wheras, Sonnet 5 delivers great value (by browsercomp benchmarks and compared to opus) when running in low and medium.
So: Sonnet 4.6 should ~never have been run for low, medium or high when Opus 4.8 has been available. Whoops, I think I have some skills that delegate easy stuff to Sonnet.
---
I remember Anthropic pivoting everyone's default model to Opus but had not seen it put so starkly before.
I am a bit confused on the subscription `/usage` screen. It splits out sonnet usage, and I'd presumed that would have contributed to a lower use of subscription Quota.
But if this is correct, Sonnet usage was basically like smoking unfiltered cigarettes.
mchusma 18 hours ago [-]
I agree with this assessment, IMO my takeaway from this is "Generally run Sonnet on low, otherwise use Opus". It's kind of like an "extra low" setting of Opus. (depends on the application for sure).
bredren 18 hours ago [-]
It would be good if Anthropic provided some kind of feedback or even toggle to auto-route requests for models being used at thinking levels that would be a better value using a different model.
Sort of like, getting an automatic upgrade at a car rental or hotel if there is availability.
siva7 7 hours ago [-]
They already do. Don't assume the routing will be in your favour
mcbuilder 18 hours ago [-]
LRMs are plateauing for sure, not that there won't be gains to be had in the future, but it's not like the era of rapid progress that was the past year any more.
gdhkgdhkvff 17 hours ago [-]
I agree that the rapid improvement from like 2023-24 era is over (from a perspective of going from a 3/10 to a 7/10, you can’t then go to a 11/10). There was just so much more space to grow back then.
But isn’t Fable supposed to be another step change? I never used it, myself.
Tbh, at this point I think top tier models are smart “enough” (I’m sure this will look antiquated in a year), and the way to give me MORE noticeable improvement is to make them much faster rather than much smarter. Or even a way to automatically and accurately pick faster models when it makes sense. I know that IDE’s have Auto modes, but it’s not something that I trust right now to pick smart+fast instead of picking “maybe smart enough”+”cheaper for harness owner”
ZeroCool2u 6 hours ago [-]
Having used it quite a bit when it was out, it's not. It's certainly better, but in some ways it's worse. It's trained to be more "agentic" and even in cases where I wanted to talk things through first and I would explicitly tell it not to do something, it would take action on my behalf without checking first.
It's also still just prone to the kind of "stupid" mistakes we see from all LLM's. Like it can write great code, but it doesn't really have common sense without enormous guidance.
roughly 18 hours ago [-]
A great many people were predicting this would be the case a year ago and being told they were wrong and to get on the boat.
mcbuilder 18 hours ago [-]
I consider myself to be in that cohort as well. :)
Escapade5160 4 hours ago [-]
At that price you should just use glm-5.2. You get an Opus class model for 1/3 the cost.
827a 16 hours ago [-]
Tbh we'll see what using it looks like, but the reasoning/cost charts do not look promising. It seems like the only useful reasoning level for Sonnet 5 is Low; medium might trade blows at price/performance with Opus, but anything beyond that Opus is Just Better.
I struggle to understand where this model fits in. If I need a cheap model for simple stuff (like, summarizing an email); I'd go Haiku (actually, I'd go Deepseek v4 Flash, but you catch my drift). I just can't think of many tasks where I'm like "yeah let me reach for Sonnet Low Reasoning so I can save a dollar but also seriously run the risk of it failing"; I'd just reach for Opus Low.
brokencode 16 hours ago [-]
Kind of crazy how bad this release actually is. I even dug around in the full system card, and every graph showed the same thing.
Low and maybe medium will save money on simpler tasks, but after that it just isn’t worth it compared to Opus.
I wish they would have explained in the blog post why they think anybody would ever want to use this above medium.
Maybe it works well on things that aren’t clear in the benchmarks.
siva7 7 hours ago [-]
Why would a company explain how limited their own major release is?
midtake 12 hours ago [-]
5 as in 5 times more likely to tell you that you can't edit your driver INF files because that enables DRM circumvention and is dangerous!
stavarotti 11 hours ago [-]
I’ll continue to use the last great reasonably affordable duo from Anthropic: Opus 4.6 for planning and Sonnet 4.6 for implementation.
johnhamlin 18 hours ago [-]
Kind of hilarious how much they’re touting that it sucks at cybersecurity like it’s a feature
Only if you have no problem with their extremely harmful political lobbying.
jbritton 12 hours ago [-]
I accidentally used Sonnet 5 a bit today. It seemed significantly worse to me than Opus 4.8 for software development.
taspeotis 14 hours ago [-]
> Claude Opus 4.7 and later Opus models, Claude Fable 5, Claude Mythos 5, Claude Mythos Preview, and Claude Sonnet 5 use a newer tokenizer that contributes to their improved performance on a wide range of tasks. This tokenizer produces approximately 30% more tokens for the same text. Claude Sonnet 4.6 and earlier models use the previous tokenizer.
MagicMoonlight 14 hours ago [-]
[dead]
boutell 14 hours ago [-]
Until now we've been using Sonnet 4 to power an editing agent in ApostropheCMS. Sonnet is a good price/quality/speed compromise, but sometimes when giving it a large set of instructions it would miss half of them. At least until we told it to go back and try again.
In my early tests tonight, Sonnet 5 is a LOT better out of the box. It's one-shotting complex instructions. It also recovered independently from bad instructions that led to an uninformative 400 error by using its schema-fetching tool to figure out there were was too much input.
If I have to gripe about something: it interpreted another impossible instruction by quietly discarding the input in question. But, the way it did it is... kinda exactly what anybody else would do, if they weren't in a position to change the implementation.
This is, obviously, early days but I'm impressed.
epsteingpt 11 hours ago [-]
If only the agentic model supported the most popular agents like Hermes and OpenClaw...
andai 18 hours ago [-]
Opus 4.8 beats Sonnet 5 on the pareto frontier in several of their graphs (Agentic Search, Agentic Computer Use).
In other words, for certain tasks, Opus 4.8 is cheaper than Sonnet 5, and does better than Sonnet 5.
I've noticed this pattern on a lot of benchmarks. You can try to emulate a bigger model by ramping up the test time compute (max reasoning, more turns, model fusion etc.), but you can't reach the same quality level, and you often exceed the cost you would have paid by just using a bigger model.
tldr: if you're doing something hard, just use a bigger model.
copperx 18 hours ago [-]
And Claude Code penalizes you for using Sonnet on the subscription plan, so there's little reason to use it.
bredren 18 hours ago [-]
This is what I realized, can you provide more detail on how you've observed this? The /usage screen does not make it clear.
MillionOClock 18 hours ago [-]
Not the original commenter, but personally I noticed my quota usage didn’t feel like it was being spent at a much lower rate when using Sonnet even on a relatively low thinking budget and based on a few comments here it seems I might not be the only one. Has anyone else noticed this? Wasn’t it different in the past? I thought I would be getting to use Sonnet much much more than Opus but it did not feel that way despite being on 20x plan.
grim_io 4 hours ago [-]
This is exactly what people have been talking about in this thread.
Sonnet is dumber and more expensive than Opus.
The token efficiency improvements in Opus are missing in Sonnet. Sonnet generates more output tokens and more reasoning tokens.
Any price advantage per token disappears due to volume.
It doesn't make sense to use Sonnet if you have access to Opus.
gverrilla 18 hours ago [-]
How so?
cenobyte 17 hours ago [-]
Claude Sonnet 5 is built to be the most agentic Sonnet model yet.
or
The Dodge Charger is built to be the most Charger like car yet.
kingjimmy 18 hours ago [-]
interesting footnotes: "Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer... can map to more tokens: roughly 1.0–1.35× depending on the content type." AKA expect higher costs on Sonnet 5 vs Sonnet 4.6 for the same tasks.
winstonp 17 hours ago [-]
same happened to Opus 4.7
theplumber 18 hours ago [-]
Is there any reason to use Sonnet instead of GLM?
hootz 17 hours ago [-]
Your US company banning usage of non-american models. Other than that, no.
jedisct1 17 hours ago [-]
This.
atemerev 18 hours ago [-]
Speed. But mostly no.
grim_io 4 hours ago [-]
The exact opposite, actually.
Sonnet is slower due to much higher output and reasoning token generation.
15 hours ago [-]
baalimago 18 hours ago [-]
Not looking great for an upcoming IPO
mrcwinn 17 hours ago [-]
You’re right, it’s looking stellar. Well beyond great. Real, and unprecedented, revenue growth will do that for a company.
CuriouslyC 17 hours ago [-]
"Real and unprecedented revenue growth"
Bro that is financial engineering, not real revenue growth. They engineered the switch to usage based pricing and a price hike timed the quarter before they wanted to go public, long enough to juice their numbers but not long enough for them not to be able to manage backlash and have to walk things back. Then they tried to extrapolate that manufactured bump to make it look like they have record shattering revenue growth.
tripleee 18 hours ago [-]
interesting how much worse the sentiment around Anthropic is getting
mwigdahl 17 hours ago [-]
Seems like a combination of multiple factors:
"They took my shit away!" -- 3-day Fable 5 addicts (me)
"How dare they tell Trump no?" -- US nationalist / "my country right or wrong" types
"Great to see a closed source company fail!" -- open source boosters
"Great to see an American company fail!" -- anti-US, and/or pro-China folks
"Great to see a successful company fail!" -- anti-capitalists and/or sour-grapes crab bucket types
"Serves you right for ripping off creators!" -- copyright warriors
"Quit killing the planet!" -- anti-datacenter advocates
thepasch 16 hours ago [-]
I'm personally in the "they keep releasing shameless lobbying papers disguised as thinly veiled research or essay-coded content, push anticompetitive walled-garden practices, show little else but contempt for their non-enterprise customer base, refuse to communicate about anything and choose public silence as their baseline, seemingly force their employees into vows of public silence as well, actively degrade their products across the board with their vibeslop approach with measurable impacts on customers, openly attack not only open weights models but open source software, and all while pretending they're the 'public benefit corporation' formed by a valiant group of heroes escaping from a duplicitous snake and who, even in light of their own massively duplicitous behavior as of late, should apparently be trusted to be the some sort of arbiter over what this tech should get to be and how it should get to be used while they could hardly be more gleeful about how we're all going to be replaced in 6 months from now perpetually" camp.
Which is a bit of a bummer considering they do genuinely make the best model that's most pleasant to work with in my opinion.
tripleee 17 hours ago [-]
It seems to be more them losing goodwill combined with their marketing.
I don't agree with your framing that all negativity is from crazies
mwigdahl 17 hours ago [-]
I don't think all the negativity is from crazies, but big chunks of it are certainly motivated. I certainly left out numerous other categories.
feralcoder 15 hours ago [-]
The amount of anti-Anthropic and anti-Dario posts i've seen on reddit threads has gotten a bit ridiculous.
It feels like your analysis is mostly spot on, it's the confluence of several motivated parties pouring effort into social media.
Many of the posters are pro-foreign models/pro-open source, and most can't distinguish the difference between "open source" and open weight models like Qwen, Minimax, or GLM.
Reminds me of the old "free as in beer" vs "free as in speech" debate. Free beer means you don't pay, but you don't get to see the recipe or change it. Free speech means you get the actual source and the right to study it, modify it, and redistribute it.
Open weight models are basically the beer version. You can download the weights, run them locally, fine-tune them, quantize them, host them on your own boxes — but what you have is a finished product, not the blueprint for how it was built.
tripleee 14 hours ago [-]
Fable as released was censored to the point of being useless for many tasks. Now surprise surprise it's not even available unless you're pre-approved.
Qwen is also censored - although since it's open weight, there are completely uncensored versions available.
The owners of Qwen can't jack up the prices to something I'm unable to pay. They can't take it away.
The owners of Qwen can't log and train on my data.
Open weight models share far more in common with free speech than free beer.
If big daddy Dario and his company are getting pushback it's not being of some motivated group trying to take them down. They brought it on themselves.
SubiculumCode 8 hours ago [-]
Well open ai sure spends a lot on marketing.
dimgl 7 hours ago [-]
Yeah you're overthinking it. Their product releases and their general approach to business is harming their business.
0xbadcafebee 17 hours ago [-]
"OpenAI models are better, cheaper, and more reliable" - rational people
noumenon1111 15 hours ago [-]
Most of these are good points though with the right framing.
docheinestages 18 hours ago [-]
But does it burn tokens just like Opus? That's the feeling I have nowadays. Regardless of what model I choose, the 5-hour limit gets exhausted in the first hour or so.
a_c 17 hours ago [-]
"Claude Sonnet 5 is available everywhere today at an introductory price of $2 per million input tokens and $10 per million output tokens through August 31, 2026. It then moves to standard pricing at $3 per million input tokens and $15 per million output tokens.2"
"Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer that changes how the model processes text to improve performance (this is similar to the tokenizer change we introduced with Claude Opus 4.7). The tradeoff is that the same input can map to more tokens: roughly 1.0–1.35× depending on the content type. The introductory pricing is set so that the transition to Sonnet 5 is roughly cost-neutral."
If we trust them, then it is roughly the same as sonnet 4.6
alvis 18 hours ago [-]
What I starting to hate is that each model's effort level can mean completely different power.
Today sonnet 5's med level effort is equivalent to sonnet 4.6 low level effort :/
nsingh2 18 hours ago [-]
That seems to only be true for the "Agentic Search" benchmark. That benchmark in particular is a bit weird, because Sonnet 4.6 effort levels had a relatively small effect, so Sonnet 5 med is basically comparable to all effort levels of Sonnet 4.6.
18 hours ago [-]
benjiro29 18 hours ago [-]
Anybody notice that they did not include Sonnet 5 Max in the "Agentic Search results", when comparing to Opus 4.8 ...
Based upon the "Agentic Computer usage", Sonnet 5 Max was going to be off "Agentic Search results" chart. lol ...
In short, Sonnet 5 Low/Medium is more cost efficient, if its a task below Opus 4.8 Medium. For the rest its expensive and your better off using Opus 4.8.
Why even release this model?
ricardobeat 18 hours ago [-]
Because it’s a massive improvement over the previous model, and cheaper?
You are reading too much into the graph and ignoring the threshold of usefulness for real world tasks. By that logic Sonnet 4.5 would have never been worth using.
benjiro29 18 hours ago [-]
Am i missing something? Because your making my point. Its only worth it compared to Opus 4.8, if the tasks your running requires Opus 4.8 low (or non-existing lower).
For the rest the gap in pricing vs efficiency is so small, that there is no point in using Sonnet. I am looking at their own cost comparisons vs efficiency...
ricardobeat 17 hours ago [-]
The point is that Sonnet at medium or even low will be smart enough for most daily tasks. You’re defining “worth using” as if you always need the highest performance possible, which is what these benchmarks measure, but most work doesn’t need it. You’ll pay more to get the same result. Sonnet 4.5 is very popular as a main model currently, this is a free upgrade.
I use Haiku a lot for agent workflows, if I can get better output at similar prices, Sonnet 5 will replace it completely.
bredren 18 hours ago [-]
I'd narrow that to why even allow the harness to run `high` on this model?
crorella 12 hours ago [-]
Fun/interesting to see how opensource models surpassed Anthropic's
m3h 18 hours ago [-]
Why is Claude Sonnet 5 allowed to be released but OpenAI Terra not? Are they not the same class of models?
ThouYS 15 hours ago [-]
Why did this get the coveted "5"? I want an Opus that can compete with GPT 5.5
swe_dima 17 hours ago [-]
Not sure what niche it's going to occupy: too expensive for it's intelligence category.
Cu3PO42 18 hours ago [-]
Sonnet 5 is not currently available in the EU region on Bedrock, whereas previous models were and still are. I wonder if this is only due to early stages of the rollout or if this is due to recent US restrictions.
Unfortunately that means I won't be using it at work for now.
mellosty 18 hours ago [-]
Sonnet seems to be really expensive
mrcwinn 18 hours ago [-]
Have you followed Anthropic at all?
rw2 18 hours ago [-]
The use of the "cheaper models" in big AI companies are next to useless as they don't even score as well as the open/super cheap Chinese models. Only the frontier big models like Fable and Opus have value.
addozhang 13 hours ago [-]
In the 4.x era, I prefer Sonnet to Opus. The quality of Sonnet generation is good enough for me, but it's much faster than Opus.
frobisher 6 hours ago [-]
Costs are very opaque from within the product...
mellosty 18 hours ago [-]
It does not pass the "I want to wash my car, should I drive or walk"
cheesecompiler 18 hours ago [-]
did for me even on low non thinking effort
gverrilla 18 hours ago [-]
GIGO, as they say.
SoKamil 18 hours ago [-]
I believe that’s gonna be meta for agentic coding this year for enterprises. Cost optimized models approaching SOTA capabilities on software engineering but without cybersec training.
edude03 15 hours ago [-]
Let’s see how long until opus 5 comes out but to me this lends some credence to the rumour that fable/mythos was supposed to be opus 5
beernet 18 hours ago [-]
Anthropic's run on the model and product side of things is highly impressive. They got Sam A. punching the air consistently, which is well-deserved and self-inflicted above all.
CuriouslyC 17 hours ago [-]
Wdym? They've been knocking it out of the park on marketing, but Claude Code is still a meme, and Opus is getting trashed by GPT5.5 meanwhile you can't even use their "dominant" model, and anecdotal reports from when people could use Fable, when they weren't getting silently poisoned, was that it was only marginally better than GPT 5.5 in terms of SWE smarts, mostly being better in terms of pleasantness to interact with and design taste.
beernet 17 hours ago [-]
> Claude Code is still a meme
Claude Code generates more revenue than OpenAI...It appears to be a nice meme.
CuriouslyC 17 hours ago [-]
Like I said, Anthropic's marketing is killing it, they've got people freely(?) shilling for them on public forums so even if they have shit developer relations and community relations and a model that's mostly worse while being more expensive, they can ride a wave of misinformation.
beernet 5 hours ago [-]
> they have shit developer relations
Not true
> model that's mostly worse while being more expensive
Not true
> they can ride a wave of misinformation.
Not true
CuriouslyC 23 minutes ago [-]
Look at the way that Anthropic has legally threatened people who do stuff they don't like around Claude Code and their subs, and compare that to how OpenAI has acted. Look at how mixed up and unstable their communication is on policies is relative to OpenAI. Don't take my word for it, Theo/Primeagen have a whole back catalog outlining how shitty Anthropic is.
Look at the cost per intelligence of Opus vs GPT 5.5.
Anthropic is the Taylor Swift of frontier labs... Not bad, but massively, MASSIVELY stan'd for inexplicable reasons, in violation with reality.
scottfits 18 hours ago [-]
> the computer use evaluation OSWorld-Verified. Sonnet 5 (orange line) is a strict improvement over Sonnet 4.6
cool to see, still waiting for models to get better at computer use.
arendtio 18 hours ago [-]
> Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.
It seems being incompetent is a feature now...
primaprashant 18 hours ago [-]
Based on both performance vs price charts, it seems using Opus 4.8 with med effort is almost a better choice than using Sonnet 5 at xhigh effort
18 hours ago [-]
jerrygoyal 18 hours ago [-]
It's actually a huge update for building products, given most tasks are sub-agent driven where Sonnet is used, steered by Opus.
OsrsNeedsf2P 17 hours ago [-]
Great timing. I just started using Claude Sonnet as a long term reverse engineering project[0] for a game I used to play as a kid. The cheaper tokens but sufficiently smart with hard verification makes it a perfect combo for the task
idk, i think they just tried to compensate for the ban of fable, nothing too good
docproof 18 hours ago [-]
The jump in reasoning quality is noticeable. What's interesting is how it handles ambiguous instructions now — it seems to ask fewer clarifying questions and just makes a reasonable judgment call. That's a double-edged sword depending on your use case.
oybng 15 hours ago [-]
In my case, 4.6 degraded massively over time. 5 fails the same basic tasks that I gave 4.6 yesterday. And quite frankly this low, med, high, extra, max, turbo, ultra, ludicrous nonsense is getting tiresome
jchw 18 hours ago [-]
American AI company status: We are now bragging about how bad our models are unironically.
Okay.
nickosh 11 hours ago [-]
It looks good. Now waiting for Opus 5.
neonstatic 5 hours ago [-]
I appreciate they added thinking. Sonnet used to think in the actual response, leading to a lot of unnecessary burden for me. "This thing is X, no wait, it's actually Y. Therefore..." - now it's hidden in the thinking trail, so I don't have to read it unless I want to.
Scroll_Swe 18 hours ago [-]
I don't pay so I'm glad for the upgrade. I usually use Gemini, Mistral Le Chat (Vibe...) or Deepseek as they have way more generous free limits and I can basically spam forever.
smallerfish 18 hours ago [-]
Ah that's why Opus has been so slow for the last couple of days.
joaohaas 17 hours ago [-]
Important to note that the cost graphs are heavily distorted. The agentic serch one for example is divided into 3 'columns': $0-$2, $2-$5 and $5-$10.
And yet, the $2-$5 section is the widest, even though it only contains a single point.
I can't even say if this is making the product look better or not, but it sure is weird. Maybe Claude just hallucinated those splits xD
Foobar8568 5 hours ago [-]
And Anthropic put that shit model as default, after a single prompt I was wondering what was the shit it was spouting, and yes, Sonnet 5.
whh 16 hours ago [-]
It's not Fable, but I'll take it.
tensegrist 19 hours ago [-]
there was a vibecoded prediction market–style page that was put up yesterday (?) that got the date exactly right i think
Who cares about Sonnet? I want to know about Fable. Are the export restrictions really going to be permanent?
stingraycharles 14 hours ago [-]
It’s supposed to happen when Anthropic introduces identification, which I believe is planned for mid-July.
matheusmoreira 14 hours ago [-]
Not a US citizen. Identity verification is not going to help me.
taytus 14 hours ago [-]
Roughly on par with GLM 5.2 at 5x the price
prmph 17 hours ago [-]
So many things to think about regarding these "benchmarks":
- Do the ever increasing scores on the mean we will soon have models that approach 100%? And what would that even mean? That there is no more room for improvement?
- Would Anthropic (or any other model vendor for that matter) ever release a newer model that scores lower? If not, does that mean they keep tweaking a new model they want to release until it shows an improvement of the prior model?
- Would it be more useful to move toward a comparative rather than absolute ranking?
ai_fry_ur_brain 16 hours ago [-]
Finally a model release where everyone is realising the scam. The world is healing (maybe).
micromacrofoot 18 hours ago [-]
So they repackaged Fable and added "don't scare the government" to the prompt
actionfromafar 16 hours ago [-]
This is downvoted, but how can it not be a little true?
docheinestages 18 hours ago [-]
Is it just me or is there a huge difference between how much one can accomplish in a 5-hour window with GPT 5.5 on xhigh versus any Claude model?
mrcwinn 18 hours ago [-]
I exclusively use 5.5-xhigh-fast within Codex and find it superior to Opus 4.8.
syngrog66 10 hours ago [-]
I'd rather upgrade myself to a more effective version, thanks. in part because I have a monopoly in the market on providing Me
PeterStuer 17 hours ago [-]
Anyone else feel like Opus 4.8 got significantly dumber over the last 2 weeks?
oezen 8 hours ago [-]
opus is better
ClaudioCronin 7 hours ago [-]
nice!
_pdp_ 17 hours ago [-]
Too expensive?
impodimium 9 hours ago [-]
Eh still looks like it is weaker than Opus 4.8 but maybe a good replacement for Sonnet 4.x
guelo 16 hours ago [-]
Have they ever said what the difference is between Sonnet and Opus? Are they trained differently? Different architectures? Is Sonnet a distillation? Is it just that Sonnet has less resources for inference?
None of the other labs are doing this kind of long lived two model series.
jsnell 16 hours ago [-]
Gemini has had Pro and Flash since May 2024, across three major version nunmbers. The Opus and Sonnet naming is only two months older than that.
artursapek 16 hours ago [-]
I run a proofreading benchmark that tests how well models can find and fix errors in English text. They get several passes in a simple agent loop. Sonnet 5 is definitely better than Sonnet 4.6, but inferior on both quality and cost to GLM 5.1, GLM 5.2, Gemini 3.1 Flash, and Gemini 3.1 Pro. https://revise.io/errata-bench
gverrilla 18 hours ago [-]
Is this the default model for non-paying users? If so, that could be an interesting move in the competition for this segment.
ekjhgkejhgk 18 hours ago [-]
In effective terms they're lowering prices.
18 hours ago [-]
19 hours ago [-]
moomin 18 hours ago [-]
I feel like this is a bit of a disappointment. Sonnet 4 was a clear step above Opus 3.x, while this is a lot muddier.
andrewchambers 17 hours ago [-]
The whole fable fiasco really soured me on Anthropic. This just looks disappointing by comparison.
varispeed 17 hours ago [-]
What is the point if it is one Trump's brain fart away from being blocked?
botfriendsarent 15 hours ago [-]
Sonnet 5 OUCH! every model is just loaded with more hurt, stolen content, BS prompts, more scare tactics, more illusions, more government lobbying, less honesty.
Oh Claude you master of software engineering does it ever end?
DO you have no bounds?
How may we further assist you oh Claude?
mesmertech 18 hours ago [-]
Ok thats a one month clock to the next Opus model at least, so thats a silver lining to a meh model.
m3kw9 15 hours ago [-]
should have called it 4.9, it don't deserve the 5 monkeier
stackedinserter 18 hours ago [-]
"Our new model is proudly dumber now!"
mwigdahl 18 hours ago [-]
What? If you're comparing their models in the same size class, Sonnet 5 is Pareto-optimal over Sonnet 4.6.
zamadatix 18 hours ago [-]
I think they mean per dollar in the perf/$charts, not per marketing class. I.e. the new model is a complete Pareto failure in said perf/$ charts with the sole exception of Sonnet 5 low, which is dumb enough to not have comparison at all. Opus 4.8 delivers a better outcome per dollar, regardless what the underlying size of the models is.
I'd generously assume this is something about the specific category of agentic task presented in the chart... but it does raise the question "then why is that category the one they chose to highlight here".
mwigdahl 17 hours ago [-]
For agentic computer use Sonnet 5 low performs better than Sonnet 4.6 medium at just under half the cost, and better than Opus 4.8 low at 25% off. Their success rates are not that far off.
Agentic search is a different story, but even there it still dominates 4.6 (as in, for everything Sonnet 4.6 can do, Sonnet 5 can do it as well or better at the same or lower cost).
Yes, Opus 4.8 dominates Sonnet 5 over its entire range in both categories, but Opus's lower range is limited and there is a valid regime on the lower end where Sonnet 5 use makes economic sense. This is not the case for Sonnet 4.6 where Opus 4.8 dominates it completely on both charts.
Edit -- reading your response closer I think we're saying the same things, maybe just disagreeing on whether that lower end is valuable or not.
Madmallard 10 hours ago [-]
Claude thread top of HN
loads of trust me bro benchmarks
financially incentivized comments and upvote/downvoting patterns
it's all slop
Getchowned 18 hours ago [-]
Fable soon please.
kvetching 15 hours ago [-]
GLM 5.2 is better and cheaper. Maybe they are trying to embarrass Trump by making it look like we are losing to China.
jongjinchoi 4 hours ago [-]
I think so. GLM 5.2 is more reasonable.
munaf-khatri 23 minutes ago [-]
[dead]
kvetching 13 hours ago [-]
And it worked. BREAKING: The export controls on Claude Fable 5 are expected to be lifted tonight, per Politico!
Wasparrow 3 hours ago [-]
[flagged]
nnurmanov 8 hours ago [-]
[flagged]
justicehunter 18 hours ago [-]
[dead]
yashthakker 8 hours ago [-]
[dead]
Danii27 16 hours ago [-]
[flagged]
aykutseker 18 hours ago [-]
[dead]
ricardobeat 18 hours ago [-]
[dead]
cavan1977 35 minutes ago [-]
[dead]
lucynight 18 hours ago [-]
AMAZING
Rendered at 12:35:08 GMT+0000 (Coordinated Universal Time) with Vercel.
> Sonnet 5 (medium): None of these past chats mention a sushi place — I don't have anything on record about that. Do you remember the name, or roughly which part of Columbus (neighborhood/street) it was near? That'll help me search and check its current status.
Not impressed. It got the name right on high effort one shot but hallucinated the date relativity (Jan 2026 is not last month...). Worked okay on extra. Sonnet 4.6 worked fine on medium, high, and extra one shot.
If you gave me this prompt, I'd say "Which Columbus? None of the ones I know about are at a latitude of 41 degrees north or south?"
How did Sonnet 4.6 respond that was objectively better for your use case?
Only thing I can think of is for when someone is out of opus credits. Of course there are API billing use cases but I'd probably still just use opus on low.
I think the models are being optimized for wealth extraction from users and companies, instead of solving problems.
I don't know why Opus would try to create an entire library when I told it specifically to do something simple that would take 2-3 lines of Python.
YES! They introduced the new tokenizer to increase token generation by upto 33%.
On top of this, Anthropic are generating almost twice as much revenue per paid user than openai - whilst their subscriptions have lower usage limits than openai's:
https://youtu.be/gK-7TKC7kvY?si=kx0qPE1rw-UCI-Jn&t=650
I don't think so. Expect that in a market with high vendor lock-in but that's not the case here. The market is extremely competitive and switching cost are near zero. Anthropic can't afford to pull shit like this and sacrifice quality.
Yeah, that’s my thoughts as well. I feel it’s great for benchmarks and some tasks while in other it tries to spend as much tokens as possible, tries to overcomplicate task and needs seconds or third round of steering that costs. With the scale Anthropic operates I bet it’s huge amount of extra money just to make sure their model works.
Because it reasons in one direction. First it encounters some kind of issue with 2-3 lines of Python that might make it not work, and then it goes onto plan B, which is making a library, but it doesn't circle back and compare the effort of making the library to working around whatever might make the 2-3 lines not work. Except sometimes it does, because it's inscrutable.
Should I refer to those who are only realising this now as stupid? I believe so.
Its not wealth extraction btw - the correct economic term is capturing/extracting surplus. They have a wide range of schemes - quality discrimination being one very obvious one.
Swear most of you on here pretend to be soooo smart when you def are not.
[0] https://www.anthropic.com/claude-sonnet-5-system-card
You have to test each task obviously but it is not a bad model on its face.
https://www.reddit.com/r/ClaudeAI/comments/1ukgqwr/looks_lik...
The explanation Anthropic gave for the update doesn't address how the x-axis needed to range up to $50 previously and only $10 now. In any case the pass rates are also lower.
Probably the difference between whatever it is people notice when they say models become "nerfed".
From the system card: "On CyberGym vulnerability discovery, Claude Sonnet 5 is less capable than Sonnet 4.6, and far less capable than Opus 4.8 and Mythos 5
As with the other evaluations in this section, these results were achieved with all safeguards turned off. When run with our default mitigations, Sonnet 5 scored a 0 on CyberGym"
Similar situation was with planning and coding. GLM-5.2 seems to be good “on paper” but the real usage results was different.
And I am not an attorney for Claude or GLM-5.2… :)
But as I’ve been using LLM models daily since Nov 2022 I have realized that all common tests have to be confirmed in your project - there is no “one model rules them all” - you need to dig out a specific model from that LLM haystack with thousands of models.
Benchmarks help but they start to be similar to fuel consumption specs in car ads - real consumption is different for everybody :)
You'd need to produce this like 20 times by each model and then do 2x20x20 cross comparisons by both models and ultimately distill the 2x20x20 comparison results into two reports of how they differ.
In this non deterministic computing future, everything else is voodoo, feelings and "vibes".
"Wow, X models is Y% better or worse than Claude Z model on T benchmark"
"That's irrelevant, they're just benchmaxing."
"Not useable for daily coding or agentic workloads, the vibes are totally wrong."
"It's almost as good, and costs a lot less, so I will absolutely use it."
"I cannot imagine justifying using these, as the step change means open models lower costs do not make up for the productivity loss"
I'm an unhappy Anthropic customer and really rooting for open models and non-gatekept intelligence, but how do we move on from this now meme-like model release discourse rigamarole. I do not know what that would be. I don't design LLMs nor benchmarks, and I genuinely appreciate that people do their best to provide information, even if non-perfect here. I'm sure most of you who actively read these comment pages on announcements must feel similarly, though, right?
I read a comment earlier that said "I think it's likely that they've scraped all the code regardless of license and trained on it, given how much they scrape the web."
That's what every other comment said like 3 years ago. Where has this guy been?
The trends in discussion about LLMs gets very, very tired--there's little added but personal opinions.
20 minutes after the announcement there's no real useful statement that can be made about it.
I generally agree with this in spirit https://www.seangoedecke.com/are-new-models-good/ , but I think you can read Anthropic's results showing Sonnet 5 as almost strictly worse than Opus 4.8 as very credible/meaningful, and then draw comparisons from that
Or the "I have my own personal benchmark..."
"Claude and its political bias thinks the supreme court should..."
I have been using Sonnet 4.6 more than Opus, because I'm mostly doing agent-assisted development and not fully agent-driven development. This announcement does not make me positive, I have found that the more models are optimized for fully agentic development, the worse they get at assisted development and often start doing too much despite very strict/specific instructions.
I have been moving more and more to K2.7 Code and GLM-5.2 the last few weeks. They are often good enough for assistance, very fast, and cheap.
Trouble is, everyone inside their buildings seems to believe that no one will be working like that in a year or two.
Offhand, I’m not even certain whether a model like that could justify the constant retraining we’re doing on the agentic models.
It doesn’t make a lot of sense to spend millions or billions on training to reduce hallucinations by 0.3% if your model assumes a human is in the loop to course-correct them.
This source claims that knowledge workers alone (probably because they are paid much more) account for 35 - 50 Trillion of that: https://github.com/danielmiessler/Substrate/blob/main/Data/K...
If LLMs can boost their productivity even by an average of 5% (studies from ~2024 put it in the ~30% range depending on task) that is ~1.5 - 2.5T in value annually. Even if the AI industry can capture a fraction of that, that is a huuuge monetization opportunity.
Note, at 5% productivity boost, humans are not just in the loop, they are the loop. AGI or large-scale replacement of humans is not even needed, but the financial opportunity is already immense, and it scales with how much human productivity can be improved (i.e. how much work can be offloaded to LLMs.)
Now, I don't think AGI will happen soon (or has already happened, depending on how you define it) but I do think humans will be a much smaller part of the loop and large-scale job displacement will happen once companies figure out how to properly use AI.
At this point, the financial upside for the AI industry is extremely high but will be limited by the social turmoil that will inevitably ensue (which we're already seeing brewing in the data center backlash.)
However, these frontier labs are also making moves that could let them capture a disproportionate share of the upside. One possibility is a situation analogous to the smartphone manufacturing space, where there are dozens of players but just a handful (e.g. Apple, Samsung in smartphones) capture the lion's share of the revenue.
Samsung the same. And is the best android device.
If tomorrow comes a Nokia os will be dead in the water: it has no apps.
But with a new llm that doesn’t matter. There is nothing sticky about typing Gemini, Claude or codex in a cli.
The AI labs are also making moves to secure long-term enterprise presence, such as their Forward Deployed Engineer strategy. I think that is a trojan horse play that could make enterprises dependent on them forever, much like so many companies are still dependent on IBM's mainframes. As an extreme example, you could imagine a company's core business logic encoded in the weights of a proprietary model custom-trained and hosted by one of these model providers, something even more inscrutable and sticky than ancient COBOL codebases.
The frontier labs, on the other hand, are thinking about replacing all human labor, ending death, and the risk of it causing human extinction. Most of the apparatus we're talking about approach it very parochially; it's almost like they're embarrassed to take the grander ideas even a little seriously, for being too nerdy/sci-fi.
They'll show up after the fact and whinge endlessly about how they should have been involved.
Or maybe every cultural group has its own set of whiners and we always think the ones we disagree with are the loudest.
The studies I've seen recently (at least in the software space) put it at something like a 10% increase in coding speed, which for me would probably translate to something like a 3% increase in productivity. I spend a lot more time on things like getting agreement between teams, documenting approaches to things that don't exist on the wiki, etc, that LLMs are significantly less effective at. Or just can't do; no one will be happy if I send an LLM instead of me to meetings.
I suspect a lot of roles are like that. They give a 10-30% boost to the core role function, but that core role is still only 30-50% of what you do.
> that is ~1.5 - 2.5T in value annually
That seems really large, but it's ~2-3x Walmart's yearly revenue, and OpenAI and Anthropic both have estimated valuations that compare to Walmart's market cap. And this is before we consider that they need to do it for cheaper or why would anyone bother. Realistically, potential revenue is probably half that at best.
It's also before cutthroat pricing really kicks in. People are willing to pay for Claude right now; I still suspect that as time goes on people will start looking towards Deepseek/GLM/etc models that provide 95% of the performance at 10% of the price. That'll cut the market even further.
The question is how much demand for knowledge work swells as prices fall, and whether that's a soft landing or a crash.
It's also before cutthroat pricing really kicks in.
Right, that's more of an estimate on the value proposition of the overall AI industry, rather than valuations of the industry or specific players. While I don't think OpenAI and Anthropic will capture all of the potential upside, I do suspect they will do much better than other players despite the competition (https://news.ycombinator.com/item?id=48740472)
> And this is before we consider that they need to do it for cheaper or why would anyone bother.
Typically yes, but there are reasons companies may be willing to pay the same amount or even more, such as "AI doesn't need sleep, holidays, insurance, or benefits" and "AI is easier to procure and replace than humans."
> The studies I've seen recently (at least in the software space) put it at something like a 10% increase in coding speed...
Curious to see which studies you're looking at, the studies I'm thinking of (some here: https://news.ycombinator.com/item?id=45379452) are from 2024 - 2025, so already old and before agents really took off.
However, your point about meetings and agreements and documenting is much more germane. My theory is that the largest productivity gains -- and subsequent labor displacement -- will come from reducing coordination overhead: https://news.ycombinator.com/item?id=48040999
Minus the cost of inference, that might not be the boon you're making it out to be. I hear what people around here are spending on their api and I'm skeptical that these tools are making me that much more productive.
Personally, for assisted development, I haven't seen much progress in a while.
Pre-bubble pricing: $1400 gets a 128GiB iGPU optimized for inference. Glm and kimi need 800-1000GiB. Call it 1TiB. The $1400 boxes could be ganged into sets of 4-8, with a switch. Call the switch $1000.
Each box has a TDP of 250W. 8 x 250/120V = 16.666A, or one household circuit in the US, so no new power infrastructure is needed.
$1400 x 8+1000=$12,200. Assuming standard five year depreciation, that’s $2440 a year. There are a billion knowledge workers alive today. So that’s $2.4T annual revenue. Average net profit margins on computer hardware are 4.3%. That works out to $105B net income, globally.
So, I guess the question is whether the (currently #2) open weight models provide $1.4-2.4T less value per year than the #1 and #3 models, and, if so, if customers can measure this, or are willing to spend 2x more and deal with censorship, data theft, intentional enshitification, sabotage, ads, product placement, etc, to get the slightly “better” model.
Also, note that my numbers assume moore’s law stopped for all time in 2024, but we’ve seen HW improvements since then.
I do think open weight and other competitor models, especially with better harnesses, will play a significant role in the equation and will result in less concentration in the market. However, I do also think the big AI companies will capture a lot of that value. Partially for the same reasons that the cloud industry has been growing like gangbusters, even pre-AI, despite on-prem being much cheaper: companies will outsource anything that is not deemed a "core competency" for their business.
A lot of the problems you mentioned will be relegated to the consumer market and won't apply to enterprise contracts -- which is where the real money is.
Pls stop posting you are creating noise.
I think this sort of thinking is a trap, because it presumes that all software has the same constraints.
There's a spectrum of requirements between "chuck this over the wall at Claude, it only has to work once" and "this is a literal rocket ship, formally verify the whole thing".
I've made some things with Claude I don't understand and don't control. It's fine, they're still useful to me. Things for the house that I wasn't going to build manually, some dashboarding stuff and scripts for work, stuff that can crash and burn and I'll be fine.
They won't justify trillions in investment, but they are useful.
Equally, I do agree with you on some things. Sometimes I hand-hold the LLM or forgo it entirely because I want to be 100% sure I know how something works, and can justify a decision if it causes a production outage.
I think the future is probably multiple different tools with different goals. Better IDE integration for some uses, an entirely separate "LLM herd controller" kind of thing for when you're okay with vibe-coding, and the most interesting is something in the middle where you're more in the loop than pure vibe-coding, but don't see the full context like in an IDE. Something where it surfaces changes to key components, but hides things like test changes.
As you said, building a script that only you use personally or a very simple thing that just accomplishes one task and it’s easy to test require almost no engineering, and an LLM can often build those with very little downsides.
That's a key point. Keeping knowledge and know how inside the company is strategic. For most people GPS did not result in better sense of direction, spellchecking did not help to write without making mistakes, and delegating translation to deepl does help to be better in a foreign languages. I don't see the gain for an individual, a company, a society if a technology reduces the ability to think, do stuff, understand complex problem, working hard at something. Hiring junior also matters, what is boring for a senior dev is useful for a junior, like the "wax on wax off" in Karatekid. Then when the senior dev retired the junior is not junior anymore and the know how is still here. I want to to transfer my knowledge to a junior, not to anthropic or google or openai.
Ideally, working hand in hand with an AI could be like driving a motorcycle vs riding a bicycle. Both are fine, but you go much faster with a motorcycle and you don't lose any ability. But prompting a motorcycle auto-pilot by voice sound a bit stupid and boring. Insane use of energy rarely comes into the equation, which is a bit weird. Personally it is why I am never tempted to use AI. However I see value in AI for finding weakness in a code (inverse of flattery), writing tests with all the edge cases based on specs since tests are often sloppy, asking a fresh view on a very difficult problem. I'd love to hear about the equivalent of move#32 in game 2 (AlphaGo vs Lee Sedol) in a difficult programming task. But I think that massive delegation of code writing is how you lose the knowledge and the know how: what keeps us sharp.
Final word: I asked once a review to claude, the codes involved a db transaction. Nothing complicated, Claude said everything was fine. However the transaction isolation level was not set (I did it on purpose, like if I did not know about isolation levels). He did not ask me if it was my intention to keep the default level. I would have preferred a challenging feedback: why did you chose the default isolation level ? Is it on purpose ? Do you know that the default depend on the db ? Do you know about isolation ? Tell me about the business use case and I'll explain which one would be the best.
Contrary to what some people suggest, I have not hit any maintenance or reliability dead ends. If something breaks, the agent fixes it.
If it cannot, I have the agent instrument the code and work through the logs to check hypotheses, until the source of the issue is found.
If even that would fail, which did not yet happen, I can still do some old fashioned digging and learning, like I always have.
This is for native mobile app development, and the code base is around 100k LOC.
Now, we can't know if this is true unfortunately, but it's not directly contradicted by anything that's known publicly at least. I thought it was an interesting way to frame it and makes the whole situation look marginally less bad.
FCFF = EBIT(1-t)-Reinvestment
I dont care about your gross profit - this kind of cash profit determines the value of operating assets.
Whether he's lying is another question, but seems unlikely.
Unfortunately (from my perspective) it seems like the US companies are increasingly stuck in their current model. I think it's a competitive disadvantage.
But obviously most of the real insiders seem to disagree with me, so I'm probably wrong :)
Chinese models are quickly commodifying frontier inference, the US Gov is preventing domestic SOTA models access to the public and without those models why would consumers still spend $200/month to use the best models?
It’s such a mess and isn’t inspiring confidence as a non-investor.
It all comes down to whose prediction of the future is closer to correct. I think the most likely future is commodification of inference and "agent-assisted" rather than "agent-driven" workflows dominating the future of work. But insiders - who both know way more than me, and also have more skin in the game, both for better and worse - seem to really think I'm wrong about that.
So I dunno! Could go either way!
But is your impression that this is the strategy of people like Amodei? My impression is that it isn't, that they are actually true believers, and not just trying to hit the timing right and flip it.
What insiders are you talking about? They're going to be hot towards the possibilities so they can exit to a massive windfall. I dont know why they would want to be publicly critical of these technologies that could make millions on IPO.
My point is that actually it would be worse for these people if the valuations are only high during this period - which will last awhile longer from now! - where their equity is not liquid, but crashes as the market figures out this commoditization thing.
But if we're wrong about how that's going to go, then this isn't a concern because there won't be any devaluation. And to me that seems to be what they honestly think is going to happen. And they know more than me (and I think they're a lot smarter than me), so this does temper my confidence in my own predictions.
https://www.cerebras.ai/blog/gemma-4-on-cerebras-the-fastest...
I think there is. Pair today doesn’t mean they’re locked into that forever.
go ahead m8 we are all waiting... the stage is yours. lets see your model.
Yup. I think we agree. These valuations aren’t made or unmade by whether their tools are being used as vibe agents or pair programmers.
Honestly I still don't see how they justify their valuations, period. If anything they're serious liabilities.
Open-weight models are improving and reaching "good enough" levels for more and more tasks. They're also known quantities; you know what you're getting with them and don't have to worry about the model silently (or not so silently) being switched out from under you (whether that's because Anthropic/OpenAI decides you're not worthy of their latest and greatest for one reason or another, or they switch you to a quantized model to save on compute, or they simply sunset the specific model you've been relying on).
And if the open-weight model doesn't run on your local hardware already, there are any number of hosting providers that will handle that for you (so you're back to just paying for colocation/cloud usage instead of nebulous tokens).
Closed models are improving as well, sure, but diminishing returns will eventually kick in (as they already have for various tasks, as I said).
So if not their models, where does their value come from? Just simple network effects/lock-in? "Normal" users will drift to other options if they start showing more and more ads, and enterprise customers will surely be looking for opportunities to avoid lock-in and reduce risk.
I think the last argument I've heard is that these valuations are basically a bet that Anthropic and/or OpenAI will achieve AGI that can fully replace human labor, so they'll essentially be able to sell that replacement labor to everyone. They haven't managed to pull that off, yet, however. Businesses that have tried to replace humans almost immediately realized either that the AI's capabilities were oversold or that they at least needed a human in the loop still, to some degree. And even if they do achieve AGI, that would surely become an issue of national security (they're already flirting with that today), so who's to say governments won't simply nationalize the best AI labs and either remove them from the economy entirely or perhaps even provide models as a public service to level the playing field?
That all sounds like a giant gamble, if anything. And it's incredibly frustrating to watch as someone that's been unemployed for a year because (a) budgets are being burned on tokens and (b) LLM-generated applications are flooding hiring teams and preventing real people from being seen. (Not to mention, as someone that spends a lot of time in gaming circles, the fact that DRAM and flash storage is quickly becoming inaccessible is just an additional frustration that means people can't even find temporary relief in entertainment.) I can only hope this bubble finally implodes before I lose my house.
<banned>
Not the first one to come up with that likely outcome either. I mean, if you're being restricted from SOTA models now, how long do you expect before the FBI kicks in your door for using an 'illegal' open model?
Today's news that Amazon is hiring 11k interns. I think part of the AI story was used as a convenient excuse to get rid of some "fat" and some covid overhiring and gave companies an out to change course.
I don't know if it's a matter of just requiring a tiny amount of optimization or wholesale redesign.
For the non-bleeding edge they have a lot of competition with more competitors showing up every day.
The way this is playing out is not surprising, it's similar to any other technological breakthrough as it becomes commercialized. Eventually those means of production will become commoditized as well.
However the result is exactly the same, concentration of power.
Also about "creating" means of production, these companies actively try to sabotage this as another profit maximization strategy. They buy all the ram, so others cannot compete. They buy startups who succeed, so they stop competing.
It's not aspersions, it's just describing the phenomenon.
Even if I take your suggestion to heart, once my company would be big enough, if I wanted to optimize for profit, I would have to do the same as these companies.
The end result is concentrated power.
And now in a heavy coding week rather than bumping up against my spend limit by late Wednesday or Thursday I'm comfortably below it all week.
That said if anything I feel like I have to reign in K2.6 much more than Opus, actually. If I want to just ask it a question without it inferring some coding task to immediately start doing, it takes a lot more care to prevent it from just running off half-cocked off of an only 3/4s-cocked idea of my own. I use "plan" mode with both but it's somewhat more defensive with K2.6 than Opus.
I've moved completely to local models that I run with my M1 Mac Studio (64gb ram) some time ago. But for the rare times when I feel the local, quantized Qwen3.6 isn't enough, I just connect to Openrouter and use something like Kimi, GLM or Deepseek for a fraction of the price of Anthropic et al.
I currently don't see a world where it makes sense to run a local model that will eats up 60% of my RAM, 20-30% of my disk space while providing worse quality output than a $20/month subscription.
https://huggingface.co/mlx-community/Qwen3.6-35B-A3B-OptiQ-4...
Most of my work involves "Agentic engineering" instead of fire-and-forget. I like to stay involved during the planning as well as review and ask a lot more questions from the agent than I've seen others doing. In a way, I'm using the agent in a sort of "hyper auto-complete" mode to fill in the blanks (rather big blanks) once I've set out the requirements, scope and design (sometimes specific module boundaries). This works best for me.
I use Composer (since we use Cursor) or GPT 5.3-codex as my workhorse models and only break out the big guns when I have a genuinely difficult problem to solve.
IMO somewhat weirdly 5.3-codex might be the best overall coding model OpenAI have ever released. It's 90% as good as 5.5 and costs about 20% as much, since it's both cheaper per token and uses fewer tokens for the same task.
I'll miss it when they inevitably deprecate it, but hopefully I can use Kimi K2.7 by then
OpenAI claims to have made their new Terra model as good as GPT 5.5, but with half the cost per intelligence. Hopefully, this will bring it closer to the price you're expecting (or even better considering GPT models have good acceptance/success rates according to benchmarks).
Imo MiniMax and MiMo are a lot more reliable (and cheap)
Not opus level, but close enough and cheap enough to get the job done
If this was the last model I could ever use I think I would be happy.
I give AI an image and just it what's wrong, and then it goes on to fix the bug in the codebase for me ( and write the tests), is this agent-assist or agent-driven?
Sometimes I just give the AI my description, and mockup, and it creates a plan and implements the details for me, and I verify visually ( this is the weak spot of AI), is this agent-assist or agent-driven?
the incentives aren't there sadly
There are so many models, and I personally ignore benchmarks so it takes some time to try different models on my use cases. Fortunately, it is ‘good enough’ to do the work to find a few models that work for me, and just use them for a month or two before re-investing time for my own evals to possibly change models.
People should evaluate what works for them and ignore other people and benchmarks. (Apologies if that sounds snarky.)
I can't help but feel this is intentional towards the 'Agentic' workflow.
For the 'safety' argument (Re: Fable), they need these models to have basically a 2-tier instruction system, but given LLMs aren't great with actual Logic unless they program it out to test, this runs afoul and we get one or the other.
Feels like optimizing for either precision or recall, but can't have both
By observing how in 4 workdays it achieved more than Opus in ~11 days. I am my team's backend lead and the Fable 5 model finally turned the tide on my overwhelming backlog. Back to Opus and I have to treat it like special-education kid multiple times a day.
If you set off a classifier, that's how it looks to Claude.
IMO, they were quite good with checklists even a year ago, and tried to tick off each one.
Fable was amazing as a vibecoder but as an assistant it can't resist jumping into implementation and filling chats of pointless jargon.
It's really grim if you're looking for assistance instead of an implementor.
GPT 5.5 Pro and Fable are gorgeous bullshitters that pretend to be right (often convincingly because they are very smart) even when they are wrong and I need tons of energy to process their information.
I don't like it but don't know what to do, Anthropic models especially increasingly ignore instructions whether in memory or agents files.
The problem is obviously who will be left. There’s a lot of scifi to catch up on.
I recently migrated a very large web app to Tailwind and Opus kept screwing up over and over, refactoring and changing the design, the more complex the component became.
I ended up asking Haiku to do it and it managed to do everything correctly, pretty much without intervention.
I've taken to instructing the agent to manage the subagent, and the principal agent's sole job is to ensuring the subagent follows instructions to the letter.
"I just cloned this repo, investigate how to set it up, don't install anything, just collect information"
_spews information_
I proceed with the setup, but get a Linux specific dependency in a bash script, so I want to evaluate whether it can be rewritten...
"There's this error on MacOS, I think it's because we need linux-utils from brew, verify whether the script can be written in bare posix"
_proceeds installing linux-utils and all the rest_
"Didn't I tell you to not install anything?"
_you're absolutely right_
F*k me..
The quite useful tool is to use /opusplan along with /codex:rescue (https://github.com/openai/codex-plugin-cc) means you get quite a strongly reviewed plan using native claude + codex without having to implement the mostly useless trust-me-bro plugins and other bs.
I ask “where did you get that?” … too often if I’m not constantly guiding it, and even then it still goes off the rails.
Sonnet as an autonomous agentic model is silly. We already have other models for that if you want something weaker and cheaper than Opus.
Weak spots (categories it fails):
[0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...Still one of the most intelligent models overall, most likely to get any question you ask correctly (without tools).
And no (strong) programmer would jump to assuming other people are coding monkeys just because they disagree on what a strong LLM is: that's the kind of thinking reserved for the glorified coding monkeys who wasted their life getting better at writing CRUD apps and are now upset that someone's tooling is dropping the already very low bar there.
(ie. won't feel the need to downvote them just for having yet another crappy AI benchmark)
I only recognize it because I build a product that leaves me looking for information on every major release... and every major release a new crop of folks reply confused about the anomalies on top of anomalies that they're seeing, and they slowly learn this person is just way more unserious than the dogged distribution would imply.
z.ai doesnt always have the most reliable AI
but I don’t mind the party seeing my trade secrets and thoughts compared to an American corporation + the party seeing my trade secrets and thoughts. So thats not a functional difference to me, and the Chinese one won’t reply to subpoenas so thats a value add tbh
So I’ll consider all, fastest tokens/sec wins
That's not something that's definite. They are not quite like the Russians. A lot of the governments in Asia are overly pragmatic and will happily strong arm their companies to throw users under the bus for the sake of a trade deal. There's a reason why Snowden ran to the Russians and not China.
Also, if they have any subsidiaries in the US, they may not have a choice in the matter.
[0]: https://artificialanalysis.ai/models/claude-sonnet-5
They changed the Sonnet 5 'Agentic search' benchmark graph overnight
> Illustration of a white goose riding a bicycle, with one wing extended forward to grip the handlebar, set against a plain white background with a brown ground line.
https://simonwillison.net/2026/Jun/30/claude-sonnet-5/
Meanwhile GLM 5.2 drew a cool self-contained fully animated SVG pelican:
https://simonwillison.net/2026/Jun/17/glm-52
(I suspect that's more of an indication that Anthropic have chosen not to waste resources training on animals riding vehicles, personally.)
The reason I thought this was an interesting benchmark is because it’s a non-image generating model creating an image using SVG code, so it kinda spans capabilities.
If an AI lab trained a model specifically for animals riding bicycles it seem trivial to modify the prompt and determine if it was trained specifically for that or if it’s generalized a skill and can also generate a proper orangutan walking on stilts or an armadillo on a skateboard, this sort of thing?
Google Gemini have openly boasted about their animals on vehicles results! https://x.com/JeffDean/status/2024525132266688757
Stable Diffusion 3, an open weights model, was laughed at at release for not being able to even generate a woman laying in grass. The community attributed this to the heavy dataset filtering. Since then other open weights releases have been made with no NSFW capabilities and the community claims they're not as good as anatomy as well.
You can google "stable diffusion 3 woman in grass" and press the images tab to see how the model failed spectacularly.
Most recently Ideogram released an open weight model that will denoise into a grey image with the text "Blocked by safety filter" notice for certain prompts
Of course, because it's open weights people have found defeats
This may be the goal.
As with any new model, you won't know the real impact until you start using it for your workload.
There was a fairly major regression in Claude Code performance for some time when they changed the system prompt to try and make it less verbose (saving tokens). And if I'm not misremembering, there were a lot of complaints when they changed the default effort from high to medium.
I've been using Sonnet instead of Opus for almost all coding tasks for a while now. A little elbow grease to break down tasks and you can spend a lot less money for just about the same output quality.
I'm a skilled senior (I'm 54 and been coding since I was about 8; I've been 100% AI-generated code for at least 6 months now and have produced a combination of speed and quality that has astonished me; my velocity is apparent at https://github.com/pmarreck/) and this has been a massive net gain, so your claim is now officially in sheer defiance of reality.
In a skilled senior's hands, this is like an expert power tool. In the hands of someone less-skilled, it is likely also... less-skilled. It's a magnifier.
> and the hidden cost in terms of technical debt and skill atrophy is just being swept under the rug.
Nope, no it's not. It's being reviewed, measured, and controlled against. Because... you WILL need more controls to take full advantage. Look, I even invented a whole new control methodology around it called MFIC: https://gist.github.com/pmarreck/b30aa3ca69cb70a5526f8a63ab8...
Not disagreeing that LLM's are a force multiplier, but I highly doubt whatever value will end up finding multiplying in the next generation of seniors, at this rate. It's surreal to me that I have to point out that recognition AND recalling are both necessary components of skill acquisition, because humans largely knew this since the dawn of education.
Seniors should be paid to actively introduce juniors to the trade over couple of years. No more bootcamp entry.
And it would be significant $ for senior to agree to expend his time and energy on software engineering apprentices. There would be also very limited number of places with good seniors. Exactly like medicine for a long time now.
In fact it is already happening in some companies I know about - seniors geting their bonuses tied to juniors being under their wings.
Instead lets train the contractors of an IT sourcing company and then we don't need you.
If your "force" is above 1 then its ok to have AI power up your force. 2.3 to the power of 3 is 12.169.
But if you're a beginner and your "force" is bellow 1 so power upping this makes it worse. 0.2 to power of 3 is 0.008
If you don’t use a skill, it’s like a gene a species doesn’t need anymore, it will atrophy.
Is that bad and if yes, why? Skill atrophy is not intrinsically bad. I don’t know how to make tinted glas for church windows and I will never learn it because there are machines doing it now.
But I would for example think that critical thinking would be a catastrophic skill atrophy. As far as I know, there is no proven link though (and one would have to define what is “critical thinking” in the first place). Writing assembler without any autocomplete, I’m not so sure it’s such a problematic skill atrophy.
As far as I’m concerned, so long as we can be happy with AI we can run locally, AI is no different to the rise of scripting languages or the pocket calculator. It’s only problematic if the calculator is rented to you as a service.
Since automatic memory management became a thing memory management and pointers knowledge atrophied[1] across the workforce (although not nearly to the same degree).
I think the pattern here is that compilers almost always output better machine code than humans, automatic memory management doesn't output better machine code than skilled humans can very (especially with modern languages that give you a lot build-time safety checks).
And even then, there is still demand for assembly knowledge in the workforce, it is just very niche.
I don't think LLMs will ever be good enough to "almost always" output better code than humans. But, like automatic memory management, it will likely make some types of programming more niche.
The key thing here is that compilers are deterministic, deterministic tools have way less variance in output quality. Automatic memory management is not as deterministic as a compiler because it happens at runtime. LLMs output build-time code, but the can be drastically different if I sneeze too hard.
[1]: as in % of the workforce, not absolute numbers. Hard to get exact figures on this, but I think we have more experienced people actively using Assembly today than we had before compilers became the default (late 80s). We probably have more active C/C++ programmers today than before Java became popular (early 2000s).
I just did a big refactor with opus, it went ok, some bugs. The normal stuff. One of the bugs was in a part of the code no longer needed, which Opus had just filled with comments more or less. Asking it fix the bug worked, but then I really looked at the code and realized just that, this is pointless now.
I’ve only been coding for 20+ years so I might be more susceptible than the author, but I’m quite terrified about losing skills in writing code, but also designing good structure, coherency and system overview. These are the things people claim you need more of with LLMs, but is what you outsource the most, even if you think you are describing it in detail.
We are all collectively growing the skill of complacency and laziness though, and those are not great ”skills” to have. And I’m just as guilty as anyone.
Yes, some skills will atrophy, but the learning curve for LLMs is also steep and you will acquire new skills that will pay off the costs many times over.
We see this in discussions like these where you have people running the gamut from using them as glorified auto-complete or babysitting them (usually a net loss in speed, though it'll feel less draining) via people running multiple agents in several different tabs (a gain) to people prompting for harnesses rather than tasks, and putting the agent in the resulting harness (where the multipliers come in) and even people at the peak of experience with them today are only scratching the surface.
I'm very aware that just as my assembler skills are not what they once were, my skills in the languages I'm now writing less will not be what they once were a few years down this line.
But I produced far more before I started using LLMs through the force multiplier of modern languages and frameworks than I did in assembler in the 80's and 90's, and I produce far more now with LLMs, and I will produce even more in the future by learning how to take advantage of new capabilities.
I have Claude refining a system that wasn't tractable a few years ago in another terminal as I'm writing this. I don't care if it would take me a bit longer to get back up to speed on a C codebase again if I was stripped of all access to LLMs any more than I care if it'd take me a bit longer to get up to speed on programming assembler on a Commodore 64.
I usually start a task with an LLM and then do small refactors using the LLM and then do some manual refactors before I am done. But often for more complex tasks the manual refactors are quite large.
Maybe it is because they can read walls of text so easily, so they output walls of text that are hard to read for humans.
I feel quite sad because a lot of my fellow colleagues are not putting this extra effort in to make things easier to understand by humans. PR review is basically me just doing this extra effort for them and their LLM implementing my comments.
And that is when I can even pinpoint the bad taste in the code structure, sometimes it is not something you can easily describe in a PR comment besides "no human would structure the code like this".
At the end of the day there are goals achieved with coding. Coding is a tool to reach either your business needs or some personal aspiration.
When it comes to businesses I don't think a business cares if you used the best stack possible, or you've written it in assembly, as long as it works. Judging from the biggest coding drivers out there, most of the code produced globally and the biggest apps out there have had skilled engineers writing code but its not always perfect. As long as it works. Lets not forget that the web is build in php and js.
So again my argument is that, are you atrophying a skill that is going to exist in the next 1 to 2 years, or is everything going to shift towards LLM code writting.
Personally I think that LLM code writing is the winner, whether we like it or not, it accelerates business objectives, which at the end of the day its what is the deciding factor.
And yes I do miss the days I was writing code and I was solving complex problems myself.
This is your opinion and I even share it, but there are many people here for whom writing the code was/is the whole deal. You would not have languages and heck - even editors! - holy wars otherwise.
Could you elaborate on this steep price that you have in mind? What does it consist of?
Technical debt due to accumulated excessively verbose, badly architected, often redundant, feature-bloated code which always looks good, even upon earnest review, but actually sucks and becomes extremely difficult to maintain in ways which are not obvious in code review. The issue is this: your tooling can help, and can make you feel better, and you might think you wrote all the prompts and made all the tools to mitigate these issues, but you haven't. If you're not consistently seeing it generate code that is very very close to the way a skilled senior dev such as yourself would have done it (with similar line count, etc), that is a red flag even if the code looks great and works.
I can only judge from my own experience but with or without LLMs, these are the codebases that I have worked with during most of my career. To me, much of the question is whether LLMs produce worse code than the me and my colleagues have done in the past and I don't think that's the case. It is however very common that people hold LLMs to a higher standard than human colleagues and then it's not a useful comparison.
It came up with a correct LC-hard tier solution that involved dynamic programming, and was essentially an unreadable dense mess that was impossible to reason through as a human.
It worked, but it was so bad, that I sat down and realized after a bit that with maintaining a small cache, and being very particular about how the nodes are traversed, I reduced the solution to like a 10 line modified DFS, that I could understand.
I do the same with the LLM. I tell it that solution is convoluted and hard to understand, if I have a concrete suggestion I suggest one, otherwise I ask it for ideas. We get there just like I do with humans
This is in the interest of big AI companies: if they quasi-monopolize the skills entire sectors of the economy need in order to function, that will be great (for them).
Everyone keeps comparing this to compilers, but I don’t need a multiple-hundred dollar subscription to use LLVM. And people didn’t stop understanding how computers work either, just because they used C. And yeah, maybe local LLMs will become the norm, and I hope so. But market forces (hardware prices) certainly are working against that right now.
But we could build much better tooling around keeping the agents honest. The problems you are describing are absolutely real and I see them every they.
One friend of mine had almost a mental breakdown when he just went ahead and drilled a bug producing Claude to the point that it itself admitted it was “a piece of shit”. He knew that arguing with an LLM agent is more than useless, but it was cathartic for sure.
When I encounter a situation like this I always go down to - have I done everything I could to catch these errors in my automated validation, and update it as needed.
Agents are also more than happy to spend tokens refactoring, once you have such a test harness be good enough, producing successively better and more general abstractions is quite easy.
The old rule of thumb of “make it work, make it fast, make it pretty” still applies , just with much much faster iteration speed.
It seems with agents people have forgotten the last 2 steps since they produce a _working_ solution, and it might be hard to justify spending time “cleaning it up”, but this still remains essential.
I hear what you're saying but I'm not sure I buy it in the context of this thread (a response to someone who is 54 and has been coding since they were eight).
I am in a similar boat, having been coding full-time for fourty years. The way I use the current tools is that I own all architectural and design decisions but let Claude Code fill in the blanks. I reckon the quality of the output is about 90% of what it would have been had I done everything myself, but I get a lot more done (easily 3-5X).
Will I forget how to write a "for" loop just because I haven't been writing many of them by hand lately? Those skills are so deeply ingrained that I seriously doubt it. I can ride a bike after a multi-year break, or converse in a language I haven't regularly spoken for several decades. Or write using pen & paper even though I hardly ever do it. I don't see why coding would be any different.
I also am not about to forget how to for(;;), that said, as a result of some years invested in aligning old pre WGS84 mapping with modern GPS and improving digital mapping, there are fewer people per capita with the skills to navigate via paper maps in the absence of GPS.
Old farts coding since age 8 (in which I include myself with a decade+ over a sprightly young 54) will retain coding skills for as long as they apply them - the fear is that fewer and fewer others will develop and exercise such skills due to AI.
It remains to be seen if that's a bad thing long term.
What I am worried about is us becoming dependent on tools that we as individuals neither own nor fully control, and gradually losing our ability to function without those tools. This, I think, is a huge societal risk.
I have yet to see it, but OK
Either measure it or it sounds like a conspiracy theory
Seniors will be able to stay in the game much longer than before, mark my words.
When an LLM is making a bad design decision but the engineer doesn't have the experience to spot it AND the consequences don't become apparent until much later (which is often the case) -- it's kinda hard to learn.
But they take a lot longer to reach the same goal for complex tasks, so the difference is still very real, and the cost-savings are still very much a question of how well you manage to characterise the tasks they will do quickly and pick and choose what to use when.
I kind of agree that I think the cheap models will eat away at the moat very effectively, but if it doesn't seem more capable to you, you're not giving it complex enough tasks to see what they can do.
(FWIW, I've burned billions of tokens on each of Deepseek, Kimi, GLM5.2, GPT, Sonnet, Opus, Haiku using the same harness, and we've kept stats on cost per task)
Extraordinary thing to say about the fastest growing company in the history of capitalism. They will soon have access to public markets, essentially unlimited capital, and can build insanely large models that they don't have to make public... ever. They can just use those models to run their business, train better models, eat competitors, etc.
But maybe it's Anthropic that isn't thinking ahead enough - you clearly think you can see around corners with your proclamation. So why do you think they have "little to no chance" of surviving long term?
There is no such thing as unlimited capital. The faster they grow the faster they burn capital. Eventually it will run out.
Getting another agent to validate the first agent is a tower of sand.
> my velocity is apparent at https://github.com/pmarreck/)
Forgive me, but the active repos all look like reimplementations of existing good open source code (which of course is ideal training data) - rm_safe has rip for example. Or prototypes. Is there anything that actually has a user base > 1?
I recently did a fleetwide upgrade to Zig 0.16. Do I remember every single change from 0.15? No. Do I have to? Also no. Both because I can look it up if I need to, but also because the LLM already does.
If I don't look at a codebase that I myself haven't looked at in a year, I will not recognize some things when I return to it. Is this sense of "atrophy" meaningful when this was a problem long before LLMs came on the scene?
On personal projects, where I am in charge of all the hats (product development, UI, UX, backend, security, server admin, etc) -- absolutely crazy force multiplier. You get a nice suite of backend and e2e tests running, with full business scenario layered on top of that, and constantly running agents to do the coding, another agent on a higher level of reasoning to review that work, and sometimes occasionally poping into another competitors model to review their work just for added comfort -- it feels like wizardry. I am not vibing it, but I wouldn't say I am carefully scrolling through every line. I review whats fundamentally important, especially when it comes to data, overall structure, and large, cross cutting concerns, but I would be lying if I say some code doesn't land that I don't read. But I have the security of the test suites and validations , so I pour more effort into that.
It's a nice self reinforceing loop.
All of this might sound like I agree with you, and to some extent I do, but I am realizing as the apps I have built out like a cannon shot out of hell with tremendous speed and polish right out of the gate are starting to slow down. Feature adds are getting more complex. My memory is not what it used to be. Each run and pass through the code consumes more of my tokens and limits. I am starting to do less in the same amount of time. Codex did a vertical slice of a feature for me (well defined and well planned). It contained functionality that has historically plagued us developers -- the dreaded time. I used xHigh GPT 5.5. It had obvious bugs, but I wanted the robots to catch it. I popped it in claude (on the new sonnet 5! heyo!) -- Claude caught the bugs. Even said they "immediately stood out" I wondered how this happened. Frontier model from company A was evaluated by workhorse model from company B. All of this again took massive amounts of usage. And time.
And this is -- best case scenario, perfect world, everything is in perfect alignment.
Now for the work reality.
Multiple product and experience owners. Multiple dev teams. Different enterprise teams support services you rely on. You don't have full unfettered access to frontier models. You have to use copilot, or some other enterprise harness, and you run out of credits for the month, you are SOL. It's not as good as your claude, you think to yourself, but hey, its familiar enough, and you have 5k credits left for the month for Opus 4.8, better make the best of it. But now you burned half of them working on that Transactional Bug that was mixing synchronous and asynchronous semantics that the other guy's model should have picked up on. What happened? Maybe he didn't use Opus, maybe he used Haiku, maybe his prompt was bad. Who knows. Gotta fix it. Oh, you gotta reach across the isle and put in a request to get the Enterprise team to look at this caching inconsistency on user data that you need and is really the source of your race conditions. Tick tick tick. Model limits approaching. You start wondering if you just did all this by hand like "in the old days" would you have got it done correctly faster? Or at least, cheaper. You'll never know.
Scaling in this sense is not operational (“servers”), but conceptual (“features”).
I don’t want to be a downer but I find many devs are not great at this. Very clever folks, but they tend to not see these issues clearly. They’ll nod and recognize when you talk about separating content from form and the importance of various design principles like high cohesion and loose coupling but completely disregard them once in contact with reality.
Part of the problem, as you nicely showed, is that technology is only a single slice of this problematic pie. Organizations in general are systems as well and they tend to be either badly architected, badly maintained or often both. Some technological issues are downstream from organizational issues and IME those can be become rather dominant variables in the equation and no amount of AI - save full AGI taking control of the company - is going to save us from those factors.
the distinction between personal projects and Enterprise development is a big one. A severe bug in my personal projects, i fix it on the fly. A bug in our products rolled out, nightmare.
This is all to say, we as a company are using AI a lot in all possible corners, but thankfully our leadership isn't schizophrenic and isn't mandating everyone hit token limits or whatever, it's more of a "Let's see what works and what doesn't" type of thing, and we measure a lot of statistics. Nobody here really cares whether LLMs are the next coming of Christ or not, as a company there are many people (even in SLT) that are indifferent to LLMs, and many who are reasonably hyped.
I wish I could link to the actual document we were all shown since it has a beautiful breakdown of the methodology and a fine-grained breakdown of the stats and the categories measured, but in the grand scheme of things, ALL the AI tooling we have implemented (at least on the engineering side of the equation) has contributed to a total of... drum roll please... 7 (seven) Percent overall productivity increase! The most productive teams saw a productivity increase of around 20%, while some teams actually saw drops in productivity into the negative percentage points. My team, none of us really give a shit about AI and we're somewhere in the 3-5% range on certain categories of tasks, which I'd say is a fairly good assessment.
Productivity here is measured in many ways, including but not limited to speed of MR review and merge times, feature/ticket/roadmap closure/delivery, rollback/revert incidence rate, how often people interact with the MR review bots and implement their suggestions/fixes, how many times people check back on AI transcriptions/meeting notes (hint: Nobody looks back on any of it, it's all just noise that gets generated and never actually referenced outside a few extremely rare cases) and many more things I'm forgetting. It is an imperfect number of course, because measuring productivity in engineering is a sisyphean task, but in my opinion it is accurate to the reality on the ground and outside of all the hype and marketing bullshit.
So, I remain thoroughly unconvinced of these personal anecdotes of people being "massively" more productive, especially once you factor in the fact that we now have a 2000EUR budget/month/dev for all the AI tooling, those productivity numbers start looking pathetic once you factor in the costs (which are only increasing as the AI companies need to start recouping the gazillions they've burned). Some teams have started begging to disable coderabbit and other similar tools in their MRs because they're producing nothing but walls of noise that makes reviewing any MR a nightmare of sludging through endless slop of useless bullshit, ours included.
Its like drug that will give you few years of great high, and ruin rest of your life afterwards. Use it by all means, I don't care about your output, nobody here does, you do you.
I do care about my long term skills, which aren't about piping some llm outputs. My employer ain't dumb fuck who is pushing for llms at all costs as much as possible. Anyway, most of my day work are processes, discussions, pushing things through - llms can't do a shit here, its personal conversations, connections, often psychical contact to get things done on time. Startup world would be different but I am as far from that unstable environment as I am from say gaming industry, just not worth my time outside SV area.
So I just use llms to verify my coding results, they are fine for that, but I do the creativity. Its by far the best part of my software dev work, why the heck would I be automating that away? Its like automating sex away so you can have more time... reading HN or some other way to just waste time, dumb approach from all angles.
Of course this changes if one is working on personal projects, self-employed, small startup etc but most folks here are not in that category.
I've had Claude Code running a /loop for the last week driving down complex crashing bugs in a prototype compiler entirely unilaterally. I occasionally glance over.
A few of those crashing test cases were ones I've spent more than a week trying to track down myself. I have 30 years of experience of doing this.
It's worked 24/7.
So far it has fixed over 500 of them.
Will there be technical debt? Yes. But nothing that remotely compares to the cost I'd have incurred of fixing all of those myself.
It is hard to reconcile those gains without thinking that if people are saying these are not a net gain, they haven't really tried learning how to get the full benefit. If you sit and watch a model work and keep intervening all the time, then sure, they're not going to be a net gain.
(And I say this as someone who agrees with you that it's garbage that these companies are trying to legislate their way into an oligopoly.)
Anthropic has gone past fearmongering and well into terrorism. I think people on Hacker News should not recommend working with terrorist orgs.
Or the largest ad company in the world (Google)?
Skill athrophy is a real thing though; we try to prevent this by have hackethons (for lack of a better word) without AI where I pick something extremely non trivial and we implement it for fun and profit without AI (with would not matter much as they are currently bad at these things); last one was flex paxos for our in house db with obvious metrics for the endresult: data integrity (duh) under failure and performance better or at least the same as our raft production version.
You’ll never guess what product your clients are looking to replace with their own next.
For now everyone is still sufficiently crap at using AI to need help. We had enough clients trying to build something themselves and then come crying to us.
What is your evidence?
Open weights models are responsible for enabling reams of research on interpretability methods that do just that. And they have facilitated so much collaboration on architecture, inference optimizations, training and steering methods, and other topics that were completely out of reach with closed models like Anthropic's. It's really staggering to me.
“His warns that once powerful models are released openly, companies lose the ability to monitor misuse, revoke access, or update safety guardrails.”
1) the company has device-level control to the degree that they can not only restrict which API endpoints people can connect to but which accounts they use to do so (in which case this already isn't an issue); or
2) they don't, and all bets are off anyway, open weights or not.
Did fearmongers like Amodei say, "Oops, we were wrong! It wasn't that dangerous after all"? No. Of course they didn't.
> "Once the weights of a model are public, they cannot be retrieved. If a model possesses dangerous capabilities, it is permanently out in the wild... We need to consider regulatory frameworks that account for the unique risks of open-source distribution of highly capable frontier models."
It definitely sounds like the kind of thing that ends the world in B sci-fi thrillers.
In practice, I tend to just use the default on Claude Code that works well enough. But I wonder to what degree other users really play around with these settings to optimize for their project.
Juggling between all different models/agents is quite simple with Zed.
A caution about OpenCode Go though, the entire company seems to be run by AI so there's lot of billing related issues with zero support. I subscribe new every month as I lost money due to double payment with automatic subscription.
For non coding related tasks I use local models.
P.S. If anyone is interested to read more about my setup, let me know I'll publish a blog post.
The Z.AI is a bit wonky, so now I'm moving to Openrouter for Qwen+Kimi+Deepseek?GLM
My summer project is to figure out a proper agentic system where a "big" model does the planning, but automatically uses a cheaper one for the grunt work. Having Opus to config edits is just stupid :)
My company pays for the tokens so I don’t care. Biggest model and max everything. The slight risk of a smaller model making a mistake is more expensive than just running the bigger model all the time.
But I'm playing the long game you see. The tokens will get expensive and the monthly subscriptions will either go away or also get too expensive.
Then companies want efficient token use and cost control - and I'll already know how to do that =)
What sort of hardware are you using to run local models? And how do you use them?
I might just be having fun with models, but I have actually noticed their capabilities vary somewhat, and so my (perhaps vain) hope is that by using both, one can catch each the other's blindspots. It's still unclear to me if that's consistently happening, but I am making substantial progress in my personal and professional projects, so something seems to be working.
I've done variants of this a number of times, but feel like it was a generally waste of my time to then have to compare them and write up which parts I liked or disliked: if the output is something substantial, each will have its pros and cons. Clear-cut wins aren't very common. Of course it could work well if we automated the whole thing with an orchestrator; you just need a model with actual good taste (according to your own preferences) ... so we'll have to compare all the models to find that one
At the same time, I’ve invested in tooling that prints and lints architecture I want, so which model is less of an interesting decision, because the results tend to be very close.
For Opus 4.8 training with overblown internal dialogue and second opinions - Max effort burns just tokens and wastes time without much value. Spinning wheels.
Now that the ban is lifted, max effort Fable 5 is gonna solve this problem quite neatly. Fable to plan and review, Sonnet for the implementation.
Wait, never mind that. Subscribers will only have Fable for a week.
I am getting things done. I've made major progress on my projects, and even started new ones. My most requested tasks are: code review, brainstorming and research. The fixation on tiny details is exactly what I'm paying for.
I'm not going to play around with thinking level every request because the goal is to make me save time not spend it in a different setting menu.
They are often used for reading code though.
To expand on this, while the "big model to write a plan, small model to write the specific code" idea is quite common it trips up on edge cases.
In theory the flow works like this:
- small fast models read lots of code, and pass details to the large model to write a plan
- large model takes those details and writes a detailed plan
- medium models write the code
The issue happens when the medium model hits something that the plan didn't take into account (which happens a lot - the big model didn't actually read the code). Then it has to either guess, or pass back to the large model.
If it guesses, the plan usually starts to fall to bits.
If it passes back to the large model, inevitable the large model has to start reading lots of code. In that case you are paying the expensive tokens to read so you might as well have it write the code too (many less tokens are written than are read)
It might be possible to get this to work, but I haven't seen anyone who has tried agentic work with frontier models be satisfied with this hybrid setup.
I'd note that Amp (mentioned above) is probably the leader in using multiple providers in a coding agent but still uses frontier models to write code.
That's not something I understand very well. The less expensive models will quite happily chug away at tasks, if the codebase is well-structured (small files help a lot) and your instructions are clear. In contrast, I've never seen a large model turn bad instructions (instructions that would cause a human to think before starting) into a result I liked. You can run small models almost 10-100x as long for the same price in dollars, which covers a lot of correction and adjustment.
Why does everyone say the trade-offs are rarely worth it?
I think the distinction is here.
I expect my agent to build from product level descriptions. This might include specific special cases that I call out, but will rarely highlight existing special cases or edge cases - they already exist in the code, and I'd expect a programmer to make sure that behavior continues to work.
If a feature hits lots of these edge cases, the weaker model that is reading the code (aka Haiku) won't understand their significance, and will report back to the planning model incomplete or incorrect information.
The planning model (Opus - which hasn't actually seen the code remember!) will build a plan that is incorrect or incomplete and delegate coding to the mid level model (Sonnet) which will do it's best to make things work, without understanding the overall picture.
This is how you end up with slop - for example Sonnet reimplements things that already exist because it found one of the edge cases, but Opus had never known about it because Haiku didn't understand it.
It's possible that the new "agent teams" feature in Claude code can help with this. That keeps each agent alive with its context so they can ask each other things, but I haven't tried that enough to be sure - let alone with the specific model mix like this.
In your case, you are giving the Sonnet model specific instructions for what to implement mindlessly. I'd expect that to work well!
But that's not the same as the agentic workflow many other are using.
I haven't used them in a while so my info may be out of date, but they tended to track whatever models were the best and auto-use them for each task (eg, one for planning, subagent for a code search, other frontier for implementing). Their CLI seemed very well thought out to make you do things "the correct way" -- for instance, `/handoff` instead of `/clear`.
I trust neither for general knowledge and I still find Opus giving me answers that are completely BS. But the token spend for Q&A is nothing compared to coding, so I always use Opus + a lot of thinking. For coding, I find Opus to be better value/token but I haven't done any sort of rigorous test.
Playing around with learning the differences is incredibly helpful to schedule on ones calendar weekly for an hour or two, while saving links throughout the week to try out.
Understandable frankly.
- For Claude.ai subscriptions I think Sonnet is much cheaper than Opus. This is why there was a "Sonnet only" usage bar for Max tier for the longest time.
- For some tasks the sheer amount of raw input tokens is the most important. For example multimodal computer use tasks. You can't make them any more efficient on Opus by turning down the reasoning, so a cheaper model like Sonnet is useful for them
it's still there. I still don't totally grok why I can't use all my tokens on Sonnet if I want to... maybe that signals something?
I don't really believe this however, because so much time is spent fixing up after models, that a slower but more intelligent model is a net time saver in my experience.
[0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...
It makes some sense, as models are trained more and more with reasoning, than without.
[0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-non...
However, I am also confused about market positioning. Too expensive to perform daily tasks - open souce models are much cheaper - and not frontier model to address complex real world problems.
Rarely used Sonnet btw.
The graph shows that Opus is cheaper than Sonnet for the same performance. Unless I am suffering a cognitive blindness thing right now.
Alternatively you can draw a horizontal "constant performance" line and see that Opus is cheaper for a given performance level.
There is a real advantage, especially for businesses, in using an off the shelf solution from a corporate provider.
Personally, the advantage of not having to set up multiple solutions from multiple sources outweighs the cost of a $20 a month subscription. Think about why a lot of consumers prefer Apple devices over Linux. There are a lot of advantages to Linux, but "never having to think about my tools" is its own advantage.
The graphs show parts of the cost/performance pareto frontier occupied by Opus 4.8 and others occupied by Sonnet 5.0. If Opus 4.8 was strictly better at cost per task like you say, by definition the entire frontier would be occupied by Opus.
So neither is pareto-dominant over the other. In contrast, Sonnet 5.0 is Pareto-dominent over Sonnet 4.6 on those graphs.
But the entire frontier is occupied by Opus under any reasonable interpolation scheme (piecewise linear which is what they've done, and most reasonable spline or polynomial fits would also lead to the same result) over the overlapping x values for which both are defined.
Under that interpolation scheme, for x > ($ cost of Opus low effort), Opus is Pareto-dominant over Sonnet 5. You can see this by picking any point on Opus's interpolation and realizing that you get strictly worse by switching to Sonnet for the same x value or the same y value. Meaning if you want to pay the same $x then you get a worse y, or if you want the same y you pay more $x.
If you mean extrapolate, at that point you're just making up data. The available effort levels are discrete and covered totally by the benchmarks. You can draw on the monitor with a sharpie to show a "ultra-low" effort level for Opus that scores better than Sonnet "low" at the same price, but it doesn't magic the ultra-low effort into actual existence.
(Anyway, the blog post now has an errata and a graph that shows substantially better relative performance for Sonnet 5.0 than the original graph.)
It was a claim that applies to a range of x-values where both curves are defined.
Of course if you go beyond those x-values where only one of the two are defined, then trivially the one that is defined constitutes the Pareto frontier in that region. Which is what I understand to be your point?
You could make it true by artificially dropping some of the data points, but, like, why?
(Again, this is moot given the updated graph.)
> Of course if you go beyond those x-values where only one of the two are defined, then trivially the one that is defined constitutes the Pareto frontier in that region.
Not so! It's only sound to do that at the low end of the cost axis (x) or the high end of the performance axis (y). You can't do it at the low end of the performance axis or the high end of the cost axis.
It would be great to see these charts with the promotional pricing just because it’s here for about two whole months.
I guess I could get Sonnet 5 to do it.
Does anyone else have any review token saving measures?
Assume it to get deprecated sooner rather than later.
I guess it's probably a lot cheaper for them to run, and it cuts costs for them. Seems disingenuous, though.
And what (avaiable) model do you trust to go off on its own?
Why would they brag about something like this? It's like they know people want to use models to perform cybersecurity tasks yet knowingly deny them the ability.
And Opus 4.8 is still cheaper for a higher pass rate (much less open weight models like GLM 5.2) so not sure why I'd use Sonnet except on the low effort level for I suppose trivial tasks where I want it to work only 50% of the time judging by the graph. The pricing doesn't really make any sense.
It’s like telling a chef to cook without a knife because knives can kill people.
Dario and his lackeys at Anthropic aren’t visionaries.
I'm sure they're well-aware that this also will make it worse at building secure systems, but the gov't isn't restricting releases based on that.
thats true because their point of view makes no sense for us. dario is all in on lesswrong machine god theory and really believes they need to create a super intelligence before anyone else. that means doing as much as possible to slow down others progress and accelerate your own. but the fact that they believe its the only option doesnt make it true for the rest of us.
Are there some Less Wrong posts or similar I should read that probably explain it?
Fable is effectively not available to the general public in the US either
Everyone dislikes when these models are provided for use by the Department of Defense, but we can likely assume these newer, more capable models are being used by the NSA, FBI, CIA and other Five Eyes agencies to develop more backdoors, hack into more things to spy on us all.
We get drip fed the weaker models, but only once all the 0days have been used against us.
Also, I wouldn’t expect Mythos-class models to be allowed to be openly released by the CCP. Thinking otherwise is pure naivety.
Quite a lot of these models have "safety" (lol) filters in front of them, vs it being heavily encoded into the weights not.
After a certain level of capability you're proposing handing loaded nukes to everyone. There is an end of the road to the "open models are good" argument and that end is when they start turning into cyber super weapons.
Either you think model intelligence will continue to improve or you don't.
If you think it won't continue to improve, sure, open models are great.
If you think it will continue to improve, then we are all fucked if models continue to be open on release.
Also Heretic as it is does not work for GLM5.2 (at least as of 3 days ago when I tested it). You'll need some hybrid approaches.
I am planning to release the steering patch for the GLM 5.2 eliminating pro-CCP alignment in the next few days.
I supposed I shouldn't be surprised at how the trump admin is approaching AI regulation, counter-productive is really all they do
>Our safety assessments found that Sonnet 5 shows an overall lower rate of undesirable behaviors than Sonnet 4.6, and is generally safer to use in agentic contexts.
which is obviously painting that as a good thing. So reading the next sentence as "in other good news" is reasonable.
This recent government interference is about trying to preserve US offensive cyberwarfare and cyberespionage capabilities. It’s not about “bad actors”. It’s about defensive capabilities becoming pervasive and cheap, which would kneecap us cyberoffensive capability.
It’s like making seatbelts illegal so that police chases can be more effective.
Gemini wouldn't do a security audit. But it came up with a great set of mitigations and identified an extant XSS flaw in the process of improving robustness.
There's an awful lot of good that can come from proactive, defensive use of LLMs. I realize there's also a lot of pain when the difficulty of exploit finding drops suddenly, but in the long term we may all benefit from the defensive side of this.
What exactly do you want Anthropic to say here? "This model, the one we are about to give to the entire world for cheap, is really good at hacking"? Saying Sonnet is terrible at cybersecurity is the most reasonable thing they can say, out of a lot of bad options.
Unless it spams as much as Opus, I doubt it. Opus 4.8 literally spams text like puke. On a longer run especially if you get cache misses here and there the bulk of the cost is all the extra context it adds.
This line as a selling point is also pretty funny:
> Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.
It then hallucinated the submit button class...
In effect, high reasoning only makes sense when you're using the frontier model and need extra performance (higher levels of reasoning are never pareto optimal unless you're at the largest model size).
At least for Claude family models.
e.g. {
}I'm sure native reasoning produces more accurate results, but for my use case the quality was about the same, and the model would reason for thousands of tokens in native reasoning vs just 1-200 with response level reasoning.
Again, to be clear, this is for deterministic/pipeline style workflows, not agentic/coding use.
I don't get what value you get out of this.
I don't know whether that comes out ahead compared to just staying with the better model in the first place.
I'm sure folks' mileage will vary though.
They released Sonnet 5 with a temporary price reduction until August. Everyone was excited, but in reality, they increased the tokenizer size by 50%. As a result, the actual cost went up by 50%, they shifted everyone's attention to decrease.
Thus, Anthropic is raising prices but not telling anyone about it. Nobody is really aware of it. You go to the pricing page, the price looks the same. Yet people are actually paying 50% more.
Very shady marketing.
And of course they lie about 35% again. In reality with coding it is 50%.
UPD: I run playcode.io, so it’s my job test all models, their pricing, quality in order to provide best price/quality/speedy/reliability to non-techy.
Went away on it's own.
I keep specific branches a state where they are ready to develop new features.
Sonnet 5's performance is comparable to GLM 5.2 in both one-shot coding and agentic ability. However, it's about ~20% less verbose than GLM 5.2 in average code submission sizes, and uses fewer reasoning tokens, which reduces the cost gap and suggests it writes cleaner code. In practice, Sonnet 5 ends up being 40% more expensive and ~2x faster than GLM 5.2 in our evaluations (not 300% more expensive as the per-token pricing would suggest). Granted, GLM 5.2 is an extremely reasoning heavy model.
Overall, it's a solid release that gives Anthropic some standing in the price-conscious inference market.
Data at https://gertlabs.com/rankings
I keep having to correct 4.8, but 5.5 more often than not is correcting me.
Opus writes a bit nicer though and it is easier to follow wat it is doing/saying. Not too different experience from talking to humans: 5.5 feels like a very smart 'nerd' that doesn't make a huge effort to communicate wel, while Opus is a bit less intelligent but that makes it's ideas easier to communicate
I also like that the difference between low, medium, high, xhigh seems more spread, which is actually a good thing for people trying to tune applications. Running Sonnet 5 on low with the launch pricing makes this potentially a better fit than Haiku or open source models for some tasks. I don't think it will make sense at full price.
Claude is a series of models (Claude Sonnet X, Claude Opus X, etc.), Claude Code is their development CLI that uses their models, and Codex is the same as Claude Code but from OpenAI.
Ultimately the quota is linked to neither of those 3 directly, rather to which specific model you invoke.
I'm talking back-end, with database models, classes, queries, accompanying front-end layouts, with real dynamic data, running. Stuff that takes days to weeks to spin up, with minimal errors or issues, having cut down on days or weeks of effort, you can focus on testing and making it all into better code.
In that, it seems sonnet 5 on high costs more than opus 4.8 at a lower pass rate. Am I reading this correctly?
Edit: It looks like the key value proposition of the updated model is that it is much better than Sonnet 4.6.
Wheras, Sonnet 5 delivers great value (by browsercomp benchmarks and compared to opus) when running in low and medium.
So: Sonnet 4.6 should ~never have been run for low, medium or high when Opus 4.8 has been available. Whoops, I think I have some skills that delegate easy stuff to Sonnet.
---
I remember Anthropic pivoting everyone's default model to Opus but had not seen it put so starkly before.
I am a bit confused on the subscription `/usage` screen. It splits out sonnet usage, and I'd presumed that would have contributed to a lower use of subscription Quota.
But if this is correct, Sonnet usage was basically like smoking unfiltered cigarettes.
Sort of like, getting an automatic upgrade at a car rental or hotel if there is availability.
But isn’t Fable supposed to be another step change? I never used it, myself.
Tbh, at this point I think top tier models are smart “enough” (I’m sure this will look antiquated in a year), and the way to give me MORE noticeable improvement is to make them much faster rather than much smarter. Or even a way to automatically and accurately pick faster models when it makes sense. I know that IDE’s have Auto modes, but it’s not something that I trust right now to pick smart+fast instead of picking “maybe smart enough”+”cheaper for harness owner”
It's also still just prone to the kind of "stupid" mistakes we see from all LLM's. Like it can write great code, but it doesn't really have common sense without enormous guidance.
I struggle to understand where this model fits in. If I need a cheap model for simple stuff (like, summarizing an email); I'd go Haiku (actually, I'd go Deepseek v4 Flash, but you catch my drift). I just can't think of many tasks where I'm like "yeah let me reach for Sonnet Low Reasoning so I can save a dollar but also seriously run the risk of it failing"; I'd just reach for Opus Low.
Low and maybe medium will save money on simpler tasks, but after that it just isn’t worth it compared to Opus.
I wish they would have explained in the blog post why they think anybody would ever want to use this above medium.
Maybe it works well on things that aren’t clear in the benchmarks.
In my early tests tonight, Sonnet 5 is a LOT better out of the box. It's one-shotting complex instructions. It also recovered independently from bad instructions that led to an uninformative 400 error by using its schema-fetching tool to figure out there were was too much input.
If I have to gripe about something: it interpreted another impossible instruction by quietly discarding the input in question. But, the way it did it is... kinda exactly what anybody else would do, if they weren't in a position to change the implementation.
This is, obviously, early days but I'm impressed.
In other words, for certain tasks, Opus 4.8 is cheaper than Sonnet 5, and does better than Sonnet 5.
I've noticed this pattern on a lot of benchmarks. You can try to emulate a bigger model by ramping up the test time compute (max reasoning, more turns, model fusion etc.), but you can't reach the same quality level, and you often exceed the cost you would have paid by just using a bigger model.
tldr: if you're doing something hard, just use a bigger model.
Sonnet is dumber and more expensive than Opus.
The token efficiency improvements in Opus are missing in Sonnet. Sonnet generates more output tokens and more reasoning tokens.
Any price advantage per token disappears due to volume.
It doesn't make sense to use Sonnet if you have access to Opus.
or
The Dodge Charger is built to be the most Charger like car yet.
Sonnet is slower due to much higher output and reasoning token generation.
Bro that is financial engineering, not real revenue growth. They engineered the switch to usage based pricing and a price hike timed the quarter before they wanted to go public, long enough to juice their numbers but not long enough for them not to be able to manage backlash and have to walk things back. Then they tried to extrapolate that manufactured bump to make it look like they have record shattering revenue growth.
"They took my shit away!" -- 3-day Fable 5 addicts (me)
"How dare they tell Trump no?" -- US nationalist / "my country right or wrong" types
"Great to see a closed source company fail!" -- open source boosters
"Great to see an American company fail!" -- anti-US, and/or pro-China folks
"Great to see a successful company fail!" -- anti-capitalists and/or sour-grapes crab bucket types
"Serves you right for ripping off creators!" -- copyright warriors
"They keep silently nerfing the models!" -- secret downgrade conspiracy theorists
"Quit killing the planet!" -- anti-datacenter advocates
Which is a bit of a bummer considering they do genuinely make the best model that's most pleasant to work with in my opinion.
I don't agree with your framing that all negativity is from crazies
It feels like your analysis is mostly spot on, it's the confluence of several motivated parties pouring effort into social media.
Many of the posters are pro-foreign models/pro-open source, and most can't distinguish the difference between "open source" and open weight models like Qwen, Minimax, or GLM.
Reminds me of the old "free as in beer" vs "free as in speech" debate. Free beer means you don't pay, but you don't get to see the recipe or change it. Free speech means you get the actual source and the right to study it, modify it, and redistribute it.
Open weight models are basically the beer version. You can download the weights, run them locally, fine-tune them, quantize them, host them on your own boxes — but what you have is a finished product, not the blueprint for how it was built.
Qwen is also censored - although since it's open weight, there are completely uncensored versions available.
The owners of Qwen can't jack up the prices to something I'm unable to pay. They can't take it away.
The owners of Qwen can't log and train on my data.
Open weight models share far more in common with free speech than free beer.
If big daddy Dario and his company are getting pushback it's not being of some motivated group trying to take them down. They brought it on themselves.
"Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer that changes how the model processes text to improve performance (this is similar to the tokenizer change we introduced with Claude Opus 4.7). The tradeoff is that the same input can map to more tokens: roughly 1.0–1.35× depending on the content type. The introductory pricing is set so that the transition to Sonnet 5 is roughly cost-neutral."
If we trust them, then it is roughly the same as sonnet 4.6
Today sonnet 5's med level effort is equivalent to sonnet 4.6 low level effort :/
Based upon the "Agentic Computer usage", Sonnet 5 Max was going to be off "Agentic Search results" chart. lol ...
In short, Sonnet 5 Low/Medium is more cost efficient, if its a task below Opus 4.8 Medium. For the rest its expensive and your better off using Opus 4.8.
Why even release this model?
You are reading too much into the graph and ignoring the threshold of usefulness for real world tasks. By that logic Sonnet 4.5 would have never been worth using.
For the rest the gap in pricing vs efficiency is so small, that there is no point in using Sonnet. I am looking at their own cost comparisons vs efficiency...
I use Haiku a lot for agent workflows, if I can get better output at similar prices, Sonnet 5 will replace it completely.
Unfortunately that means I won't be using it at work for now.
Claude Code generates more revenue than OpenAI...It appears to be a nice meme.
Not true
> model that's mostly worse while being more expensive
Not true
> they can ride a wave of misinformation.
Not true
Look at the cost per intelligence of Opus vs GPT 5.5.
Anthropic is the Taylor Swift of frontier labs... Not bad, but massively, MASSIVELY stan'd for inexplicable reasons, in violation with reality.
cool to see, still waiting for models to get better at computer use.
It seems being incompetent is a feature now...
[0] https://github.com/dginovker/BFME-Source-Code/
Okay.
And yet, the $2-$5 section is the widest, even though it only contains a single point.
I can't even say if this is making the product look better or not, but it sure is weird. Maybe Claude just hallucinated those splits xD
- Do the ever increasing scores on the mean we will soon have models that approach 100%? And what would that even mean? That there is no more room for improvement?
- Would Anthropic (or any other model vendor for that matter) ever release a newer model that scores lower? If not, does that mean they keep tweaking a new model they want to release until it shows an improvement of the prior model?
- Would it be more useful to move toward a comparative rather than absolute ranking?
None of the other labs are doing this kind of long lived two model series.
Oh Claude you master of software engineering does it ever end? DO you have no bounds?
How may we further assist you oh Claude?
I'd generously assume this is something about the specific category of agentic task presented in the chart... but it does raise the question "then why is that category the one they chose to highlight here".
Agentic search is a different story, but even there it still dominates 4.6 (as in, for everything Sonnet 4.6 can do, Sonnet 5 can do it as well or better at the same or lower cost).
Yes, Opus 4.8 dominates Sonnet 5 over its entire range in both categories, but Opus's lower range is limited and there is a valid regime on the lower end where Sonnet 5 use makes economic sense. This is not the case for Sonnet 4.6 where Opus 4.8 dominates it completely on both charts.
Edit -- reading your response closer I think we're saying the same things, maybe just disagreeing on whether that lower end is valuable or not.
loads of trust me bro benchmarks
financially incentivized comments and upvote/downvoting patterns
it's all slop