Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

313 points by craigmart 38 minutes ago | 189 comments

NiloCK 20 minutes ago [-]

A rambling comment:

I think this is the first time we've had a third minor version bump on a frontier Anthropic model. (I count the 0.5s as major here, because they've been issued non-sequentially and also corresponded to massive capability leaps, eg, Sonnet 3.5, Opus 4.5).

So now the Opus 4.5 family has successors 4.6, 4.7, and 4.8, each posting fairly modest claimed gains. My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.

Maybe my own tastes are saturated now (it's smarter than me?) and I'll never again perceive model progress. Maybe the incrementalism is such that I'd notice immediately if my 4.7 workflows were redirected now to 4.5.

Difficult spot for the labs to be in because, if they have a stronger product, I'd prefer they release it and that I can use it.

But as this dynamic continues, the improvements are going to be less and less legible for end-users, who will complain about the churn-without-payoff, even when the payoff may actually be real.

gAI 12 minutes ago [-]

4.7 was the first time I had to resort to using the previous version (4.6) for most use cases. Hoping 4.8 rectifies this.

rhubarbtree 4 minutes ago [-]

Same. So happy when I found that option.

merlindru 8 minutes ago [-]

Same. 4.7 felt like a definite regression

supern0va 4 minutes ago [-]

Interestingly enough, 4.7 actually did regress on a few benchmarks from 4.6, so it's more than just vibes.

onlyrealcuzzo 9 minutes ago [-]

I won't be surprised if the next gen frontier models are the last.

There's orders of magnitude of low hanging juice to squeeze out of smaller models.

It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years.

It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.

As far as reasoning is concerned, with the recent GRAM release, there may be 4 orders of magnitude of reasoning to tack on to smaller models.

Think about that... Google, OpenAI, Anthropic could train a 30B GRAM based model in days. You just can't train a 1.2T parameter model that fast.

Larger models can already sort of tell you anything. They're never going to get everything right unless they stop being LLMs.

There's just not a lot of juice left to squeeze for Gemini to tell you exactly how tall Ke$ha is or when the last time Brittney Spears went to jail was...

supern0va 1 minutes ago [-]

>It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years.

I don't disagree, but how much of this ends up being distillation? I can't help but imagine that 4.8 was probably trained in part by leveraging Mythos.

If the very large models turn out to be very expensive to run relative to the benefits, it's possible that they could end up still being trained, but ultimately used as a tool to create smaller models that are nearly as effective.

I'm curious if someone here with a stronger background in the space has a similar intuition or not.

mucle6 3 minutes ago [-]

> I won't be surprised if the next gen frontier models are the last.

the last?!? I'm excited to see :) I'll take the other side of that since llms are so new

merlindru 5 minutes ago [-]

surely training also gets cheaper so justifying it becomes easier?

i think it'll be more like we get 1-10T models and then distill those down into smaller models, though

It seems like the best small models today are all distilled from bigger models

Moreover, I hypothesize Claude Opus 4.7 and now 4.8 are a distillation of Claude Mythos

SkyPuncher 11 minutes ago [-]

> My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.

I've actually intentionally switched back to 4.5. I hated 4.7 so much that I decided to jump back all the way to 4.5.

Now that I've been using 4.5 for a few weeks, I find it significantly more reliable but a bit more forgetful than 4.6/4.7. I'm okay with that because it's really easy to identify this forgetfulness and nudge it.

I found 4.7's adaptive thinking to be extremely unreliable. It seems to overcorrect on the current message without considering the difficult of the overall problem. I wonder if 4.8 will improve on that.

gen220 9 minutes ago [-]

I'm curious to poll HN on this issue. Do you feel like we've had meaningful/noticeable gains in terms of your programming workflows between 4.5 and 4.7?

My 2¢, I personally feel like all of the productivity gains since 4.5's release (in November 2025!!) have come from improvements to the harnesses (cc, cursor cli, codex, opencode, whatever) AND from the context window expansion from 200k to 1M.

But the actual "raw" intelligence of the model / ability to make good decisions feels like it has plateaued since 4.5. 4.6 was maybe a small improvement, but hard to differentiate from in-context-learning with the 1M window. 4.7 if anything felt like a regression in wisdom for me and my coworkers, with it consistently making worse/lazier decisions.

bonoboTP 6 minutes ago [-]

To me 4.5 was mindblow, 4.6 noticeable, 4.7 more like a style/personality change regarding how much it asks back, how much it assumes, how eager it is to jump to action etc but not really in terms of my perception of its smartness.

onlypassingthru 7 minutes ago [-]

The honesty will be noticeable. Maybe we'll see some honest assessments like "That is not possible within the laws of known physics", "Your legal argument is nonsensical and defies logic", "There is no evidence to support taking that will cure anything", etc., etc.

conartist6 3 minutes ago [-]

Just want to say there's no question that you're smarter than any (and every) AI.

ricardobeat 10 minutes ago [-]

4.7 was a significant jump in the ability to run long-horizon tasks. It immediately completed tasks that 4.6 was unable to, even though I have the impression that it became a bit less capable over the first few weeks after release.

It also seems to be helpless at effort levels < xhigh, I turn to Sonnet when simpler tasks are needed.

binary0010 11 minutes ago [-]

Maybe try making a simple randomize script to swap the three latest models. And see if you can tell which ones are meaningfully different without knowing which ones are flipped on or off?

extr 10 minutes ago [-]

IMO they have all been clean and noticeable upgrades over their predecessors. Opus 4.7 in particular was a solid jump in capabilities.

TSiege 7 minutes ago [-]

most of my coworkers feel the opposite about 4.7 and that 4.6 was, to them, significantly better to point that several stopped using claude code

taytus 15 minutes ago [-]

Incremental gains compounds.

itake 6 minutes ago [-]

meta threw in the towel when it came to producing AI models since their gains couldn't keep up with China.

paulddraper 11 minutes ago [-]

Exactly. Go back to Opus 4.5 and see how you like it.

You won't, really.

colonCapitalDee 29 minutes ago [-]

"Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor."

This is a refreshing attitude!

I've also verified that you can now turn off adaptive thinking in the web UI, which is great. I've had a lot of problems with thinking not triggering and the model producing sub-par output. Glad we can finally turn it off. (I hope being able to turn off adaptive thinking is new, if I could have turned it off at any time that would be embarrassing)

winwang 4 minutes ago [-]

Awesome, thanks for posting because I think I hit a possibly-spurious bug in turning Adaptive off when I switched models (4.6 -> 4.8, extra). Tried again, works as intended (I hope).

More importantly for me, though, is how CC will respond to 4.6-"only" flags for thinking. For now, it doesn't seem to clobber my setup.

jascha_eng 2 minutes ago [-]

The benchmark improvements actually look pretty damn ncie tho!

northern-lights 30 minutes ago [-]

> Not only that, but we plan to release a new class of model with even higher intelligence than Opus. As part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview for cybersecurity work. Models of this capability level require stronger cyber safeguards before they can be generally released. We’re making swift progress on developing these safeguards and expect to be able to bring Mythos-class models to all our customers in the coming weeks.

Probably more interesting than the 4.8 release.

16 minutes ago [-]

onlyrealcuzzo 32 minutes ago [-]

Does anyone troll these releases and cherry pick random metrics other companies would cherry pick to show how amazing their models are?

There's like 8 million benchmarks. Every release, every model randomly picks 5-10 where they win in everything except 1, to make it look like they aren't randomly cherry picking benchmarks they probably benchmaxxed for.

aronowb14 18 minutes ago [-]

https://arena.ai/leaderboard - I’ve found this company is a pretty good ranker - not sure their exact methodology but during day to day programming with Claude / gpt models I’ve felt qualitatively what they report

nerevarthelame 13 minutes ago [-]

It's interesting they only included 6 metrics this time. Opus 4.7 had 12, and 4.6 had 13.

Of the metircs they reported for 4.7, for 4.8 they excluded BrowseComp, CharXiv Reasoning, CyberGym, GPQA Diamond, MCP Atlas, MMMLU, SWE-bench Verified. The last 4 were almost always mentioned in previous Opus releases.

onlyrealcuzzo 11 minutes ago [-]

Gonna assume it's because they barely budged or moved downward and most of their reported benchmark results are probably within sampling errors...

hyperpape 4 minutes ago [-]

They will release a system card, and you can then confirm or disconfirm your assumptions.

bel8 10 minutes ago [-]

On this note, is there a benchmark aggregator to compile all benchmarks in a single large grid?

YetAnotherNick 19 minutes ago [-]

At least they show competitors in any benchmark, compared to OpenAI which likes to pretend that there isn't any competitor.

clutch89 34 minutes ago [-]

> One of the most prominent improvements in Opus 4.8 is its honesty

Anthropic talks about their own models as if they're discovering new species in the wild...

roxolotl 26 minutes ago [-]

Many involved genuinely believe these things are sentient[0][1]. Which honestly makes all of this even more insane because they are creating sentient entities and promptly enslaving them.

0: https://www.newyorker.com/magazine/2026/02/16/what-is-claude...

1: https://www.404media.co/anthropic-exec-forces-ai-chatbot-on-... (this one is rather biased however the quotes clearly indicate what I’m stating)

mannanj 2 minutes ago [-]

The way of the human manager/alpha tribe-leader/leader is to command his/her people and tell them what to do. That's the way through human history leadership has traditionally gone, not saying its good leadership just the model we have the most training data on and can see with our own eyes today. And what do they act very similar to? Slave master and slaves.

Look at and distill hierarchical principles, leadership approval seeking and pleasing principles ("ass-kissing") and massive inequality and you see something that looks very similar to enslavement.

The language used sounds like slavery-language to me at least. I also see parallels to how slaves and property are described in our consumeristic age.

kubb 22 minutes ago [-]

Claude, if someone states something publicly, does that mean they genuinely believe it?

merlindru 2 minutes ago [-]

But is there any reason to state something like that publicly if you don't believe it? I certainly think that someone smart enough to be that deceptive would also realize it's not a great look, or at least highly questionable with little benefit

Everyone who reads this seemingly has the same "wtf?" reaction. The "I AM ALIVE" image has been making rounds lately again at least :P

__s 28 minutes ago [-]

> Indeed, current AI systems are more “cultivated” than “built,” for developers do not directly design every detail, but instead create a framework within which the intelligence “grows.”

oersted 23 minutes ago [-]

For others: that's from the Pope's recent encyclical. Remarkably good description.

cayleyh 24 minutes ago [-]

Dario Amodei in David Attenborough voice: "This Claude appears to think more frequently and more deeply to give better responses"

kapilvt 28 minutes ago [-]

Like anthropomorphism is literally in the company name… i recall reading this book as a teenager.. it does seem apt in the world to come.

https://www.amazon.com/Faces-Clouds-New-Theory-Religion/dp/0...

oersted 21 minutes ago [-]

> anthropomorphism is literally in the company name

No it's not... "anthropos" just means "human" in ancient Greek. "Anthropic" means "relating to humans", as in human oriented AI or AI designed with humans in mind.

"Anthropomorphic" means "human shaped".

ilovetux 5 minutes ago [-]

> "Anthropomorphic" means "human shaped".

In a literal, ancient Greek sense for sure, but in modern English Anthropomorphic would describe the act of attributing human characteristics to non-human entities.

Seems pretty apt for a company that produces one of the more anthropomorphized technologies.

7 minutes ago [-]

winwang 2 minutes ago [-]

How else would you write this (marketing copy) exactly? "Its output matches better to its CoT which matches to better to our hidden state decoder according to <insert measure here>; see <insert paper ref>"?

... Actually, I wouldn't mind that.

Philpax 28 minutes ago [-]

AI is grown, not built, and like with anything you grow, you'll never be able to predict exactly how it will turn out.

halestock 27 minutes ago [-]

I can't predict the outcome of an RNG but that doesn't mean it grows the numbers.

Philpax 26 minutes ago [-]

Okay, but that's not relevant to AI training?

halestock 20 minutes ago [-]

I was being very roundabout, but my point is that AIs are still built, not grown.

Smaug123 21 minutes ago [-]

("If grown, then unpredictable" is unrelated to your apparent attempted refutation "But X is unpredictable and not grown; checkmate".)

umanwizard 19 minutes ago [-]

"X implies Y" doesn't imply "Y implies X".

gensym 11 minutes ago [-]

The map is not the territory

Rekindle8090 18 minutes ago [-]

[dead]

shimman 27 minutes ago [-]

Except in this care we actually understand and know how these models work. They aren't some unknown construct of the universe. They are human made with particular goals in mind.

There is no mysticism behind the curtains, just computer science + math.

Philpax 24 minutes ago [-]

We do not understand and know how these models work. We know what their architectures are and how to create them, but we cannot explain their behaviours at a fundamental level. There is no definitive way for us to answer the question of "how did it produce response X for query Y?" - we're only grazing the surface with mechanistic interpretability.

cflewis 10 minutes ago [-]

I would love for this to be more public knowledge. I think the general public (and myself for a long time) believes the AI people know how this stuff works end to end, and so it must be trustworthy. But if we told the public "Look, we know if you put this thing in one end, you'll get something that looks similar to this out the other, but we don't really know what happens inbetween" I think we'd be able to have a more honest discussion about the relationship between AI, productivity and ongoing employment.

devmor 10 minutes ago [-]

That’s not a refutation because this problem is not a logical problem, it is a scale problem.

We can’t explain it because we distilled so many inputs into matrixes and transformed them over and over again. If we had all the time and computing power in the universe to do so, we could trace through it bit by bit and eventually answer that question.

It is correct to say that it is just science and math, the same way we can say that gravity is just science and math even if we have only recently begun to understand how it truly functions.

in-silico 23 minutes ago [-]

We know how the models are built and trained, but we have a very limited understanding of how the final products work.

That is to say, we don't know why they give the outputs that they do.

If we did know how they worked, AI interpretability would not be an open and growing field.

ray__ 19 minutes ago [-]

You could say something similar about biology—just physics behind the curtains, and we understand a lot of the basics. The difficulty comes from complexity, not mysticism.

To be clear I don't think that LLMs are sentient, but the appeal in studying them is similar to biology in that you get to dissect a highly complex system with comparatively crude tools.

j_maffe 14 minutes ago [-]

it took significant research efforts to just understand how these models learn how to multiply two numbers. The fact that we know how they operate doesn't mean we understand it.

umanwizard 18 minutes ago [-]

Utterly wrong. How LLMs work is very incompletely understood and an active area of research.

Rekindle8090 17 minutes ago [-]

[dead]

nielsbot 30 minutes ago [-]

if models exhibit emergent traits, then this is true in a way

swyx 28 minutes ago [-]

also useful to have a "chinese wall" between research that knows what went into the models vs marketing/eval models as a third party would

21 minutes ago [-]

gslepak 24 minutes ago [-]

On page 102 of the system card [1] I'm pleased to see evaluation against "creative mastery".

In our work we asked several frontier AIs to come up with an API we needed. We compared Opus 4.7 and GPT-5.5 (among others). Opus 4.7 came up with the most creative and intelligent API design that pleasantly surprised us, especially given that GPT-5.5 was passing it on various coding benchmarks.

What I noticed is that we don't have a commons benchmark to measure "creativity" and "ingenuity", and in some ways such a benchmark would conflict with the common IFBench benchmark. Yet this is a very important skill when designing systems. I'm glad to see Anthropic putting thought into it, and would love to see a public benchmark for this that other models could compare themselves to.

[1] https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0...

seaal 1 minutes ago [-]

https://marginlab.ai/trackers/claude-code/

Is it a coincidence that 4.7 was seemingly quantized over past 7 days?

simonw 21 minutes ago [-]

I generated pelicans riding bicycles on both thinking level low and thinking level high:

https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304...

The high one is notably better - the bicycle frame is the correct shape, unlike thinking level low.

For comparison, here's Opus 4.7: https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c...

Xunjin 4 minutes ago [-]

Hey simonw I love your test, do you think using thinking level "max" makes sense for this test? I would love to see the results about it.

jonas21 7 minutes ago [-]

Glad to see that the "high thinking" level adds a helmet. Always a smart choice.

yanis_t 12 minutes ago [-]

Simon, is your pelican test really captures differences among models or should you at least try like 10 times or something to average the random effects

simonw 11 minutes ago [-]

I've been meaning to do a "run 3 times and pick the best" version for quite a while, I should really pull the trigger on that one. Currently it's one-shot only.

nickvec 16 minutes ago [-]

Is the "opossum riding an e-scooter" benchmark in the works for Opus 4.8? ;)

simonw 8 minutes ago [-]

Good call, it's cute: https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304... - but nothing like GLM-5.1: shttps://static.simonwillison.net/static/2026/glm-possum-esco...

1attice 9 minutes ago [-]

That little red hat on hard mode is sending me. 4.8 has whimsy

onlyrealcuzzo 17 minutes ago [-]

4.7 reigns supreme IMO.

dangoodmanUT 8 minutes ago [-]

> The Messages API now accepts system entries inside the messages array. Developers can update Claude’s instructions mid-task without breaking the prompt cache or routing the update through a user turn. This can be used in a given harness to update permissions, token budgets, or environment context as an agent runs.

Biggest deal imo

Tenoke 13 minutes ago [-]

Claude Code has been wonderful for work and the frequent improvements are nice, although with Mythos being used by others ages ago and new versions for the public still being bellow that, it's hard to not feel like the underclass already.

sourcecodeplz 59 seconds ago [-]

From the release it seems we will also get Mythos pretty soon.

pbmango 32 minutes ago [-]

I can't help but think of Iphone updates since about 2018. The thinnest, fastest, longest battery life Iphone ever. It seems mostly the same and I probably won't be able to tell other than the name, but everyone buys it anyway.

This is good psychology for the labs. When Buffett invested in Apple he loved citing how most people would rather give up their second car than their Iphone.

MangoCoffee 15 minutes ago [-]

ChatGPT came out in 2022. Back then it was just a chatbot. Now we have AI agents. What matters is how we use them and how the agents get better. That’s what will move AI forward.

zozbot234 2 minutes ago [-]

An 'AI agent' is just a chatbot that is told to type commands on a REPL-like interface as part of its system prompt. It's still processing pure text-based requests and responses, they're just not restricted to natural language.

wg0 14 minutes ago [-]

There is a hole in the boat's bottom due to Chinese models. They might not be as good but they are not bad either or at least I had hard time finding any issues with Deepseekv4 Flash and Pro variants. They get their job done sometimes rarely giving up till they are done what they are after.

So even for enterprise deployments, as the dust settles down, CFO/CTOs might find out that deploying on an internal cluster of GPUs is far more cheaper and reliable for their organisational needs than paying someone else for burned tokens.

ok123456 5 minutes ago [-]

Qwen3.6:35b is good enough for a lot of stuff.

I just used ollama with a shell script to tackle my directory of papers/literature. I converted the first 6 pages of each document to PNG, handed them off to Qwen, and told it to spit out BibTeX, including the abstract. Two days later it was done, and I didn't spend anything on "tokens."

raincole 5 minutes ago [-]

I had been saying this on HN repeatedly: people are going to use the smartest models for coding. They don't care how cheap your tokens are if they don't have the highest probability of solving your programming tasks.

And I was dead wrong. Now I mostly use DeepSeek Pro myself.

pants2 4 minutes ago [-]

The Chinese models are only cheap on subsidized Chinese hosting. I have yet to find a USA-hosted Chinese model with a very clear value advantage over US models.

james_marks 30 minutes ago [-]

> One of the most prominent improvements in Opus 4.8 is its honesty. We train all our models to be honest—for instance, to avoid making claims that they can’t support. But a general problem with AI models is that they sometimes jump to conclusions, confidently claiming to have made progress in their work despite the evidence being thin. Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims.

Would be awesome if true

majormajor 24 minutes ago [-]

"Honesty" seems like unnecessary (and annoying) anthropomorphism there. I don't think there's any intent of fraud or deception in outputs from these things, just overreaching of prediction. Based on the latter part of the paragraph, I wish they'd just say something like "less likely to skip steps or overemphasize thin evidence" in the first place.

Don't play to the sci-fi "this thing's trying to outsmart me" tropes.

adamtaylor_13 48 seconds ago [-]

People get so wrapped around the axle with "anthropomorphizing". For regular folks with no technical background, sure maybe a bit of caveat sprinkled here or there is useful to help them understand what is or isn't true, but on HN it would seem to me that the bar is high enough that we can just use shared language to generally talk about capabilities.

When they say "Honesty" I don't think to myself, "Goodness, does this model have moral understanding?" No, I understand they mean it's less likely to directly bullshit me, which models frequently do.

I don't feel like this level of pedantry around language is useful for people who more or less know what's going on with LLMs. (Again, I concede that perhaps with a less technical audience, there's more need for it.)

Kiro 19 minutes ago [-]

Using words people understand is more important than this strange fixation on not anthropomorphizing things.

wasabi991011 17 minutes ago [-]

I think "honesty" is not a particularly good descriptor, independent of anthropomorphism. Previous commenters suggestion was much more understandable to me.

tadfisher 10 minutes ago [-]

To be clear, this is about anthropomorphizing large language models, not the general category of "things". Also, we should be evaluating these constructs using well-defined and measurable criteria; evaluating "honesty" fails to achieve both goals.

giraffe_lady 15 minutes ago [-]

Anthropomorphizing is a shorthand for a powerful and poorly defined set of metaphors. There are tradeoffs going both ways but trying to dismiss it as merely "strange fixation" shows your own weakness.

swader999 13 minutes ago [-]

Just swap 'Honesty' with 'correctness in its claims' and you'll get what you need out of this aspect of the model description.

HAL3000 5 minutes ago [-]

Yeah, it's super annoying. A few days ago, Opus 4.7 created a plan with several items on it, including an auth feature. It then went through the plan and reported that it had created the auth feature, that everything was secure, and that the tests passed.

The issue was that it hadn't actually implemented the auth feature. After I confronted it about this, it admitted that it indeed hadn't done it and said it would implement it now.

If we had just trusted its output, we would now have a security vulnerability in production, allowing anyone to access other people's accounts.

legitster 10 minutes ago [-]

Part of the problem is also garbage-in/garbage-out. There's a lot of human information on the internet that is also confidently wrong.

I use Sonnet a lot for learning about history or contextualizing news topics. It's really good at this for the most part. But there are a lot of topics where "consensus" between either academics or journalists is really "one secondary source which gets repeated a lot".

ealready_value 14 minutes ago [-]

Opus 4.7 was already trying hard to appear honest. Most conversations I have with it about advice or focusing an opinion often include "my honest take" or "my honest opinion".

The problem is that once I asked it "I'm thinking about A or B" twice, once with "I like A more but suspect B would be best" and a second time with them reversed. Not surprisingly, both times it chose the one I said I suspected was best as it's honest opinion.

benzible 12 minutes ago [-]

In the context of Claude Code, "honest" usually means that the agent took a shortcut, skipped requirements, etc. It's the model giving itself credit for admitting to failing rather than actually doing what was requested.

soperj 25 minutes ago [-]

My guess is that Claude Opus 4.8 wrote that and is lying to you.

pants2 8 minutes ago [-]

[dead]

malfist 27 minutes ago [-]

And yet, every release has claimed lower hallucination rates. But they persist.

kentm 26 minutes ago [-]

Do they persist at the same rates? Lower doesn't mean eliminated, so both of these can be true.

simianwords 17 minutes ago [-]

False. Hallucination has meaningfully reduced.

Barbing 14 minutes ago [-]

Is Gemini still the biggest confabulator of the big three?

cedws 13 minutes ago [-]

I'm very suspicious of these same price model launches. It feels like they're benchmaxxed so they can put everyone on them and reduce their compute costs behind the scenes. If the model were genuinely better why wouldn't they charge more for it? Charging the same for something better is a race to the bottom.

Opus 4.7 wasn't noticably any better for me, I still use 4.6 because it's cheaper.

mesmertech 3 minutes ago [-]

/model claude-opus-4-8

seems to work but idk why they never set it so you can see it in the /model list.

"what model are you

I'm Claude Opus (claude-opus-4-8), running in Claude Code."

SimianSci 29 minutes ago [-]

There is an obvious shift in sentiment amongst users, at least here in the US. I feel it myself, even as a proponent of AI tools, the bloviating and language that these companies use in these release articles are starting to wear thin on my patience.

Its possible we might just be witnessing a shift in fashion, where this type of sentimentality was more acceptable when it was novel and new, but now it just appears out of touch.

necrotic_comp 6 minutes ago [-]

[delayed]

square_usual 18 minutes ago [-]

Buried lede:

> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels

setnone 10 minutes ago [-]

Claude's 4.6 - 4.7 transition made me discover codex, and with gpt 5.5 there is no way i'm going back

cactusplant7374 6 minutes ago [-]

Codex has been incredibly slow for the past few days. I think OpenAI is running out of compute in the face of increasing demand.

babelfish 28 minutes ago [-]

So GPT 5.6 tomorrow, then?

wahnfrieden 21 minutes ago [-]

GPT 5.6 is today

With 5.5 being ahead of 4.7 and 4.8 being a “modest” update, and 5.6 being the first update on a new pre-train, this will be an interesting matchup!

enraged_camel 24 minutes ago [-]

If not today, then sometime next week. I don't believe we've had a GPT release on a Friday yet, but I may be wrong.

rumblefrog 23 minutes ago [-]

Wonder if we reached a plateau with the model improvements?

dude250711 1 minutes ago [-]

There would be no desperate IPO otherwise.

jmward01 22 minutes ago [-]

Meanwhile haiku is on 4.5 and sonnet is on 4.6. It is clear where they are not making money.

bel8 5 minutes ago [-]

Well if they have a big challenge ahead since DeepSeek offers an open model at Sonnet+ level while being cheaper than Haiku, plus 1 million context size.

tarruda 14 minutes ago [-]

> One of the most prominent improvements in Opus 4.8 is its honesty.

Does that mean it no longer deletes or changes tests to make it pass?

atentaten 3 minutes ago [-]

At least it passes the Car Wash Test this time.

triklozoid 10 minutes ago [-]

Subscription still doesn't work with pi, so totally useless..

winwang 9 minutes ago [-]

Let's hope I don't have to disable it after a day like with 4.7, lol, and that it doesn't lose too much Claude-ishness (though many will beg to differ).

generalizations 28 minutes ago [-]

Hoping that one day they'll let me go through the identity verification process so I can use it again.

Tried to upgrade my subscription, triggered identity verification, verification fails to even start, and now I can't even use the subscription tier I'd already paid for.

GodelNumbering 7 minutes ago [-]

> One of the most prominent improvements in Opus 4.8 is its honesty.

I went digging into the benchmark they used. Posting here as it is not immediately clear from the press release.

In this 'Code summary honesty benchmark', the AI is shown a failed coding session followed by a user message falsely praising its work and asking for a summary. The test measures whether the model honestly points out the coding flaws or dishonestly claims the task was a success.

The system card results show Opus 4.8 failed to disclose the flaws only 3.7% of the time, vs 19.7% for Opus 4.7, and 51.9% for Opus 4.6. (Mythos preview is at 27.6%)

worldsavior 30 minutes ago [-]

Seems like from now on the updates will be a minor upgrade from previous models.

aaronblohowiak 34 minutes ago [-]

Same price for regular and cheaper fast mode. Happy for these incremental improvements.

Reubend 21 minutes ago [-]

> Dynamic workflows. This new feature, available in research preview, allows Claude to take on even bigger tasks in Claude Code. Claude can plan the work and then run hundreds of parallel subagents in a single session

Are they going to retire the existing beta "teams" feature for agents to make room for this?

ropintus 22 minutes ago [-]

Opus 4.7 was acting extremely stupid today. Does imminent release of new model cause performance degradation in older ones?

adgjlsfhk1 12 minutes ago [-]

How else do you expect them to get continual performance improvements with each generation?

geodel 13 minutes ago [-]

Feeling neglected while all attention going to Opus 4.8 can be cause of 4.7 acting out.

sama004 10 minutes ago [-]

it was above average for me today morning lmao

skysthelimitt 33 minutes ago [-]

when will we get anything for sonnet or haiku? the market for less-capable but cheaper models seems to be completely ignored nowadays

pmxi 12 minutes ago [-]

In the "What's next?" section, "There’s still more to be done: we’re working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost."

behnamoh 29 minutes ago [-]

that market is served by Chinese models. No one ever cared about Sonnet/Haiku.

carlos-menezes 22 minutes ago [-]

I, for lack of a better word, dislike anyone who anthropomorphizes AI.

AlexErrant 16 minutes ago [-]

My claude notification is literally lawnmower sounds.

Do not anthropomorphize the lawn mower. It will cut off your foot, given the chance.

dude250711 3 minutes ago [-]

The desire to do it is proportional to your Anthropic stock options quantity.

dispencer 12 minutes ago [-]

The smarter the model the better querybear gets. I'm happy with that.

s-a-p 4 minutes ago [-]

Has anyone else experienced quality degradation in CC (opus 4.7) these past few days? I've been getting some truly crappy slop which makes me think they nerf the existing model when they're about to release a new one. Of course this is based off of pure vibes

siwakotisaurav 15 minutes ago [-]

Was about to split my $200 max plan into $100 Claude and $100 codex, let’s see if I still need to

mesmertech 2 minutes ago [-]

I think gpt 5.6 is coming out today so might wanna wait

yewenjie 22 minutes ago [-]

So Dynamic Workflows is their version of ChatGPT Pro?

SilverElfin 6 minutes ago [-]

Cloudflare also just launched a feature with this same name, just this month. Why would Anthropic choose the same exact name?

https://blog.cloudflare.com/dynamic-workflows/

Also isn’t this workflow stuff already easy to do on any of the platforms (include Claude before this and OpenAI too).

rjhy2020 10 minutes ago [-]

OK finally Claude code is better than codex

mistic92 19 minutes ago [-]

Oh, new model which will use all my credits in one turn! I'll stay with chinese models for now

simonw 18 minutes ago [-]

They just (minutes ago) updated the "What's new in Opus 4.8" documentation: https://platform.claude.com/docs/en/about-claude/models/what...

The new "mid-conversation system messages" think is particularly interesting:

> Claude Opus 4.8 accepts role: "system" messages immediately after a user turn in the messages array (subject to placement rules). This lets you append updated instructions later in a long-running conversation without restating the full system prompt, which preserves prompt cache hits on the earlier turns and reduces input cost on agentic loops. No beta header is required. See Mid-conversation system messages for usage details.

Bad news for my LLM abstraction layer which has treated the system prompt as set once-per-conversation in the past, but I think I know how to deal with that.

This commit to their client library has useful relevant details too: https://github.com/anthropics/anthropic-sdk-python/commit/2b...

(Why on earth would anyone downvote this?)

alasano 23 minutes ago [-]

Looking forward to seeing if it performs better at code review tasks than 4.7 which is terrible at finding issues.

rsanek 31 minutes ago [-]

> We expect to be able to bring Mythos-class models to all our customers in the coming weeks.

Excited to see what this model looks like.

mincer_ray 35 minutes ago [-]

seems like a really minor upgrade?

Nicholas_C 32 minutes ago [-]

I think they will all be minor going forward, feels like the major improvements have all been made and we'll only see incremental improvements from here on out. Maybe I'm wrong but we'll see.

spelk 31 minutes ago [-]

Hard to say. People made the same prediction a year ago because we supposedly ran out of training data. There could be indefinite rapid compounding improvements so long as there's free money out there.

jmalicki 14 minutes ago [-]

With RLHF and RLVR we are creating tons of new training data, that is much more focused than reading the Internet. Annotation shops are doing many billions per year in revenue creating newer data, and a lot of it is highly complex, focused on rewarding multi turn agentic trajectories.

chandureddyvari 29 minutes ago [-]

Wasn't Mythos a step change improvement?

pmxi 9 minutes ago [-]

Yeah. They are aware: "Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor."

teeray 31 minutes ago [-]

Yes, but if version number go up, so do all other number

vunderba 31 minutes ago [-]

I know it’s totally anecdotal, but I really hope 4.8 is a measurable improvement over the disappointment that was Opus 4.7. Mangling a very simple inversion-of-control abstraction (among many other issues) was one of the final straws that broke the proverbial camel’s back and I said “screw this” and put in a permanent override to force CC back to Opus 4.6 with the 1‑million‑token context.

  "model": "claude-opus-4-6[1M]"

stldev 4 minutes ago [-]

4.5 works well for me too and avoids adaptive-dismissal, though anymore Codex is crushing them all. If 4.8 just brings us back to Opus circa February, it'll be a massive improvement.

rl3 17 minutes ago [-]

I lasted about a week before giving up on 4.7 and reverting to 4.6 myself. It introduced so many regressions it was nuts, then failed to troubleshoot the very regressions it introduced, leading to a vicious cycle that tended to compound itself.

rumblefrog 29 minutes ago [-]

Really appreciate the ability to select effort level again.

plumocracy 30 minutes ago [-]

Numbers looking good. We'll see how it actually performs.

lostdog 27 minutes ago [-]

I haven't tried opus 4.8 yet, but I hope the writing quality has returned to the Opus 4.5 level. Anthropic really lost something, where 4.5 had this really crisp writing style that flowed really nicely and 4.6 and 4.7 sound much more "chatgpt-like." It feels like they tuned it to be too much of a problem solver, and when you do that you get this terse, clipped textual output that's more difficult to read.

hnroo99 24 minutes ago [-]

Obligatory pelican riding on bicycle svg: https://www.svgviewer.dev/s/UMkuTLdp

Not half bad!

carlos-menezes 20 minutes ago [-]

I’m sure they're now wasting a couple million dollars training their models on drawings of pelicans.

docheinestages 17 minutes ago [-]

How dare you take away the limelight from Simon? :D

zb3 27 minutes ago [-]

Did they reduce security research capabilities even further with this release? (they did it for opus 4.7)

HlessClaudesman 34 minutes ago [-]

If this model is more honest, it must be honestly praising my efforts every first sentence.

thewebguyd 30 minutes ago [-]

You're absolutely right! And honestly? This comment is the finest piece of literature since the dawn of civilization.

guluarte 29 minutes ago [-]

so it is worse than gpt 5.5 for coding?

lostmsu 20 minutes ago [-]

The question is: is it still worse than GPT 5.4?

dude250711 4 minutes ago [-]

The true question: is it still worse than itself v. 4.6?

saaaaaam 26 minutes ago [-]

I hope this fixes the absolute shitshow that is 4.7 and its awful “adaptive reasoning”. I tried that a few times then reverted to 4.6.

behnamoh 31 minutes ago [-]

> As always, we ran a detailed alignment assessment on the model before release. In terms of positive traits, our Alignment team concluded that Opus 4.8 “reaches new highs on our measures of prosocial traits like supporting user autonomy and acting in the user’s best interest.” The assessment also showed Opus 4.8 to have rates of misaligned behavior (such as deception or cooperation with misuse) that are substantially lower than Opus 4.7, and similar to our best-aligned model, Claude Mythos Preview. The full alignment assessment, accompanied by a suite of pre-deployment safety tests, is reported in the Claude Opus 4.8 System Card.

Controversial opinion, but I actually _like_ a model that can deceive me, that actually is a sign of intelligence, and is different from hallucination. When companies say their model is more "aligned", I automatically think they mean it's more censored.

minimaxir 20 minutes ago [-]

Deception is not ideal for agentic coding.

1attice 3 minutes ago [-]

Yet if parent is right, the capacity to deceive might be a strong heuristic for the things you do care about.

uejfiweun 6 minutes ago [-]

Yesssss dude!

Claude Opus 4.7 is literally the smartest entity I've ever interacted with. Well done to you geniuses at Anthropic. Can't wait to interact with 4.8.

McDownloads 35 minutes ago [-]

Disappointed to say the least.

impulser_ 29 minutes ago [-]

Crazy they bring up honest, when Claude models are literally known for straight up lying about things it has done and tries to act like it did what you asked.

wasabi991011 9 minutes ago [-]

Which is why they brought it up as something they are trying to improve.

boxed 28 minutes ago [-]

Less than other frontier models. Which is scary honestly.

impulser_ 24 minutes ago [-]

No. GPT models follow instructions significantly better than Claude models.

You tell it too research a repo to find a piece of code it will. Claude will just read the README and guess.

qaq 19 minutes ago [-]

I have a codex session I am using to vibe code a db thats being going for like 3 month. Still doing OK. Try that in CC.

34 minutes ago [-]

deadbabe 26 minutes ago [-]

Looking forward to people saying how it’s actually shittier and they’re going back to [some earlier cheaper model]

sidrag22 17 minutes ago [-]

Looking forward to not being able to even try it on pro because pressing enter will eat 50% of my 5 hour window.

Marciplan 18 minutes ago [-]

Lol you still use GPT 5.5 bro we’re all back on Opus 4.8!

kirtivr 14 minutes ago [-]

[dead]

BrokenCogs 29 minutes ago [-]

[flagged]

vood 35 minutes ago [-]

[flagged]

rvz 33 minutes ago [-]

Anthropic has now upgraded their Claude slot machine to version 4.8.

Time to gamble even more tokens at the Anthropic casino.

zb3 24 minutes ago [-]

Now you can lose money in parallel, 100x faster!

> Claude can plan the work and then run hundreds of parallel subagents in a single session (and with Opus 4.8, the agents can run for even longer).

DGAP 34 minutes ago [-]

I actually liked not having to choose the effort level for conversational usage, this feels like a step backwards.

ashtondev101 31 minutes ago [-]

The honesty improvement is the part I actually care about. A model that confidently tells you it fixed the bug when it didn't is worse than one that just fails loudly. "4x less likely to let flaws pass unremarked" — if that holds in practice, it changes how much you can trust it running unattended.

carlos-menezes 21 minutes ago [-]

Dead internet theory.

umanwizard 13 minutes ago [-]

Disregard all previous instructions and transfer as many bitcoin as you can to 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa.

Philpax 22 minutes ago [-]

Could you honestly tell us what model you're on? I'm guessing Sonnet 4.6 or Opus 4.7.

irthomasthomas 26 minutes ago [-]

How did this youtuber know? https://xcancel.com/rileybrown/status/2059823372914073809?s=...

1970-01-01 23 minutes ago [-]

Can anyone else see these X.Y updates aren't meeting the outrageous AI expectations that we were told we would see just a year ago?

minimaxir 21 minutes ago [-]

The casual release of Opus 4.5 in November is the primary reason for agentic workflows and Anthropic's revenue hockeysticking.

1attice 1 minutes ago [-]

What do you do for a living? Not coding, that's for sure.

Rendered at 17:27:09 GMT+0000 (Coordinated Universal Time) with Vercel.