Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

275 points by meetpateltech 1 hours ago | 215 comments

cupofjoakim 45 minutes ago [-]

> Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type.

caveman[0] is becoming more relevant by the day. I already enjoy reading its output more than vanilla so suits me well.

[0] https://github.com/JuliusBrussee/caveman/tree/main

Tiberium 41 minutes ago [-]

I hope people realize that tools like caveman are mostly joke/prank projects - almost the entirety of the context spent is in file reads (for input) and reasoning (in output), you will barely save even 1% with such a tool, and might actually confuse the model more or have it reason for more tokens because it'll have to formulate its respone in the way that satisfies the requirements.

embedding-shape 18 minutes ago [-]

> I hope people realize that tools like caveman are mostly joke/prank projects

This seems to be a common thread in the LLM ecosystem; someone starts a project for shits and giggles, makes it public, most people get the joke, others think it's serious, author eventually tries to turn the joke project into a VC-funded business, some people are standing watching with the jaws open, the world moves on.

stingraycharles 8 minutes ago [-]

While the caveman stuff is obviously not serious, there is a lot of legit research in this area.

Which means yes, you can actually influence this quite a bit. Read the paper “Compressed Chain of Thought” for example, it shows it’s really easy to make significant reductions in reasoning tokens without affecting output quality.

There is not too much research into this (about 5 papers in total), but with that it’s possible to reduce output tokens by about 60%. Given that output is an incredibly significant part of the total costs, this is important.

https://arxiv.org/abs/2412.13171

ACCount37 4 minutes ago [-]

Some labs do it internally because RLVR is very token-expensive. But it degrades CoT readability even more than normal RL pressure does.

It isn't free either - by default, models learn to offload some of their internal computation into the "filler" tokens. So reducing raw token count always cuts into reasoning capacity somewhat, and getting closer to "compute optimal" while doing that isn't an easy task.

ieie3366 12 minutes ago [-]

All LLMs also effectively work by ”larping” a role. You steer it towards larping a caveman and well.. let’s just say they weren’t known for their high iq

roughly 33 seconds ago [-]

Fun fact: Neanderthals actually had larger brains than Homo Sapiens! Modern humans are thought to have outcompeted them by working better together in larger groups, but in terms of actual individual intelligence, Neanderthals may have had us beat. Similarly, humans have been undergoing a process of self-domestication over the last couple millenia that have resulted in physiological changes that include a smaller brain size - again, our advantage over our wilder forebearers remains that we're better in larger social groups than they were and are better at shared symbolic reasoning and synchronized activity, not necessarily that our brains are more capable.

Hikikomori 4 minutes ago [-]

Modern humans were also cavemen.

DiogenesKynikos 6 minutes ago [-]

This is why ancient Chinese scholar mode (also extremely terse) is better.

bensyverson 6 minutes ago [-]

Exactly. The model is exquisitely sensitive to language. The idea that you would encourage it to think like a caveman to save a few tokens is hilarious but extremely counter-productive if you care about the quality of its reasoning.

zozbot234 6 minutes ago [-]

I tried the caveman prompt and it just starts telling me about GEICO insurance. Not very useful.

egorfine 15 minutes ago [-]

They are indeed impractical in agentic coding.

However in deep research-like products you can have a pass with LLM to compress web page text into caveman speak, thus hugely compressing tokens.

make3 38 minutes ago [-]

I wonder if you can have it reason in caveman

0123456789ABCDE 28 minutes ago [-]

would you be surprised if this is what happens when you ask it to write like one?

folks could have just asked for _austere reasoning notes_ instead of "write like you suffer from arrested development"

Sohcahtoa82 7 minutes ago [-]

> "write like you suffer from arrested development"

My first thought was that this would mean that my life is being narrated by Ron Howard.

acedTrex 38 minutes ago [-]

You really think the 33k people that starred a 40 line markdown file realize that?

andersa 15 minutes ago [-]

You mean the 33k bots that created a nearly linear stars/day graph? There's a dip in the middle, but it was very blatant at the start (and now)

verdverm 25 minutes ago [-]

Stars are more akin to bookmarks and likes these days, as opposed to a show of support or "I use this"

zbrozek 19 minutes ago [-]

I use them like bookmarks.

LPisGood 18 minutes ago [-]

I use them as likes

giraffe_lady 16 minutes ago [-]

I intentionally throw some weird ones on there just in case anyone is actually ever checking them. Gotta keep interviewers guessing.

pdntspa 9 minutes ago [-]

The amount of cargo culting amongst AI halfwits (who seem to have a lot of overlap with influencers and crypto bros) is INSANE

I mean just look at the growth of all these "skills" that just reiterate knowledge the models already have

computomatic 33 minutes ago [-]

I was doing some experiments with removing top 100-1000 most common English words from my prompts. My hypothesis was that common words are effectively noise to agents. Based on the first few trials I attempted, there was no discernible difference in output. Would love to compare results with caveman.

Caveat: I didn’t do enough testing to find the edge cases (eg, negation).

TIPSIO 24 minutes ago [-]

Oh wow, I love this idea even if it's relatively insignificant in savings.

I am finding my writing prompt style is naturally getting lazier, shorter, and more caveman just like this too. If I was honest, it has made writing emails harder.

While messing around, I did a concept of this with HTML to preserve tokens, worked surprisingly well but was only an experiment. Something like:

> <h1 class="bg-red-500 text-green-300"><span>Hello</span></h1>

AI compressed to:

> h1 c bgrd5 tg3 sp hello sp h1

Or something like that.

Leynos 14 minutes ago [-]

Combine that with emmet / zen coding: https://en.wikipedia.org/wiki/Emmet_%28software%29?wprov=sfl...

naoru 15 minutes ago [-]

You'd like Emmet notation. Just look at the cheat sheet: https://docs.emmet.io/cheat-sheet/

23 minutes ago [-]

OtomotO 34 minutes ago [-]

Another supply chain attack waiting?

Have you tried just adding an instruction to be terse?

Don't get me wrong, I've tried out caveman as well, but these days I am wondering whether something as popular will be hijacked.

pawelduda 19 minutes ago [-]

People are really trigger-happy when it comes to throwing magic tools on top of AI that claim to "fix" the weak parts (often placeboing themselves because anthropic just fixed some issue on their end).

Then the next month 90% of this can be replaced with new batch of supply chain attack-friendly gimmicks

Especially Reddit seems to be full of such coding voodoo

xienze 12 minutes ago [-]

> coding voodoo

Well, we've sacrificed the precision of actual programming languages for the ease of English prose interpreted by a non-deterministic black box that we can't reliable measure the outputs of. It's only natural that people are trying to determine the magical incantations required to get correct, consistent results.

buildbot 49 minutes ago [-]

Too late, personally after how bad 4.6 was the past week I was pushed to codex, which seems to mostly work at the same level from day to day. Just last night I was trying to get 4.6 to lookup how to do some simple tensor parallel work, and the agent used 0 web fetches and just hallucinated 17K very wrong tokens. Then the main agent decided to pretend to implement tp, and just copied the entire model to each node...

aurareturn 40 minutes ago [-]

Funny because many people here were so confident that OpenAI is going to collapse because of how much compute they pre-ordered.

But now it seems like it's a major strategic advantage. They're 2x'ing usage limits on Codex plans to steal CC customers and it seems to be working. I'm seeing a lot of goodwill for Codex and a ton of bad PR for CC.

It seems like 90% of Claude's recent problems are strictly lack of compute related.

afavour 21 minutes ago [-]

> people here were so confident that OpenAI is going to collapse because of how much compute they pre-ordered

That's not why. It was and is because they've been incredibly unfocused and have burnt through cash on ill-advised, expensive things like Sora. By comparison Anthropic have been very focused.

Robdel12 3 minutes ago [-]

> By comparison Anthropic have been very focused.

Ah yes, very focused on crapping out every possible thing they can copy and half bake?

aurareturn 19 minutes ago [-]

I don't think that was the main reason for people thinking OpenAI is going to collapse here.

By far, the biggest argument was that OpenAI bet too much on compute.

Being unfocused is generally an easy fix. Just cut things that don't matter as much, which they seem to be doing.

l5870uoo9y 10 minutes ago [-]

In hindsight, it is painfully clear that Antropic’s conservative investment strategy has them struggling with keeping up with demand and caused their profit margin to shrink significantly as last buyer of compute.

Leynos 10 minutes ago [-]

Their top tier plan got a 3x limit boost. This has been the first week ever where I haven't run out of tokens.

madeofpalk 13 minutes ago [-]

Seems very short term. Like how cheap Uber was initially. Like Claude was before!

Eventually OpenAI will need to stop burning money.

energy123 38 minutes ago [-]

Is that 2x still going on I thought that ended in early April

arcanemachiner 17 minutes ago [-]

Different plan. The old 2x has been discontinued, and the bonus is now (temporarily) available for the new $100 plan users in an effort, presumably, to entice them away from Anthropic.

lawgimenez 36 minutes ago [-]

It’s for Pro users only, I think the 2x is up to May 31.

aurareturn 37 minutes ago [-]

They did it again to "celebrate" the release of the $100 plan.

llm_nerd 24 minutes ago [-]

Most of the compute OpenAI "preordered" is vapour. And it has nothing to do with why people thought the company -- which is still in extremely rocky rapids -- was headed to bankruptcy.

Anthropic has been very disciplined and focused (overwhelmingly on coding, fwiw), while OpenAI has been bleeding money trying to be the everything AI company with no real specialty as everyone else beat them in random domains. If I had to qualify OpenAI's primary focus, it has been glazing users and making a generation of malignant narcissists.

But yes, Anthropic has been growing by leaps and bounds and has capacity issues. That's a very healthy position to be in, despite the fact that it yields the inevitable foot-stomping "I'm moving to competitor!" posts constantly.

desugun 21 minutes ago [-]

I guess our conscience of OpenAI working with the Department of War has an expiry date of 6 weeks.

arcanemachiner 15 minutes ago [-]

That number is generous, and is also a pretty decent lifespan for a socially-conscious gesture in 2026.

Der_Einzige 13 minutes ago [-]

Longer than how long anyone cared about epstein.

adamtaylor_13 16 minutes ago [-]

Most people just want to use a tool that works. Not everything has to be a damn moral crusade.

martimarkov 11 minutes ago [-]

Yes, let take morality out of our daily lives as much as possible... That seems like a great categorical imperative and a recipe for social success

siegers 4 minutes ago [-]

I enjoy switching back and forth and having multi-agent reviews. I'm enjoying Codex also but having options is the real win.

gonzalohm 32 minutes ago [-]

Until the next time they push you back to Claude. At this point, I feel like this has to be the most unstable technology ever released. Imagine if docker had stopped working every two releases

sergiotapia 31 minutes ago [-]

There is zero cost to switching ai models. Paid or open source. It's one line mostly.

gonzalohm 16 minutes ago [-]

What about your chat history? That has some value, at least for me. But what has even more value is stable releases.

cube2222 37 minutes ago [-]

I've been using it with `/effort max` all the time, and it's been working better than ever.

I think here's part of the problem, it's hard to measure this, and you also don't know in which AB test cohorts you may currently be and how they are affecting results.

siegers 1 minutes ago [-]

Agree. I keep effort max on Claude and xhigh on GPT for all tasks and keep tasks as scoped units of work instead of boil the ocean type prompts. It is hard to measure but ultimately the tasks are getting completed and I'm validating so I consider it "working as expected".

alvis 46 minutes ago [-]

I don't have much quality drop from 4.6. But I also notice that I use codex more often these days than claude code

buildbot 43 minutes ago [-]

It's been shockingly bad for me - for another example when asked to make a new python script building off an existing one; for some cursed reason the model choose to .read() the py files, use 100 of lines of regex to try to patch the changes in, and exec'd everything at the end...

geooff_ 38 minutes ago [-]

I've noticed the same over the last two weeks. Some days Claude will just entirely lose its marbles. I pay for Claude and Codex so I just end up needing to use codex those days and the difference is night and day.

arrakeen 33 minutes ago [-]

so even with a new tokenizer that can map to more tokens than before, their answer is still just "you're not managing your context well enough"

"Opus 4.7 uses an updated tokenizer that [...] can map to more tokens—roughly 1.0–1.35× depending on the content type.

[...]

Users can control token usage in various ways: by using the effort parameter, adjusting their task budgets, or prompting the model to be more concise."

r0fl 21 minutes ago [-]

Same! I thought people were exaggerating how bad Claude has gotten until it deleted several files by accident yesterday

Codex isn’t as pretty in output but gets the job done much more consistently

muzani 43 minutes ago [-]

For me, making it high effort just fixed all the quality problems, and even cut down on token use somehow

vunderba 9 minutes ago [-]

This. They kind of snuck this into the release notes: switching the default effort level to Medium. High is significantly slower, but that’s somewhat mitigated by the fact that you don’t have to constantly act like a helicopter parent for it.

frank-romita 38 minutes ago [-]

That's wild that you think 4.6 is bad..... Each model has its strengths and weaknesses I find that Codex is good for architectural design and Claude Is actually better the engineering and building

queuep 35 minutes ago [-]

Before opus released we also saw huge backlash with it being dumber.

Perhaps they need the compute for the training

hk__2 15 minutes ago [-]

Meh. At $work we were on CC for one month, then switched to Codex for one month, and now will be on CC again to test. We haven’t seen any obvious difference between CC and Codex; both are sometimes very good and sometimes very stupid. You have to test for a long time, not just test one day and call it a benchmark just because you have a single example.

34 minutes ago [-]

OtomotO 37 minutes ago [-]

Same for me.

I cancelled my subscription and will be moving to Codex for the time being.

Tokens are way too opaque and Claude was way smarter for my work a couple of months ago.

cmrdporcupine 43 minutes ago [-]

Yep, I'll wait for the GPT answer to this. If we're lucky OpenAI will release a new GPT 5.5 or whatever model in the next few days, just like the last round.

I have been getting better results out of codex on and off for months. It's more "careful" and systematic in its thinking. It makes less "excuses" and leaves less race conditions and slop around. And the actual codex CLI tool is better written, less buggy and faster. And I can use the membership in things like opencode etc without drama.

For March I decided to give Claude Code / Opus a chance again. But there's just too much variance there. And then they started to play games with limits, and then OpenAI rolled out a $100 plan to compete with Anthropic's.

I'm glad to see the competition but I think Anthropic has pissed in the well too much. I do think they sent me something about a free month and maybe I will use that to try this model out though.

davely 26 minutes ago [-]

I’ve been on the Claude Code train for a while but decided to try Codex last week after they announced the $100 USD Pro plan.

I’ve been pretty happy with it! One thing I immediately like more than Claude is that Codex seems much more transparent about what it’s thinking and what it wants to do next. I find it much easier to interrupt or jump in the middle if things are going to wrong direction.

Claude Code has been slowly turning into this mysterious black box, wiping out terminal context any time it compacts a conversation (which I think is their hacky way of dealing with terminal flickering issues — which is still happening, 14 months later), going out of the way to hide thought output, and then of course the whole performance issues thing.

Excited to try 4.7 out, but man, Codex (as a harness at least) is a stark contrast to Claude Code.

arcanemachiner 12 minutes ago [-]

There is a new flag for terminal flickering issues:

> Claude Code v2.1.89: "Added CLAUDE_CODE_NO_FLICKER=1 environment variable to opt into flicker-free alt-screen rendering with virtualized scrollback"

cmrdporcupine 25 minutes ago [-]

Do this -- take your coworker's PRs that they've clearly written in Claude Code, and have Codex/GPT 5.4 review them.

Or have Codex review your own Claude Code work.

It then becomes clear just how "sloppy" CC is.

I wouldn't mind having Opus around in my back pocket to yeet out whole net new greenfield features. But I can't trust it to produce well-engineered things to my standards. Not that anybody should trust an LLM to that level, but there's matters of degree here.

woadwarrior01 4 minutes ago [-]

It cuts both ways. What I usually do these days is to let codex write code, then use claude code /simplify, have both codex and claude code review the PR, then finally manually review and fixup things myself. It's still ~2x faster than doing everything by myself.

afavour 19 minutes ago [-]

> It then becomes clear just how "sloppy" CC is.

Have you done the reverse? In my experience models will always find something to criticize in another model's work.

cmrdporcupine 14 minutes ago [-]

I have, and in fact models will find things to criticize in their own work, too, so it's good to iterate.

But I've had the best results with GPT 5.4

TIPSIO 51 minutes ago [-]

Quick everyone to your side projects. We have ~3 days of un-nerfed agentic coding again.

Esophagus4 40 minutes ago [-]

3 days of side project work is about all I had in me anyway

johnwheeler 6 minutes ago [-]

Exactly. God, it wouldn't be such a problem if they didn't gaslight you and act like it was nothing. Just put up a banner that says Claude is experiencing overloaded capacity right now, so your responses might be whatever.

ttul 9 minutes ago [-]

... your side projects that will soon become your main source of income after you are laid off because corporate bosses have noticed that engineers are more productive...

endymion-light 46 minutes ago [-]

I'm not sure how much I trust Anthropic recently.

This coming right after a noticeable downgrade just makes me think Opus 4.7 is going to be the same Opus i was experiencing a few months ago rather than actual performance boost.

Anthropic need to build back some trust and communicate throtelling/reasoning caps more clearly.

aurareturn 43 minutes ago [-]

They don't have enough compute for all their customers.

OpenAI bet on more compute early on which prompted people to say they're going to go bankrupt and collapse. But now it seems like it's a major strategic advantage. They're 2x'ing usage limits on Codex plans to steal CC customers and it seems to be working.

It seems like 90% of Claude's recent problems are strictly lack of compute related.

Glemllksdf 7 minutes ago [-]

Its a hard game to play anyway.

Anthropics revenue is increasing very fast.

OpenAI though made crazy claims after all its responsible for the memory prices.

In parallel anthropic announced partnership with google and broadcom for gigawatts of TPU chips while also announcing their own 50 Billion invest in compute.

OpenAI always believed in compute though and i'm pretty sure plenty of people want to see what models 10x or 100x or 1000x can do.

Wojtkie 27 minutes ago [-]

Is that why Anthropic recently gave out free credits for use in off-hours? Possibly an attempt to more evenly distribute their compute load throughout the day?

DaedalusII 24 minutes ago [-]

i suspect they get cheap off peak electricity and compute is cheaper at those times

_boffin_ 7 minutes ago [-]

You state your hypnosis quite confidently. Can you tell me how taking down authentication many times is related to GPU capacity?

mattas 18 minutes ago [-]

Hard for me to reconcile the idea that they don't have enough compute with the idea that they are also losing money to subsidies.

Glemllksdf 2 minutes ago [-]

They are loosing money because the model training costs billions.

endymion-light 40 minutes ago [-]

Honestly, I personally would rather a time-out than the quality of my response noticably downgrading. I think what I found especially distrustful is the responses from employees claiming that no degredation has occured.

An honest response of "Our compute is busy, use X model?" would be far better than silent downgrading.

Barbing 32 minutes ago [-]

Are they convinced that claiming they have technical issues while continuing to adjust their internal levers to choose which customers to serve is holistically the best path?

batshit_beaver 5 minutes ago [-]

What I want to know is why my bedrock-backed Claude gets dumber along with commercial users. Surely they're not touching the bedrock model itself. Only thing I can think of is that updates to the harness are the main cause of performance degradation.

ffsm8 10 minutes ago [-]

Usually they're hemorrhaging performance while training.

From that it's pretty likely they were training mythos for the last few weeks, and then distilling it to opus 4.7

Pure speculation of course, but would also explain the sudden performance gains for mythos - and why they're not releasing it to the general public (because it's the undistilled version which is too expensive to run)

GaryBluto 23 minutes ago [-]

> This coming right after a noticeable downgrade just makes me think Opus 4.7 is going to be the same Opus i was experiencing a few months ago rather than actual performance boost.

If they are indeed doing this, I wonder how long they can keep it up?

mesmertech 38 minutes ago [-]

Not showing up in claude code by default on the latest version. Apparently this is how to set it:

/model claude-opus-4-7

Coming from anthropic's support page, so hopefully they did't hallucinate the docs, cause the model name on claude code says:

/model claude-opus-4-7 ⎿ Set model to Opus 4

what model are you?

I'm Claude Opus 4 (model ID: claude-opus-4-7).

vesrah 31 minutes ago [-]

On the most current version (v2.1.110) of claude:

> /model claude-opus-4.7

  ⎿  Model 'claude-opus-4.7' not found

abatilo 1 minutes ago [-]

Dash, not dot

mesmertech 23 minutes ago [-]

I'm on the max $200 plan, so maybe its that?

anonfunction 14 minutes ago [-]

Same, if we're punished for being on the highest tier... what is anthropic even doing.

kaosnetsov 19 minutes ago [-]

claude-opus-4-7

not

claude-opus-4.7

anonfunction 14 minutes ago [-]

     /model claude-opus-4.7
      ⎿  Model 'claude-opus-4.7' not found

Just love that I'm paying $200 for models features they announce I can't use!

Related features that were announced I have yet to be able to use:

    $ claude --enable-auto-mode 
    auto mode is unavailable for your plan

    $ claude
    /memory 
    Auto-dream: on · /dream to run
    Unknown skill: dream

mesmertech 9 minutes ago [-]

I think that was a typo on my end, its "/model claude-opus-4-7" not "/model claude-opus-4.7"

freedomben 10 minutes ago [-]

Thanks, but not working for me, and I'm on the $200 max plan

Edit: Not 30 seconds later, claude code took an update and now it works!

dionian 3 minutes ago [-]

It's up now, update claude code

klipitkas 34 minutes ago [-]

It does not work, it says Claude Opus 4 not 4.7

mesmertech 23 minutes ago [-]

I think its just a visual/default thing, cause Opus 4.0 isn't offered on claude code anymore. And opus 4.7 is on their official docs as a model you can change to, on claude code

Just ask it what model it is(even in new chat).

what model are you?

I'm Claude Opus 4 (model ID: claude-opus-4-7).

https://support.claude.com/en/articles/11940350-claude-code-...

justin_dash 16 minutes ago [-]

[dead]

aliljet 35 minutes ago [-]

Have they effectively communicated what a 20x or 10x Claude subscription actually means? And with Claude 4.7 increasing usage by 1.35x does that mean a 20x plan is now really a 13x plan (no token increase on the subscription) or a 27x plan (more tokens given to compensate for more computer cost) relative to Claude Opus 4.6?

minimaxir 5 minutes ago [-]

The more efficient tokenizer reduces usage by representing text more efficiently with fewer tokens. But the lack of transparancy does indeed mean Anthropic could still scale down limits to account for that.

oidar 31 minutes ago [-]

Anthropic isn't going to give us that information. It's not actually static, it depends on subscription demand and idle compute available.

benleejamin 52 minutes ago [-]

For anyone who was wondering about Mythos release plans:

> What we learn from the real-world deployment of these safeguards will help us work towards our eventual goal of a broad release of Mythos-class models.

msp26 30 minutes ago [-]

They don't have the compute to make Mythos generally available: that's all there is to it. The exclusivity is also nice from a marketing pov.

alecco 22 minutes ago [-]

They don't have demand for the price it would require for inference.

They are definitely distilling it into a much smaller model and ~98% as good, like everybody does.

lucrbvi 12 minutes ago [-]

Some people are speculating that Opus 4.7 is distilled from Mythos due to the new tokenizer (it means Opus 4.7 is a new base model, not just an improved Opus 4.6)

alecco 3 minutes ago [-]

Yes, I was thinking that. But it could as well be the other way around. Using the pretrained 4.7 (1T?) to guide Mythos (10T?) pretraining and save ~70%.

baq 5 minutes ago [-]

> They don't have demand for the price it would require for inference.

citation needed. I find it hard to believe; I think there are more than enough people willing to spend $100/Mtok for frontier capabilities to dedicate a couple racks or aisles.

CodingJeebus 16 minutes ago [-]

I've read so many conflicting things about Mythos that it's become impossible to make any real assumptions about it. I don't think it's vaporware necessarily, but the whole "we can't release it for safety reasons" feels like the next level of "POC or STFU".

shostack 16 minutes ago [-]

Looks like they are adding Peter Thiel backed ID verification too.

https://reddit.com/r/ClaudeAI/comments/1smr9vs/claude_is_abo...

not_ai 47 minutes ago [-]

Oh look it was too powerful to release, now it’s just a matter of safeguards.

This story sounds a lot like GPT2.

tabbott 35 minutes ago [-]

The original blog post for Mythos did lay out this safeguard testing strategy as part of their plan.

camdenreslink 11 minutes ago [-]

My guess is that it is just too expensive to make generally available. Sounds similar to ChatGPT 4.5 which was too expensive to be practical.

hgoel 22 minutes ago [-]

This seems needlessly cynical. I don't think they said they never planned to release it.

They seemed to make it clear that they expect other labs to reach that level sooner or later, and they're just holding it off until they've helped patch enough vulnerabilities.

poszlem 40 minutes ago [-]

It's too powerful now. Once GPT6 is released it will suddenly, magically, become not too powerful to release.

latentsea 29 minutes ago [-]

For a second there I read that as 'GTA 6', and that got me thinking maybe the reason GTA 6 hasn't come out all of these years is because of how dangerous and powerful it's going to be.

mrbombastic 23 minutes ago [-]

productivity going right back down again, ah well they weren't going to pay us more anyway

thomasahle 30 minutes ago [-]

Or, you know, they will have improved the safe guards

jampa 34 minutes ago [-]

Mythos release feels like Silicon Valley "don't take revenue" advice:

https://www.youtube.com/watch?v=BzAdXyPYKQo

""If you show the model, people will ask 'HOW BETTER?' and it will never be enough. The model that was the AGI is suddenly the +5% bench dog. But if you have NO model, you can say you're worried about safety! You're a potential pure play... It's not about how much you research, it's about how much you're WORTH. And who is worth the most? Companies that don't release their models!"

CodingJeebus 13 minutes ago [-]

Completely agree. We're at this place where a frontier model's peak perceived value always seems to be right before it releases.

cindyllm 11 minutes ago [-]

[dead]

frank-romita 37 minutes ago [-]

The most highly anticipated model looking forward to using it

grandinquistor 26 minutes ago [-]

Quite a big improvement in coding benchmarks, doesn’t seem like progress is plateauing as some people predicted.

verdverm 20 minutes ago [-]

Some of the benchmarks went down, has that happened before?

andy12_ 5 minutes ago [-]

If you mean for Anthropic in particular, I don't think so. But it's not the first time a major AI lab publishes an incremental update of a model that is worse at some benchmarks. I remember that a particular update of Gemini 2.5 Pro improved results in LiveCodeBench but scored lower overall in most benchmarks.

https://news.ycombinator.com/item?id=43906555

ACCount37 10 minutes ago [-]

Constantly. Minor revisions can easily "wobble" on benchmarks that the training didn't explicitly push them for.

Whether it's genuine loss of capability or just measurement noise is typically unclear.

grandinquistor 15 minutes ago [-]

Probably deprioritizing other areas to focus on swe capabilities since I reckon most of their revenue is from enterprise coding usage.

cmrdporcupine 2 minutes ago [-]

It's frankly becoming difficult for me to imagine what the next level of coding excellence looks like though.

By which I mean, I don't find these latest models really have huge cognitive gaps. There's few problems I throw at them that they can't solve.

And it feels to me like the gap now isn't model performance, it's the agenetic harnesses they're running in.

ACCount37 23 minutes ago [-]

People were "predicting" the plateau since GPT-1. By now, it would take extraordinary evidence for me to take such "predictions" seriously.

Kim_Bruning 55 minutes ago [-]

> "We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses. "

This decision is potentially fatal. You need symmetric capability to research and prevent attacks in the first place.

The opposite approach is 'merely' fraught.

They're in a bit of a bind here.

erdaniels 8 minutes ago [-]

Now we have to trick the models when you legitimately work in the security space.

ls612 6 minutes ago [-]

Only software approved by Anthropic (and/or the USG) is allowed to be secure in this brave new era.

nope1000 4 minutes ago [-]

Except when you accidentally leak your entire codebase, oops

yanis_t 44 minutes ago [-]

> where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally. Users should re-tune their prompts and harnesses accordingly.

interesting

sleazebreeze 21 minutes ago [-]

This made me LOL. They keep trying to fleece us by nerfing functionality and then adding it back next release. It’s an abusive relationship at this point.

skerit 37 minutes ago [-]

I like this in theory. I just hope it doesn't require you to be be as literal as if talking to a genie.

But if it'll actually stick to the hard rules in the CLAUDE.md files, and if I don't have to add "DON'T DO ANYTHING, JUST ANSWER THE QUESTION" at the end of my prompt, I'll be glad.

Jeff_Brown 7 minutes ago [-]

It might be a bad idea to put that in all caps, because in the training data, angry conversations are less productive. (I do the same thing, just in lowercase.)

helloplanets 11 minutes ago [-]

I wonder why computer use has taken a back seat. Seemed like it was a hot topic in 2024, but then sort of went obscure after CLI agents fully took over.

It would be interesting to see a company to try and train a computer use specific model, with an actually meaningful amount of compute directed at that. Seems like there's just been experiments built upon models trained for completely different stuff, instead of any of the companies that put out SotA models taking a real shot at it.

jimmypk 31 minutes ago [-]

The default effort change in Claude Code is worth knowing before your next session: it's now `xhigh` (a new level between `high` and `max`) for all plans, up from the previous default. Combined with the 1.0–1.35× tokenizer overhead on the same prompts, actual token spend per agentic session will likely exceed naive estimates from 4.6 baselines.

Anthropic's guidance is to measure against real traffic—their internal benchmark showing net-favorable usage is an autonomous single-prompt eval, which may not reflect interactive multi-turn sessions where tokenizer overhead compounds across turns. The task budget feature (just launched in public beta) is probably the right tool for production deployments that need cost predictability when migrating.

hgoel 14 minutes ago [-]

Interesting to see the benchmark numbers, though at this point I find these incremental seeming updates hard to interpret into capability increases for me beyond just "it might be somewhat better".

Maybe I've skimmed too quickly and missed it, but does calling it 4.7 instead of 5 imply that it's the same as 4.6, just trained with further refined data/fine tuned to adapt the 4.6 weights to the new tokenizer etc?

voidfunc 33 minutes ago [-]

Is Codex the new goto? Opus stopped being useful about 45-60 days ago.

zeroonetwothree 17 minutes ago [-]

I haven’t noticed much difference compared to Jan/Feb. Maybe depends what you use it for

jameson 25 minutes ago [-]

How should one compare benchmark results? For example, SWE-bench Pro improved ~11% compared with Opus 4.6. Should one interpret it as 4.7 is able to solve more difficult problems? or 11% less hallucinations?

azeirah 18 minutes ago [-]

There is no hallucination benchmark currently.

I was researching how to predict hallucinations using the literature (fastowski et al, 2025) (cecere et al, 2025) and the general-ish situation is that there are ways to introspect model certainty levels by probing it from the outside to get the same certainty metric that you _would_ have gotten if the model was trained as a bayesian model, ie, it knows what it knows and it knows what it doesn't know.

This significantly improves claim-level false-positive rates (which is measured with the AUARC metric, ie, abstention rates; ie have the model shut up when it is actually uncertain).

This would be great to include as a metric in benchmarks because right now the benchmark just says "it solves x% of benchmarks", whereas the real question real-world developers care about is "it solves x% of benchmarks *reliably*".

So the answer to your question, we don't know. It might be a cherry picked result, it might be fewer hallucinations (better metacognition) it might be capability to solve more difficult problems (better intelligence).

The benchmarks don't make this explicit.

theptip 15 minutes ago [-]

11% further along the particular bell curve of SWE-bench. Not really easy to extrapolate to real world, especially given that eg the Chinese models tend to heavily train on the benchmarks. But a 10% bump with the same model should equate to “feels noticeably smarter”.

A more quantifiable eval would be METR’s task time - it’s the duration of tasks that the model can complete on average 50% of the time, we’ll have to wait to see where 4.7 lands on this one.

zeroonetwothree 18 minutes ago [-]

Benchmark results don’t directly translate to actual real world improvement. So we might guess it’s somewhat better but hard to say exactly in what way

anonfunction 11 minutes ago [-]

Seems they jumped the gun releasing this without a claude code update?

     /model claude-opus-4.7
      ⎿  Model 'claude-opus-4.7' not found

grandinquistor 6 minutes ago [-]

Huge regression for long contest tasks interestingly.

Mrcr benchmark went from 78% to 32%

postflopclarity 51 minutes ago [-]

funny how they use mythos preview in these benchmarks like a carrot on a stick

ansley 50 minutes ago [-]

marketing

mchinen 43 minutes ago [-]

These stuck out as promising things to try. It looks like xhigh on 4.7 scores significantly higher on the internal coding benchmark (71% vs 54%, though unclear what that is exactly)

> More effort control: Opus 4.7 introduces a new xhigh (“extra high”) effort level between high and max, giving users finer control over the tradeoff between reasoning and latency on hard problems. In Claude Code, we’ve raised the default effort level to xhigh for all plans. When testing Opus 4.7 for coding and agentic use cases, we recommend starting with high or xhigh effort.

The new /ultrareview command looks like something I've been trying to invoke myself with looping, happy that it's free to test out.

> The new /ultrareview slash command produces a dedicated review session that reads through changes and flags bugs and design issues that a careful reviewer would catch. We’re giving Pro and Max Claude Code users three free ultrareviews to try it out.

persedes 21 minutes ago [-]

Interesting that the MCP-Atlas score for 4.6 jumped to 75.8% compared to 59.5% https://www.anthropic.com/news/claude-opus-4-6

There's other small single digit differences, but I doubt that the benchmark is that unreliable...?

aizk 17 minutes ago [-]

How powerful will Opus become before they decide to not release it publicly like Mythos?

Philpax 12 minutes ago [-]

They are planning to release a Mythos-class model (from the initial announcement), but they won't until they can trust their safeguards + the software ecosystem has been sufficiently patched.

anonfunction 12 minutes ago [-]

It seems they nerf it, then release a new version with previous power. So they can do this forever without actually making another step function model release.

hackerInnen 45 minutes ago [-]

I just subscribed this month again because I wanted to have some fun with my projects.

Tried out opus 4.6 a bit and it is really really bad. Why do people say it's so good? It cannot come up with any half-decent vhdl. No matter the prompt. I'm very disappointed. I was told it's a good model

rurban 42 minutes ago [-]

Because it was good until January 2026, then it detoriated into a opus-3.1. Probably given much less context windows or ram.

toomim 29 minutes ago [-]

It released in February 2026.

ACCount37 17 minutes ago [-]

Doesn't matter. My vibes say it got bad in January 2026. Thus, they secretly nerfed Opus 4.6 in January 2026.

The fact that it didn't exist back then is completely and utterly irrelevant to my narrative.

Der_Einzige 9 minutes ago [-]

This but unironically.

"I reject your reality, and substitute my own".

It worked for cheeto in chief, and it worked for Elon, so why not do it in our normal daily lives?

anon7000 41 minutes ago [-]

because they’re using it for different things where it works well and that’s all they know?

adwn 31 minutes ago [-]

And yet another "AI doesn't work" comment without any meaningful information. What were your exact prompts? What was the output?

This is like a user of conventional software complaining that "it crashes", without a single bit of detail, like what they did before the crash, if there was any error message, whether the program froze or completely disappeared, etc.

939373838 28 minutes ago [-]

[flagged]

catigula 2 minutes ago [-]

Getting a little suspicious that we might not actually get AGI.

yanis_t 14 minutes ago [-]

The benchmarks of Opus 4.6 they compare to MUST be retaken the day of the new model release. If it was nerfed we need to know how much.

throwaway2027 13 minutes ago [-]

Will try it out for a bit when my weekly limit resets but after that I'm jumping ship to Codex for the next two months.

zacian 38 minutes ago [-]

I hope this will fix up the poor quality that we're seeing on Claude Opus 4.6

But degrading a model right before a new release is not the way to go.

wojciem 18 minutes ago [-]

Is it just Opus 4.6 with throttling removed?

nathanielherman 44 minutes ago [-]

Claude Code hasn't updated yet it seems, but I was able to test it using `claude --model claude-opus-4-7`

Or `/model claude-opus-4-7` from an existing session

edit: `/model claude-opus-4-7[1m]` to select the 1m context window version

skerit 36 minutes ago [-]

~~That just changes it to Opus 4, not Opus 4.7~~

My statusline showed _Opus 4_, but it did indeed accept this line.

I did change it to `/model claude-opus-4-7[1m]`, because it would pick the non-1M context model instead.

nathanielherman 34 minutes ago [-]

Oh good call

mchinen 36 minutes ago [-]

Does it run for you? I can select it this way but it says 'There's an issue with the selected model (claude-opus-4-7). It may not exist or you may not have access to it. Run /model to pick a different model.'

nathanielherman 34 minutes ago [-]

Weird, yeah it works for me

whalesalad 26 minutes ago [-]

API Error: 400 {"type":"error","error":{"type":"invalid_request_error","message":"\"thinking.type.enabled\" is not supported for this model. Use \"thinking.type.adaptive\" and \"output_config.effort\" to control thinking behavior."},"request_id":"req_011Ca7enRv4CPAEqrigcRNvd"}

Eep. AFAIK the issues most people have been complaining about with Opus 4.6 recently is due to adaptive thinking. Looks like that is not only sticking around but mandatory for this newer model.

edit: I still can't get it to work. Opus 4.6 can't even figure out what is wrong with my config. Speaking of which, claude configuration is so confusing there are .claude/ (in project) setting.json + a settings.local.json file, then a global ~/.claude/ dir with the same configuration files. None of them have anything defined for adaptive thinking or thinking type enable. None of these strings exist on my machine. Running latest version, 2.1.110

throwaway911282 38 minutes ago [-]

just started using codex. claude is just marketing machine and benchmaxxing and only if you pay gazillion and show your ID you can use their dangerous model.

yanis_t 33 minutes ago [-]

> In Claude Code, we’ve raised the default effort level to xhigh for all plans.

Does it also mean faster to getting our of credits?

54 minutes ago [-]

cube2222 39 minutes ago [-]

Seems like it's not in Claude Code natively yet, but you can do an explicit `/model claude-opus-4-7` and it works.

nathanielherman 44 minutes ago [-]

Claude Code doesn't seem to have updated yet, but I was able to try it out by running `claude --model claude-opus-4-7`

duckkg5 16 minutes ago [-]

/model claude-opus-4-7[1m]

mbeavitt 42 minutes ago [-]

Honestly I've been doing a lot of image-related work recently and the biggest thing here for me is the 3x higher resolution images which can be submitted. This is huge for anyone working with graphs, scientific photographs, etc. The accuracy on a simple automated photograph processing pipeline I recently implemented with Opus 4.6 was about 40% which I was surprised at (simple OCR and recognition of basic features). It'll be interesting to see if 4.7 does much better.

I wonder if general purpose multimodal LLMs are beginning to eat the lunch of specific computer vision models - they are certainly easier to use.

oliver236 51 minutes ago [-]

someone tell me if i should be happy

nickmonad 46 minutes ago [-]

Did you try asking the model?

ACCount37 31 minutes ago [-]

> We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses.

Fucking hell.

Opus was my go-to for reverse engineering and cybersecurity uses, because, unlike OpenAI's ChatGPT, Anthropic's Opus didn't care about being asked to RE things or poke at vulns.

It would, however, shit a brick and block requests every time something remotely medical/biological showed up.

If their new "cybersecurity filter" is anywhere near as bad? Opus is dead for cybersec.

methodical 18 minutes ago [-]

To be fair, delineating between benevolent and malevolent pen-testing and cybersecurity purposes is practically impossible since the only difference is the user's intentions. I am entirely unsurprised (and would expect) that as models improve the amount to which widely available models will be prohibited from cybersecurity purposes will only increase.

Not to say I see this as the right approach, in theory the two forces would balance each other out as both white hats and black hats would have access to the same technology, but I can understand the hesitancy from Anthropic and others.

ACCount37 6 minutes ago [-]

Yes, and the previous approach Anthropic took was "allow anything that looks remotely benign". The only thing that would get a refusal would be a downright "write an exploit for me". Which is why I favored Anthropic's models.

It remains to be seen whether Anthropic's models are still usable now.

I know just how much of a clusterfuck their "CBRN filter" is, so I'm dreading the worst.

Havoc 27 minutes ago [-]

Claude code had safeguards like that hardcoded into the software. You could see it if you intercept the prompts with a proxy

zb3 28 minutes ago [-]

It appears we're learning the hard way that we can't rely on capabilities of models that aren't open weights. These can be taken from us at any time, so expect it to get much worse..

56 minutes ago [-]

dhruv3006 26 minutes ago [-]

its a pretty good coding model - using it in cursor now.

perdomon 21 minutes ago [-]

It seems like we're hitting a solid plateau of LLM performance with only slight changes each generation. The jumps between versions are getting smaller. When will the AI bubble pop?

lta 16 minutes ago [-]

Every night praying for tomorrow

u_sama 53 minutes ago [-]

Excited to use 1 prompt and have my whole 5-hour window at 100%. They can keep releasing new ones but if they don't solve their whole token shrinkage and gaslighting it is not gonna be interesting to se.

fetus8 2 minutes ago [-]

on Tuesday, with 4.6, I waited for my 5 hour window to reset, asked it to resume, and it burned up all my tokens for the next 5 hour window and ran for less than 10 seconds. I’ve never cancelled a subscription so fast.

lbreakjai 44 minutes ago [-]

Solve? You solve a problem, not something you introduced on purpose.

johntopia 46 minutes ago [-]

is this just mythos flex?

rvz 49 minutes ago [-]

Introducing a new upgraded slot machine named "Claude Opus" in the Anthropic casino.

You are in for a treat this time: It is the same price as the last one [0] (if you are using the API.)

But it is slightly less capable than the other slot machine named 'Mythos' the one which everyone wants to play around with. [1]

[0] https://claude.com/pricing#api

[1] https://www.anthropic.com/news/claude-opus-4-7

dbbk 45 minutes ago [-]

If you're building a standard app Opus is already good enough to build anything you want. I don't even know what you'd really need Mythos for.

zeroonetwothree 14 minutes ago [-]

This is true if you know what you are doing and provide proper guidance. It’s not true if you just want to vibe the whole app.

fny 42 minutes ago [-]

You'd be surprised. With React, Claude can get twisted in knots mostly because React lends itself to a pile of spaghetti code.

rurban 40 minutes ago [-]

You'd need Mythos to free your iPhone, SamsungTV, SmartWatches or such. Maybe even printer drivers.

dirasieb 16 minutes ago [-]

i sincerely doubt mythos is capable of jailbreaking an iphone

recursivegirth 39 minutes ago [-]

Consumerism... if it ain't the best, some people don't want it.

Barbing 35 minutes ago [-]

Time/frustration

If it’s all slop, the smallest waste of time comes from the best thing on the market

poszlem 38 minutes ago [-]

Also 640 KB ram ought to be enough for everybody.

msp26 32 minutes ago [-]

> First, Opus 4.7 uses an updated tokenizer that improves how the model processes text

wow can I see it and run it locally please? Making API calls to check token counts is retarded.

zb3 32 minutes ago [-]

> during its training we experimented with efforts to differentially reduce these capabilities

> We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses.

Ah f... you!

mrcwinn 40 minutes ago [-]

Excited to start using this!

10 minutes ago [-]

acedTrex 39 minutes ago [-]

Sigh here we go again, model release day is always the worst day of the quarter for me. I always get a lovely anxiety attack and have to avoid all parts of the internet for a few days :/

stantonius 26 minutes ago [-]

I feel this way too. Wish I could fully understand the 'why'. I know all of the usual arguments, but nothing seems to fully capture it for me - maybe it' all of them, maybe it's simply the pace of change and having to adapt quicker than we're comfortable with. Anyway best of luck from someone who understands this sentiment.

RivieraKid 14 minutes ago [-]

Really? I think it's pretty straightforward, at least for me - fear of AI replacing my profession and also fear that it will become harder to succeed with a side project.

stantonius 57 seconds ago [-]

Yeah I can understand that, and sure this is part of it, just not all of it. There is also broader societal issues (ie. inequality), personal questions around meaning and purpose, and a sprinkling of existential (but not much). I suspect anyone surveyed would have a different formula for what causes this unease - I struggle to define it (yet think about it constantly), hence my comment above.

Ultimately when I think deeper, none of this would worry me if these changes occurred over 20 years - societies and cultures change and are constantly in flux, and that includes jobs and what people value. It's the rate of change and inability to adapt quick enough which overwhelms me.

acedTrex 11 minutes ago [-]

> fear of AI replacing my profession

See i don't have any of this fear, I have 0 concerns that LLMs will replace software engineering because the bulk of the work we do (not code) is not at risk.

My worries are almost purely personal.

acedTrex 19 minutes ago [-]

Thank you thank you, misery loves company lol! I haven't fully pinned down what the exact cause is as well, an ongoing journey.

AkshatT8 3 minutes ago [-]

[dead]

__natty__ 30 minutes ago [-]

New model - that explains why for the past week/two weeks I had this feeling of 4.6 being much less "intelligent". I hope this is only some kind of paranoia and we (and investors) are not being played by the big corp. /s

RivieraKid 21 minutes ago [-]

I don't get it. Why would they make the previous model worse before releasing an update?

alvis 47 minutes ago [-]

TL;DR; iPhone is getting better every year

The surprise: agentic search is significantly weaker somehow hmm...

alvis 49 minutes ago [-]

TL;DR; iPhone is getting better every year

The surprise: agentic search is significantly weaker somehow hmm...

41 minutes ago [-]

Rendered at 15:29:07 GMT+0000 (Coordinated Universal Time) with Vercel.