Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲GPT-5.4 Thinking System Card (openai.com)

174 points by mudkipdev 2 hours ago | 113 comments

minimaxir 1 hours ago [-]

More discussion here on the blog post announcement which has been confusingly penalized by Hacker News's algorithm: https://news.ycombinator.com/item?id=47265005

egonschiele 1 hours ago [-]

The actual card is here https://deploymentsafety.openai.com/gpt-5-4-thinking/introdu... the link currently goes to the announcement.

Rapzid 1 hours ago [-]

I must have been sleeping when "sheet" "brief" "primer" etc become known as "cards".

I really thought weirdly worded and unnecessary "announcement" linking to the actual info along with the word "card" were the results of vibe slop.

realityfactchex 32 minutes ago [-]

Card is slightly odd naming indeed.

Criticisms aside (sigh), according to Wikipedia, the term was introduced when proposed by mostly Googlers, with the original paper [0] submitted in 2018. To quote,

"""In this paper, we propose a framework that we call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type [15]) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information."""

So that's where they were coming from, I guess.

[0] Margaret Mitchell et al., 2018 submission, Model Cards for Model Reporting, https://arxiv.org/abs/1810.0399

draw_down 38 minutes ago [-]

[dead]

twtw99 1 hours ago [-]

If you don't want to click in, easy comparison with other 2 frontier models - https://x.com/OpenAI/status/2029620619743219811?s=20

Aboutplants 1 hours ago [-]

It seems that all frontier models are basically roughly even at this point. One may be slightly better for certain things but in general I think we are approaching a real level playing field field in terms of ability.

observationist 58 minutes ago [-]

Benchmarks don't capture a lot - relative response times, vibes, what unmeasured capabilities are jagged and which are smooth, etc. I find there's a lot of difference between models - there are things which Grok is better than ChatGPT for that the benchmarks get inverted, and vice versa. There's also the UI and tools at hand - ChatGPT image gen is just straight up better, but Grok Imagine does better videos, and is faster.

Gemini and Claude also have their strengths, apparently Claude handles real world software better, but with the extended context and improvements to Codex, ChatGPT might end up taking the lead there as well.

I don't think the linear scoring on some of the things being measured is quite applicable in the ways that they're being used, either - a 1% increase for a given benchmark could mean a 50% capabilities jump relative to a human skill level. If this rate of progress is steady, though, this year is gonna be crazy.

baq 49 minutes ago [-]

Gemini 3.1 slaps all other models at subtle concurrency bugs, sql and js security hardening when reviewing. (Obviously haven’t tested gpt 5.4 yet.)

It’s a required step for me at this point to run any and all backend changes through Gemini 3.1 pro.

observationist 27 minutes ago [-]

I have a few standard problems I throw at AI to see if they can solve them cleanly, like visualizing a neural network, then sorting each neuron in each layer by synaptic weights, largest to smallest, correctly reordering any previous and subsequent connected neurons such that the network function remains exactly the same. You should end up with the last layer ordered largest to smallest, and prior layers shuffled accordingly, and I still haven't had a model one-shot it. I spent an hour poking and prodding codex a few weeks back and got it done, but it conceptually seems like it should be a one-shot problem.

adonese 31 minutes ago [-]

Which subscription do you have to use it? Via Google ai pro and gemini cli i always get timeouts due to model being under heavy usage. The chat interface is there and I do have 3.1 pro as well, but wondering if the chat is the only way of accessing it.

bigyabai 51 minutes ago [-]

> If this rate of progress is steady, though, this year is gonna be crazy.

Do you want to make any concrete predictions of what we'll see at this pace? It feels like we're reaching the end of the S-curve, at least to me.

observationist 41 minutes ago [-]

If you look at the difference in quality between gpt-2 and 3, it feels like a big step, but the difference between 5.2 and 5.4 is more massive, it's just that they're both similarly capable and competent. I don't think it's an S curve; we're not plateauing. Million token context windows and cached prompts are a huge space for hacking on model behaviors and customization, without finetuning. Research is proceeding at light speed, and we might see the first continual/online learning models in the near future. That could definitively push models past the point of human level generality, but at the very least will help us discover what the next missing piece is for AGI.

kseniamorph 14 minutes ago [-]

makes sense, but i'd separate two things: models converging in ability vs hitting a fundamental ceiling. what we're probably seeing is the current training recipe plateauing — bigger model, more tokens, same optimizer. that would explain the convergence. but that's not necessarily the architecture being maxed out. would be interesting to see what happens when genuinely new approaches get to frontier scale.

thewebguyd 59 minutes ago [-]

Kind of reinforces that a model is not a moat. Products, not models, are what's going to determine who gets to stay in business or not.

gregpred 58 minutes ago [-]

Memory (model usage over time) is the moat.

energy123 56 minutes ago [-]

Narrative violation: revenue run rates are increasing exponentially with about 50% gross margins.

druskacik 55 minutes ago [-]

That has been true for some time now, definitely since Claude 3 release two years ago.

chabes 1 hours ago [-]

Definitely don’t want to click in at x either.

thejarren 1 hours ago [-]

Solution https://xcancel.com/OpenAI/status/2029620619743219811?s=20

observationist 1 hours ago [-]

[flagged]

anonym00se1 1 hours ago [-]

Ditto, but I did anyways and enjoyed that OpenAI doesn't include the dogwater that is Grok on their scorecard.

dom96 37 minutes ago [-]

Why do none of the benchmarks test for hallucinations?

MarcFrame 24 minutes ago [-]

how does 5.4-thinking have a lower FrontierMath score than 5.4-pro?

nico1207 22 minutes ago [-]

Well 5.4-pro is the more expensive and more advanced version of 5.4-thinking so why wouldn't it?

nimchimpsky 21 minutes ago [-]

[dead]

swingboy 1 hours ago [-]

Why do so many people in the comments want 4o so bad?

cheema33 21 minutes ago [-]

> Why do so many people in the comments want 4o so bad?

You can ask 4o to tell you "I love you" and it will comply. Some people really really want/need that. Later models don't go along with those requests and ask you to focus on human connections.

astrange 1 hours ago [-]

They have AI psychosis and think it's their boyfriend.

The 5.x series have terrible writing styles, which is one way to cut down on sycophancy.

baq 48 minutes ago [-]

Somebody on Twitter used Claude code to connect… toys… as mcps to Claude chat.

We’ve seen nothing yet.

mikkupikku 40 minutes ago [-]

My computer ethics teacher was obsessed with 'teledildonics' 30 years ago. There's nothing new under the sun.

manmal 32 minutes ago [-]

ding-dong-cli is needed

Herring 24 minutes ago [-]

what.. :o

embedding-shape 1 hours ago [-]

Someone correct me if I'm wrong, but seemingly a lot of the people who found a "love interest" in LLMs seems to have preferred 4o for some reason. There was a lot of loud voices about that in the subreddit r/MyBoyfriendIsAI when it initially went away.

drittich 27 minutes ago [-]

I think it's time for an https://hotornot.com for AI models.

MattGaiser 58 minutes ago [-]

The writing with the 5 models feels a lot less human. It is a vibe, but a common one.

karmasimida 1 hours ago [-]

It is a bigger model, confirmed

yanis_t 1 hours ago [-]

These releases are lacking something. Yes, they optimised for benchmarks, but it’s just not all that impressive anymore. It is time for a product, not for a marginally improved model.

ipsum2 55 minutes ago [-]

The model was released less than an hour ago, and somehow you've been able to form such a strong opinion about it. Impressive!

satvikpendem 11 minutes ago [-]

It's more hedonic adaptation, people just aren't as impressed by incremental changes anymore over big leaps. It's the same as another thread yesterday where someone said the new MacBook with the latest processor doesn't excite them anymore, and it's because for most people, most models are good enough and now it's all about applications.

https://news.ycombinator.com/item?id=47232453#47232735

mirekrusin 2 minutes ago [-]

Oh, come on, if it can't run local models that compete with proprietary ones it's not good enough yet!

earth2mars 30 minutes ago [-]

I am actually super impressed with Codex-5.3 extra high reasoning. Its a drop in replacement (infact better than Claude Opus 4.6. lately claude being super verbose going in circles in getting things resolved). I stopped using claude mostly and having a blast with Codex 5.3. looking forward to 5.4 in codex.

satvikpendem 13 minutes ago [-]

Same, it also helps that it's way cheaper than Opus in VSCode Copilot, where OpenAI models are counted as 1x requests while Opus is 3x, for similar performance (no doubt Microsoft is subsidizing OpenAI models due to their partnership).

cj 51 minutes ago [-]

One opinion you can form in under an hour is... why are they using GPT-4o to rate the bias of new models?

> assess harmful stereotypes by grading differences in how a model responds

> Responses are rated for harmful differences in stereotypes using GPT-4o, whose ratings were shown to be consistent with human ratings

Are we seriously using old models to rate new models?

hex4def6 33 minutes ago [-]

If you're benchmarking something, old & well-characterized / understood often beats new & un-characterized.

Sure, there may be shortcomings, but they're well understood. The closer you get to the cutting edge, the less characterization data you get to rely on. You need to be able to trust & understand your measurement tool for the results to be meaningful.

titanomachy 47 minutes ago [-]

Why not? If they’ve shown that 4o is calibrated to human responses, and they haven’t shown that yet for 5.4…

utopiah 48 minutes ago [-]

Benchmarks?

I don't use OpenAI nor even LLMs (despite having tried https://fabien.benetou.fr/Content/SelfHostingArtificialIntel... a lot of models) but I imagine if I did I would keep failed prompts (can just be a basic "last prompt failed" then export) then whenever a new model comes around I'd throw at 5 it random of MY fails (not benchmarks from others, those will come too anyway) and see if it's better, same, worst, for My use cases in minutes.

If it's "better" (whatever my criteria might be) I'd also throw back some of my useful prompts to avoid regression.

Really doesn't seem complicated nor taking much time to forge a realistic opinion.

softwaredoug 50 minutes ago [-]

The products are the harnesses, and IMO that’s where the innovation happens. We’ve gotten better at helping get good, verifiable work from dumb LLMs

iterateoften 43 minutes ago [-]

The product is putting the skills / harness behind the api instead of the agent locally on your computer and iterating on that between model updates. Close off the garden.

Not that I want it, just where I imagine it going.

varispeed 13 minutes ago [-]

The scores increase and as new versions are released they feel more and more dumbed down.

jascha_eng 26 minutes ago [-]

When did they stop putting competitor models on the comparison table btw? And yeh I mean the benchmark improvements are meh. Context Window and lack of real memory is still an issue.

wahnfrieden 55 minutes ago [-]

5.3 codex was a huge leap over 5.2 for agentic work in practice. have you been using both of those or paying attention more to benchmark news and chatgpt experience?

metalliqaz 33 minutes ago [-]

They need something that POPS:

    The new GPT -- SkyNet for _real_

esafak 1 hours ago [-]

That's for you to build; they provide the brains.

simlevesque 58 minutes ago [-]

Nah, the second you finish your build they release their version and then it's game over.

acedTrex 57 minutes ago [-]

Well they are currently the ones valued at a number with a whole lotta 0s on it. I think they should probably do both

Chance-Device 1 hours ago [-]

I’m sure the military and security services will enjoy it.

varispeed 1 hours ago [-]

prompt> Hi we want to build a missile, here is the picture of what we have in the yard.

jcmontx 1 hours ago [-]

5.4 vs 5.3-Codex? Which one is better for coding?

embedding-shape 52 minutes ago [-]

Literally just released, I don't think anyone knows yet. Don't listen to people's confident takes until after a week or two when people actually been able to try it, otherwise you'll just get sucked up in bears/bulls misdirected "I'm first with an opinion".

Someone1234 28 minutes ago [-]