Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Defeating Nondeterminism in LLM Inference (thinkingmachines.ai)

252 points by jxmorris12 15 hours ago | 100 comments

dns_snek 1 hours ago [-]

Why do you care about determinism in a probabilistic system? What difference does it make to the end user if the input "How do I X?" always produces the same deterministic output when semantically equivalent inputs "how do i x?", "how do I x", and "how do I X??" are bound to produce different answers that often won't even be semantically equivalent.

What LLMs need is the ability to guarantee semantically-equivalent outputs for all semantically-equivalent inputs, but that's very different from "determinism" as we understand it from other algorithms.

Ratelman 53 minutes ago [-]

Was my thinking exactly - but also semantically equivalent is also only relevant when it needs to be factual, not necessarily for ALL outputs (if we're aiming for LLM's to present as "human" - or for interactions with LLMs to be natural conversational...). This excludes the world where LLMs act as agents - where you would of course always like the LLM to be factual and thus deterministic.

Taek 53 minutes ago [-]

You aren't wrong, but that doesn't mean this level of determinism isn't useful. If you don't even have the level of determinism that the exact same input tokens produce the exact same output tokens, then it's very hard to share reproducible results with peers, which can be useful if you are say, red teaming an LLM to produce a very rare / unreliable output.

lsy 14 hours ago [-]

Fixing "theoretical" nondeterminism for a totally closed individual input-output pair doesn't solve the two "practical" nondeterminism problems, where the exact same input gives different results given different preceding context, and where a slightly transformed input doesn't give a correctly transformed result.

Until those are addressed, closed-system nondeterminism doesn't really help except in cases where a lookup table would do just as well. You can't use "correct" unit tests or evaluation sets to prove anything about inputs you haven't tested.

kazinator 12 hours ago [-]

There is no such thing as "exactly the same input, but with different preceding context". The preceding context is input!

If you were to obtain exactly the same output for a given input prompt, regardless of context, then that would mean that the context is being ignored, which is indistinguishable from the session not maintaining any context such that each prompt is in a brand new empty context.

Now what some people want is requirements like:

- The different wording of a prompt with exactly the same meaning should not change anything in the output; e.g. whether you say "What is the capital of France" or "What is France's capital" the answer should be verbatim identical.

- Prior context should not change responses in ways that don't have any interaction with the context. For instance, a prompt is given "what is 2 + 2", then the answer should always be the same, except if the context instructs the LLM that 2 + 2 is to be five.

These kinds of requirements betray a misunderstanding of what these LLMs are.

Zacharias030 3 hours ago [-]

While I get that this is how LLMs work, I think you should think backwards from the user / from what AI as a field is aiming for and recognize that the „naive“ way of the parent to ask for reliable responses no matter what the „context“ is, is exactly what a good AI system should offer.

„The context is the input“ betrays a misunderstanding of what (artificial) intelligence systems are aiming for.

Helmut10001 3 hours ago [-]

Then we need something else. This is not how LLMs work. They are simple statistical predictors, now universal anwsering machines.

Zacharias030 2 hours ago [-]

I agree mostly. They are all that you say, but if you think about the conditional distribution that you are learning, there is nothing preventing us in principle from mapping different contexts to the same responses. It is rather a practical limitation that we don’t have sufficient tools of shaping these distributions very soundly. All we can do is throw data at them and hope that they generalize to similar contexts.

We have observed situations where agentic LLM traces on verifiable problems with deterministic (greedy) decoding lead to either completely correct or completely wrong solutions depending on the minutes on the clock which are printed as coincidental output of some tool that the LLM used.

I think there may be some mild fixes to current models available , for example it is worrying that the attention mechanism can never fully disregard any token in the input, because the softmax will always assign a > 0 weight everywhere (and the NN has no way of setting a logit to -infinity). This directly causes that it is extremely difficult for the LLM to fully ignore any part of the context reliably.

However Yann LeCun actually offers some persuasive arguments that autoregressive decoding has some limitations and we may need something better.

skybrian 10 hours ago [-]

I wonder if there's a way to use an LLM to rewrite the prompt, standardizing the wording when two prompts mean the same thing?

kazinator 9 hours ago [-]

It's going to backfire. In real scenarios (not regression testing) users don't want to see the exact same thing twice out of the LLM in the same session in spite of trying to refine the result with more context.

There are going to be false positives: text that is subtly different from a previous response is misidentified as a duplicate such that the previous response is substituted for it, frustrating the user.

skybrian 9 hours ago [-]

Google search rewrites misspelled search queries and also lets you override it if that's not what you want. Maybe something similar would work?

d4mi3n 10 hours ago [-]

Not an expert, but I've been told RAG in combination with a database of facts is one way to get more consistency here. Using one of the previous examples, you might have a knowledge store (usually a vector database of some kind) that contains a mapping of countries to capitols and the LLM would query it whenever it had to come up with an answer rather than relying on whatever was baked into the base model.

taneq 9 hours ago [-]

Deterministically, you mean? ;)

tonyhart7 3 hours ago [-]

oh so you want it to be thinking???? now we talking

kjkjadksj 6 hours ago [-]

If an input along with the context generated some random seed or hash this would certainly be possible. Just paste your seed over to your coworker, they supply it to the model and it contains all contextual information.

raincole 8 hours ago [-]

> where the exact same input gives different results given different preceding context

Why and how is this a problem?

If 'preceding context' doesn't cause different results, it means you can simply discard the context. Why do I want that? It's not how I expect a tool to work (I expect vim responds differently to my input after I switch to the insert mode). It's absolutely not how I expect intelligence to work either. It sounds like the most extreme form of confirmation bias.

edflsafoiewq 2 hours ago [-]

An example is translation. I MTLed some text recently where the name of a (fictional) city was translated about a dozen different ways. Sometimes you'd get a calque, sometimes you'd get a transliteration (including several wrong ones). Ironically "dumb" MTLs are often much more consistent about this than LLMs.

saagarjha 14 hours ago [-]

This is really useful in reproducing bugs.

brookst 11 hours ago [-]

I was with you until you said it “doesn’t really help”. Did you mean “doesn’t completely solve the problem “?

jll29 14 hours ago [-]

Sometimes, the reason for non-determinism is implementation-specific. For instance, in GPT-2's source code (I haven't checked other model versions), setting the temperature in the GUI does not lead to a value of 0 but "epsilon" (a very small value larger than 0), to avoid a division by zero error in the code, which makes sense.

For many applications, non-determinism implies "useless". This has been a long standing issue with LDA topic models. In particular in the legal, financial and regulatory domains, if a method is not deterministic, it may be illegal to use it or it may lead to follow-on requirements that one does not want (e.g. all screens shown to humans must be preserved to be able to go back and reconstruct what exactly happened to a particular user in a particular second).

daralthus 11 hours ago [-]

I thought this was pretty well known (at least in the JAX/XLA world). I've hit this many times and got batch variance explained to me before: https://github.com/google-deepmind/penzai/issues/82 and https://github.com/jax-ml/jax/issues/20047#issuecomment-1975...

Zacharias030 1 hours ago [-]

should be the top comment.

nakamoto_damacy 10 hours ago [-]

"in collaboration with others at Thinking Machines"

If you're old enough, you might remember Danny Hillis' Thinking Machines from the late 80s. I wish they had chosen a different name (I say this for nostalgic reasons, having been in front of one of those cubes glowing with red LEDs back in the late 80s at MIT's AI Lab" (renamed to CSAIL at some point). Feynman did some amazing work on that, too: https://longnow.org/ideas/richard-feynman-and-the-connection...

In the U.S., the “THINKING MACHINES” trademarks were owned by Thinking Machines Corporation (the company Hillis co-founded), not Hillis personally, and those registrations were cancelled in 1998–1999. USPTO Report +1

The company itself went bankrupt in 1994 and its assets were dispersed (e.g., to Sun Microsystems, later Oracle).

There’s a new, pending USPTO application for “THINKING MACHINES” filed in 2025 by Thinking Machines Lab Inc., the company founded by Amira Murati.

Imnimo 8 hours ago [-]

I make this mistake every time I see their name.

orbital-decay 1 hours ago [-]

By setting the temperature to 0 you get greedy decoding, which does a lot more than just making it predictable, and can degrade outputs. Random sampling exists for a reason! Gemini 2.5 Pro in particular doesn't like temp 0, for example.

Focus on correctness, not determinism.

empiko 1 hours ago [-]

Determinism does not require temperature=0. You can have a deterministic behavior even with >0 temperature as long as you fix your random seeds.

riazrizvi 12 hours ago [-]

Natural language is ambiguous. It needs to be. I think the approach here of trying to figure out how to make circles into squares, and argue why circles should be squares, is misguided.

Discussions of this type are going to eventually morph into better understanding of how to accept ambiguity and randomness in language, and further shape it with other larger sub-patterns beyond the little proto-grammars that the QKV projection matrices extract.

atoav 11 hours ago [-]

Yes, but determinism != ambiguity, because determinism means: for this exact input the same exact output needs to follow.

If I ask the same model the same question I should be able to deterministically get the same answer.

Now if we phrase the same question slightly differently we would expect to get a slightly different answer.

Jensson 9 hours ago [-]

> Now if we phrase the same question slightly differently we would expect to get a slightly different answer.

You wouldn't get this from an LLM though, a tiny change in starting point gets a massive change in output, its a chaotic system.

ares623 4 hours ago [-]

Maybe predictability is what is meant?

riazrizvi 10 hours ago [-]

Me: What’s an example of a dice roll?

LLM: 1

“Language ambiguity with determinism”? Sure I can juxtapose the terms but if it’s semantically inconsistent, then what we mean by that is not a deterministic, definitive thing. You’re chasing your tail on this ‘goal’.

Nevermark 10 hours ago [-]

Ambiguity: The request/prompt leaves a lot of room for interpretation. Many qualitatively different answers may be correct, relative to the prompt. Different or non-deterministic models will return highly variance results.

Determinism: If a model is given the exact same request/prompt twice, its two responses will also be identical. Whether or not the consistent response qualifies as correct.

The two concepts are very different.

(Ambiguous vs. precise prompt) x (Deterministic vs. Non-deterministic model) = 4 different scenarios.

A model itself can be non-deterministic without being ambiguous. If you know exactly how it functions, why it is non-deterministic (batch sensitive for instance), that is not an ambiguous model. Its operation is completely characterized. But it is non-deterministic.

An ambiguous model would simply be model whose operation was not characterized. A black box model for instance. A black box model can be deterministic and yet ambiguous.

atoav 40 minutes ago [-]

Maybe I got this wrong but I thought ambiguity refered to the input. So in a deterministic system I would assume that a input of "Give an example of a dice roll" Will always output the exact same example (unless the model also gets the context of the message history).

Ambiguity is what happens when you change the prompt slightly, e.g. by adding a word: "Give an example of a single dice roll". Now as a human our expectation would be that this is the same question and should thus (in a deterministic system) receive the same answer. But to an LLM it may not be.

skybrian 9 hours ago [-]

If you really want that to work while being reproducible, maybe give it a random number tool and set the seed?

raincole 8 hours ago [-]

> LLM: 1

A perfectly acceptable answer.

If it answers 1 every time it's still a perfectly acceptable answer.

riazrizvi 3 hours ago [-]

So is ‘2’ or ‘3’ or ‘19’ or ‘99’ or ‘a jam sponge cake with gaming dice for frosting’… The point is in natural language there are many perfectly acceptable answers. Usually any particular answer is arbitrary, and it would probably be undesirable to have the same answer everytime. For a majority of use cases.

gajjanag 3 hours ago [-]

As others have pointed out, these phenomena are well known to many folks across companies in the AI infra space. It doesn't really break new ground. This article is a good exposition of the basic strategies though.

What I would have loved is a discussion around collectives/multi-node setups. And showing how to get determinism at low performance penalty for multi-node reduction collectives.

jasonjmcghee 13 hours ago [-]

I love high quality blog post style research discussion - Anthropic has been leading the charge with this recently and it's great to see it spreading. OpenAI was also doing this during all the RL research days.

PeterStuer 54 minutes ago [-]

THANK YOU! Great work and writeup. Hope it finally silences the "concurrency + floating point" crowd and the "LLMs can never be deterministic" zealots.

gond 11 hours ago [-]

I am still irritated by the name of the company.

What is the reasoning behind these schemes? The hope that bits of the properties of legendary companies will rub off onto the new venture?

As if naming the next best venture PARC will inevitably create a breakthrough in networking just by the arrangement of four letters.

ricardobeat 11 hours ago [-]

Are you talking about the “Thinking Machines” company that shut down in 1994? Took me some digging to figure it out, doesn’t seem well-known enough to be the reason - it’s just a nice (and relatively obvious) name.

gond 11 hours ago [-]

Yes. Danny Hillis’ Thinking Machines Corporation, an AI company which created its own massive parallel processing supercomputer hardware.

“We are building a machine that will be proud of us” was their corporate motto. And that was in 1983.

One of those Machines is on view at the Computer History Museum in Mountain View. Back then, they could be ordered in “Darth Vader Black”, no kidding here. You can also see a couple of them (the CM-5) as the stereotypical supercomputer in the original Jurassic Park.

More here: https://en.m.wikipedia.org/wiki/Thinking_Machines_Corporatio...

kkylin 11 hours ago [-]

And in the original Jurassic Park! https://www.google.com/search?q=jurassic+park+cm-5

kkylin 10 hours ago [-]

[addendum: posted this too quickly & didn't see it in the comment above. duh.]

ewoodrich 10 hours ago [-]

It may not be a household name like Apple or Microsoft but its flagship product the Connection Machine is somewhat iconic in (super)computing history. The physical design of the machine is cool and unforgettable looking, plus recurring HN favorite Richard Feynman contributed to the original architecture.

random3 10 hours ago [-]

The thinking is free marketing and the same reason trademarks were invented

syntaxing 13 hours ago [-]

Super interesting. For those unaware, this is the company Mira Murati (OpenAI previous CTO) started

mg 14 hours ago [-]

I really hope we will get deterministic LLMs in the future. Even if it causes slightly slower response times.

Nondeterminism is what currently keeps me from working with other developers.

As I wrote in "Prompt Coding" [1], these days I am not looking for good code. I am looking for prompts that create good code. But how do you share prompts among developers when they produce different code every time? You cannot simply state "Here, I found a prompt that makes gpt-5-2025-08-07 output a solution with all the desired attributes".

Similar with images. At the moment, for most image models, you cannot outsource the task of writing prompts that create the desired images. Because most image models will not create the same image when given the same prompt and parameters.

[1]: https://www.gibney.org/prompt_coding

p1necone 12 hours ago [-]

Surely if you end up relying on a given prompt to produce the exact same code every time you should instead just check that code into source control the first time you generate it?

A deterministic LLM isn't going to behave appreciably differently from a non deterministic one if your input or context varies by even a tiny bit (pun intended) each time.

skybrian 9 hours ago [-]

If nothing has changed, caching the result would certainly be cheaper. But if you're doing that as part of a test, it's not really running the test and it might defeat the purpose of the test.

xnx 60 minutes ago [-]

> most image models will not create the same image when given the same prompt and parameters.

Really? If you include the seed as one of the parameters most produce pixel identical output.

E.g. "Generate deterministic images" https://cloud.google.com/vertex-ai/generative-ai/docs/image/...

khimaros 13 hours ago [-]

i tried to create a makefile driven workflow based on this idea and ended up with https://github.com/khimaros/enc -- it suffers from the issues you raised

i'm hoping that it becomes more useful as models improve and become more reliable at producing working code (though determinism would be great for improving prompts).

kybernetikos 11 hours ago [-]

For fun over the last few days, I've built a compressor / decompressor that uses the logits from an LLM, for each token in the input, then takes the ranks and exponential goolomb encodes them. Then you work in reverse to regenerate the original

It took me ages to get the prediction for the second token after "hello" to match the same as the prediction for the second token when running the model on the string "hello world", despite the fact that I was using a causal model. I tried all kinds of things before discovering that `quantized: false` was the important setting.

giveita 11 hours ago [-]

What's the Weissman score? Or more seriously :) did it perform well. Sounds like it should. If more and more text is AI slop it should do well.

I dont fully understand what you said but I guess higher probability logits are encoded with fewer bits. If your text is the LLM output then you may need a bit or two per token?

kybernetikos 10 hours ago [-]

I used exponential golomb coding, so the rank 0 logit is encoded with a single bit, ranks 1 and 2 are encoded with three bits, ranks 3-6 are encoded with 5 bits, etc.

In terms of performance, I've not done any serious testing, but e.g. the wikipedia article on volcanos compresses to about 20% using GPT2. I've seen other strings compress even further.

The big issue is that while encoding is not unreasonable, decoding any significant amount of data is incredibly slow, since I'm doing a model run for every token in the output. It's bad enough that the scheme is probably unworkable as it is. I'm thinking about changing my code so that it streams out the tokens as it decodes them, so you're not just left there waiting for ages.

themeiguoren 5 hours ago [-]

A bit off topic from the technical discussion but does anyone recognize what blog layout or engine this is? I really like the layout with sidenotes and navigation.

ako 4 hours ago [-]

Seems like a Thufte inspired style, something like this: https://clayh53.github.io/tufte-jekyll/articles/20/tufte-sty...

eldenring 13 hours ago [-]

Very impressive! I guess this still wouldn't affect their original example

> For example, you might observe that asking ChatGPT the same question multiple times provides different results.

even with 0.0 temperature due to MOE models routing at a batch level, and you're very unlikely to get a deterministic batch.

> Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.

The router also leaks batch-level information across sequences.

boroboro4 13 hours ago [-]

> even with 0.0 temperature due to MOE models routing at a batch level, and you're very unlikely to get a deterministic batch.

I don’t think this is correct - MoE routing happens at per token basis. It can be non deterministic and batch related if you try to balance out your experts load in a batch but that’s performance optimization (just like all of the blogpost) and not the way models are trained to work.

eldenring 12 hours ago [-]

Ah interesting, good point. So I guess expert-choice routing leaks across the batch. Now I'm not sure.

quantum_state 11 hours ago [-]

As the bottom of LLM inference, it is sampling for the next token based on the probability distribution conditioned on the tokens currently in the context window. If the distribution exhibits degeneracy in probability for more than token, outcome of the sampling will naturally, as it should, be nondeterministic. It should be left alone.

bee_rider 11 hours ago [-]

From their code:

    A = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16)
    B = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16)
    ref = torch.mm(A, B)
    for _ in range(1000):
         assert (torch.mm(A, B) - ref).abs().max().item() == 0

I’m sort of surprised that Torch doesn’t have some kind of lazy evaluation thing to avoid computing anything here. I thought that was one of the nice things about all these fancy frameworks (if I wanted the computer to actually do silly things when I asked it to, I would use BLAS directly, right?).

nomel 11 hours ago [-]

Maybe I'm missing something, but in this case, wouldn't being lazy would be pure overhead? I don't see anything can be lazy here. The reference computed once, nanoseconds before it's needed, and test cases computed at the time of comparison, then tossed away.

What would hope to be achieved by making this case lazy? If you wanted these to run in parallel, with a multi-gpu system, you would use the appropriate parallel interface.

bee_rider 10 hours ago [-]

I mean if you wait long enough, it is asking for

  .abs().max().item()

of something that can be identified as definitionally zero.

nomel 9 hours ago [-]

I don't understand. Since it's not using the parallel interface, only one operation can happen at a time. This would be, literally, sequential execution with extra overhead, in this case. Again, in this case, what would hope to be achieved from doing things lazily, since the lazy operations would immediately be followed by their evaluation?

The parallel interface, which is async, is probably what you're lookin for.

bee_rider 8 hours ago [-]

The Python commands are encountered sequentially. One could image a library where the Python commands build the computation under the hood. Then, the library would be able to take advantage of situations like this one (or, more practically, reorder multiplications and/or avoid unnecessary temporaries).

measurablefunc 15 hours ago [-]

I think this means that the results might also be non-deterministic across hardware revisions b/c I don't think they verified that the kernels will work the same on different GPU & TPU versions b/c how do they know that the compiler will not re-order the operations behind their back?

saagarjha 14 hours ago [-]

Yes, there’s usually no guarantee on how different hardware does operations (for example, even if the hardware is correctly rounding intermediate results, different hardware may use different tile sizes). The reproducibility here is for runs on the same machine.

Compilers can also reorder operations but in practice this is rarely an issue because kernels typically synchronize frequently and this limits the ability for compilers to reorder things. This isn’t to say it doesn’t happen, but even if it does happen it’s likely because the compiler changed because the code they generate is generally run-to-run identical.

AlotOfReading 14 hours ago [-]

You can prevent reordering with sufficient amounts of compiler abuse.

With revisions, you're trying to ensure a consistent floating point environment where the operations used are deterministic, and used in the same order with the same inputs. The best way to do that is to use operations that adhere to a mostly deterministic standard like IEEE-754.

14 hours ago [-]

reliabilityguy 14 hours ago [-]

> will not re-order the operations behind their back?

Valid point. Floating point summation is not always commutative.

TimorousBestie 15 hours ago [-]

Ensuring the same floating-point algorithm workload behaves exactly the same on two distinct workstations is a heck of a lot of work that almost no one is willing to pay for.

measurablefunc 14 hours ago [-]

Not only that but heterogeneous clusters (inevitable at a large enough scale) will also have non-deterministic outputs. So it's great that they wrote kernels to make the forward pass deterministic but getting rid of it entirely at data center scale would mean that they'd also have to do this type of work across cluster nodes as well to maintain "cluster" invariance & not just batch invariance.

reasonableklout 3 hours ago [-]

Some great discussion on twitter: https://x.com/thinkymachines/status/1965826369721623001

Seems a buried lede is that on-policy RL is unlocked by bitwise identical results between training and sampling. I'm not an expert here but my understanding is that this would allow for stronger guarantees about deployment/training alignment for the RL training that the labs already do.

I don't fully understand the BigMath example though. They show that off-policy RLVR requires off-policy correction, which avoids divergence, but is suboptimal because it results in noisy rewards. Then they say "we fixed the sampler and trainer numerical mismatch, which allows for on-policy RL, look how much better it is." It's not clear to me whether this is an artificial example that deliberately uses different trainer/sampler setups, or if it's actually impossible to have the same numerics between trainer/sampler without their fixes (even if we use same batch size, no atomics, etc.).

paulbjensen 12 hours ago [-]

It reminded me of this wonderful talk by the late Joe Armstrong (Erlang's creator): https://www.youtube.com/watch?v=lKXe3HUG2l4

Great post.

bendoy 11 hours ago [-]

Where this gets really complicated is when you are chaining many LLM calls together (basically any agent). A slight deviation in the call stack can throw off everything else.

lrvick 14 hours ago [-]

Job one is have every bit of software involved also be deterministic, which stagex takes care of.

I had no problem getting deterministic LLM outputs when I experimented with this 6 months ago.

Run two of these with the same prompts and same seed and you get the same results.

Obviously in GPU clusters with different hardware things get more complicated.

https://git.distrust.co/public/llmshell

spindump8930 14 hours ago [-]

That's not what this is about.

"I had no problem getting deterministic LLM outputs when I experimented with this 6 months ago" looks like you're using llama-cpp in that repo. This is about vllm serving many requests at once, at long sequence lengths.

> As it turns out, our request’s output does depend on the parallel user requests. Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.

Your situation isn't really comparable.

saagarjha 14 hours ago [-]

What’s stagex?

lrvick 8 hours ago [-]

supply chain security focused linux distro that does not trust its own maintainers by design.

emharsha1812 3 hours ago [-]

I think this is an excellent article which addresses the issue that I personally have been thinking about a long time. And no its not just some slop they put but actual an engineering blog(with open source code and reproducible results!) I think the company is off to a good start

threeducks 12 hours ago [-]

It should also be noted that PyTorch has a page about reproducibility: https://docs.pytorch.org/docs/stable/notes/randomness.html

TL;DR

Seed your PRNGs and call torch.use_deterministic_algorithms(True) to get the deterministic kernels. They may be slightly slower, but in practice, you probably will not notice.

Note that results will still differ between different drivers and GPUs. It would be great if NVIDIA tried harder in that regard.

red2awn 11 hours ago [-]

The blog post is about LLM non-determinism in the context of serving at scale (variable batch size). The page you link is only about run-to-run determinism implicitly assuming a fixed batch size.

htrp 12 hours ago [-]

We know what thinking machines does yet?

cubefox 14 hours ago [-]

His solution still relies on greedy (temperature 0) sampling, which is probably not optimal for model performance on various tasks. For example, Gemini 2.5 uses temperature 1 by default. But deterministic inference with temperature >0 can still be achieved by using pseudorandom sampling with a fixed seed.

red2awn 13 hours ago [-]

Conceptually setting temperature to be >0 doesn't actually introduce any non-determinism. If your sampler is seeded then it will always choose the same next token. Higher temperature only flattens the logit distribution.

mynameismon 13 hours ago [-]

The point of the blog is that even at "supposed" deterministic generative sampling, non-determinism creeps in. This in turn has disastrous effects in very real experiments.

cubefox 13 hours ago [-]

My point is that greedy sampling is not just not sufficient but also not necessary for deterministic inference.

sudohalt 12 hours ago [-]

cool project but if this is what you are producing with $2 billion funding, i doubt you will survive. This is the type of article a grad student would write over a weekend.

lairv 10 hours ago [-]

on the contrary this makes me bullish about their team, it shows that people here care about the craft

sudohalt 3 hours ago [-]

The team is good, and I enjoyed the read. But this is just an engineering blog post. They're promoting this like it's ground breaking research and it's on their front-page. Ultimately this paper is not very meaningful and just a fun debugging session.

I've seen this play out dozens of times. So many startups that have come and go in the bay area were composed of extremely talented individuals, but almost all of them failed.

unit149 3 hours ago [-]

[dead]

TNDnow 14 hours ago [-]

Who needs a working product when you can spend all day designing the most WEWORK looking website and slap some pseud slop on it. It's like crypto "startups" but it's not even fun.

nowittyusername 13 hours ago [-]

I am baffled that I still run against these statement years after LLM's have been around. LLM's are deterministic and always have been. The reason people are having issues with them is because they are basing their assumptions on api based experiments. Like my man, how can you be making these statements when you haven't done the due diligence of running the LLM on your own hardware with all of the variables locked down and accounted for? If you do just that it would become obviously clear that they are deterministic and most of the time the reason you see the non deterministic behavior is because you have not controlled for a variable. Usually prompt caching, batch processing or some other obvious variable. Now this is related to within same system deterministic behavior. You might get different answers when running on a different gpu, but at least for same systems the behavior is 100% identical if you account for all server startup flags and properly account for things like prompt cashing, slot contamination etc...

Voloskaya 13 hours ago [-]

I suggest you look up the name of the main author of TFA before assuming they don’t know what they are talking about.

This is literally one of the most knowledgeable person on the topic. I think you are the one that hasn’t peeled enough layers to connect with what they are saying.

sudohalt 3 hours ago [-]

1. they aren't, they are just popular online. 2. the author has nothing to do with the original comment. Why do you think academic reviews are double blind?

golol 13 hours ago [-]

Hold on a second. A transformer produces deterministically a probability distribution over the token alphabet from the context. Then one samples from this distribution. This is random and meant to be random.

nowittyusername 12 hours ago [-]

The sampling process isn't random. If you sample with identical sampling parameters and identical values for said parameters, you will always get same results. You only start getting "non deterministic" behavior when you start using more complex systems outside the scope of your control like multi gpu systems and batch processing. One llm sampled with cash prompting off and and batch processing off will always generate same results if all values are same.

oasisaimlessly 13 hours ago [-]

It's possible to deterministically sample from a probability distribution. For example, just seed your RNG with a constant, or with the SHA256 hash of the context.

golol 12 hours ago [-]

Well yes, you can "hack" the pseudorandom number generator, but... that's not really the point when talking about determinism in LLMs is it? I mean the mathematical idea of the standard LLM is certainly truly random.

tossandthrow 13 hours ago [-]

The article literally justifies This in the second paragraph.

nowittyusername 13 hours ago [-]

I suppose I have issues with the way "determinism" is used in the title of this article. It can mean different things to different people and in my mind stating that "Defeating Nondeterminism in LLM Inference" frames it as an actual issue with LLM inference. But its not, its an issue with LLM inference when you start using large scale inference with more complex parts such as systems which use multi gpu inference systems or batching processes and other mechanisms. It is not an issue when using an LLM without those more complex parts. Stating it this way muddies the signal and gives a false sense that this is a fundamental issue with architecture, where its an issue of the systems at scale...

Rendered at 08:38:10 GMT+0000 (Coordinated Universal Time) with Vercel.