Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Surpassing vLLM with a Generated Inference Stack (infinity.inc)

16 points by lukebechtel 5 hours ago | 4 comments

ntonozzi 1 hours ago [-]

Why do they need to run benchmarks to confirm performance? Can't they run an example prompt and verify they get the exact same output token probabilities for all prompts? The fact that they are not doing this makes me suspicious that they are in fact not doing the exact same thing as vLLM.

It is also a bit weird that they are not incorporating speculative decoding, that seems like a critical performance optimization, especially for decode heavy workloads.

lukebechtel 58 minutes ago [-]

Yes, speculative decoding will make both us and VLLM faster, but we believe it would be a relatively even bump on both sides, so we didn't include it in this comparison. Worth another test!

rfw300 1 hours ago [-]

OK... we need way more information than this to validate this claim! I can run Qwen-8B at 1 billion tokens per second if you don't check the model's output quality. No information is given about the source code, correctness, batching, benchmark results, quantization, etc. etc. etc.

lukebechtel 60 minutes ago [-]

We validate with MMLU and Hellaswag presently, and are getting this independently verified by a 3rd party.

We have considered open-sourcing some of our optimized inference libraries in the future, but have not yet come to a decision on this.

Also if you need a rough intuition as to why this is possible: it's because this entire inference stack was built for exactly one model, and thus we can really tune the entire framework accordingly.

Rendered at 20:01:55 GMT+0000 (Coordinated Universal Time) with Vercel.