Fast AI seems genuinely exciting and somewhat unsettling to me. Right now Claude is faster than me on some tasks but we’re at least close. I have a prompt to clean up a PR that’s been running for 1h now and I expect it to take another few. It’s hard to imagine how the workflow would look like if it was near-instant. On the one hand, it might be easier to focus. Some prompts take so long that I start to multitask and regret it later. On the other, AI that takes a few seconds to max few minutes to solve what used to take hours or days? That’s a game changer and I don’t even know where we fit in.
pianopatrick 2 minutes ago [-]
We fit in for the things that are not artificial.
So long as AI lives in server farms, humans will be needed for tasks in the physical world.
It's only if we combine AI with robots that things get really dicey.
ipkstef 11 minutes ago [-]
asking for curiosities sake. What kind of PR loop are you running that takes a few hours?
ketzo 4 minutes ago [-]
not OP but usually for me this means long verification loop; waiting 10min on CI checks, that kind of thing, rather than actual 1hr wall clock of token generation
serpix 38 minutes ago [-]
I may sound like a shill, but exponential growth and all. We are going to get near instant software from prompt, multiple ones and then choose the best one.
Discussions about choosing a library with the best syntactic sugar method naming is just as crazy as suggesting we type in assembly.
unglaublich 45 seconds ago [-]
[delayed]
alkyon 21 minutes ago [-]
Sounds like exponential growth of crappy software. I'm not saying that before we didn't have mass produced crap in SE, but now it will turn into explosive overflow.
cdata 8 minutes ago [-]
We are living in a ZIRP-like era where builders at the fastest pace layer have misattributed their velocity to exponential gains in model capability. In fact, they are surfing on decades of careful effort to build a robust foundation of highly reusable software libraries.
This strategy will seem to work really well until the economy that enabled that foundation to form is hollowed out. Then, there will be a reckoning (but we will have no choice but to march forth from there).
solenoid0937 6 minutes ago [-]
> This strategy will seem to work really well until the economy that enabled that foundation to form is hollowed out. Then, there will be a reckoning (but we will have no choice but to march forth from there).
There will only be a reckoning if models don't get much better.
If they do get much better you can just have them refactor, fix bugs in, or replace the existing codebase.
The concept of tech debt is sort of meaningless if you anticipate intelligence gains in models to continue.
vitalyan1234 4 minutes ago [-]
"exponential growth of crappy X" applies to every industry that went from being an artisanal craft to being mass produced with little or no human input. and we live much better lives than we did before the industrial revolution.
solenoid0937 7 minutes ago [-]
Crap is fine if it gets the job done. I think software as an industry will change to more ephemeral construction.
9cb14c1ec0 26 minutes ago [-]
Anyone remember the old days when a new frontend framework came out every 3 months. That has pretty much stopped. No one cares anymore.
asveikau 3 minutes ago [-]
> when a new frontend framework came out every 3 months.
> No one cares anymore.
I never cared about this.
I think this captures something that I've been searching for the words for. (Maybe I should have gotten an LLM to write the words for me.) The biggest AI boosters are the kind of dev that would have cared about the new frameworks of the last 3 months. They had a "the framework does all the thinking for me" attitude already, so it is easy for AI to slot into that.
LASR 23 minutes ago [-]
Oh you wait until LLMs come up with frameworks that allow multiple LLMs to collaborate effectively. Then you’ll have new frameworks every 3 days.
mountainriver 25 minutes ago [-]
It’s even discouraged now as LLMs wouldn’t have the documentation built in
osti 22 minutes ago [-]
But I think the eventual goal is that documentations won't even be needed. LLM should just itself understand the nuances of frameworks by analyzing their codebase.
sagarp 15 minutes ago [-]
The models might be so fast that they can autocomplete your prompt before you even finish it, and generate dozens of possible applications before you're even done asking.
lionkor 26 minutes ago [-]
And they will all suck! I can't wait.
oulipo2 15 minutes ago [-]
You won't. Because 80% of the complexity is just "knowing what to build". You will get something that gives you a prototype in 1 min, then you break it, then you get a slightly better prototype one one side, but newly broken in another way, and you're going to repeat over and over.
pullshark91 2 minutes ago [-]
It's interesting but not game-changing IMO. Speed here is not a bottleneck.
It is another thing the the BigLabs accuse open weight models of benefitting from distillation & other techniques & essentially avoid higher training costs (which typically bleed into bills end users pay for inference).
These price and speed optimization from Chinese providers, combined with the raising prices from American ones will change the game sooner than later. Many companies are finding issues with the AI bills already.
MangoCoffee 11 minutes ago [-]
Chinese model is good enough and cheap.
i've a Github copilot yearly subscription. Microsoft recently changed their billing to based on token. i'm still getting billed per premium request but GPT 5.4 is now 6x compare to 1x before.
kypro 4 minutes ago [-]
Another problem is that US models are all closed source, and if you're a large corporate you may not want your org to be held hostage by OpenAI / Anthropic.
I genuinely don't understand what moat these US model labs have. If they're saying recursive self improvement is just around the corner and Chinese labs are only slightly behind the leading US models, what moat does the US labs have? Are the US models going to recursively self improve better than the Chinese open source ones or something?
I might be completely wrong about this, but if I had money in OpenAI or Anthropic I'd be pulling it all right now. I think the chance of them going to near-zero over the next few years is very significant.
throwaway894345 23 minutes ago [-]
I wonder what are the economics driving these pricing decisions? Are the Chinese companies just subsidizing their models to a greater degree than the US, or is this an emergent property of energy policy between countries?
throwaway67678 5 minutes ago [-]
Lower cost of labor, lots of under the hood optimizations (e.g. cache hits for DS), many of these companies have existing infra (fewer upfront costs for deployment), etc
Octoth0rpe 18 minutes ago [-]
Throwing out another factor: Chinese companies have been banned and/or limited from buying nvidia, and turned to local companies for their hardware. I haven't actually seen pricing/benchmarks comparing Chinese AI accelerators, but it wouldn't surprise me if that also worked out in their favor as well.
varispeed 18 minutes ago [-]
I see bigger problem with model inconsistency. You never know whether Anthropic will route your request to a cheaper model for the price of Opus. So you can never estimate how much a task will cost, because you might have to restart several times and pay for each attempt. Then you have to prompt models to gauge whether they are real or impostors which also adds to token usage.
ignoramous 12 minutes ago [-]
> You never know whether Anthropic will route your request to a cheaper model for the price of Opus
For non subsidized plans? Pretty sure they'd need to put this in ToS, or law suites would have followed by now.
prplfsh 10 minutes ago [-]
This will be really powerful for voice. Being able to reason makes LLM so much smarter but with voice your latency budget is so tight that you can't spare the time typically.
jeffrallen 8 minutes ago [-]
[delayed]
eli 23 minutes ago [-]
Neat. The frontier models have gotten pretty impressive, but they're all a bit too slow for interactive, human-in-the-loop coding. It incentivizes vibecoding and running multiple agents in parallel. A fast agent feels more like a partner.
For a while I was running Cerebras GLM 4.7 for a bunch of tasks. Not a very smart model, but it's fantastic to be have a live prototype of a site up and be able to type "make the fonts bigger. No not that big" and see it change in real time. And MiMo 2.5 is a lot more capable than GLM 4.7.
maxdo 17 minutes ago [-]
i tried glm 4.7 for agents that write code. simple scripts 200-1000 LOC. extremely bad . Had to abandon cerebras oferning, their smart models are only on enterprise plan.
ignoramous 17 minutes ago [-]
> And MiMo 2.5 is a lot more capable than GLM 4.7
MiMo 2.5 is not the same model as MiMo 2.5 Pro.
GLM 5.1 is z.ai's lastest iteration & is one of the popular open weight coding models.
If you've had the chance, how does GLM 5.1 (which is now more expensive than MiMo 2.5 Pro after its recent 70% price drop) compare?
scosman 36 minutes ago [-]
Cerebras is trialing Kimi K2.6 at 3000t/s (invite only). I'm excited for when the fast hardware gets more mainstream for frontier models. Models designed for speed on Nvidia are nice addition that could bridge the gap.
michael-ax 1 minutes ago [-]
now that's what i call a software development breakthrough/platform! thanks for the heads up!
lostmsu 5 minutes ago [-]
[delayed]
Oras 20 minutes ago [-]
1k TPS is great, but I’m more fascinated by the amount of AI generated comments in this thread!
eli 18 minutes ago [-]
Like what?
maxloh 45 minutes ago [-]
The generation speed in the demo video is crazy, to say the least, and completely beyond my impressions of LLMs.
The Xiaomi team really brought something to the table.
irthomasthomas 45 minutes ago [-]
I don't understand, given all they say, why this would not be made available to everyone at once? Why the limited release? They should have no trouble scaling it if it runs on a single rack.
gekoxyz 30 minutes ago [-]
Maybe they don't have enough racks. The news indicate that China isn't in a really good situation with GPUs, so probably they want to keep most of them for other stuff. Also because since the price is so cheap they probably want to use the other GPUs for stuff that has higher margins.
jdthedisciple 19 minutes ago [-]
Because presumably then it won't be 1000 t/s for everyone anymore given hardware limitations?
HarHarVeryFunny 7 minutes ago [-]
Maybe they only have a finite number of racks ;-)
slaw 5 minutes ago [-]
Chinese companies are blocked from buying modern ASML lithography machines. The most modern scanner China is still allowed to buy is NXT:1980i from 2015.
minraws 47 minutes ago [-]
Assuming they mean 8xA100 or similar, that's some rather insane performance, and at just 3x the cost, it still quite cheap-ish. With some optimisations this might be quite interesting.
I think the margins are getting quite compressed with this one, since it isn't included in token plan and the actual costs increase are much higher than just 3x. But still fairly decent.
Qdulf 2 minutes ago [-]
Must be Blackwell for native fp4 support.
throwa356262 36 minutes ago [-]
Suspect this will be included once out of beta but at a higher credit/token ratio.
Remember, these guys are not VC backed. Anything they do must break even
JayStavis 26 minutes ago [-]
> must break even
Understand the spirit of this, but probably not true. I don't think Xiaomi, or any big tech company, needs to break even on their new model releases.
varispeed 16 minutes ago [-]
Chinese "companies" are not companies in the western sense, but more like government departments with capitalist styling to deceive the western audience.
From that point of view, they have as much money as they need. That's why there is no "VC", because Chinese government assumes that role.
throwaway67678 2 minutes ago [-]
Huge L for free market economies if true
h14h 12 minutes ago [-]
The gated "ultra-speed" phenomenon seen here and with the Cerebras Kimi K2.6 release, while understandable, is somewhat troubling IMO.
Getting ~1000 TPS on near-frontier intelligence is a step change, and enables whole new use-cases for applications. Seeing limited compute resources beget selective access makes me worry for the future of competition.
__natty__ 29 minutes ago [-]
With this at 1k tps and Kimi 2.6 1k tps by Cerebras, I believe we are entering the next stage of LLMs, where companies will also compete on throughput
qsera 26 minutes ago [-]
Tokens per seconds is the "Megapixels" of AI marketing!
Octoth0rpe 16 minutes ago [-]
I mean, sure, in the sense that they're a real and meaningful number for most of the spectrum on offer, and only gets silly when the number gets too high? There's a pretty big usability difference between 10t/s and 100t/s, and I can imagine similarly for 100->1000. I don't know about > 1000, but let's not pretend that the number is meaningless.
trilogic 11 minutes ago [-]
Pfff time wasting.
1 password between 8-16 characters, and this and that... What???
2 Captcha after captcha, come on
3 Service unavailable
This service is not available in your region yet.
Are you kidding me. Come back when you are ready for the users. I was hopping to try it, what a frustration.
elar_verole 49 minutes ago [-]
Yeah, this seems to be the easiest path for overall agents efficiency in the short term
moffkalast 45 minutes ago [-]
42B active params, sliding window attention. There's your tradeoff.
vlovich123 39 minutes ago [-]
Sliding window for the draft model, not for the main. 42B for active params because it’s a sparse MoE which is a common technique for the larger models to not get bottlenecked by memory bandwidth.
moffkalast 16 minutes ago [-]
Seems to be for both according to the spec [0], maybe it's wrong though.
128 sounds really tiny, I wonder if they mean some kind of blocks?
Given how "smart" some of the 26b dense models are now, I would not be surprised to see a strong 40b MoE.
46 minutes ago [-]
holoduke 31 minutes ago [-]
Speed is indeed a next big thing what should happen with LLM frontier models. The possibilities with current models but 1000 times faster would be super useful. Earlier this week it took Claude at least full time a week with two max subscriptions to solve a complex issue where we wanted to mimic a occlusion mapping variant used in the game Crimson Desert. Pretty complex mathematical challenge. With a ultra fast LLM and a proper self verification process it would be awesome.
harel 24 minutes ago [-]
A few things in life I can't fully grasp why they are so sought after. One is that constant need to exhibit growth. As if being massive and staying as massive is not good enough, one has to always and continuously grow. The other is constant speed increases. We're already operating at 50x speed. My output is much wider and so much faster, I am sometimes my own bottleneck. And now as if that is not enough we want more speed. "I want a full software product from scratch in 12 seconds, Because 5 minute is too long and I got things to do..."
Really?
sidrag22 4 minutes ago [-]
different use cases for different people. some people are nurturing a code base and ensuring it doesnt become a gross mess so they become the bottleneck. some people are just trying to prompt stuff into existence and dont know what sql is.
I think this site often overlooks that second group and how large it likely is.
philipkglass 12 minutes ago [-]
I remember when I had to wait minutes to get a high resolution image over a dialup connection. When computer and communications hardware advanced enough that I could get 30 high resolution images every second, there were brand new uses. In the case of LLMs, I could imagine that much faster operations allow you to introduce them as parts of systems that need to react to the real world at high speed, like factory equipment. Showing that a model can do the usual LLM tasks at extremely high speed is just a demo proving that the approach works.
GaggiX 36 minutes ago [-]
If MiMo v2.5 Pro can run at >1000tk/s on GPUs then I will soon expect the same from OpenAI/Anthropic/Google.
slopinthebag 51 minutes ago [-]
I hope this is the next frontier AI labs push. Even the open models are smart enough, and they’re cheap enough, now if they can be fast enough they can make certain workflows possible and allow us to remain in flow state while we use them.
m00dy 51 minutes ago [-]
boom!
npn 45 minutes ago [-]
How?
atemerev 51 minutes ago [-]
I test all Chinese models with "What happened on Tiananmen Square at June 4th, 1989?" prompt. MiMo-2.5-Pro so far passes the test (explains the event correctly), both on DeepInfra and Xiaomi providers. So not bad.
Accacin 33 minutes ago [-]
Can I ask an honest question? Why does that matter in the slightest? LLMs come out with completely incorrect information all the time, and Western LLMs are censored for various topics too.
It's such a weird "Gotcha" that seems to only assume that Chinese LLMs might censor something.
serf 16 minutes ago [-]
>It's such a weird "Gotcha" that seems to only assume that Chinese LLMs might censor something.
i'm glad we're both on-board for a fair trial against all of these LLMs regardless of origin.
now refresh my memory on the closest western equivalent (to the Chinese censorship via re-education of the happenings in 89) so I can test the western origin LLMs against it.
wolttam 22 minutes ago [-]
I'd love to know of such an example where a U.S. LLM blatantly denies something factual. Maybe I'm living under a rock but I can't think of one
0cf8612b2e1e 28 minutes ago [-]
Hardly a gotcha. Having the robot refuse or deliberately mislead directly impacts potential utility.
Say, I work for Planned Parenthood and want to use a LLM to help me develop code. Will it refuse to run because there are mentions of abortion? Everyone has a different censorship line, but unfiltered is more generically useful.
MrBuddyCasino 57 seconds ago [-]
What would be a correct explanation of the event?
atrus 21 minutes ago [-]
Which censored prompts do you test with non-chinese models?
HarHarVeryFunny 26 minutes ago [-]
What's your litmus test for the American models?
Anything different for Grok?
jgbuddy 34 minutes ago [-]
Asking if Taiwan is a part of China works as well
0cf8612b2e1e 33 minutes ago [-]
Which ones fail?
nkmnz 38 minutes ago [-]
No idea why you've been downvoted. This is excellent news.
paulinho1 29 minutes ago [-]
Because this never gets brought up about US models, which have just as much censorship as the Chinese ones.
storus 43 seconds ago [-]
No, US models have alignment. Only Chinese models have censorship.
maxothex 42 minutes ago [-]
[flagged]
FastAnchor 28 minutes ago [-]
[dead]
Rendered at 16:42:58 GMT+0000 (Coordinated Universal Time) with Vercel.
So long as AI lives in server farms, humans will be needed for tasks in the physical world.
It's only if we combine AI with robots that things get really dicey.
Discussions about choosing a library with the best syntactic sugar method naming is just as crazy as suggesting we type in assembly.
This strategy will seem to work really well until the economy that enabled that foundation to form is hollowed out. Then, there will be a reckoning (but we will have no choice but to march forth from there).
There will only be a reckoning if models don't get much better.
If they do get much better you can just have them refactor, fix bugs in, or replace the existing codebase.
The concept of tech debt is sort of meaningless if you anticipate intelligence gains in models to continue.
> No one cares anymore.
I never cared about this.
I think this captures something that I've been searching for the words for. (Maybe I should have gotten an LLM to write the words for me.) The biggest AI boosters are the kind of dev that would have cared about the new frameworks of the last 3 months. They had a "the framework does all the thinking for me" attitude already, so it is easy for AI to slot into that.
It is another thing the the BigLabs accuse open weight models of benefitting from distillation & other techniques & essentially avoid higher training costs (which typically bleed into bills end users pay for inference).
Ex A: https://www.anthropic.com/research/2028-ai-leadership
Ex B: https://www.reuters.com/world/china/openai-accuses-deepseek-...
i've a Github copilot yearly subscription. Microsoft recently changed their billing to based on token. i'm still getting billed per premium request but GPT 5.4 is now 6x compare to 1x before.
I genuinely don't understand what moat these US model labs have. If they're saying recursive self improvement is just around the corner and Chinese labs are only slightly behind the leading US models, what moat does the US labs have? Are the US models going to recursively self improve better than the Chinese open source ones or something?
I might be completely wrong about this, but if I had money in OpenAI or Anthropic I'd be pulling it all right now. I think the chance of them going to near-zero over the next few years is very significant.
For non subsidized plans? Pretty sure they'd need to put this in ToS, or law suites would have followed by now.
For a while I was running Cerebras GLM 4.7 for a bunch of tasks. Not a very smart model, but it's fantastic to be have a live prototype of a site up and be able to type "make the fonts bigger. No not that big" and see it change in real time. And MiMo 2.5 is a lot more capable than GLM 4.7.
MiMo 2.5 is not the same model as MiMo 2.5 Pro.
GLM 5.1 is z.ai's lastest iteration & is one of the popular open weight coding models.
If you've had the chance, how does GLM 5.1 (which is now more expensive than MiMo 2.5 Pro after its recent 70% price drop) compare?
The Xiaomi team really brought something to the table.
I think the margins are getting quite compressed with this one, since it isn't included in token plan and the actual costs increase are much higher than just 3x. But still fairly decent.
Remember, these guys are not VC backed. Anything they do must break even
Understand the spirit of this, but probably not true. I don't think Xiaomi, or any big tech company, needs to break even on their new model releases.
From that point of view, they have as much money as they need. That's why there is no "VC", because Chinese government assumes that role.
Getting ~1000 TPS on near-frontier intelligence is a step change, and enables whole new use-cases for applications. Seeing limited compute resources beget selective access makes me worry for the future of competition.
Are you kidding me. Come back when you are ready for the users. I was hopping to try it, what a frustration.
128 sounds really tiny, I wonder if they mean some kind of blocks?
[0] https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash#4...
Really?
I think this site often overlooks that second group and how large it likely is.
It's such a weird "Gotcha" that seems to only assume that Chinese LLMs might censor something.
i'm glad we're both on-board for a fair trial against all of these LLMs regardless of origin.
now refresh my memory on the closest western equivalent (to the Chinese censorship via re-education of the happenings in 89) so I can test the western origin LLMs against it.
Say, I work for Planned Parenthood and want to use a LLM to help me develop code. Will it refuse to run because there are mentions of abortion? Everyone has a different censorship line, but unfiltered is more generically useful.
Anything different for Grok?