DwarfStar work in progress numbers: I see 14 tokens/sec generation, that slopes to 10 t/s with longer 10k or more context size. Consider that the indexed attention requires evaluating 2048 selected rows, 2x DeepSeek and with less compression, so the performances with larger contexts here to south faster. Prefill can be 180 t/s on small contexts to 150 t/s and less with larger contexts. I used DeepSeek v4 PRO in this conditions, it is usable but it is far from the 35 t/s 400 t/s prefill you get with DeepSeek v4 Flash 2 bit on a MacBook m5 max. But likely my implementation is yet not optimized enough, so a bit more performance can be obtained. I'm using 4 bit quants. The model is also definitely less sparse than DeepSeek v4, so it activates a bigger percentage of parameters. If it works decently at 2-bit, that would be a win even for machines where 4-bit fits, since this would mean 2x memory (equivalent) bandwidth basically for the routed experts.
Local inference needs really hard a 1.2 / 1.5 T/s memory bandwidth system with 512GB and 2/3 times the GPU compute of Mac Studio M3 Ultra, at an affordable 10/15k price point. A variant with 1TB memory would also be welcomed at 20k price point.
reasonabl_human 1 hours ago [-]
Thank you for your work on DwarfStar! It is truly helping democratize access to frontier tech.
segmondy 15 hours ago [-]
I run Q4_K_XL. All it takes to run to get about 6tk/sec is 512gb of ram and 2 3090 GPUs with llama.cpp -cmoe. I also have crappy DDR4, 2400mhz, 3200mhz will bring that speed up to about 9tk/sec. I also have ok 32core epyc CPU, a better 64core would bring it up to about 11tk/sec. I did a budget build before the crazy hardware cost and I regret it everyday. Nevertheless, it's fantastic being able to run this model at home. It's great for planning, one shot prompting once you have a plan or all the context you need. This entire hardware cost $2400 when it was built. If you're willing to be resourceful, you can find ways to run these models at home. I often get the silly question of why, and suggestions about how much I can save using cloud API, but the Fable drama has opened up eyes on why it's good for us to be independent. Thanks team unsloth, Q4_K_XL is solid, if you are going to grab a quant, make sure to get the K_XL variant if it can fit.
effisfor 7 hours ago [-]
I applaud all you tinkerers for pushing on the state of the home-brewed art here. Like crypto, AI is drowned out by hucksters, very few people talk about developing resilience. Or the researchers who will push on open source models in efforts to cram them onto an electric toothbrush or tamagotchi. Bravo to you all.
discordance 10 hours ago [-]
Running that full load is at least 600 W, so in a day ~14 kWh. At $0.2 a kWH, that would be $2.80/day or $1k a year of op-ex in electricity.
Unless you really want privacy or the fuzzy feeling of owning your own, it’s cheaper, more convenient and has much faster tok/s if you pay a hyper scaler.
That said, I do like the direction we are heading and look forward to seeing what host your own hardware we get in 2 years.
segmondy 3 hours ago [-]
No one locally runs full load all day. The only way to see that is if you're training. We are talking about inference. I limit my GPU to 300watts. You can limit them down to 200w. Since everything is not in GPU and the bottleneck is between CPU/system ram. The GPUs don't even get to spike, I see 160w-180w for each GPU during inference. So redo your calculation again. Figure about 6 hrs of daily inference, and we are down to roughly $125 a year. Thanks again for your speculation.
matheusmoreira 6 hours ago [-]
We do want privacy, and we also want to own the hardware so the US can't just turn it off whenever it feels like it.
I think the main reason not to run locally is to get the full models instead of quantized versions.
traceroute66 5 hours ago [-]
> We do want privacy, and we also want to own the hardware so the US can't just turn it off whenever it feels like it.
I agree and I prefer on-prem where possible. The Apple Mac Studios have been great for that although I don't have enough of them to run GLM-5.2 without heavy quantization. I'm also waiting for the Apple next product refresh which I hope will enable me to do more with less.
Meanwhile there are hosted privacy-conscious options out there. Two names to look at are Tinfoil[1] and Privatemode (from Edgeless Systems)[2].
Tinfoil[1] is, sadly, US-based. EU-sovereignty-option is on their long-term radar. But they do have GLM-5.2 today.
Privatemode[2] is a German company (Edgeless Systems) with EU-based servers. But sadly no GLM-5.2 today, it is on their mid-long term radar though.
Both Tinfoil and Privatemode operate on the same concept of the LLM operating in a secure enclave and you have end-to-end attestation and encryption.
Tinfoil have not been independently audited, it is somewhere on their long-term radar.
Privatemode have been thoroughly independently audited with documentation available on request.
Both of them are API-tokens-only. So if you're currently one of those people throwing $200 a month down the pan at Anthropic/OpenAI for a so-called-alleged 'unlimited' plan, then neither Tinfoil or Privatemode will be the place for you.
I have this feeling that it'll be very expensive and still scarce. Normally I wouldn't say this about Apple, because their pricing is part of their brand, but this time the demand (both by data-centers and prosumers) is the force majeure.
traceroute66 3 hours ago [-]
> because their pricing is part of their brand
I know people usually say that about Apple, but to be fair to them on this occasion they have not hiked up their prices yet because they are clearly at present still under some old deals that they did a good job negotiating.
However, of course, at some point Apple will run out of both inventory and old-pricing manufacturing capacity. Yes, I am fully expecting some sort of price-hike like has been seen everywhere else. I am not naïve.
When that time comes it will remain a financial calculation, Apple boxes on one side versus hosted-option-costs on another, in relation to my specific use-cases.
Ultimately I still blame the chip-hoarding hyperscalers though. :)
walrus01 9 hours ago [-]
Not everyone lives in a place where electricity is $0.20 a kWh. For instance BC Hydro residential rates are $0.11 (CAD) for the first tier and $0.14 for the second tier of consumption in a month. At current exchange rate $0.14 CAD is $0.099 USD a kWh. Hydro Quebec is even cheaper.
At a theoretical 6 tok/s, 86400 seconds in a day, approx 500,000 tokens of GLM5.2 output for 2 bucks a day seems like a pretty good bargain to me. Of course not counting the one time cost of the hardware to run it. But I see people dropping $4000-5000 on all kinds of much less useful stuff.
Additionally in a place where people use electric baseboard heating or electric in floor radiant heating, or really any other heating element based system in winter that's less efficient than a heat pump, additional electrical from a computing load is basically "free" since you would be spending that same money otherwise to heat your house. If a computer with 512GB of RAM is dumping the waste heat into your room, it accomplishes a portion of the same thing as a baseboard.
Not to mention there is a whole other less measurable benefit of having a locally hosted model that can't be turned off or arbitrarily restricted by a service provider, and where all of your queries and context cache aren't subject to surveillance by any third party.
Incipient 10 minutes ago [-]
Unless the token estimates I get from using Claude are wayyy out, I burn through 5m+ tokens/day, and I'm not doing a lot of time. 500k tokens in a 24h period for $5k of hardware seems quite poor?
kristjansson 1 minutes ago [-]
Be sure you compare inputs tokens to pre-fill rates and output tokens to generation rates.
discordance 8 hours ago [-]
Where I live prices are often higher than 20c/kWh, but lets take your example and halve it (10c/kWh) so it's ~$1.40/day or ~$500/year.
On Openrouter, the cheapest GLM 5.2 provider costs $3/MTok (at 44 tps). Assuming most use is output tokens, that's still the equivalent of 450k token/day, so we're in the same ball park, but without the capex for 2 3090's and the machine.
Self hosted only makes economic sense if your priority is being in control / avoiding surveillance.
walrus01 8 hours ago [-]
That's true, there's a lot of places where power is considerably more expensive than $0.20 USD/kWh. But also the 600W figure assumes that it's fully loaded 24x7x365.
Running a system that will be 600W under max CPU usage on all cores and RAM and a few 3090-class GPUs, that same system might be only 90W or around there when idle at 0.00 unix load.
If we say: (600 * 24 * 31)/1000 = 446kWh in a month at full load 24 hours a day
But it could be less, such as: (90 * 12 * 31)/1000 = 33.48 kWh of idle time in a month, and 223kWh of "full load" 600W time in a month, if it's at full load only 12 hours a day.
If you're the only user accessing it and you only "use" it 12 hours a day, that cumulative yearly dollar figure would be almost halved. Or even less if a person is using it in bursts and intermittently throughout an 8 hour workday.
wqaatwt 6 hours ago [-]
> person is using it in bursts and intermittently throughout an 8 hour workday.
You can’t do that with 6 tps, though.
AbsurdCensor 6 hours ago [-]
I think that's the biggest difference for most. If you can amortize the hardware costs, then 'burst usage' is cheaper at home to a degree, because you are paying a fixed monthly rate elsewise. Overall thought for most, it is likely cheaper to use the cloud than at home, but really depends on what you want.
nomel 1 hours ago [-]
> because you are paying a fixed monthly rate elsewise
No, you would pay usage based rates with API, in this case. I have exactly one fixed monthly rate for the 6 AI models I have tokens available for.
re-thc 2 hours ago [-]
> But also the 600W figure assumes that it's fully loaded 24x7x365.
It isn't 100% efficient. Even the best PSUs aren't.
tmountain 8 hours ago [-]
Lots of people have solar. Green AI, imagine that!
cultofmetatron 7 hours ago [-]
if only there was a magical place where geothermal and hydroelectric is ubiquitous and the weather is cold enough that no one is going to be complaining about free heating.
nomel 59 minutes ago [-]
The largest geothermal plant in the world is only 1.5GW, in the United States, which is over double all the plants combined in Iceland. The second largest is 1/3 that, in Mexico. [1]
There is no "ubiquitous" geothermal where there also high power usage. Data centers have to go where power is, not can be.
To be fair, Vancouver is such a magical place in terms of electrical cost, but the cost of living and real estate are otherwise through the roof, with decrepit and nasty (would need $100k in renovations immediately if you're not treating it as a teardown) single family detached homes on the east side of the city selling for 3.2 million.
brailsafe 53 minutes ago [-]
Shhh don't forget we have a water shortage. But it is nice to have electricity wrapped into my relatively cheap basement suite rent ;)
fghorow 1 hours ago [-]
You aren't, perchance, from Iceland, are you?
2 hours ago [-]
DrScientist 3 hours ago [-]
Depends on whether you've also gone for self-hosted electricity generation or not.
throwawayffffas 5 hours ago [-]
So in my experience with 2 7900XTs with models that sit fully in VRAM it's more like 400W the gpus spend a lot of time waiting for each other.
SXX 9 hours ago [-]
I guess you missed recent news. Problem is that cloud LLM might just sliently sabotage your work by downgrading output model with no notice.
Or cloud LLM might just refuse to sell to you because it dont like your passport.
yorwba 9 hours ago [-]
So you're buying expensive hardware as insurance for the case that your cloud provider turns against you and you have to switch to another of the twenty offering the same model https://openrouter.ai/z-ai/glm-5.2 or in the worst case buy the same hardware later? How does that make sense?
brookst 5 hours ago [-]
It’s rationalization for what people want to do anyway.
Like buying a new car today and taking on gas, parking, etc, expenses in case the bus route you’re using goes away at some point in the future. It’s not an economic decision, it’s a desire to have the new car dressed up in what-ifs.
drptech 5 hours ago [-]
[dead]
9 hours ago [-]
drptech 5 hours ago [-]
[dead]
swiftcoder 9 hours ago [-]
This is not really a problem for the open-weight models, you can always give your money to an inference provider in a different jurisdiction
poulpy123 6 hours ago [-]
which hyper scaler would you suggest ?
dzjkb 8 hours ago [-]
how do you rent 2 3090s for $2.80/day?
zozbot234 12 hours ago [-]
AIUI the llama.cpp implementation for this model is still quite half-baked due to missing the support for DSA sparse attention mechanism. This leads to running the model with a different mechanism that it has not been trained for, which has been shown to lead to lower quality and performance.
Anyway, I think GLM 5.2 in many ways is not as interesting as DeepSeek V4 series, which uses an even more advanced attention mechanism and can save a lot of memory capacity for KV cache, especially at larger contexts. Which in turn opens up wide batching especially on consumer platforms. GLM doesn't have that, in some ways it feels broadly similar to Kimi 2.6 wrt. the underlying performance architecture. Both are a bit too heavy to run reasonably at full quality on ordinary hardware.
trollbridge 3 hours ago [-]
Particularly DeepSeek 4.1, which they appear to be A/B testing on the API and which also seems available on the free chat interface.
It also has an input image modality, which is a game changer. The cheap Sinofrontier models have generally been lacking in this regard.
Basically, Chinese competition is fierce - DeepSeek set the pricing tier, and the question for each lab now is how to justify charging a little more.
MiMo-2.5-Pro has gone with UltraSoeed, pumping out 1000t/s for a 3X price hike.
GLM has gone with 5.2, hitting Opus levels of reasoning at a fraction of the cost.
DeepSeek will probably keep their pricing model and just keep getting better and better.
Qwen-3.7 is the dark horse. Some rumours are Alibaba is simply making these models because they need them internally.
The real question is why this level of innovation and competition isn’t happening in America or Europe. In particular I see no reason Europe doesn’t have a lab competing on these terms.
SalariedSlave 1 hours ago [-]
Competing and innovating in the fast moving SOTA end of the llm space requires a ruthless disregard for copyright, IP, bureaucracies, formalities, risk assurances and other slowdowns. It requires a risk tolerant, quick and large flowing investment of capital. It requires a scoped focus that is pragmatic and sharp about key concerns, and efficiently dismissive of meaningless details.
Europe can provide none of this. They will never be at the frontier of AI tech, for the same reason they were never at the frontier of any tech.
I say this as a software engineer from Europe.
leansensei 55 minutes ago [-]
Europe was never at the frontier of any tech? Huh what now?
SalariedSlave 53 minutes ago [-]
A hyperbole born of frustration, I admit.
Qualify it to software, rather than all tech, if you will.
dxuh 12 hours ago [-]
"All it takes to run" might be fair if you paid $2400, but right now the total price is way closer to $10k (almost 5k for the RAM and 2k each for the GPUs). Today that is a lot of expensive hardware.
segmondy 11 hours ago [-]
512gb 2400mhz ddr4 ram = $1600 not $5000. https://www.ebay.com/itm/188284985172
You can get creative and source 2-3 2080ti 22gb from China for about $250 a piece. You can either be resourceful and find a way or find a whole bunch of excuses.
officialchicken 9 hours ago [-]
> You can either be resourceful and find a way or find a whole bunch of excuses.
How about addressing this false dichotomy with the likelihood that someone who is new or interested in a tech isn't willing to drop thousands of dollars on used hardware for a whim or learning exercise.
pizza234 5 hours ago [-]
LOL, sure this works if one has a time machine or a LOT of money to burn.
32 CPU Epyc (Epyc is required for faster memory access) + 32 GB VRAM + 512 GB RAM is stupid expensive nowadays, and in best case, it will just downgrade to "very" expensive at some point in the future.
This makes sense only if 1. one is paranoid about privacy or 2. they have money to smoke or 3. they need to workaround cloud model restrictions, AND they have to do it routinely (because if not, a oneshot cloud bare metal setup is way cheaper, faster, and allows more powerful models, due to VRAM offering).
I did spend stupid money as well and yet, the system is 2x slower than cloud providers for comparable performance on vision tasks (I still have to test coding). Oh, and it's hot as hell.
fsuts 12 hours ago [-]
6 tokens per second?
Can you put up with that? As seems very slow. I aim for 40t/s on a laptop and choose models that deliver that speed over larger slower ones
segmondy 12 hours ago [-]
I have been putting up with it forever. We are spoiled by MixtureOfExperts. Folks were delighted to run llama3-70B at such speed. We were happy with 15-20tk/sec with 8b models, and if you could run llama3-405B at 1tk/sec you were a god. To each their own. I can live with 6 high quality tokens. If I could get a Fable equivalent model, I'll gladly take 2tk/sec if that's what it took to run it locally.
manmal 12 hours ago [-]
But what is it doing for you that you couldn’t do yourself at that speed? I‘m really curious and on the fence of partly going local.
all2 12 hours ago [-]
Is think you would use it more like email and less like text messages, so the domain of communication shifts drastically. The other part is, you don't have to run just that model, you can offload a lot of chores to smaller models.
AussieWog93 7 hours ago [-]
Not a Local LLM user, but I regularly kick off meaty jobs in Claude Code then check on them 1-2hrs later.
wqaatwt 6 hours ago [-]
In this case it would be 20-40 hours to accomplish the same amount in f work when running locally
Mashimo 11 hours ago [-]
Run one task, while you do another? Or while you sleep / eat / rave?
manmal 4 hours ago [-]
While my colleagues are running 6 parallel agents at 50-100t/s each, with an actual SOTA model? Don’t you think I‘d get fired after a few weeks of that?
nijave 4 hours ago [-]
I agree single digit tk/sec is painfully slow, but I also doubt anyone with these local/homelab setups are using them for work. Likely fire off and check back later. That said, I've had terrible results one-shotting so you'd need to design with a faster model or have extreme patience during the discovery/design phase.
Mashimo 3 hours ago [-]
Why would you use this when your company has access to actual SOTA? I don't get it.
segmondy 2 hours ago [-]
Here's a thought experiment for you. Let's say you can run 1000 agents at 10,000 tokens a second. Do you think you are going to be more productive than someone running at 6tk/sec with the same model?
Incase it's not clear, you will be generating 10,000,000 a second. Good luck verifying it. Token generation is not the bottleneck for creative work. If you are doing a predictable work and have a good workflow and massive dataset to process, then speed of token matters. If you are performing creative work like coding, it doesn't.
froh 12 hours ago [-]
do you use caveman or similar?
walrus01 9 hours ago [-]
I get a lot done with something that's also approximately 6 tokens/second, if you're willing to give it a well defined set of prompts and projects to work on, leave it for an hour or two, then come back and check what it's done. And often to remember to give it something of more consequence to do for at least 3-4 hours of wall clock runtime before heading to bed.
radku 7 hours ago [-]
I have pretty much almost this exact setup with 2x3090s and with slightly faster DDR4 512GB and 64 core Epyc! [0] I've been enjoying it a lot. Can't wait to give this model a try.
Apart of running local models I use this rig as my main remote development platform. All Claude Code sessions are running there in tmux now. And my fingers can't be happier not having to deal with constantly hot laptop. Not to mention that Claude Code is such a battery hog.
How can you combine CPU cores and multiple GPU? Are you running some layers in cpu, others in gpu #1, and others in gpu #2? What about the bandwidth and latency between them?
Or maybe the model itself only runs at gpus, and the cpu memory only store the weights for experts not corrently activated? If so, then what's the 32 or 64 cpu cores for?
I'm a big fan of fully utilizing one's hardware and it's kinda sad that it's not the norm to run things on either gpu, cpu or both, dynamically choosing at runtime, for everyday software
nodja 12 hours ago [-]
Pipeline parallelism. Instead of splitting layers by row/column. You split at the layer edges. So instead of having this huge bottleneck of bandwidth you only need to transfer about 4KB per token when changing devices on a model like Qwen 3 30BA3.
xrd 11 hours ago [-]
This is a good place to start reading about dual gpus.
checkout llama.cpp, the entire point of the project is for us mere mortals and GPU poor.
edg5000 13 hours ago [-]
Very cool. So it's not just about GPU VRAM which I incorrectly thought. I though you'd need 512 GB GPU VRAM. I don't think it cost only 2400; 512GB ram would be more expensive though back in the day. But not mortgage-grade 200.000 which I estimated myself (which assumed running in 100% VRAM; overkill for a single user probably).
segmondy 12 hours ago [-]
you can use system ram with a system like llama.cpp which offloads to system ram. token generation is a function of system bandwidth, the faster the bandwidth the better. so I'm on 8 channel 2400mhz. if I had a 12 ddr channel, I would get 1.5x the speed at 2400mhz. of course ddr5 is much faster, so a 12 ddr at 4800mhz will provide 3x the speed for token generation or roughly 18tk/sec. prompt processing is all about compute, so the more cpu cores you have the faster it can do PP.
nijave 4 hours ago [-]
Well, it's about GPU VRAM if you want something competitive with cloud-hosted offerings at the performance levels showing in benchmarks. This is a heavy quant with quality degradation and significantly lower performance.
Cloud offerings are 80-200tk/sec versus single digit tk/sec.
That said, I'm also surprised it runs at all locally. I do think it'd be painfully slow for anything interactive so you're relying on another model for a comprehensive design or you're hoping a one-shot with somewhat degraded quality turns out correctly.
edg5000 3 hours ago [-]
I see. So not quite usable apart for specific use cases. Maybe in a few years we'll see new hardware players and better prices.
redox99 15 hours ago [-]
That's crazy good for $2400.
draginol 2 hours ago [-]
The most interesting part of this to me is not the benchmark table, but the packaging.
A model like GLM-5.2 being available as GGUF, usable through llama.cpp/Ollama/vLLM/SGLang/LM Studio, and wrapped for local agent workflows changes the category. It stops being an impressive open model exists and starts becoming this is something a small team can actually put into its development stack.
For instance, company buys an RX6000 setup for say $15k total. They could use this for handling data heavy sifting that would otherwise be a lot of Claude tokens.
It doesn't need to be as good as frontier-best. Just good enough.
I could see a business of people packaging this and handing it to companies who want Help Desk bots without any extra setup.
giancarlostoro 2 hours ago [-]
> For instance, company buys an RX6000 setup for say $15k total. They could use this for handling data heavy sifting that would otherwise be a lot of Claude tokens.
Considering they might be spending thousands per month on API costs already, dropping 15K to save on one process might not be bad. On the other hand, also an opportunity to sell GLM 5.2 inference at near cost to other companies for less than whatever Claude costs. In theory it costs anywhere from $0.51 to less than $2 an hour to run it and use it 24/7 that's still wildly cheaper than calling Opus which doesn't bill per hour, but per million tokens, drastically higher. Hell, you could probably bill at $5 per GPU hour and still be cheaper. Whether you're looking to self-host or sell hosting for it, it looks way cheaper regardless. I think most decent open models will continue to fit in at least 32GB of VRAM so a 6000 Pro GPU is more than enough. alternatively, even on a 5090 you can get a reasonable amount of inference for way less than paying for Opus, Qwen would be your friend there though.
xrd 18 hours ago [-]
So close! My machine with 192GB RAM + RTX 3090 24GB can almost run this. It says it needs 24GB of VRAM and 256GB of RAM for MoE offloading.
$500k is a vast overestimation. For massive concurrency at FP8 or even BF16 maybe.
NVFP4 at reasonable speeds (~120 tok/s) and concurrency is possible at a $80/90k figure with today's prices, maybe even less. That buys you 6 RTX 6000 PRO Blackwells, a decent CPU and motherboard, power supply. 576gb of VRAM.
You could do it for under $50k if you're OK with 40 tok/s decode, ~1200 tok/s prefill.
hbbio 16 hours ago [-]
Yes, a single GB300 workstation also does it, probably even more than 120tok/s.
Official price 85k...
simpaticoder 3 hours ago [-]
Actual price $100k and everything is very closed and proprietary. Oddly this MSI system provides "only" 252G vram and 500G ram. I would have expected more vram for this price. Also why 252 instead of 256? https://www.centralcomputer.com/msi-xpertstation-ws300-ai-wo...
The problem is the backplane I have not managed to find a single baseboard, and getting a random baseboard to work with random modules is probably a crap shoot.
__m 17 hours ago [-]
How fast will the hardware become outdated? Are there big improvements expected in the next 3 years?
easygenes 16 hours ago [-]
M5 Ultra will ship before end of year, likely. Though with current RAM shortage, likely max spec will be 256GB and in short supply.
In late 2027 or early 2028, Nvidia will release Vera Rubin DGX Spark, likely with double or better the performance of current Blackwell, though unclear if memory capacity will go up much from current 128GB. Two to four of those will run models like this decently.
In 2028 we should expect Vera Rubin RTX discrete lineup, including the replacement to the RTX PRO 6000. Likely memory spec will be minimum 128GB. Good chance of up to 200GB. Two to four of those will run NVFP4 models in this class very well.
hajile 1 hours ago [-]
It might be M6 Ultra and I think the real reason for stopping selling top-tier units was to avoid mid-generation price hikes and increasing demand for the more expensive next-gen systems that I assume will come with 512gb (maybe 1TB) of RAM and a massive markup to match.
jiqiren 13 hours ago [-]
I hope all this speculation comes true. Right now this ram crunch is ridiculous.
16 hours ago [-]
digitaltrees 15 hours ago [-]
I feel like the models are good enough for a decade of future work. So Once you have a working set up you can keep using it to do the work at the same level. There will be better stuff and may make that type of work obsolete but if you can do useful things it won’t be worth less.
Tepix 12 hours ago [-]
I think there is a gap right now for running large models such as GLM 5.2 in Q4 or Q8.
My hope is on Intel Crescent Island 480GB cards. Let‘s see how expensive they‘ll be.
npodbielski 3 hours ago [-]
480GB? Probably like 100k$ each? :D
segmondy 15 hours ago [-]
P40 was release 2016 and still selling like hotcakes!
easygenes 16 hours ago [-]
[dead]
mgambati 18 hours ago [-]
With 2 wouldn’t have good results. Ideal range for coding is at least Q8.
kibibu 18 hours ago [-]
According to this very article, 4-bit dynamic is essentially lossless
Aurornis 17 hours ago [-]
Watch out. Those claims are often made based on KL-divergence over some arbitrary corpus, not performance in the real world or benchmarks.
I’ve found that I need to go a couple steps past whatever quantizations are good enough in the KL-divergence testing to get good performance in real tasks with long context. So when Q4 is claimed to be lossless I end up with Q5 or Q6 for actual long-context tasks.
ijidak 16 hours ago [-]
Crossing my fingers that this boom jumpstarts 90's like improvements in computing hardware.
I feel like part of the reason for the relative stagnation in hardware over the last twenty years was simply the lack of use cases to justify hardware refreshes by businesses.
Most of the money and energy went to mobile for the last fifteen years.
Affordable local inference might be the gravy train the server, desktop, and laptop manufacturers need to get back in gear.
0xbadcafebee 14 hours ago [-]
Definitely the stagnation was due to a lack of use cases, but this isn't a bad thing. We don't need most of the hardware advancement we got.
Business hardware got beefier because businesses demanded more data (or more specifically: the industry told businesses they needed more data), with no idea of what to actually do with it once they got it. To get all that data, bandwidth needed to be increased, with more iops to read/write it, more storage to keep it, and more memory and cpu to process it. But 99% of the data is junk. Companies have "data lakes" so big they need to come up with excuses to use the data, or risk somebody pointing out that they're spending a fortune hoarding bits.
Consumer hardware hasn't had a new use case since like 2012. Faster wifi for broadband & local file transfers, and higher-resolution video, are the only reasons one needed newer hardware. We actually got a resolution so high it makes no perceivable difference. And yeah we got faster CPUs and memory, but as soon as we did it got all eaten up by the most inefficient, wasteful software conceivable. Same use cases as 13 years ago, just more expensive, harder to use, and buggier. We should've gotten a new CPU architecture that was faster and more energy efficient. Finally it was delivered, but with a moat around the golden Apple.
Here we are two and a half decades into the Internet era, and my damn bluetooth earbuds and webcam microphone don't work half the time that I open a video conferencing app. Hardware can stay exactly like it is for the next few decades and I'd be happy. I just want software that works, and doesn't get continuously slower, forcing me to buy bigger hardware; or more draconian, locking me out of being able to use it how I want.
omnimus 11 hours ago [-]
The natural progression when performance is enough would be price. We were starting to see that but not anymore. I wonder if somebody is afraid the future where generally useful computation is cheap.
gruez 15 hours ago [-]
>I feel like part of the reason for the relative stagnation in hardware over the last twenty years was simply the lack of use cases to justify hardware refreshes by businesses.
No, we're running into limits of moore's law, and it's showing in prices for new nodes, where they're getting denser but not cheaper.
horsawlarway 14 hours ago [-]
It's true we hit limits, but I feel like a lot of it was "limits" in the sense that the tradeoff stopped being worth the cost, so we optimized in other areas.
So we hit limits on clock speed in the early 2000s (ex - the 4ghz wall) but it also turned out that mobile as the driver for sales meant no one really cared much about clock speed compared to performance/watt.
Clock speed mattered, but only relative to how many watts it took to get it (and above 4ghz... too many watts).
But we've seen a 15x improvement over the last 20 years. Performance/Watt is WAY up.
My guess is that LLMs are going to drive another "improvement cycle" in areas that we didn't care much about before.
I've built about 10 personal desktop machines (1 every ~4 years) and I can honestly say that I didn't care much about memory bandwidth prior to 2021.
In the same way that I didn't care much about how many watts my pentium 4 was using in 2005.
But now... now I care a lot about memory bandwidth. I care about memory speeds and total system ram in a manner I really, really didn't before.
So I think we're going to see a big shift to machines built on unified ram with a crazy focus on squeezing memory bandwidth and total ram capacity as far as we can.
My bet is that we'll get a similar 10-15x improvement by 2040 in unified system ram designs.
I fully expect to see 2tb unified ram desktops and 200gb unified ram phones be relatively common on a 20 year timeline, assuming we see similar levels of geopolitical stability (ex - world war 3 throws a wrench into things).
BobbyTables2 13 hours ago [-]
Yeah, even Windows managed to not drive terribly dramatic upgrades in general computing
(besides Windows’ absurd RAM usage and now requiring a TPM).
In the old days, Microsoft Entertainment Pack games were somewhat visibly taxing on some lower end systems.
linzhangrun 15 hours ago [-]
Physical limitation of the manufacturing process may be more significant factor, starting from the TSMC 10nm ten years ago
bbor 5 hours ago [-]
I’m kinda lost here… do y’all really have machines in your houses with hundreds of gigs of RAM?? Am I just behind the times?
The page advertises the 8-bit quant as taking ~800GB, which seems like it would require at least 3 consumer motherboards fully stacked w/ 4x64GB cards each.
Maybe “locally” has slowly come to imply “…on your homelab”?
numpad0 4 hours ago [-]
DRAM prices at mid-2025 rates were ~$2.5/GB for DDR5, and ~$1.5/GB for DDR4. "Hundreds of gigs" of RAM used to be under $500. 128GB of cheapest RAM used to be like $200. It seemed to go over heads for a lot of people that you could get hypothetical future machines on CS/CE textbooks were attainable for that little, for some reason - there seemed to be some fixation on the idea that 16GB is all you need.
Gracana 4 hours ago [-]
You don't have to have a server, workstation motherboards support lots of memory channels.
I was lucky to buy a lot of RAM before prices skyrocketed. I knew I wanted to play with this stuff, so I spent what felt like a lot of money at the time to buy 8x96GB DDR5-6400 RDIMMs. Now the same RAM costs at least 6x more.
woodrowbarlow 3 hours ago [-]
[dead]
oceanplexian 3 hours ago [-]
As soon as Llama came out I had a realization what was coming and went all-in on hardware with the assumption open source would catch up with GPT4. Surprise, it did, we now have small models that absolutely crush GPT4 in performance.
It wasn’t that absurdly expensive for a hobby, I bought 64GB DDR4 ECC sticks between $70-$100 on eBay before everything took off. Now everyone is in here debating if open source is 1 month or 3 months behind SOTA. The future is obviously local.
nijave 3 hours ago [-]
I got a 2U rackmount with 192Gi DDR4 for $1.1k USD in 2023. Around 1.5 yrs ago, server RAM could be had pretty cheap--especially slower LRDIMMs (I wanna say 512Gi DDR4 was <$500 USD). I checked a couple old ServeTheHome threads and seeing maybe around $50/32GB RDIMM although thought it was cheaper than that for a little while
cpburns2009 4 hours ago [-]
RAM wasn't expensive even a year ago. I maxed out a used Dell Precision T5610 with 128 GB DDR3 for $250 in 2021.
cheema33 18 hours ago [-]
I have the RAM, but not the VRAM. What kind of speed/tps could you expect from a 3090 with 24GBs of RAM? I am somewhat tempted to pick a GPU with 24GBs of RAM.
ekidd 9 hours ago [-]
A GPU with 24GBs of RAM is mostly useful for running a very carefully squeezed Qwen3.6 27B (4-bit Unsloth quants, 8-bit K/V cache, possibly MTP, 128k context). This is a fun little model that's smart enough to do debugging, refactoring, and implementing "clean" specs that don't force it to make complicated design choices. I've seen it rip through a 9-year-old Terraform AWS config, and (without using the network) correctly identify nearly everything that would need to be upgraded or migrated for modern AWS. But if I give it some poorly conceived spec with lurking design headaches, then it goes on an endless thinking binge and ultimately fails.
Speed-wise, I don't have numbers, but it feels subjectively faster than Opus in Claude Code. YMMV.
Once you go above "a used 3090 at a decentish price", then I strongly recommend renting cloud GPUs or at least testing models using paid APIs. This allows testing your use case before spending piles of money.
phamilton 17 hours ago [-]
Generation is basically just memory bandwidth math.
Each token has to read all the active weights. I think that's around 40B parameters active. At a 4-bit quant that's 20GB. With 100GB/s (replace with whatever your bandwidth is) and you get 5 tokens per second.
SlavikCA 13 hours ago [-]
And with MTP (or other speculation techniques) you can ~double that.
phamilton 3 hours ago [-]
MTP on a MoE is hit or miss. If you're bottlenecked on memory, MTP can increase the number of active experts (like any batch processing would), which can eat away gains from it.
uberex 17 hours ago [-]
Funny I casually asked Gemini and it said 500k for unquantized with decent throughput.
stymaar 16 hours ago [-]
This is why you shouldn't believe uncritically an answer from an LLM (neither should you do for any answer from a human either though).
andy_ppp 13 hours ago [-]
But I did my research online and the sun cycle is every 11 years and something something global warming is a hoax every single year now.
nijave 3 hours ago [-]
That's fair for new hardware. You probably want to prompt "homelab" or "used hardware" to compare what's in this thread.
colinsane 15 hours ago [-]
i asked gemini and it replied with "Error: 400 Your prompt was blocked by safety filters. Please revise and try again."
matheusmoreira 6 hours ago [-]
Safety from competition!
digitaltrees 13 hours ago [-]
I asked and it said “403 forbidden - careful peon attempts to bypass the late stage capitalism api with your monetary offerings in exchange for you daily tokens will get you perma banned right to jail”.
j45 15 hours ago [-]
LLMs aren't discrete calcluators or estimators of things unless framed and guided to do so.
uberex 9 hours ago [-]
Good job I didn't use a vanilla LLM without tool use harness then.
skiing_crawling 17 hours ago [-]
"it can fit" on 256GB of RAM, but it will be heavily quantized and still run very slowly. The headline number is not token generation, its prompt processing. So if you get 10 tok/s and an API gives you 20-30 tok/s, it doesn't seem that bad on its face, but a mac studio or any other machine that's not loading all of it into GPU will do PP 20-50X slower than a purely GPU based setup, which is what actually makes this unusable without $50k in GPUs.
On top of that, you will still be heavily quantized.
gerdesj 17 hours ago [-]
A nvidia spark thingie has 128GB unified RAM. They also have a dual port version of one of these things: https://www.nvidia.com/content/dam/en-zz/Solutions/networkin.... ie 2 x 100GB/s ports, they may even be 2 x 200GB/s. Once I've got my paws on one, I'll know more.
You can cluster these beasts too. Two and three (with two IP subnets) is fairly obvious. Four or more might need a switch depending on how much network latency affects things.
Apple seem to have forgotten about M series with gobs of RAM. I can't get the Apple shop to show more than 96GB of unified RAM and that costs a kidney.
mapontosevenths 16 hours ago [-]
I have one, and I love it. That said my buddies Mac smokes it for inference workloads in terms of tokens per second AND its more usable for other things.
If you are training and doing research it's great, if you want to cluster them it cant be beat, but if you just want local inference on a single box buy a mac or even a strix halo device.
colinsane 15 hours ago [-]
can those macs boot linux? i've heard about Asahi but have no idea how far along they are. i've got my fleet configured with nix and sure, nix can target darwin, but there's a _lot_ of sharp edges there: i don't really want to pull that thread unless i have to...
mapontosevenths 15 hours ago [-]
I don't know. I think he just uses LMStudio most of the time on his, but that's one place I can say the spark really shines for me.
I'm a Linux guy, but also don't always have alot of time. The Spark comes out of the box with a nice Linux distro that's pre-configured to be easy to setup and the guides and online resources make getting up and running trivial, for even some complex tasks. You would have to do a LOT of tinkering just to figure out some of the things the nvidia resources walk you through natively. They have guides for a ton of stuff that include the optimal settings so you don't have to figure it all out through trial and error.
Check out these "playbooks" for some examples. [0] There's a lot to be said for not having to piece all that together yourself.
I think between unboxing mine setting it up to run headless, and generating tokens was like 20 minutes total for me.
Fizz43 16 hours ago [-]
which mac is smoking the spark?
pmarreck 15 hours ago [-]
pretty much any of them, dude, as long as you have enough RAM, since it uses unified RAM and a powerful SoC CPU/GPU. Literally any M-class model, but the M5 is currently top tier.
dannyw 13 hours ago [-]
The DGX Spark has basically the same memory bandwidth as a M5 Pro, and far more than a M5.
Only the M3 Ultra really beats it, and once you start scoping out the cost of a M3 Ultra with 128GB or 256GB, the DGX Spark doesn’t look bad after all.
entrope 4 hours ago [-]
> The DGX Spark has basically the same memory bandwidth as a M5 Pro, and far more than a M5.
I see ~274 GB/sec for the DGX Spark[1], versus 307 GB/sec for M5 Pro and 460 or 614 GB/sec for M5 Max[2]. One might call 90% "basically the same", but there are nominally two tiers above "Pro".
Yes, a MacBook Pro with 128 GB and M5 Max costs $5100 (14") or $5400 (16") versus currently $4700 for the DGX Spark, but the MBP includes keyboard, mouse, battery and portability. I believe its prefill is slower and you get 2 TB vs 4 TB SSD, but overall one gives up a lot to save 10% of the cost.
I looked, but a sibling comment just provided the links. ~274 GB/sec for the DGX Spark, vs. 307 GB/sec for M5 Pro, and max 614 GB/sec (!!!) for M5 Max? Why would you completely friggin’ lie about this, or at minimum, not double-check your facts before bullshitting? Plus, you get a full-fledged computer along with it!
Apple could actually be a good deal and you folks would still make up something to not justify it. In a way, it’s amazing what Apple has accomplished- Baseless negatively-tainted perception in certain influential tech circles.
(To be fair, they’re kind of earning it. I’m glad Tim “Sweet T” Cook is departing.)
Plus, my original comment got downvoted despite being factually-correct. Thanks, Reddit. Oh, wait…
mapontosevenths 15 hours ago [-]
Yep. Memory bandwidth is what decides how fast LLM's generate tokens (mostly). The DGX Spark has something like 270 GB/s of memory bandwidth, and the m5 ultra is ~615 GB/s. Theoretically DOUBLE the speed. In practice he only generates like 25% more tok/s, but that's still very impressive.
The spark can fine tune models in 1/4 the time and excels at other compute tasks in ways that Mac never can. Plus the high bandwidth ConnectX-7 ports would be like $1700 to buy on a card just for the network adapters... But for generating tokens, it just plain loses.
fsuts 12 hours ago [-]
How noisy does his fan get…
pmarreck 4 hours ago [-]
it doesn’t get noisy at all
justincormack 6 hours ago [-]
It is 2x200Gb/s physically but the PCIe bandwidth is basically only 200Gb/s so it may as well be one, and actually its a weird 2xPCIe4 not 1xPCIe8 so it appears in software as dual 100Gb/s. Its a bit odd.
jauntywundrkind 15 hours ago [-]
200 Gb / s (not GB/s)!
(Still potentially very useful! But not magically ultra fast.)
Computer0 17 hours ago [-]
128 gb of much slower ram than Apple.
dannyw 13 hours ago [-]
DGX Spark is ~273GB/s. That’s about M5 Pro territory, and twice as fast as the M5. You’d have to go to the M5 Max, or M3 Ultra, to get higher memory bandwidth than the Spark.
hajile 1 hours ago [-]
If you are trying to get more than 64gb of RAM or doing tons of inferencing, you're getting a Max or Ultra anyway.
Frannky 13 hours ago [-]
There is a push from multiple directions at the same time:
- new AI desktops with GB10s. They are relatively cheap and you can cluster them and load 1TB of VRAM
- Nvidia, amd, intel, Cerebras etc pushing new hardware
- oss models getting crazy good, like glm 5.2
- flash models getting very good like deepseek V4 flash
- quantizations
- harnesses being able to use different models (big for difficult stuff, small for grunt work)
So hopefully soon for the ones who want to break free from APIs, we will be able to host at home a cluster of AI desktops at a reasonable price with Opus-level capabilities, can't wait!!
Tepix 3 hours ago [-]
My hope is on Intel Crescent Island with 480GB. I don't need 8x H200 performance (and cost), but I would like to run GLM 5.2 Q8.
MaKey 28 minutes ago [-]
I'd love to too, but I guess Crescent Island with 480 GB will cost something like $10-12k or even more.
khafra 6 hours ago [-]
I feel like "relatively" is doing a lot of work, there: at about $4k per GB10, that's $36k for a 1TB cluster. Cheap compared to equivalent H200's, but out of reach for home labs that aren't funded with OpenAI or Anthropic RSUs.
snarfy 4 hours ago [-]
When the AI bubble pops those hardware prices will pop too.
matheusmoreira 6 hours ago [-]
Hope you're right! Can't wait!
pheggs 18 hours ago [-]
I feel like the gap is closing to be able to run good enough models locally even for coding and I would assume it could make some companies a bit nervous. Am I wrong about that?
UncleOxidant 18 hours ago [-]
If we didn't have a RAM/GPU shortage right now they would be more nervous than they are. But as it is very few people are going to be able to afford a rig that can run this model effectively. That's probably not going to change for several more years yet. I think if the Z.ai folks decide to come out with a flash version of GLM-5.2 specialized for coding that came in about about 80B params, then the US frontier labs would probably be more worried. Overall, the Chinese AI companies have been showing the way to do the same amount with less (sometimes much less) and as that trend continues it's going to make the frontier labs worried - but even the Chinese AI companies are going to want to protect their moat by not releasing capable models that are significantly smaller than their current flagship models. AliBaba Qwen seems to be there now - it's gotten mighty quiet from them lately - their latest 395B model is just too large for most folks to run at home and they don't seem to be making any noises about releasing smaller ones this time around.
gpm 17 hours ago [-]
The ram/gpu shortage won't last forever though. Moreover we can be pretty confident that long-term the prices will obey wrights law and come down in cost significantly (from the pre-shortage prices) as we learn to produce them more efficiently.
LLM companies are valued as if they're going to have some enduring monopoly that they can extract money from... GLM-5.2 and similar models make that valuation very very questionable.
UncleOxidant 17 hours ago [-]
> The ram/gpu shortage won't last forever though.
No disagreement there, but it could easily last another 3 to 5 years which is a long time in tech terms.
DougN7 14 hours ago [-]
Long enough for them to IPO and all the execs to retire. I doubt they care beyond the IPO.
r0b05 6 hours ago [-]
I think this is the play
mannanj 17 hours ago [-]
> The ram/gpu shortage won't last forever though
Don't underestimate the markets ability to remain irrational
colinsane 15 hours ago [-]
the companies which have the power to alleviate these shortages are the same companies who are profiting most from the shortage. scarcity is an asset, it's not irrational that a concentrated marked will produce more of that asset.
selectodude 15 hours ago [-]
The solution for high prices is high prices.
If making RAM and SSDs is now cause for a 10 figure valuation, after enough time somebody will dive in.
Tepix 3 hours ago [-]
What's the irrational part? There's sky high demand.
elorant 17 hours ago [-]
Very few people, but quite a lot of companies especially after per token pricing took effect and companies see their invoices skyrocketing. You pay an upfront cost once and you’re done.
dannyw 13 hours ago [-]
When a large open weight model is released, a lab, startup, or a rich hoist can easily do logit-level distillation and create a XXb param model or whatever, and in theory you should get a really good distill.
verdverm 17 hours ago [-]
I suspect the time horizon is shorter because of software advances. We are getting more capability out of smaller models
Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)
UncleOxidant 15 hours ago [-]
> Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)
True, Qwen3.6-27B is amazing for it's size. However, it seems likely that we're not going to see anymore of these smaller models from Alibaba/Qwen since several key players exited that organization a few months back.
Infernal 14 hours ago [-]
Do we know where those key players went?
verdverm 14 hours ago [-]
Good to know, I think the trend is clear based on the models coming out of China and well see more capabilities in smaller, more efficient models.
cogman10 18 hours ago [-]
I don't think so. I could easily see a company deciding to host and run these models for their own development. If you have a dev team of about 10 people, a one time $50k investment in an LLM server has to be pretty tempting. Unlimited tokens, decent performance, upgrade options, and potential product integrations.
For companies wanting LLMs in their products in general, I have to think going the local llm route is even more tempting. Somewhat dumb models are more than good enough for a lot of the things people are integrating LLMs into their products.
twelvechairs 17 hours ago [-]
Surely for most the desire is just an LLM provider that doesnt store or sell their queries (including by national actors). As long as that is allowed to happen surely its the answer for the vast majority.
matheusmoreira 6 hours ago [-]
> LLM provider that doesnt store or sell their queries
> As long as that is allowed to happen
It won't be. Only we can provide that, and only for ourselves.
eventualcomp 18 hours ago [-]
Where is $50k coming from again?
stingraycharles 18 hours ago [-]
That’s less than the monthly salary of 10 software engineers, and assuming they pay API prices, probably earns itself back in about a year.
Having said that, I don’t think it’s all that tempting for companies at all, considering this whole market is developing rapidly and it’s nearly impossible to predict where we’ll be at in a year or two.
cogman10 17 hours ago [-]
The hardware requirements aren't evolving and the local models have only been improving.
It's not like you'd lose capabilities, if anything this solution just gets better with time.
chatmasta 17 hours ago [-]
If the newer models require more/better hardware then you’ll lose capabilities.
I think you’re better off renting GPU instances and running all the software on those. It’ll be cheaper than Anthropic and OpenRouter but slightly more expensive than electricity and depreciation of hardware.
cogman10 16 hours ago [-]
The newer models don't require more/better hardware. There's a small army of local llm enthusiasts who are running LLMs using 3090s and H100s because they have lots of memory. Them being old isn't really that big of an issue as the compute power needed is relatively low all things considered.
The number of parameters needed for these open weight models has mostly stabilized so the actual memory requirements aren't likely to change all that much.
dannyw 13 hours ago [-]
Correct. The main bottleneck with LLM inference is, and have always been, memory bandwidth.
TPS = active weights in GB / your memory bandwidth.
That’s it for decode. That’s all.
Tepix 3 hours ago [-]
$50K seems low if you want to run, say, GLM 5.2 4bit fast enough for a team for devs.
You need something like 6x RTX Pro 6000 at $11800 each plus a nice server (add $10000) = $80800 and then quite a bit of electricity.
cogman10 18 hours ago [-]
As in who pays for it or how did I arrive at that number?
For who pays for it, obviously the employer would.
For "how did I arrive at this number" Ballpark estimate from what I know about part cost. Most of that money will go towards AI cards about $5k for the mb, cpu, power supply, etc. $45k would be for as much ram and as big/expensive nVidia cards as you can get your hands on. The B300 has 288GB of VRAM in it. Probably what you'd be after.
simplyluke 15 hours ago [-]
You don't even need to run them locally for them to be a threat. Plenty of companies are looking at paying third party companies to host these models and they come in at fractions of the price of the frontier labs.
fny 18 hours ago [-]
The RAM requirements are still pretty painful.
yieldcrv 18 hours ago [-]
equilibrium in one or two more years on the consumer/prosumer side
think Apple M6 or M7 with a currently unforeseen denser memory style, 256gb RAM
a couple inference or cache improvements on the algorithmic side, using less ram for context windows and doubling token speed again
denser open source models, packing more experts for smaller active layers
it'll still be expensive but like $8,000 - $13,000 instead of $450,000 worth of B200s
stingraycharles 18 hours ago [-]
Fairly certain that model sizes and computational requirements will grow as the price for LLM compute drops.
3stacks 16 hours ago [-]
Maybe there's a conversation to be had about how much is enough... Unless something beyond my imagination happened, I would be happy enough with Opus 4.5 levels of productivity
stingraycharles 8 hours ago [-]
This really sounds like “640kb should be enough”.
I’m sorry, but I just can’t imagine us running smaller models than we are using right now in 5-10 years from now.
hajile 1 hours ago [-]
We've already hit RAM power and size limits (about 40k electrons which is the limit before we get noise messing up the amplifier).
If a model needs 2x more memory, but serves the same number of customers, the cost is going to go up per customer to cover the increased hardware and power costs. Companies are starting to implement AI limits to keep costs under control.
Anthropic and OpenAI are rumored to be considering cutting inference prices trying to retain customers as LLMs commoditize and race to the bottom. It reminds me of the Chinese bike wars where bike-share companies were losing massive amounts of money, but kept running sales and lowering prices in an attempt to compete and drive out their competitors. The end of that story was a bunch of major bankruptcies and giant bike graveyards.
Nvidia's hard pivot to "in the near future, everyone will run their AI at home" seems to indicate that they also see the market shifting. We've already had AI ingest everything out there. The real challenge becomes how to better optimize their algorithm to get more useful data in less space.
yieldcrv 16 hours ago [-]
have you seen the open source LLM space? people fulfill all niches and there are active communities at every range of RAM and all are looking for the most capable in their respective range
a lot of innovation occurring
scosman 15 hours ago [-]
It's not economic to run them locally. It's amazing for privacy, and fun hobby. But you're either looking at super slow CPU builds with $10k in RAM, $90k worth of GPUs, or a really quantized model that doesn't compare in quality.
I might build one for fun, but it's not going to change the economics alone. Still exciting it's possible.
oceanplexian 2 hours ago [-]
It depends what you’re using it for. Real time interactive Claude code session? No, it’s kind of impractical.
But if you already have agent loops dialed in (For example I have one that uses a browser testing framework), it wouldn’t really affect me at all if it crunched away at 7 tokens per second all night long.
leansensei 33 minutes ago [-]
Not really, you can do great things without them. I've been summarizing hundreds of documents. I've added MCP servers to my internal business tools (Elixir apps) and can chat with the Nous Hermes agent over Telegram about pending orders, inventory level, historical product prices, etc., without having to click/dick around with a web UI.
Sure, it cannot replace SOTA models for agentic coding, except for small, well-scoped refactorings. But even a model like ministral-3:8b or qwen3.5:9b is a boon for so many smaller use cases!
CamouflagedKiwi 18 hours ago [-]
The hardware requirements to run this locally are still very high. Seems far enough off mainstream for those companies not to be too worried yet.
stymaar 16 hours ago [-]
Honestly, Qwen3.6 is already what you need for the large majority of tasks.
(I only ask Opus every 5 to 10 requests, when my local Qwen fails or when I encounter a situation that is too world-knowledge specific to be worth asking, but that way you can live easily with Claude's cheapest plan without ever facing usage limit).
notatoad 17 hours ago [-]
locally on what hardware? something like the new dgx spark, ryzen halo, or mac studio will cost you ~ $4k plus whatever you pay for power. at the rate AI is currently progressing, i think you'd be optimistic to consider that as having a 2 year depreciation.
for $4k, you can get 20 months of claude max 200. i'd take claude over the hardware.
anthropic will have something to worry about when you can run a local model on your macbook that can code. but i think we're quite a ways off from that.
oceanplexian 2 hours ago [-]
Yeah, 20 months of Claude Max until they rugpull you. I’m spending 7-10k/month in raw token costs on Claude Max. Having an alternative is a nice insurance policy.
fc417fc802 11 hours ago [-]
> at the rate AI is currently progressing, i think you'd be optimistic to consider that as having a 2 year depreciation.
How so? Model capability at a fixed hardware level has been consistently (and rapidly) increasing. You might or might not be able to run state of the art 2 (or 4 or whatever) years from now but you can reasonably expect the hardware to last upwards of a decade with model performance consistently improving over that time frame.
You can get a tolerable (at least by some metrics) experience using 10 year old hardware today.
chatmasta 17 hours ago [-]
Just a hunch, but I think the most cost effective “local” deployment method right now is renting GPU clusters by the hour and running all the inference software on them yourself. This will be cheaper than capital expenditure on hardware that will depreciate and become last-gen, and cheaper than OpenRouter pay per token.
tomr75 17 hours ago [-]
people who can't afford Claude max 200 are using qwen 3.6 27b for local coding assistance already
c7b 11 hours ago [-]
You can get a 128GB Strix Halo for under $3k. Used to be under $2k. Even if you believe it'll be completely obsolete for AI in two years, it'll still be good for many other things. Games for at least several more years, a great home server and/or desktop almost indefinitely. Plus, we might actually reach good enough levels for some AI use cases, if we're not already there.
And never underestimate the potential for enshittification. Your local rig will only deliver better performance over time as more and more tweaks come out. With cloud services expect the opposite to happen as subsidies run out. It's entirely possible that they will intersect on a bang per buck basis within two years.
SXX 9 hours ago [-]
You forget that after 2 years you still gonna have said Mac Studio that can be sold off for 30-50% of the price.
Of course its gonna lose value faster if something magical happen with hardware manufacturing, but you'll likely get 25% back at least.
On other side you cant really predict how valuable claude max gonna be in a year because Anthropic can further enshittify it.
fsuts 12 hours ago [-]
Why do you think they are rushing to IPO!!
storus 2 hours ago [-]
So a minimum of 3x RTX Pro 6000 to run 1-bit at ~76% accuracy or MacStudio 512GB RAM to run 4-bit at ~97% accuracy.
iaw 33 minutes ago [-]
No. Unsloth has CPU offloading. It'll be slow but it'll work even with SSD offloading.
Havoc 9 hours ago [-]
I bet OpenAI and Anthropic hate the timing of glm 5.2.
Kinda shows they have a headstart rather than a magic moat
achrono 2 hours ago [-]
Nope, GLM 5.2 is only the latest and greatest in a long line of open-weights models. There are even fully open source models that are comparable to o1-mini (OLMo), or almost-fully-open ones that are comparable to o3 (Nemotron).
I'm super grateful to the open labs (who, importantly, do not have the word 'Open' in their name), all the more so to the likes of Ai2.
There is no magic moat indeed. It is math, engineering and of course copious amounts of data (and the political maneuvering required to secure it, e.g. how most everyone has trained on Anna's Archive by this point).
jessinra98 4 hours ago [-]
Anyone here tried both Qwen and GLM families on the same setup and found a clear winner for one task vs the other?
9 hours ago [-]
c7b 11 hours ago [-]
Can someone explain the math to me? Why is 1-bit only ten percent less memory than 2-bit?
idonotknowwhy 10 hours ago [-]
2 reasons.
First, it's not really "1 bit", actually much closer to 2-bit.
IQ1_M is actually 1.75bit and IQ2_XXS is 2.06bit
This is from the ./llama-quantize --help with most of the quant types and their size in bpw:
https://pastebin.com/bCUqGfeE
And to elaborate on the "dynamic" aspect inconito said in the other comment, if you click on one of the .gguf files in huggingface:
There are a lot of Q5_K, Q6_K, etc tensors.
Only the routed experts (ffn_gate_exps.weight, ffn_up_exps.weight, ffn_down_exps.weight) are heavily quantized, and it looks like the down_proj is actually iq3_xxs for this model.
incognito124 10 hours ago [-]
Keyword dynamic, the parameters are quantized on a case by case basis
numlock86 12 hours ago [-]
Is this really worth it, though? Throughout the years my experience with quantized models has been that they feel like a lobotomized version of the original. Doesn't matter if it's an LLM, dedicated diffusion model or some other dedicated task. Sure, they get the job done. But a lot worse. The only ones that can somewhat hold up are the ones provided by the vendor directly. Gemma4 comes to mind. However I suspect they have some secret sauce other than just "let's quantize this" since they have the original model and its data at hand.
There should be more native 4bit, 1.25bit and likewise models. Those actually work great while making them smaller in comparison. But I guess there is some reason for them being pretty niche.
iaw 32 minutes ago [-]
On a model of this size quantization has much less impact on quality of output. I'm running a 3bit version and find it comparable to sonnet, almost opus.
nicman23 12 hours ago [-]
it is not a flat quant but a dynamic
jonathanhefner 15 hours ago [-]
> Runing GLM-5.2 on local hardware
Do the runes make it smarter or just run faster (or both)?
nicman23 11 hours ago [-]
depends on the color
zkmon 9 hours ago [-]
I have high respect for unsloth's work, helping millions to get started with local AI, but this post appears kind of download bait.
Offloading too many layers to CPU is not going to work at all. I have tried this many times and had to rm -rf on those heavy hf cache folders. Also I doubt 1-bit or 2-bit quants of GLM 5.2, running mostly outside of VRAM can beat Q8_0 of Qwen3.6-27B fully loaded in VRAM - on usefulness.
iaw 30 minutes ago [-]
I run 3bit GLM5.2 and full precision Qwen3.6-27B. GLM is much much closer to frontier models in it's breadth and ability to plan. If you just need to implement Python code from an existing plan Qwen is your choice but it has problem succeeding with more complex tasks that GLM5.2 does not.
As I type this my local GLM5.2 is troubleshooting bugs that Qwen would not be able to handle.
CGamesPlay 17 hours ago [-]
Can somebody help me understand the Quantization Analysis? It says "dynamic 4-bit UD-Q4_K_XL and dynamic 5-bit UD-Q5_K_XL are generally lossless" while showing a top-1% token agreement on the chart of 97.5%. Not what I would consider "generally lossless". Is this implying that some post-processing is going to account for the 2.5% loss? Beam search?
dannyw 13 hours ago [-]
Generally 97.5% token agreement is very positive. Like the article explains, the difference isn’t the model thinking the capital of France isn’t Paris, but rather maybe saying “The capital of France is Paris” instead of “Paris is the capital of France”.
drudolph914 13 hours ago [-]
GLM 5.2 is the first time I'm actually excited about AI! I'm not the most bullish on AI code for several few reasons, but the biggest reason is the ownership model. We all know we're near the tail end of the "subsidized pricing" window for AI, and I've been hoping for so long to get an open weight model that is _close enough_ to the SOTA before this window closes - and we actually got it! I'm excited to be able to in the near future run GLM locally, and use these things like a tool instead of living in this for-rent model for the rest of my life. I'm excited to actually enjoy programming again
andai 18 hours ago [-]
How is this model half the size of DeepSeek V4 Pro? Is it because DeepSeek did more aggressive cost cutting on the attention mechanism?
walrus01 9 hours ago [-]
I really don't think anyone is going to have a good time trying to run it on anything with 256GB of RAM no matter what the post says. 512 is the much more realistic minimum. I'm fortunate enough to have two 512GB RAM dual xeon workstations in my home office that I bought cheap before the price rise to mess around with things...
edg5000 12 hours ago [-]
One advantage about local LLM: You could serialize the context yourself, without being constrained by APIs. And let's not forget, the Big 2 encrypt their thinking. If you use custom clients, which is a very grey area alreay, being able to produce the context string raw is a big bonus. Takes away a lot of annoying constraints and needless mystique/obfuscation.
But I don't know how usable GLM 5.2 is vs the Big 2.
cjbprime 8 hours ago [-]
I've got access to a 192GB RAM Mac Studio, which is below the stated minimum RAM. Can swapping off fast disk be used to make it work out, especially since it's MoE?
walrus01 8 hours ago [-]
Seems like a good way to shorten the lifespan of an NVME SSD significantly by using up its TB written lifespan, if you let it extensively swap. Also the performance will be absolutely abysmal like 0.1 tok/second.
smallerize 4 hours ago [-]
The LLM tools are smart enough to keep the weights on the disk and read-write stuff in RAM.
jzer0cool 8 hours ago [-]
1 bit requirement (1-bit 223 GB wowza). What you all recommend with 24-48 vram, or is this approach much out dated now.
snootypoot 16 hours ago [-]
if sam altman didnt exist i could afford to run this
numlock86 12 hours ago [-]
if sam altman didn't exists this model would most likely not exist as well
maxignol 8 hours ago [-]
Lucky me, I never go out without my 256gb unified ram mac x)
ramgine 16 hours ago [-]
I have up to 1tb of ddr4 in my server but it only has a 12gb vram 3060. Would getting a 24gb vram make this a viable system or am I throwing money away?
segmondy 15 hours ago [-]
You can run it today with that 12gb vram 3060, but I would suggest getting 2 3090s. Use cmoe option. This will keep the attention/route tensors on the GPU and offload the rest to system memory. Try it now and see the performance.
rnewme 15 hours ago [-]
Should work yes.
suyash 10 hours ago [-]
We really need a quantized version for regular laptop
Wowfunhappy 17 hours ago [-]
> The full model requires 1.51TB of disk space
...a bit of an odd question: how well do LLMs losslessly compress, as in for cold storage?
I definitely don't have the hardware to run this model at any kind of reasonable speed (and I don't want to use a super aggressive quantization that would kill performance). Even so, I think it would be cool to retain an offline copy, in case... I don't really know, a solar flare destroys the internet some day, or maybe a zombie apocalypse. It would just be cool to have.
But 1.5 TB is a bit too much! If it could be compressed down into something semi kind of reasonable, that would be fun!
gcr 17 hours ago [-]
There are two forms of compression relevant to LLMs:
1. Reduce the number of parameters
2. Reduce the resolution of each parameter (quantization)
For 1, changing the architecture is typically only possible by the labs producing the models, which is why each OSS model release tends to feature a small number of carefully chosen model sizes (for example, Gemma4 comes in e2B, e4B, 12B, 26Ba4B, and 31B sizes).
Generally, models with higher parameter counts have more world knowledge. For coding models, this shows up as a stronger command of uncommon libraries/languages. Very small models (<20B) also lack “smarts.”
Reducing the resolution of each parameter is easier which is why lots of practitioners have their own quantizations, but this makes it harder for a model to “think” fluently. Interacting with heavily quantized models feels like interacting with someone who didn’t get any sleep the night before.
Models that have higher-fidelity quantization take more RAM and have higher “smarts,” but don’t necessarily have more world knowledge. Models with aggressive quantization tend to be more likely to make rookie mistakes, emit malformed tool calls, get stuck in loops, or even exhibit signs of “neuroticism” / “distress” in their thinking tokens.
Parameter counts = world knowledge, quantization = “smarts.”
This is a soft rule of thumb, the difference isn’t very strong.
throwdbaaway 13 hours ago [-]
On ZFS with zstd compression, I am getting 1.34x compressratio for the BF16 weights (across multiple models).
Here's the du output for GLM-5.2:
$ du -s -BG /cube/models/zai-org/GLM-5.2/
1099G /cube/models/zai-org/GLM-5.2/
> ...a bit of an odd question: how well do LLMs losslessly compress, as in for cold storage?
TBH this is like the near last ranking consideration in cost for being able to download and run this. Even though HDD and SSD prices have gone nuts as a result of the recent demand/shortage, it's not like 1.5TB of space costs a lot.
Even if you fed it into xzip with the most cpu intensive compression options and it didn't compress at all (eg: like trying to xzip an AV1 video, or whatever), it's still the cost of a single fast food hamburger meal in $/TB. The real concern is the RAM to run it.
But anyways, anecdotally, many 16-bit full precision GGUF files will compress to about 65% of original size with default xz options. I have a log here showing that's what IBM Granite 4.1 30b compressed to, which I'm keeping around but in lukewarm storage.
redox99 17 hours ago [-]
Probably not at all, considering weights are randomly initialized.
dofm 16 hours ago [-]
Can't run this myself.
But I do like Unsloth Studio, quite a lot. It's nicely designed.
bilekas 4 hours ago [-]
> this can directly fit on a 256GB unified memory Mac
And yet Apple won't sell them to you anymore. And I'm not too confident it will be even possible to hand then 10k to get one again.
hxii 17 hours ago [-]
Any time I see one of these posts about models of this size a quote comes to mind – "Your Scientists Were So Preoccupied With Whether Or Not They Could, They Didn’t Stop To Think If They Should".
Only a select few have the hardware required to run this to begin with, and even then the forecasted performance makes me wonder if it’s worth it at all.
segmondy 15 hours ago [-]
Completely worth it. At 6tk a second. If I can get 2 hrs of token generation. That's 2hrs * 3600secs * 6tk = 43200 tokens, at about 10tk to a line of code, that's about 4320 lines. Let's even trim it more and slice it by half. That's 2160 lines of code a day. Most professional programmers can't deliver that much consistently in a day.
The key to a model this large is (1) Use it to plan, generate lots of plan and farm out to a smaller model. Then for very specific and complicated portions precisely prompt for what you need.
uberex 13 hours ago [-]
Thats not a complete reasoning. Even frontiers need to revisit and fix things. Add 10 loops to that and it is 20 hours. Still great compared to a 2023 human, but why am I not just paying pocket money for Claude Pro instead?
segmondy 12 hours ago [-]
You're talking about agentic workflow. Agentic is cruise controls. Race car drivers shift manually for more precision and to go faster. If the only way you know how to code with AI is agentic, then you are putting yourself on a crutch.
uberex 12 hours ago [-]
You are saying you can one shot without loops on something like GLM-5.2?
chakintosh 4 hours ago [-]
Breaking even in 2069
nullc 17 hours ago [-]
Just running cpu only w/ Q6 on 9684X I get about 1tok/s ... also still get about 1tok/s/stream when running 16 in parallel.
zuzululu 18 hours ago [-]
wonder if AMD's new ai chip can run this with ease? I'm seriously consider buying it. GLM 5.2 is just shy of GPT 5.4 so I would welcome offloading any grunt work locally
I am very excited for local LLMs I think we may have GPT 5.5-xhigh level of performance for under 2000 EUR
This should put more pressure on the frontier models to avoid sitting on any fancy stuff and lower token prices as a whole.
Nothing beats a local LLM disconnected from the cloud.
UncleOxidant 17 hours ago [-]
Are you talking about Medusa Halo? It's going to support up to 256GB unified memory (up from 128GB for Strix Halo and 192GB for Gorgon Halo). That might just be barely enough to run a 2-bit quant GLM-5.2. It will expand memory bus to 384-bits, vs. 256-bits for Strix Halo which will help with bandwidth (projected to be around 500 GB/sec). But don't expect Madusa Halo-based machines to appear until sometime in 2028.
The other way this could go is that Z.ai could decide to release a smaller model targeted towards coding. They've done that before (GLM-4.7-Flash had 30B params). It would be great if they decided to release something in the 80B-100B param range. Something that size would easily run in a current Strix Halo system.
monksy 14 hours ago [-]
Strix Halo only supports 96gb of video memory then it goes to 32gb to the host system.
zuzululu 16 hours ago [-]
yeah you are correct 2 bit quant won't be enough
guess we'll be paying $200/month for a while
nl 18 hours ago [-]
> I am very excited for local LLMs I think we may have GPT 5.5-xhigh level of performance for under 2000 EUR
We are maybe 10 years off that.
RAM prices are going to continue to increase for the next 2 years at least.
Even putting that aside it's currently around 40-70,000 EUR to run this with a FP8 quantization (which you need to get close to maximum performance).
To actually get GPT 5.5-xhigh performance in the real world you need more headroom to support things like subagents (which will fill up your KV cache).
I like local models but realism is important. The sweet spot for the next 3 years will continue to be ~35B MoE models. They might match GPT 5.5-xhigh for chat-style problems but not for coding.
hsuduebc2 17 hours ago [-]
I wonder, if in the near future any acquisitions of some RAM producers with intent to just keep RAM prices up, will happen from the AI companies. It could seriously hurt their business, if companies will be able to host their AI in some time.
nl 16 hours ago [-]
I think AI companies have enough things to spend capital on already.
zuzululu 16 hours ago [-]
[dead]
Iolaum 18 hours ago [-]
At full quantization GLM 5.2 may be close to GPT 5.4. But at Q2 or whatever one needs in order to run it on a pro-sumer device it will be worse.
Also I m not sure where you are getting the under 2k value. I bought a Framework desktop 128GB last year and my setup was around 2.7k. The same setup now sells for around 4.7k.
nh43215rgb 18 hours ago [-]
Even with upcoming AI Max+ PRO 495 we are capped with 192GB, so no...
benjiro29 18 hours ago [-]
"GLM 5.2 is just shy of GPT 5.4"... If your running the full model. As in have 750 (FP8) to 1.5TB(FP16) of memory available.
Do not mix the benchmark results of GLM 5.2 FP16/FP8 with FP4 or FP2.
* FP4 will mean a accuracy loss of about 3%. Not noticeable but more chance for mistakes.
* FP2 ... what is what most people are able to run at home, for a "reasonable" price. Your looking at over 17% loss in accuracy.
At that point, your running at less then claude-sonnet-4.6, as the issues compound with accuracy losses. And reasonable priced is still in the ~ $5000 range (192GB + GPU 32GB active/kv cache system).
For that price your using a Codex / Claude Pro subscription for the next 4+ years with better models (by default), let alone with a FP2 GLM 5.2 version. And your looking at < 10 fps. A MacStudio with 512GB will net you 18 a 20fps+ with FP4, but ... i mean, those used to be $10.000.
Unfortunately the local hardware cost is a major issue for running large models like that.
Edit: Its funny whenever the issue of cost and what you need to give up vs the subscription services, there are always people who downvote in bad faith.
kgeist 15 hours ago [-]
The cost of local hardware is amortized if a whole team uses it instead of just 1 dev (GPUs are extremely underutilized if you launch just 1 generation stream). I'm not sure why everyone always assumes solo devs with Macs. We've just ordered a large datacenter-grade node for use by the whole dev team, and the calculations show that it's going to cost the same amount of money if we kept using AWS Bedrock (infosec reasons) for a couple years but... it gives us 100% privacy, we're immune to all the AI regulation dramas in the US/EU, all the random outages, and the developers won't have to think about token limits/weekly caps etc. ever again. And all that with a model which is Opus-grade
(it's not our first AI server, we already have experience deploying LLMs for our clients, so the numbers look solid)
benjiro29 4 hours ago [-]
Yes but unfortunately a lot of the discussion that people participate in, are not done from a corporate point of view, but from a normal consumer level.
And there is a lot of drama in those discussions. GLM 5.2 is a great model for corporations to run, but people only want to hear about running a 35B/27B or maybe a 120B model. And in that market, subscription services are simply way better value for money (take in account the privacy issues).
Everybody wants GPT 5.5/Opus 4.8 Max levels, on a budget that simply is not realistic. And GLM fit in that 4.8 medium/low level.
But then people do not want to be told that running a 750b model in Q2 or Q1 is just going to destroy the models accuracy. And that is still going to cost them 5k+ for that reduced model.
The whole local llm landscape from a consumer point of view, is just filled with odd people. lol.
Corporation really benefit from those models, because spending $90k on a server, is a deductible expense. And they are billed at token prices anyway from all the major providers. So its a even faster ROI on that hardware.
I am surprised that nobody figured out to make a business of selling leftover capacity from corporate llm installations, because there is easily 12h+ just wasted (unless its a large corp that has people in all timezones).
froh 12 hours ago [-]
> GPUs are extremely underutilized if you launch just 1 generation stream
why is that? b/c the thing is waiting for the hoooman and idling? or some parallelizable interleaving steps?
I have no intuition yet how this works under the hood.
cataflutter 9 hours ago [-]
Some of the inference engines can process multiple requests in parallel more efficiently than doing them sequentially. Not sure of the exact mechanism but e.g. llama.cpp's llama-server can do this (you tell it the number of slots to have when starting, then fire HTTP requests at it and it batches them together when it can).
Waiting for the hooman (or tool calls) won't help either, of course.
entrope 4 hours ago [-]
The mechanism is that generating tokens (the "decode" phase) in an LLM is limited by memory bandwidth for the weights, so computing multiple streams amortizes the bandwidth over streams as long as you can keep the contexts in RAM. This is most true for dense models and the always-on expert in MoE models, or when you have significantly more streams than experts for MoE models.
In contrast, prompt prefill is more easily compute-bound, so there are interesting trade-offs for latency of decode vs prefill when the LLM utilization is high.
zuzululu 16 hours ago [-]
you are right that means GLM is still quite far off from truly competitive
i think your answer was perfect not sure why you are being downvoted
kccqzy 18 hours ago [-]
The AMD 395 supports up to 128GB unified RAM. So still not enough even at 1-bit quant unfortunately.
monksy 14 hours ago [-]
96gb vram is the max it supports.
cpburns2009 3 hours ago [-]
That's the max you can statically allocate in the BIOS. It's best to leave that at the minimum (500 MB I think), and let the drivers dynamically allocate. You can use up to about 120 GB on Linux.
selfhoster11 2 hours ago [-]
Under Linux it is allegedly 110GB, but I’m not sure.
viduus 43 minutes ago [-]
[flagged]
jingpostmedia 1 hours ago [-]
[flagged]
c121618 2 hours ago [-]
[flagged]
cws_ai_buddy 15 hours ago [-]
[flagged]
VaporJournalAPP 15 hours ago [-]
[flagged]
tsouth2 17 hours ago [-]
[dead]
lisa_luoyf 8 hours ago [-]
[flagged]
boringspinner 15 hours ago [-]
[dead]
CHUNK_CHUNK 15 hours ago [-]
[flagged]
9 hours ago [-]
stackedinserter 3 hours ago [-]
TLDR: realistically, you can't.
ofcyes 9 hours ago [-]
[dead]
Rendered at 16:50:04 GMT+0000 (Coordinated Universal Time) with Vercel.
Local inference needs really hard a 1.2 / 1.5 T/s memory bandwidth system with 512GB and 2/3 times the GPU compute of Mac Studio M3 Ultra, at an affordable 10/15k price point. A variant with 1TB memory would also be welcomed at 20k price point.
Unless you really want privacy or the fuzzy feeling of owning your own, it’s cheaper, more convenient and has much faster tok/s if you pay a hyper scaler.
That said, I do like the direction we are heading and look forward to seeing what host your own hardware we get in 2 years.
I think the main reason not to run locally is to get the full models instead of quantized versions.
I agree and I prefer on-prem where possible. The Apple Mac Studios have been great for that although I don't have enough of them to run GLM-5.2 without heavy quantization. I'm also waiting for the Apple next product refresh which I hope will enable me to do more with less.
Meanwhile there are hosted privacy-conscious options out there. Two names to look at are Tinfoil[1] and Privatemode (from Edgeless Systems)[2].
Tinfoil[1] is, sadly, US-based. EU-sovereignty-option is on their long-term radar. But they do have GLM-5.2 today.
Privatemode[2] is a German company (Edgeless Systems) with EU-based servers. But sadly no GLM-5.2 today, it is on their mid-long term radar though.
Both Tinfoil and Privatemode operate on the same concept of the LLM operating in a secure enclave and you have end-to-end attestation and encryption.
Tinfoil have not been independently audited, it is somewhere on their long-term radar.
Privatemode have been thoroughly independently audited with documentation available on request.
Both of them are API-tokens-only. So if you're currently one of those people throwing $200 a month down the pan at Anthropic/OpenAI for a so-called-alleged 'unlimited' plan, then neither Tinfoil or Privatemode will be the place for you.
[1]https://tinfoil.sh/ [2] https://www.privatemode.ai/
I have this feeling that it'll be very expensive and still scarce. Normally I wouldn't say this about Apple, because their pricing is part of their brand, but this time the demand (both by data-centers and prosumers) is the force majeure.
I know people usually say that about Apple, but to be fair to them on this occasion they have not hiked up their prices yet because they are clearly at present still under some old deals that they did a good job negotiating.
However, of course, at some point Apple will run out of both inventory and old-pricing manufacturing capacity. Yes, I am fully expecting some sort of price-hike like has been seen everywhere else. I am not naïve.
When that time comes it will remain a financial calculation, Apple boxes on one side versus hosted-option-costs on another, in relation to my specific use-cases.
Ultimately I still blame the chip-hoarding hyperscalers though. :)
At a theoretical 6 tok/s, 86400 seconds in a day, approx 500,000 tokens of GLM5.2 output for 2 bucks a day seems like a pretty good bargain to me. Of course not counting the one time cost of the hardware to run it. But I see people dropping $4000-5000 on all kinds of much less useful stuff.
Additionally in a place where people use electric baseboard heating or electric in floor radiant heating, or really any other heating element based system in winter that's less efficient than a heat pump, additional electrical from a computing load is basically "free" since you would be spending that same money otherwise to heat your house. If a computer with 512GB of RAM is dumping the waste heat into your room, it accomplishes a portion of the same thing as a baseboard.
Not to mention there is a whole other less measurable benefit of having a locally hosted model that can't be turned off or arbitrarily restricted by a service provider, and where all of your queries and context cache aren't subject to surveillance by any third party.
On Openrouter, the cheapest GLM 5.2 provider costs $3/MTok (at 44 tps). Assuming most use is output tokens, that's still the equivalent of 450k token/day, so we're in the same ball park, but without the capex for 2 3090's and the machine.
Self hosted only makes economic sense if your priority is being in control / avoiding surveillance.
Running a system that will be 600W under max CPU usage on all cores and RAM and a few 3090-class GPUs, that same system might be only 90W or around there when idle at 0.00 unix load.
If we say: (600 * 24 * 31)/1000 = 446kWh in a month at full load 24 hours a day
But it could be less, such as: (90 * 12 * 31)/1000 = 33.48 kWh of idle time in a month, and 223kWh of "full load" 600W time in a month, if it's at full load only 12 hours a day.
If you're the only user accessing it and you only "use" it 12 hours a day, that cumulative yearly dollar figure would be almost halved. Or even less if a person is using it in bursts and intermittently throughout an 8 hour workday.
You can’t do that with 6 tps, though.
No, you would pay usage based rates with API, in this case. I have exactly one fixed monthly rate for the 6 AI models I have tokens available for.
It isn't 100% efficient. Even the best PSUs aren't.
There is no "ubiquitous" geothermal where there also high power usage. Data centers have to go where power is, not can be.
[1] https://en.wikipedia.org/wiki/List_of_geothermal_power_stati...
Or cloud LLM might just refuse to sell to you because it dont like your passport.
Like buying a new car today and taking on gas, parking, etc, expenses in case the bus route you’re using goes away at some point in the future. It’s not an economic decision, it’s a desire to have the new car dressed up in what-ifs.
Anyway, I think GLM 5.2 in many ways is not as interesting as DeepSeek V4 series, which uses an even more advanced attention mechanism and can save a lot of memory capacity for KV cache, especially at larger contexts. Which in turn opens up wide batching especially on consumer platforms. GLM doesn't have that, in some ways it feels broadly similar to Kimi 2.6 wrt. the underlying performance architecture. Both are a bit too heavy to run reasonably at full quality on ordinary hardware.
It also has an input image modality, which is a game changer. The cheap Sinofrontier models have generally been lacking in this regard.
Basically, Chinese competition is fierce - DeepSeek set the pricing tier, and the question for each lab now is how to justify charging a little more.
MiMo-2.5-Pro has gone with UltraSoeed, pumping out 1000t/s for a 3X price hike.
GLM has gone with 5.2, hitting Opus levels of reasoning at a fraction of the cost.
DeepSeek will probably keep their pricing model and just keep getting better and better.
Qwen-3.7 is the dark horse. Some rumours are Alibaba is simply making these models because they need them internally.
The real question is why this level of innovation and competition isn’t happening in America or Europe. In particular I see no reason Europe doesn’t have a lab competing on these terms.
Europe can provide none of this. They will never be at the frontier of AI tech, for the same reason they were never at the frontier of any tech.
I say this as a software engineer from Europe.
Qualify it to software, rather than all tech, if you will.
How about addressing this false dichotomy with the likelihood that someone who is new or interested in a tech isn't willing to drop thousands of dollars on used hardware for a whim or learning exercise.
32 CPU Epyc (Epyc is required for faster memory access) + 32 GB VRAM + 512 GB RAM is stupid expensive nowadays, and in best case, it will just downgrade to "very" expensive at some point in the future.
This makes sense only if 1. one is paranoid about privacy or 2. they have money to smoke or 3. they need to workaround cloud model restrictions, AND they have to do it routinely (because if not, a oneshot cloud bare metal setup is way cheaper, faster, and allows more powerful models, due to VRAM offering).
I did spend stupid money as well and yet, the system is 2x slower than cloud providers for comparable performance on vision tasks (I still have to test coding). Oh, and it's hot as hell.
Can you put up with that? As seems very slow. I aim for 40t/s on a laptop and choose models that deliver that speed over larger slower ones
Incase it's not clear, you will be generating 10,000,000 a second. Good luck verifying it. Token generation is not the bottleneck for creative work. If you are doing a predictable work and have a good workflow and massive dataset to process, then speed of token matters. If you are performing creative work like coding, it doesn't.
Apart of running local models I use this rig as my main remote development platform. All Claude Code sessions are running there in tmux now. And my fingers can't be happier not having to deal with constantly hot laptop. Not to mention that Claude Code is such a battery hog.
[0] https://medium.com/@rathko/i-built-an-epyc-64-core-512gb-ram...
Or maybe the model itself only runs at gpus, and the cpu memory only store the weights for experts not corrently activated? If so, then what's the 32 or 64 cpu cores for?
I'm a big fan of fully utilizing one's hardware and it's kinda sad that it's not the norm to run things on either gpu, cpu or both, dynamically choosing at runtime, for everyday software
https://github.com/noonghunna/club-3090/blob/master/docs/DUA...
Cloud offerings are 80-200tk/sec versus single digit tk/sec.
That said, I'm also surprised it runs at all locally. I do think it'd be painfully slow for anything interactive so you're relying on another model for a comprehensive design or you're hoping a one-shot with somewhat degraded quality turns out correctly.
A model like GLM-5.2 being available as GGUF, usable through llama.cpp/Ollama/vLLM/SGLang/LM Studio, and wrapped for local agent workflows changes the category. It stops being an impressive open model exists and starts becoming this is something a small team can actually put into its development stack.
For instance, company buys an RX6000 setup for say $15k total. They could use this for handling data heavy sifting that would otherwise be a lot of Claude tokens.
It doesn't need to be as good as frontier-best. Just good enough.
I could see a business of people packaging this and handing it to companies who want Help Desk bots without any extra setup.
Considering they might be spending thousands per month on API costs already, dropping 15K to save on one process might not be bad. On the other hand, also an opportunity to sell GLM 5.2 inference at near cost to other companies for less than whatever Claude costs. In theory it costs anywhere from $0.51 to less than $2 an hour to run it and use it 24/7 that's still wildly cheaper than calling Opus which doesn't bill per hour, but per million tokens, drastically higher. Hell, you could probably bill at $5 per GPU hour and still be cheaper. Whether you're looking to self-host or sell hosting for it, it looks way cheaper regardless. I think most decent open models will continue to fit in at least 32GB of VRAM so a 6000 Pro GPU is more than enough. alternatively, even on a 5090 you can get a reasonable amount of inference for way less than paying for Opus, Qwen would be your friend there though.
https://unsloth.ai/docs/models/glm-5.2#usage-guide
In a prior thread, someone said it would take $500k in hardware:
https://news.ycombinator.com/item?id=48629970
NVFP4 at reasonable speeds (~120 tok/s) and concurrency is possible at a $80/90k figure with today's prices, maybe even less. That buys you 6 RTX 6000 PRO Blackwells, a decent CPU and motherboard, power supply. 576gb of VRAM.
You could do it for under $50k if you're OK with 40 tok/s decode, ~1200 tok/s prefill.
Official price 85k...
The problem is the backplane I have not managed to find a single baseboard, and getting a random baseboard to work with random modules is probably a crap shoot.
In late 2027 or early 2028, Nvidia will release Vera Rubin DGX Spark, likely with double or better the performance of current Blackwell, though unclear if memory capacity will go up much from current 128GB. Two to four of those will run models like this decently.
In 2028 we should expect Vera Rubin RTX discrete lineup, including the replacement to the RTX PRO 6000. Likely memory spec will be minimum 128GB. Good chance of up to 200GB. Two to four of those will run NVFP4 models in this class very well.
I’ve found that I need to go a couple steps past whatever quantizations are good enough in the KL-divergence testing to get good performance in real tasks with long context. So when Q4 is claimed to be lossless I end up with Q5 or Q6 for actual long-context tasks.
I feel like part of the reason for the relative stagnation in hardware over the last twenty years was simply the lack of use cases to justify hardware refreshes by businesses.
Most of the money and energy went to mobile for the last fifteen years.
Affordable local inference might be the gravy train the server, desktop, and laptop manufacturers need to get back in gear.
Business hardware got beefier because businesses demanded more data (or more specifically: the industry told businesses they needed more data), with no idea of what to actually do with it once they got it. To get all that data, bandwidth needed to be increased, with more iops to read/write it, more storage to keep it, and more memory and cpu to process it. But 99% of the data is junk. Companies have "data lakes" so big they need to come up with excuses to use the data, or risk somebody pointing out that they're spending a fortune hoarding bits.
Consumer hardware hasn't had a new use case since like 2012. Faster wifi for broadband & local file transfers, and higher-resolution video, are the only reasons one needed newer hardware. We actually got a resolution so high it makes no perceivable difference. And yeah we got faster CPUs and memory, but as soon as we did it got all eaten up by the most inefficient, wasteful software conceivable. Same use cases as 13 years ago, just more expensive, harder to use, and buggier. We should've gotten a new CPU architecture that was faster and more energy efficient. Finally it was delivered, but with a moat around the golden Apple.
Here we are two and a half decades into the Internet era, and my damn bluetooth earbuds and webcam microphone don't work half the time that I open a video conferencing app. Hardware can stay exactly like it is for the next few decades and I'd be happy. I just want software that works, and doesn't get continuously slower, forcing me to buy bigger hardware; or more draconian, locking me out of being able to use it how I want.
No, we're running into limits of moore's law, and it's showing in prices for new nodes, where they're getting denser but not cheaper.
So we hit limits on clock speed in the early 2000s (ex - the 4ghz wall) but it also turned out that mobile as the driver for sales meant no one really cared much about clock speed compared to performance/watt.
Clock speed mattered, but only relative to how many watts it took to get it (and above 4ghz... too many watts).
But we've seen a 15x improvement over the last 20 years. Performance/Watt is WAY up.
My guess is that LLMs are going to drive another "improvement cycle" in areas that we didn't care much about before.
I've built about 10 personal desktop machines (1 every ~4 years) and I can honestly say that I didn't care much about memory bandwidth prior to 2021.
In the same way that I didn't care much about how many watts my pentium 4 was using in 2005.
But now... now I care a lot about memory bandwidth. I care about memory speeds and total system ram in a manner I really, really didn't before.
So I think we're going to see a big shift to machines built on unified ram with a crazy focus on squeezing memory bandwidth and total ram capacity as far as we can.
My bet is that we'll get a similar 10-15x improvement by 2040 in unified system ram designs.
I fully expect to see 2tb unified ram desktops and 200gb unified ram phones be relatively common on a 20 year timeline, assuming we see similar levels of geopolitical stability (ex - world war 3 throws a wrench into things).
In the old days, Microsoft Entertainment Pack games were somewhat visibly taxing on some lower end systems.
The page advertises the 8-bit quant as taking ~800GB, which seems like it would require at least 3 consumer motherboards fully stacked w/ 4x64GB cards each.
Maybe “locally” has slowly come to imply “…on your homelab”?
I was lucky to buy a lot of RAM before prices skyrocketed. I knew I wanted to play with this stuff, so I spent what felt like a lot of money at the time to buy 8x96GB DDR5-6400 RDIMMs. Now the same RAM costs at least 6x more.
It wasn’t that absurdly expensive for a hobby, I bought 64GB DDR4 ECC sticks between $70-$100 on eBay before everything took off. Now everyone is in here debating if open source is 1 month or 3 months behind SOTA. The future is obviously local.
Speed-wise, I don't have numbers, but it feels subjectively faster than Opus in Claude Code. YMMV.
Once you go above "a used 3090 at a decentish price", then I strongly recommend renting cloud GPUs or at least testing models using paid APIs. This allows testing your use case before spending piles of money.
Each token has to read all the active weights. I think that's around 40B parameters active. At a 4-bit quant that's 20GB. With 100GB/s (replace with whatever your bandwidth is) and you get 5 tokens per second.
On top of that, you will still be heavily quantized.
You can cluster these beasts too. Two and three (with two IP subnets) is fairly obvious. Four or more might need a switch depending on how much network latency affects things.
Apple seem to have forgotten about M series with gobs of RAM. I can't get the Apple shop to show more than 96GB of unified RAM and that costs a kidney.
If you are training and doing research it's great, if you want to cluster them it cant be beat, but if you just want local inference on a single box buy a mac or even a strix halo device.
I'm a Linux guy, but also don't always have alot of time. The Spark comes out of the box with a nice Linux distro that's pre-configured to be easy to setup and the guides and online resources make getting up and running trivial, for even some complex tasks. You would have to do a LOT of tinkering just to figure out some of the things the nvidia resources walk you through natively. They have guides for a ton of stuff that include the optimal settings so you don't have to figure it all out through trial and error.
Check out these "playbooks" for some examples. [0] There's a lot to be said for not having to piece all that together yourself.
https://build.nvidia.com/spark
I think between unboxing mine setting it up to run headless, and generating tokens was like 20 minutes total for me.
Only the M3 Ultra really beats it, and once you start scoping out the cost of a M3 Ultra with 128GB or 256GB, the DGX Spark doesn’t look bad after all.
I see ~274 GB/sec for the DGX Spark[1], versus 307 GB/sec for M5 Pro and 460 or 614 GB/sec for M5 Max[2]. One might call 90% "basically the same", but there are nominally two tiers above "Pro".
Yes, a MacBook Pro with 128 GB and M5 Max costs $5100 (14") or $5400 (16") versus currently $4700 for the DGX Spark, but the MBP includes keyboard, mouse, battery and portability. I believe its prefill is slower and you get 2 TB vs 4 TB SSD, but overall one gives up a lot to save 10% of the cost.
[1]- https://docs.nvidia.com/dgx/dgx-spark/hardware.html [2]- https://support.apple.com/en-us/126319
Apple could actually be a good deal and you folks would still make up something to not justify it. In a way, it’s amazing what Apple has accomplished- Baseless negatively-tainted perception in certain influential tech circles.
(To be fair, they’re kind of earning it. I’m glad Tim “Sweet T” Cook is departing.)
Plus, my original comment got downvoted despite being factually-correct. Thanks, Reddit. Oh, wait…
The spark can fine tune models in 1/4 the time and excels at other compute tasks in ways that Mac never can. Plus the high bandwidth ConnectX-7 ports would be like $1700 to buy on a card just for the network adapters... But for generating tokens, it just plain loses.
(Still potentially very useful! But not magically ultra fast.)
- new AI desktops with GB10s. They are relatively cheap and you can cluster them and load 1TB of VRAM
- Nvidia, amd, intel, Cerebras etc pushing new hardware
- oss models getting crazy good, like glm 5.2
- flash models getting very good like deepseek V4 flash
- quantizations
- harnesses being able to use different models (big for difficult stuff, small for grunt work)
So hopefully soon for the ones who want to break free from APIs, we will be able to host at home a cluster of AI desktops at a reasonable price with Opus-level capabilities, can't wait!!
LLM companies are valued as if they're going to have some enduring monopoly that they can extract money from... GLM-5.2 and similar models make that valuation very very questionable.
No disagreement there, but it could easily last another 3 to 5 years which is a long time in tech terms.
Don't underestimate the markets ability to remain irrational
If making RAM and SSDs is now cause for a 10 figure valuation, after enough time somebody will dive in.
Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)
True, Qwen3.6-27B is amazing for it's size. However, it seems likely that we're not going to see anymore of these smaller models from Alibaba/Qwen since several key players exited that organization a few months back.
For companies wanting LLMs in their products in general, I have to think going the local llm route is even more tempting. Somewhat dumb models are more than good enough for a lot of the things people are integrating LLMs into their products.
> As long as that is allowed to happen
It won't be. Only we can provide that, and only for ourselves.
Having said that, I don’t think it’s all that tempting for companies at all, considering this whole market is developing rapidly and it’s nearly impossible to predict where we’ll be at in a year or two.
It's not like you'd lose capabilities, if anything this solution just gets better with time.
I think you’re better off renting GPU instances and running all the software on those. It’ll be cheaper than Anthropic and OpenRouter but slightly more expensive than electricity and depreciation of hardware.
The number of parameters needed for these open weight models has mostly stabilized so the actual memory requirements aren't likely to change all that much.
TPS = active weights in GB / your memory bandwidth.
That’s it for decode. That’s all.
You need something like 6x RTX Pro 6000 at $11800 each plus a nice server (add $10000) = $80800 and then quite a bit of electricity.
For who pays for it, obviously the employer would.
For "how did I arrive at this number" Ballpark estimate from what I know about part cost. Most of that money will go towards AI cards about $5k for the mb, cpu, power supply, etc. $45k would be for as much ram and as big/expensive nVidia cards as you can get your hands on. The B300 has 288GB of VRAM in it. Probably what you'd be after.
think Apple M6 or M7 with a currently unforeseen denser memory style, 256gb RAM
a couple inference or cache improvements on the algorithmic side, using less ram for context windows and doubling token speed again
denser open source models, packing more experts for smaller active layers
it'll still be expensive but like $8,000 - $13,000 instead of $450,000 worth of B200s
I’m sorry, but I just can’t imagine us running smaller models than we are using right now in 5-10 years from now.
If a model needs 2x more memory, but serves the same number of customers, the cost is going to go up per customer to cover the increased hardware and power costs. Companies are starting to implement AI limits to keep costs under control.
Anthropic and OpenAI are rumored to be considering cutting inference prices trying to retain customers as LLMs commoditize and race to the bottom. It reminds me of the Chinese bike wars where bike-share companies were losing massive amounts of money, but kept running sales and lowering prices in an attempt to compete and drive out their competitors. The end of that story was a bunch of major bankruptcies and giant bike graveyards.
Nvidia's hard pivot to "in the near future, everyone will run their AI at home" seems to indicate that they also see the market shifting. We've already had AI ingest everything out there. The real challenge becomes how to better optimize their algorithm to get more useful data in less space.
a lot of innovation occurring
I might build one for fun, but it's not going to change the economics alone. Still exciting it's possible.
But if you already have agent loops dialed in (For example I have one that uses a browser testing framework), it wouldn’t really affect me at all if it crunched away at 7 tokens per second all night long.
Sure, it cannot replace SOTA models for agentic coding, except for small, well-scoped refactorings. But even a model like ministral-3:8b or qwen3.5:9b is a boon for so many smaller use cases!
(I only ask Opus every 5 to 10 requests, when my local Qwen fails or when I encounter a situation that is too world-knowledge specific to be worth asking, but that way you can live easily with Claude's cheapest plan without ever facing usage limit).
for $4k, you can get 20 months of claude max 200. i'd take claude over the hardware.
anthropic will have something to worry about when you can run a local model on your macbook that can code. but i think we're quite a ways off from that.
How so? Model capability at a fixed hardware level has been consistently (and rapidly) increasing. You might or might not be able to run state of the art 2 (or 4 or whatever) years from now but you can reasonably expect the hardware to last upwards of a decade with model performance consistently improving over that time frame.
You can get a tolerable (at least by some metrics) experience using 10 year old hardware today.
And never underestimate the potential for enshittification. Your local rig will only deliver better performance over time as more and more tweaks come out. With cloud services expect the opposite to happen as subsidies run out. It's entirely possible that they will intersect on a bang per buck basis within two years.
Of course its gonna lose value faster if something magical happen with hardware manufacturing, but you'll likely get 25% back at least.
On other side you cant really predict how valuable claude max gonna be in a year because Anthropic can further enshittify it.
Kinda shows they have a headstart rather than a magic moat
I'm super grateful to the open labs (who, importantly, do not have the word 'Open' in their name), all the more so to the likes of Ai2.
There is no magic moat indeed. It is math, engineering and of course copious amounts of data (and the political maneuvering required to secure it, e.g. how most everyone has trained on Anna's Archive by this point).
First, it's not really "1 bit", actually much closer to 2-bit. IQ1_M is actually 1.75bit and IQ2_XXS is 2.06bit This is from the ./llama-quantize --help with most of the quant types and their size in bpw: https://pastebin.com/bCUqGfeE
And to elaborate on the "dynamic" aspect inconito said in the other comment, if you click on one of the .gguf files in huggingface:
https://huggingface.co/unsloth/GLM-5.2-GGUF/blob/main/UD-IQ1...
There are a lot of Q5_K, Q6_K, etc tensors. Only the routed experts (ffn_gate_exps.weight, ffn_up_exps.weight, ffn_down_exps.weight) are heavily quantized, and it looks like the down_proj is actually iq3_xxs for this model.
There should be more native 4bit, 1.25bit and likewise models. Those actually work great while making them smaller in comparison. But I guess there is some reason for them being pretty niche.
Do the runes make it smarter or just run faster (or both)?
Offloading too many layers to CPU is not going to work at all. I have tried this many times and had to rm -rf on those heavy hf cache folders. Also I doubt 1-bit or 2-bit quants of GLM 5.2, running mostly outside of VRAM can beat Q8_0 of Qwen3.6-27B fully loaded in VRAM - on usefulness.
As I type this my local GLM5.2 is troubleshooting bugs that Qwen would not be able to handle.
But I don't know how usable GLM 5.2 is vs the Big 2.
...a bit of an odd question: how well do LLMs losslessly compress, as in for cold storage?
I definitely don't have the hardware to run this model at any kind of reasonable speed (and I don't want to use a super aggressive quantization that would kill performance). Even so, I think it would be cool to retain an offline copy, in case... I don't really know, a solar flare destroys the internet some day, or maybe a zombie apocalypse. It would just be cool to have.
But 1.5 TB is a bit too much! If it could be compressed down into something semi kind of reasonable, that would be fun!
1. Reduce the number of parameters
2. Reduce the resolution of each parameter (quantization)
For 1, changing the architecture is typically only possible by the labs producing the models, which is why each OSS model release tends to feature a small number of carefully chosen model sizes (for example, Gemma4 comes in e2B, e4B, 12B, 26Ba4B, and 31B sizes).
Generally, models with higher parameter counts have more world knowledge. For coding models, this shows up as a stronger command of uncommon libraries/languages. Very small models (<20B) also lack “smarts.”
Reducing the resolution of each parameter is easier which is why lots of practitioners have their own quantizations, but this makes it harder for a model to “think” fluently. Interacting with heavily quantized models feels like interacting with someone who didn’t get any sleep the night before.
Models that have higher-fidelity quantization take more RAM and have higher “smarts,” but don’t necessarily have more world knowledge. Models with aggressive quantization tend to be more likely to make rookie mistakes, emit malformed tool calls, get stuck in loops, or even exhibit signs of “neuroticism” / “distress” in their thinking tokens.
Parameter counts = world knowledge, quantization = “smarts.”
This is a soft rule of thumb, the difference isn’t very strong.
Here's the du output for GLM-5.2:
TBH this is like the near last ranking consideration in cost for being able to download and run this. Even though HDD and SSD prices have gone nuts as a result of the recent demand/shortage, it's not like 1.5TB of space costs a lot.
Even if you fed it into xzip with the most cpu intensive compression options and it didn't compress at all (eg: like trying to xzip an AV1 video, or whatever), it's still the cost of a single fast food hamburger meal in $/TB. The real concern is the RAM to run it.
But anyways, anecdotally, many 16-bit full precision GGUF files will compress to about 65% of original size with default xz options. I have a log here showing that's what IBM Granite 4.1 30b compressed to, which I'm keeping around but in lukewarm storage.
But I do like Unsloth Studio, quite a lot. It's nicely designed.
And yet Apple won't sell them to you anymore. And I'm not too confident it will be even possible to hand then 10k to get one again.
Only a select few have the hardware required to run this to begin with, and even then the forecasted performance makes me wonder if it’s worth it at all.
The key to a model this large is (1) Use it to plan, generate lots of plan and farm out to a smaller model. Then for very specific and complicated portions precisely prompt for what you need.
I am very excited for local LLMs I think we may have GPT 5.5-xhigh level of performance for under 2000 EUR
This should put more pressure on the frontier models to avoid sitting on any fancy stuff and lower token prices as a whole.
Nothing beats a local LLM disconnected from the cloud.
The other way this could go is that Z.ai could decide to release a smaller model targeted towards coding. They've done that before (GLM-4.7-Flash had 30B params). It would be great if they decided to release something in the 80B-100B param range. Something that size would easily run in a current Strix Halo system.
guess we'll be paying $200/month for a while
We are maybe 10 years off that.
RAM prices are going to continue to increase for the next 2 years at least.
Even putting that aside it's currently around 40-70,000 EUR to run this with a FP8 quantization (which you need to get close to maximum performance).
To actually get GPT 5.5-xhigh performance in the real world you need more headroom to support things like subagents (which will fill up your KV cache).
I like local models but realism is important. The sweet spot for the next 3 years will continue to be ~35B MoE models. They might match GPT 5.5-xhigh for chat-style problems but not for coding.
Also I m not sure where you are getting the under 2k value. I bought a Framework desktop 128GB last year and my setup was around 2.7k. The same setup now sells for around 4.7k.
Do not mix the benchmark results of GLM 5.2 FP16/FP8 with FP4 or FP2.
* FP4 will mean a accuracy loss of about 3%. Not noticeable but more chance for mistakes.
* FP2 ... what is what most people are able to run at home, for a "reasonable" price. Your looking at over 17% loss in accuracy.
At that point, your running at less then claude-sonnet-4.6, as the issues compound with accuracy losses. And reasonable priced is still in the ~ $5000 range (192GB + GPU 32GB active/kv cache system).
For that price your using a Codex / Claude Pro subscription for the next 4+ years with better models (by default), let alone with a FP2 GLM 5.2 version. And your looking at < 10 fps. A MacStudio with 512GB will net you 18 a 20fps+ with FP4, but ... i mean, those used to be $10.000.
Unfortunately the local hardware cost is a major issue for running large models like that.
Edit: Its funny whenever the issue of cost and what you need to give up vs the subscription services, there are always people who downvote in bad faith.
(it's not our first AI server, we already have experience deploying LLMs for our clients, so the numbers look solid)
And there is a lot of drama in those discussions. GLM 5.2 is a great model for corporations to run, but people only want to hear about running a 35B/27B or maybe a 120B model. And in that market, subscription services are simply way better value for money (take in account the privacy issues).
Everybody wants GPT 5.5/Opus 4.8 Max levels, on a budget that simply is not realistic. And GLM fit in that 4.8 medium/low level.
But then people do not want to be told that running a 750b model in Q2 or Q1 is just going to destroy the models accuracy. And that is still going to cost them 5k+ for that reduced model.
The whole local llm landscape from a consumer point of view, is just filled with odd people. lol.
Corporation really benefit from those models, because spending $90k on a server, is a deductible expense. And they are billed at token prices anyway from all the major providers. So its a even faster ROI on that hardware.
I am surprised that nobody figured out to make a business of selling leftover capacity from corporate llm installations, because there is easily 12h+ just wasted (unless its a large corp that has people in all timezones).
why is that? b/c the thing is waiting for the hoooman and idling? or some parallelizable interleaving steps?
I have no intuition yet how this works under the hood.
Waiting for the hooman (or tool calls) won't help either, of course.
In contrast, prompt prefill is more easily compute-bound, so there are interesting trade-offs for latency of decode vs prefill when the LLM utilization is high.
i think your answer was perfect not sure why you are being downvoted