NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Qwen 3.6 27B is the sweet spot for local development (quesma.com)
iagooar 21 minutes ago [-]
I love my MacBook Pro M5 128GB RAM and I love qwen3.6.

BUT DO NOT buy this MacBook if you plan on doing serious coding using local LLMs with it. The reason is simple: your fingers will burn and your head will explode from the noise.

Running any kind of sophisticated job on the very laptop you are using is just not viable. Sure you can use it in clamshell mode, but forget touching it while working with AI coding or agents.

If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk. Connect to it over LAN or Tailscale. The MacMini will also cost you almost 1/3 of the MacBook Pro.

Thank me later.

Arubis 3 minutes ago [-]
Don't forget that your OLED screen will start to color-shift as the heat cooks the panel!
manmal 2 minutes ago [-]
There is no MacBook Pro with OLED (yet).
acters 10 minutes ago [-]
Would the new upcoming AMD AI ryzen halo desktop be a better value offer? or dgx spark?

You would have to get a third party reseller/scalper or refurbished mac mini to get 64gb of ram ever since apple stopped selling it.

busymom0 2 minutes ago [-]
Also look into buying the Mac mini refurbished from Apple. They come almost brand new, same warranty and you save money.
SkitterKherpi 11 minutes ago [-]
I am considering getting something like NVIDIA's RTX Spark when it comes out, though even that will be limited to 128GB.
awesomeusername 8 minutes ago [-]
It's out, I'm daily driving one. It's great
oceanplexian 13 minutes ago [-]
If you want to do coding with a local LLM your best bet is a 6 year old Nvidia 3090 which is substantially more powerful than the highest end overhyped Apple product for 1/5th the price.
verdverm 15 minutes ago [-]
Get an OEM Spark instead, mine are silent and can fit 2 qwen/gemma at 8bit or give you room for a bunch of other, smaller models (embed,rerank,etc)
diseasedyak 60 seconds ago [-]
I have 24GB of VRAM (via a RTX 4090) and run Qwen3.6-35b:iq4, so it's importance-aware quantization and isn't nearly as dumb as it sounds like, fitting the 35b into 18 GB so you have some left over. So far I've had no issues, other than it taking a while for things like image gen, which I found out if you're gonna do with any alacrity, just have a cloud model do it.

For anything else local, including writing some automation scripts and such, it works great.

bensyverson 2 hours ago [-]
The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0]

Some people will be happy to pay that premium for privacy, but at roughly 10X the cost of a MacBook Neo, that money could also buy a lot of credits on OpenRouter or frontier labs.

[0]: https://www.apple.com/shop/buy-mac/macbook-pro/14-inch-space...

dofm 2 hours ago [-]
The maths there is pretty undeniable, but it is not where I'd make the split. Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it.

I don't know how much serious hands-free agentic coding I will ever do on my MacBook alone, but I do know that I would not have got so far into understanding this without tinkering with local models, llama.cpp, LM Studio, and LM Studio and all that.

I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled. Because it's just huge, exhausting, jargon-drenched, unknowable, and I am over the hill at fifty-plus.

Until, that is, I could poke around with setting it up on my own (secondhand) machine, watching the API calls, understanding some of the terminology. I didn't even buy the machine for that; it's just adequate to the task.

The Neo is too small to really get much benefit from this opportunity to make it more visceral and knowable.

pizza234 1 hours ago [-]
> Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it.

Cloud models are (much) faster, they don't consume so much power/generate heat, they have much bigger (LLM) context, they're much more precise and they have a much wider (engineering) context of the given problem.

Except privacy and use cases that are blocked by cloud models (e.g. reverse engineering), local LLMs are currently an expensive toy.

When I try to program with a local LLM (I'm on a 32/128 GB system), I end up wasting time compared to a cloud LLM.

dofm 57 minutes ago [-]
Again, I would not argue against any of this.

And I can't say that I won't switch to openrouter (even just for the same models) at some point.

But one of the things I have found about my own process learning is that some lessons only come to you when you make yourself available to them. And if that means doing things the difficult way, that is what you should do.

wahnfrieden 44 minutes ago [-]
Difficult... and wastefully expensive
dofm 41 minutes ago [-]
I mean, it's a (secondhand) computer I bought for other tasks (processing very large photos, compiling large apps quickly). It's running all the time. It can also run LLMs when I want to.

The rest of my life is ultra-frugal so I am relaxed about this.

AlpacaJones 48 minutes ago [-]
The key word there is 'currently'.
smt88 30 minutes ago [-]
Economies of scale are a fact of nature and aren’t going to be subverted in the future by even the most advanced local models
kennywinker 8 minutes ago [-]
Which is of course why, if you want to render 3d scenes to play a video game, you have to rent time on a mainframe system. I don’t see that changing ever - it’s just economies of scale!

(sarcasm, btw)

oceanplexian 4 minutes ago [-]
You know economies of scale doesn’t mean it always gets cheaper at scale right? Things can get both more expensive and cheaper at scale, hence the term.
psychoslave 37 minutes ago [-]
Anything done local will likely come at higher cost and at scale with less energy efficiency and commodity, with less possibility to fine tune engineer deeply on wider horizon of issues.

That's never the point of keeping local alternatives though.

dofm 26 minutes ago [-]
Right.

For me this dates all the way back to installing Slackware 1.0 (0.99pl12!) on an offline 486SX rather than just using the internet-connected workstations in the lab.

Here, I already had a Mac that was powerful enough to run a local LLM, so now I do, because I can.

44 minutes ago [-]
45 minutes ago [-]
codazoda 22 minutes ago [-]
I agree with the learning aspect, but I have another motivation. I suspect that closed models might become too expensive to run for personal hobbyist use. I’ve been planning to buy a 64GB machine just to allow the limited local models this enables.
oceanplexian 7 minutes ago [-]
Honestly your best bet is to buy a $20 Claude subscription, ask Claude to set it all up with Pi and llama.cpp and come back in 20 minutes after a cup of coffee. This is also a good idea because it will help set expectations of what a local model can do vs. a frontier model.
rusk 2 hours ago [-]
> I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled.

I found LM studio to be a nice starting point. Frindlier and more featureful than Ollama and not as intimidating as llama.cpp (though you will want to use that eventually)

dofm 1 hours ago [-]
LM Studio is also nice because of the way the interface explains things; parameters have explanations and hints. It has been designed by people who really care about making it understandable.

I tried Ollama but I've settled on Unsloth Studio generally; once things really settle down I'll just run the llama-server UI, which is pretty nice.

A friend is tinkering with LLMs for amusement on a 16GB Raspberry Pi 5, and when I explained that llama.cpp now had a typical web chat interface he was so happy — it's amazing what the "table stakes" are now.

cyanydeez 1 hours ago [-]
I've setup to local paradigms for local coding:

- opencode with it's webui

- deer-flow with it's research/powered front end

They both run websites so you don't have to baby sit them (eg, keep your mac open). I've build a pdf compressor over a few days by first having deer flow try and research the frameworks and pipeline. It stalls out because its not really a fluid programmer. Once it stalls out, I transferred it (manually for now) to opencode and it's refactoring it because it's just a collective bundle of sticks and it needs a lot of testing to tweak out the limited scop context. LLMs can't really hold large scopes (locally anyway, from what I've read from HN, it's possible with longer context).

It'll complete in a few days with maybe 3-4 hours of full attention interaction, but it's running 3x that without my attention. Obviously, if I paid more attention it'd run quicker, but since it's local, it's not pumping out large volumes of code, it's mostly looping over tests and capabilities as observed.

It's running Qwen3.6 35B MoE on a AMD 128GB strix halo. If I switched to the dense models, perhaps it'd be smarter, but the trade off seems to be much slower gen.

dofm 1 hours ago [-]
> - opencode with it's webui

Have you tried Paseo?

I have opencode in a VM, and the paseo daemon running in the VM, and then the Paseo Mac app. Really nice.

(You can also use the Opencode GUI to frame a remote opencode web interface)

c-hendricks 55 minutes ago [-]
You can also just add OpenCode web as a PWA, if that's what you mean by "frame".

I'm gonna check out paseo, but am not looking forward to all the ram the agent needs + all the ram paseo needs

ddalex 2 hours ago [-]
I just got Claude to download and install all the models and servers and agents and prepare all the launch scripts for me... no need to learn, just ask it to do it for you
dofm 1 hours ago [-]
Right, but I am a middle-aged bloke who is experiencing existential angst about whether I can carry on in this industry.

I have a pretty deep, maybe paranoid need to be confident I have an intrinsic understanding, and I have found in my life that lessons come to you when you make yourself open to learning.

So I need to build on top of what I know, taking as much of the hard way as I can bear to take at any one time — it has to be not quite difficult enough to put me off.

I can't really explain what I have learned this way that is different, but I feel it in a way that I wouldn't if I'd simply pushed a button.

For the same reason, I have a really basic 3D printer that I've set up myself, set up Klipper, configured how I want it, learned how to calibrate, all that. And now I can say that I feel I have an understanding of 3D printing. I could hold my head above water in a discussion with a real expert, maybe find work in an adjacent field where my insights would keep me grounded.

I can afford a really good printer that has all that set up, and more, has no problems. But I'd just be someone who has a 3D printer.

(Also who am I kidding about the existence of a printer with no problems)

swiftcoder 31 minutes ago [-]
I don't necessarily think your answer is wrong for all people, but if you work in software... how do you plan to differentiate yourself from everyone else out there, if the depth of your understanding is "Claude can do it for me"?
dofm 11 minutes ago [-]
This ultimately is the discussion I am here for.

I mean one of the things I use a local LLM for, because I can, is to generate starter documentation. But I ask it to — I want it to give me overviews, plans, all that. It can make something bespoke for me.

I guess I could also ask it to do the work. But where do you draw the line?

The universal labour-saving device is the great provocation of the next 100 years I think, and both Star Trek and Wall-E have grappled with it.

coldtea 50 minutes ago [-]
>no need to learn, just ask it to do it for you

And that's how skills die.

CamperBob2 44 minutes ago [-]
When's the last time you shoed a horse?

The reason I delegate so much of local LLM installation and administration to Claude Code is simply because there's no point learning practical things that will work completely differently in a couple of years, or in memorizing procedures that I'll forget long before I need to perform them again.

No longer having to sweat all the details is a Good Thing, not a Bad Thing.

WickyNilliams 25 minutes ago [-]
If I worked with horses for 8 hours a day I imagine the answer would be "recently"
dofm 37 minutes ago [-]
I am not sure I disagree, and I certainly don't mean to disagree very fervently.

But I think if you want to really learn to ride well, understand horses well, there might be some benefit in learning how to shoe a horse. At some level it should never only be someone else's job.

verdverm 27 minutes ago [-]
At the same time, most people can drive without understanding how a car works.
saganus 3 minutes ago [-]
[delayed]
psychoslave 32 minutes ago [-]
Having to shoe a horse never was a general skill.

Maybe a more apt analogy would be a skill like making fire without a lighter.

charcircuit 42 minutes ago [-]
Except with AI models it's possible to make a backup of them creating a permanent artifact of a skill.
sorokod 1 hours ago [-]
Then what is the point of ddalex?
dofm 44 minutes ago [-]
I think if you really don't feel the need to know the "why" of everything, sometimes this might be the right approach. It is quick, pragmatic, gets you started.

Maybe my biggest problem with the world of agentic AI, and the reason I am putting myself through learning it the way I am, is that the need to know the "why" of everything is so fundamental to me, that I don't know if there is any point to me without it.

So this is really the only way I know how to proceed.

kdkdjduxnd 1 hours ago [-]
[dead]
porphyra 2 hours ago [-]
You can also run Qwen 3.6 27B dense model on DGX Spark with comparable performance [1][2] for about $4000 (Asus Ascent GX10 is $3999 at various retailers).

In theory you can also get 48GB of VRAM with, say, two 3090s, but it will take up a lot of space and generate a lot of heat compared to the Macbook Pro and GB10.

[1] https://x.com/MiaAI_lab/status/2070859135399182444

[2] https://github.com/MiaAI-Lab/Qwen3.6-27B-NVFP4-vLLM

esperent 2 hours ago [-]
> 48GB of VRAM with, say, two 3090s

So like... $2000+ just for the used GPUs? Plus I assume it's considerably more effort to get it working.

fluoridation 1 hours ago [-]
>Plus I assume it's considerably more effort to get it working.

Nah, not really. It is a little annoying in terms of space and power, though. Not every case and motherboard can support cards that big.

Catloafdev 2 hours ago [-]
The model they reference can be easily run with 24gb+ of VRAM, and there are other similar models capable of running easily on 16gb of VRAM. It's not like 128gb is a requirement here.
bitexploder 1 hours ago [-]
For a MBP I have 48 GB of RAM M5 Pro. It runs at about 12-14 t/s at Q4, you could probably optimize it further. RAM is not a limitation but overall memory bandwidth. Q8 is slower. 35B A3B Qwen is quite speedy, but a little less accurate. With Qwen 3.6 27B dense I can squeeze a 9B parameter model and use that for fast analysis or code scanning while 27B is churning on a task in the background. It is tight, but totally reasonable.

The real sweet spot for Qwen 27B is getting it on something like a Dual 3090 system or some other config where it can blaze at 50-80 t/s and that costs well under 6K currently. It is a surprisingly capable model. Using something like GLM for orchestration, specs, task farming and then letting Qwen churn is relatively inexpensive.

Overall I recommend people try models of this class out using OpenCode and some for pay service to experiment with them and understand how they work. I find they are very useful.

Long term, I am convinced enough that if I wanted to use local models for any number of reasons I would be okay investing in a dual GPU box. The Mac is not fast enough for me and M5 Max is just too expensive relative to GPU linux box. Still, it is nice to have the models local ON the laptop and it is useful for what I care about locally.

thewebguyd 2 hours ago [-]
I'd go for at least 32GB+. It'll fit in 24GB but leaves you little to no room for context, and that's at 4-bit quantization.

If you want to run unquantized, you definitely need 128GB.

Catloafdev 2 hours ago [-]
Nobody runs unquantized, there's literally no reason to. Q8 would be the largest anyone actually runs on consumer hardware for inference.
bitexploder 1 hours ago [-]
It also comes down to inference speed, not "can I run this". 8-bit quant is quite a bit slower on an M5 Pro.
gchamonlive 1 hours ago [-]
[dead]
Numerlor 50 minutes ago [-]
And if you go for actual GPUs it'll run much faster, I'd say 24gb may be pushing it for context, but my 5090 with 32GB VRAM is usually somewhere between 60 to 100 tok/s with mtp and 2-3k tok/s for prompt processing. I'm not sure what they cost now but it's definitely still quite far from the macbook, and there's also some other 32GB GPUs that are considerably more affordable
nok22kon 1 hours ago [-]
a computer with 24 GB VRAM is at least $3000
daemonologist 9 minutes ago [-]
A 7900 XTX is about $850, and the rest of the computer basically just needs to boot Linux. You could easily build such a machine for $1500.

Even that isn't strictly necessary - you can get perfectly acceptable performance by splitting a model between multiple older 12 or 16 GB cards.

sleepyeldrazi 1 hours ago [-]
I can't speak for the US, but in Germany (where hardware is usually more expensive, not less), I got my 3090 3 months ago for 750 euro and have been running the iq4_nl 27B using q4 kv (which after recent patches in llama.cpp is in my xp indistinguishably accurate from q8 of f16) at full ctx, with MTP at 2, peaking around 70 t/s on small ctx, around 50 t/s when im around 64k and ends around 40 t/s near the cap. The rest of the PC is a 50 euro ddr3 16gb i5 4th gen box, absolutely nothing special. And this setup is often more useful than dsv4pro (and sometimes kimi, but not glm) for research and ML work.
danilocesar 3 minutes ago [-]
I can't find a 3090 for less than 2k CADs (or 1200 eur). Is this the average price in Germany? It's pretty cheap.
throw1234567891 1 hours ago [-]
But the tokens or credits are gone. MacBook stays. You can run other models on the same MacBook. What I read people burn every month on saas… for that money you break even on that MacBook in 5 months.

Edit: it’s not just “data privacy”, when you are using Claude, you are shipping EVERYTHING to Anthropic. It’s crazy.

wilsonnb3 42 minutes ago [-]
Companies are already shipping everything to Microsoft or Google and 17 other companies, just the cost of doing business.
throw1234567891 27 minutes ago [-]
Sure, but no one gets everything. Just that one.
DANmode 33 minutes ago [-]
That’s at today-prices.

If the cost doubles, or 4x, which is seems to need to for them to go profitable, what then?

wahnfrieden 41 minutes ago [-]
It's much slower, and often quantized
nozzlegear 2 hours ago [-]
Just putting it out there: I run Qwen 3.6 on my M1 Mac Studio with 64gb. It's quantized and all that, but I agree with TFA: it's the sweet spot for local development right now.
stymaar 2 hours ago [-]
> The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0]

Qwen3.6-27B would be faster on a 3090 that costs around $1000-1200 though so I don't think it's a good counter-argument.

Op just happened to have that MacBook, but it doesn't mean it's necessary to run the model.

boutell 1 hours ago [-]
That 3090 is going to burn 750W and it will still cap you at a 4 bit quant and ~48K context. Here's someone who worked through it:

https://github.com/noonghunna/qwen36-27b-single-3090

Flies though (50-70tps is impressive for a model this smart)

I went through roughly the same process to get it working on my M2 Macbook Pro... at awful speeds of course, since models like this one are mostly bound by memory bandwidth.

stymaar 44 minutes ago [-]
> That 3090 is going to burn 750W

The 3090's TPD is 350W, but given that LLM's token generation isn't compute bound, people usually undervolt these cards to reduce power consumption. IIRC you can get as low as 200-250W without any degradation. Caveat these figures are without speculative decoding and at batch size =1.

4chandaily 29 minutes ago [-]
This is correct. I have (4) 3090s in my inference server, and they are each capped at 250w. I run Qwen 3.5 122B-A10 at about 45-50tok/s on this and am quite happy with it. At idle it draws around 95-105w for all four, which is a bit high, but tolerable.
redox99 14 minutes ago [-]
I bought 2 used 3090s some years ago for $500 each. They're probably a bit more expensive now, but I guess for something like $2000 you can build a barebones 2x3090 PC which will be way faster than a Macbook. (you're fine with very basic hardware outside the GPUs)
acchow 18 minutes ago [-]
That $6700 is a $5000 upgrade over a base model Macbook Pro.

$5000 in US Treasuries (currently at 4.89%) yields $244.5/yr. That's more than enough to cover the annual Claude Pro subscription ($200/yr) which includes Claude Code with lots of Sonnet usage (far better than Qwen 3.6)

dmayle 1 hours ago [-]
For that price you can put together a PC with 128GB of ram ($2000) and an RTX 5090 ($3600) and get 70-100 tokens per second instead of 45
dannyw 2 hours ago [-]
I’m running the same model on a 48GB MBP with a q4 quant and it’s pretty decent. You definitely don’t 128GB. That’s the scale for 70B models at q8 or something.
dom96 1 hours ago [-]
I've been running it on my 48GB MBP too and it's not particularly great. Super slow and not near enough to the quality provided by even Claude Sonnet.
doodlesdev 2 hours ago [-]
How much does one of those cost in the US? Here in Brazil, your notebook is worth as much as a used Honda Fit, which seems absolutely insane. For comparison, the ThinkPad I'm currently running cost me 1/20 of how much this MBP costs here, leaving me with over $8.000 to spend with LLM inference (if I actually spent money with that).
dannyw 1 hours ago [-]
I purchased mine for approximately $4400 AUD before the price hikes. That unit is now ~$5100 AUD.

I use my MBP essentially as my workstation, it's almost always plugged in. I have a MBA (M4, 24GB RAM) that I picked up for ~A$1500 or so, and that's an amazing daily driver. I don't do local LLM inference on that unit, I can just hit my own APIs (via LM Studio) on the MBP over Tailscale.

organsnyder 2 hours ago [-]
I run Qwen 3.6 on my Framework Desktop 128GB, and it's very performant. I know Framework has had to raise the price since I preordered mine, but they're still well under half the cost of that Macbook.
andy99 2 hours ago [-]
I get ~55 Tok/s on my framework desktop with the 35B A3B q8 model, and so far am also very happy with the coding performance.
cyanydeez 1 hours ago [-]
did you upgrade to MTP?
montebicyclelo 1 hours ago [-]
Isn't the directionality important. I.e. it is currently possible to run useful / great models locally, but on high end machines; and in a few years we will likely be able to run even better models on standard machines.
elorant 29 minutes ago [-]
You can get an AMD Strix Halo with half that price even after hardware price adjustments. Besides you don't need 128GB of RAM to run a 27B model.
georgeven 2 hours ago [-]
I have a 1500 dollar machine that can run it at 50 tok/s (3 V100s)
Dig1t 1 hours ago [-]
How did you buy 3 V100's for $1500??
Insanity 2 hours ago [-]
But you have to factor in that this device will last you 5-10 years. That said, I wouldn't spend almost $7k USD on this macbook lol.
petilon 2 hours ago [-]
Memory requirements of newer models will increase, so while the hardware may last 10 years it won't be able to run the latest models for 10 years.
roadside_picnic 2 hours ago [-]
My experience working in the open model space pretty deeply (both LLMs and diffusion models) for years now is that it is not quite as simple as that.

In the open model space an insane amount of effort goes into getting more powerful models to run with the same or less RAM. For example in the diffusion world many things that could not be run on easily under 24GB of VRAM actually run much better today with much less VRAM than they did a few years ago. You can do many things today with 8-16GB of VRAM that would not have been possible. At the same time the most advanced open models, like LTX 2.3 for video gen, still seem to respect 24GB of VRAM as the upper bound.

Similarly the standard "big" but localish open model for LLMs back in the day was Llama 3 70B, this was both a much worse and much larger model than Qwen 3.6 27B

So in two different spaces I've witnessed the "RAM required to run the best" decreasing or at least remaining stable, while the performance being achieved in both areas is astounding (LTX 2.3 is faster, better and more capable than the Wan 2.2 model that held popularity before it).

The biggest thing to watch out for is not just RAM/VRAM but memory bandwidth. You can try to "future proof" yourself with lots of RAM, but if it's 400 GB/S you're still constrained to smaller models.

prima-facie 36 minutes ago [-]
> The biggest thing to watch out for is not just RAM/VRAM but memory bandwidth. You can try to "future proof" yourself with lots of RAM, but if it's 400 GB/S you're still constrained to smaller models.

I'm thinking of getting a SoC machine with 128GB RAM but the bandwidth is limited to 256 GBps. Would you even consider such a machine a decent investment, or should I wait for the newer gen of chips? Thanks!

petilon 57 minutes ago [-]
> insane amount of effort goes into getting more powerful models to run with the same or less RAM

The same can be said about operating system memory requirements. I am sure Linux and Windows kernel developers can confirm. Yet 30 years ago Solaris used to run comfortably in 16 MB of RAM, today you need 512 times that to run Linux.

Insanity 2 hours ago [-]
You raise a fair point, but I'm not convinced it'll offer a meaningful difference in performance as long as we're stuck with the current AI paradigm.
bluGill 2 hours ago [-]
Will they? Or will we find ways to optimize models and need less? Only time will tell.
simonw 2 hours ago [-]
It can't run the latest models today - GLM-5.2 class models already need 1TB+ of RAM.

... but, the models that WILL run on 128GB (or 64GB or even 32GB) models today are a huge improvement on the best models that would run in the same amount of memory six months ago.

godwinsonsucks 5 minutes ago [-]
[dead]
cyanydeez 1 hours ago [-]
I think you have too much faith in context AGI.

at 128GB, you can find almost it's entire context for Qwen3.6 35B MoE.

Again, I think you have too much faith in extrapolation. It's like you got a baby at 0 months, then measured it at 12 months and expect it to be a giant.

someperson 2 hours ago [-]
In 5-10 years, incremental cloud tokens will be far cheaper (likely but not guaranteed).
jubilanti 2 hours ago [-]
[flagged]
dvduval 2 hours ago [-]
Absolutely for the average developer the token speed is just going to be too slow for it to be workable. I think we’re looking at 2028 when memory becomes cheaper again and they’ll be a lot more people using local models.
cyanydeez 1 hours ago [-]
AMD started their 128GB Halo Strix at a pretty damn good point at ~2.5k; I got mine after the first memory bump at $3k.

I think you might be a little to into the stew here.

zdragnar 59 minutes ago [-]
I got mine at the same price point, and I've been pretty pleased with it. Tailscale lets me use it from my ultrabook / lightweight laptop, no burning lap or crazy fan noises. Desktops with the amd ai+ 395 are still fairly affordable for what they can do.

I haven't tried it with https://lemonade-server.ai/ yet but I just might give it a shot.

organsnyder 51 minutes ago [-]
I'm running Lemonade on Nixos on my Framework Desktop. I had been trying other tools out before finding Lemonade, but Lemonade really made it plug-and-play.
1 hours ago [-]
trentor 60 minutes ago [-]
Runs fine on 2x4080s or on two 5060/5070s with 16GBVRAM... and faster than on the mac.
colinsane 1 hours ago [-]
i like that people are taking the privacy argument seriously, after however many decades. i think there are other arguments to be made for running these locally which are less settled, but IMO the Fable debacle drives it home: the surest way to embrace this technology without worry that it will be taken away from you down the road is to physically own the compute.
r_lee 18 minutes ago [-]
if you need to ensure that, then just back up the model and buy hardware if the need arises
oldfuture 2 hours ago [-]
a lot of credits? we can’t predict any price change for them
AnimalMuppet 2 hours ago [-]
How many credits would it buy? How long would it take to use them up? What's the payback period?

From what I understand, for a developer, $5000/month is maybe the high end, but $5000/year is fairly standard. (Is that accurate?) So if it pays back in 15 months, that's pretty decent. If it pays back in two months, that's spectacular.

dminik 52 minutes ago [-]
Using some rough napkin (well, spreadsheet) math, if you ran Qwen 27B for every minute every day at the current price of $0.195/$1.56 with a 2:1 input to output ratio (eg. agentic coding) at the advertised 22 tps it would take you just about 11 years to get to ~$5000 spent.

Disclaimer: There's a 35% sale from Alibaba right now. And I'm not accounting for input tokens going faster than output tokens.

eli 1 hours ago [-]
Are you comparing the cost of hosted Opus to running Qwen 3.6 locally? That doesn't really seem fair.
1 hours ago [-]
h4ny 2 hours ago [-]
What kind of narrative are you trying to push?

Do you know how much VRAM/unified is needed for the 27B model, which is generally regarded as better between the two compared in the article, is needed with little to no KLD loss and at 256k context?

Also, once you worked out how much memory is needed for that, maybe tell us how much a non-Apple system that you can run that (probably similarly or faster) would cost?

And when you have answered that, can you tell us how much privacy costs? Maybe also tell us how private OpenRouter is?

Edit: looking at other replies that are basically pointing out the same thing I did, I guess it's my wording. It's frustrating that people who misinform others in some nicely packaged ways or just simply uninformed get to keep doing that if they sound nice. Thanks.

kllrnohj 2 hours ago [-]
> maybe tell us how much a non-Apple system that you can run that (probably similarly or faster) would cost?

Ryzen AI Max 395+ with 128GB of unified memory can be found around $3-4k.

But 27B isn't that large, either, especially if you are ok with the quantized models. So this laptop choice seems to more be a "because they had it" rather than "this is what's necessary for this particular workflow"

h4ny 2 hours ago [-]
That's my point. You can run Qwen3.6 27B with MTP and whatever else you want to bolt onto it at 256k context for much less than even a Ryzen AI Max 395+ with 128GB would cost. Even unquantized you don't need 128 GB so given your comment and the downvotes maybe I didn't word my original comment properly for this?
zx76 3 minutes ago [-]
I see a lot of people writing about how expensive the hardware to run these local models is - but see no mentions of the Intel Arc Pro B50/B60/B70 which seem like decent value if you're not interested in Apple kit (as much as anything can be decent value in the current status quo).

I just got a B70 with 32GB RAM for the equivalent of $1200 (incl. sales tax and import duties to my non-US location, so presumably it could be cheaper elsewhere). The memory bandwidth is 608 GB/s. For M5 Max (32-core GPU) it's 460 GB/s and for M5 Max (40-core GPU) it's 614 GB/s. A 3090 is still faster at ~900 GB/s but you're getting 32GB VRAM for a lot less than equivalent Nvidia cards. It's about 1/3 the bandwidth of a 5090 for 1/3 the cost, but with the same 32GB VRAM. If you're interested in being able to run bigger quants with some context and stay on a lower budget then it's an appealing trade off.

I'm still exploring using these local models so don't want to spend the equivalent of $5 000 - $10 000 just to test it out. I don't mind slightly slower perf to do some experimentation more affordably.

I actually got an B50 16GB (with meager 70w TDP!) first to test an Intel card with my stack - it worked easily with Ubuntu & Vulkan. I'd read a lot about hassles and people writing them off as unusable but it seems like these are often with SYCL which doesn't even seem to outperform vulkan and so why bother? (The B50 was just $370 inclusive tax and duties). Literally `apt install` the vulkan libraries and it worked with default xe driver in 26.04 and the vulkan build of llama.cpp. The SR-IOV PF/VF also just works with qemu/kvm, no tricks required. Since I got it fwupdmgr has updated the firmware twice so Intel is presumably actually trying to invest in these products.

onion2k 2 hours ago [-]
None of the examples reflect 'real work', at least not what I'd consider real work. Being able to nail a zero-shot greenfield project is relatively easy even for a small model. There's not much context to build up and it can fall back to similar examples in the training data easily. So long as you're not asking it to invent something wholly new it'll probably manage.

The real test is whether or not it can work with your existing codebases. In my limited experiments Qwen 3.5 (maybe 3.6 is loads better) does OK on a Rust+React app, and less well on a C# monolith. Not to the point of being unusable but definitely poorly enough that I went back to Claude after 20 minutes. If I lost access to a cloud model and had to use Qwen instead I'd be visibly sad.

Aurornis 11 minutes ago [-]
> and it can fall back to similar examples in the training data easily.

This is an underrated consideration when evaluating the small models: The further you deviate from standard example code, the more their weaknesses show.

My experience is that Qwen3.6 produced some amazing results for a small model when I tried it with simple apps that are widely reproduced everywhere. If you want a React TODO app or to set up a little boilerplate app with shadcn and other popular tools, it will produce something that looks not too bad.

Then when I started straying outside of common tasks and into some of my more niche work, it would spin for hours and go in circles before finally producing some groan-inducing output that wasn't usable.

If you're looking for a model to help with simple refactoring or small tasks where you provide very explicit instructions for exactly what you want, but you don't want to do all of the typing yourself, they can do a lot of good work, though. But you're right that once you get into long context sessions involving topics off the beaten path, the weaknesses are very apparent.

The quantizations that are popular for making these models fit on smaller hardware make the problems worse. When you read it about online there is almost a consensus that 4-bit quants are lossless and that you can use q8_0/q8_0 kv cache quantization without any real loss, but in my experience with real projects there's a substantial degradation in long context performance with any of these quants.

janalsncm 1 hours ago [-]
> Being able to nail a zero-shot greenfield project is relatively easy even for a small model

Not really germane to your comment but I hope I don’t sound old when I say I remember a time when spinning up a PoC was a week of work, and a statement like yours was pure science fiction.

ai_fry_ur_brain 1 minutes ago [-]
Yeah, and we still do take a week for people that actually care.

If I start prompting away the core of a new project I lose interest in the entire thing almost straight away. I hate it. The next day I could care less about it. In fact it just makes me lazy, like a fat person who drives everywhere.

I love typing code and thinking for myself. Im going to continue to do that. I still dont know anyone who's shipped anything truly useful with this garbage tech, let alone with a local 30b param model. So much cope in these comments.

Spending 6k on hardware to run the worlds most mediocre model truly does make you an incredibly stupid person, so Im not really suprised by these comments of people saying these tiny models are helping them so much.

Its like a special needs kid all of sudden got the ability to code, of course they'd be impressed by basically all the code it produces.

cyanydeez 1 hours ago [-]
I love the ability to spin up any repo on github by pointing a local model at it with zero cost beyond the heat & electricity.
onion2k 9 minutes ago [-]
[dead]
Zambyte 22 minutes ago [-]
I have been using pi (and previously the codex cli) with Qwen 3.6 27b with 100k context for my development at work, and I have been very blown away by how well it works. It's not perfect, but it's enough to accelerate my normal development flow. I mostly use it for writing Go and C#.
sosodev 2 hours ago [-]
In my experience, even with basic project concepts the small models struggle to spin up greenfield stuff. There's just too many decisions to be made and they're not good at that.

Modifying existing code is way easier if you don't expect it to be smart about it. Don't say "add X feature" and let it explore the codebase and build its own understanding. Point it at the relevant files and say "the goal is to add X feature to this code, follow Y guidelines". Now you've done the hardest part of making the decisions and it just has to follow instructions while coloring within the lines.

fluoridation 1 hours ago [-]
>Point it at the relevant files and say "the goal is to add X feature to this code, follow Y guidelines".

Is that not how you would work with any model, local or not? I wouldn't trust it to make the right decisions unattended. I just know the moment I look away it's going to do something utterly braindead.

verdverm 24 minutes ago [-]
I had good results doing an open box reimplementation. Gave qwen access to my old projects and it rebuilt it on JAX.

https://github.com/verdverm/pge-jax

esafak 45 minutes ago [-]
I don't use local models but have you tried augmenting the model with code intelligence MCPs like https://github.com/DeusData/codebase-memory-mcp ?
h4ny 2 hours ago [-]
> In my limited experiments Qwen 3.5 (maybe 3.6 is loads better)

1. Maybe you should tell us what those limited experiments are.

2. Maybe you should actually try 3.6 because it's huge difference in most cases. Don't forget to tell us quants and don't forget to tell us scope.

3. Maybe actually show us data compared to frontier models instead of this... vibe comment. Pretty tired of this kind of comments on HN that doesn't require logic or evidence. Just vibes. Like the pelican riding a bicycle crap that everyone has taken for granted but has no objective way of assessing goodness.

snapcaster 32 minutes ago [-]
Nobody owes you a scientifically rigorous write up
doodlesdev 2 hours ago [-]
I feel like I'm going insane seeing people buy these 128gb MBP for thousands of dollars to run models that are objectively much worse than SOTA and spending so much more. The amount spent on a 128gb M5 MAX can buy you a damned new car here. What the hell am I missing? Are developers in other countries living in such different worlds?

(I'm aware the price is, in absolute terms, more expensive where I live compared to the USA. That reinforces what I think, because anyone sane that would've bought one of those in another country would sell them as soon as they landed here and save that money.)

JeremyNT 1 hours ago [-]
I also don't understand why people in this price bracket are buying Mac laptops instead of desktop computers with GPUs? Just to flex that it's portable?
redox99 9 minutes ago [-]
Yeah, it's a much better idea to buy many used 3090s. 4090s or 5090s if you can afford it. Way faster.
jeroenhd 49 minutes ago [-]
A mac with a boatload of RAM can run models that will exceed the limits of any GPU not worth at least twice the Apple hardware itself.

You get fewer tokens per second, but at some point the balance between quality and quantity makes the large model size worth the spend.

When you're spending this kind of money, you may as well treat yourself to a pretty screen and some decent speakers. Nothing the competition doesn't offer these days, but you get them for free with the car-priced RAM upgrade so why go for less.

ilogik 49 minutes ago [-]
What GPU can I buy with >100GB of memory?
verdverm 23 minutes ago [-]
DGX Spark is one, but really depends on how much you want to spend
LeBit 44 minutes ago [-]
I think it is because desktop computers with GPUs with enough VRAM to run interesting models are insanely expensive, hard to source and consume a lot of electricity and dissipate a lot of heat.
verdverm 12 minutes ago [-]
It's not always about the price or being the cheapest. For me, it's about freedom, both to play and from the govt/corp censorship.
znpy 23 minutes ago [-]
> Are developers in other countries living in such different worlds?

Yes. Back in the my days at $faang in europe it was not uncommon to hear people getting 120-160 k€/year in compensation and we were “poor” compared to us engineers at the same faang (4-500 k$/year total compensation) with a bit of seniority…

adamors 1 hours ago [-]
Yes they are, 6k is peanuts to a lot of people.
bellowsgulch 45 minutes ago [-]
> Are developers in other countries living in such different worlds?

Yes. Your people earn an order of magnitude less income than Americans.

beastman82 2 hours ago [-]
FWIW I'm running gemma4 31b on my 5090 and it's pretty great as well.

QAT, MTP, 128k context.

I liked Qwen 3.6 27b too, it just seems that Gemma4 is a bit underrated.

kofu 2 hours ago [-]
My experience also aligns with this. I'm running gemma4 31B on a 4090 through llm.cpp with unsloth models. I also run Qwen 3.6. Qwen is good for thinking and planning as it is faster, but Gemma4's generated code is much higher quality in the first try (Rust, C++ and C#). so it needs less revisions to be at a level I'm comfortable for merging.
beastman82 1 hours ago [-]
I second unsloth models. I'm using them over blackwell-oriented nvfp4 models as they are (empirically) top quality and performance.
1 hours ago [-]
nozzlegear 26 minutes ago [-]
I can't Gemma4 to actually finish a turn properly, it's always ending abruptly or making malformed tool calls. It's probably something I've misconfigured in oMLX or Opencode.
accrual 2 hours ago [-]
Nice. I flip flop between Qwen 3.5 9B Q6_M and Gemma4 12B Q4_K_M on a 4080 Super. They run at about the same speed and I can have them review each other's plan or diffs. For smaller projects I find them very capable, and I can step up to a better quant for slightly more challenging work.
nok22kon 1 hours ago [-]
you can probably run Gemma4 26B on your card also at 4 bit. World of a difference compared with 12B.
zingar 23 minutes ago [-]
Where does “big model highly quantized” start getting worse than “smaller model less quantized”? Is there a general formula or is it just trial and error?
MangoCoffee 7 minutes ago [-]
Running LLMs locally for development doesn’t make sense to me. The hardware gets outdated in just a few years. Even hyperscalers replace their GPUs faster than they can buy them, plus the cost of running it locally, isn’t cheap. the cost saving just ain't there.
0x0000000 2 hours ago [-]
> ... on my Macbook Max M5 128 GB

Local development for who? How many of y'all are rocking 128GB of memory? Am I reading Apple's site correctly that it's a $10,000 laptop?

kllrnohj 2 hours ago [-]
You don't need nearly that much RAM to run Qwen 3.6 27B, though. qwen3.6:27b-q4_K_M is only 17GB, for example.
DanHulton 2 hours ago [-]
This is what I run on an M5 MacBook Air 32GB. Works great.

I’m not having it build whole features from scratch, though. I give it pretty explicit instructions closer to the class or function level, and it still saves me an immense amount of time, while I’m very connected to the code that’s written.

Definitely the sweet spot for me.

__s 2 hours ago [-]
I'm on 128GB ram strix halo, bought framework desktop for a few thousand CAD back when everyone was calling framework desktop overpriced
rhdunn 2 hours ago [-]
A 27B model can fit easily on a 32GB VRAM card (e.g. 5090) or a 32GB computer in RAM at FP8/Q8 (unsloth have 28.6GB Q8 files).

For 24GB VRAM cards (e.g. 4090) you can use Q6_K (22.5GB) or Q5_K_M (19.5GB) quants, possibly offloading some of the weights to RAM.

wpm 2 hours ago [-]
It wasn't $10k a month ago
mr_mitm 2 hours ago [-]
Think commercial. My company invested in a local rig since privacy is important to our customers and sometimes I want to use these models on private data.
spike021 2 hours ago [-]
Certainly won't work on my M4 Pro with 24GB lol
MatthiasPortzel 1 hours ago [-]
I’m using it on a 48GB machine and it causes some lag, so it might be worse on 24, but it should run.

Unsloth recommends 18GB of RAM for Qwen3.6-27B (for their version of the model).

https://unsloth.ai/docs/models/qwen3.6

whynotmaybe 2 hours ago [-]
I feel you!

Sent from my 8gb M2 Mac mini.

narrator 16 minutes ago [-]
In hindsight, the Mac 512gb for about $10k was a total steal given that to run GLM 5.2 you need a 4x H100 to get the necessary amount of VRAM. Yeah the h100 is 2 to 8 times faster, but it's $20k a month to rent a 4xH100 VPS.
jjcm 2 hours ago [-]
I'd also look at the qwopus distil if you're using qwen 3.6 27b. It's a nice refinement of the current 27b with slightly better stats.

Jackrong has a few different ones available depending on what you're trying to do: https://huggingface.co/Jackrong

IronWolve 1 hours ago [-]
I think things are moving fast, tested that new vibethink-3B, works on many small tasks/fast, and playing with ornith-35B with a draft vibethinker-3b as a draft gave me some good speed/results.

Was just trying to see how small I could go and get acceptable results, but yeah, larger Qwen 3.6 with MTP is going to be better. Cant wait to see how AI model (unsloth/local-llm/heretic/reaper/etc communities) are tweaking/engineering quality down into smaller models. Lots of new things coming out.

RedCinnabar 2 hours ago [-]
Call me back when you can run these models on 16GB of RAM and any recent i5/i7. Until then, there’s no point on using these toy models.
giancarlostoro 2 hours ago [-]
You need it to run in about 8 GB so you have extra space for the context window.
Catloafdev 2 hours ago [-]
Hello, it's the internet calling, today is that day.

https://github.com/ikawrakow/ik_llama.cpp

Edit: it's gonna be slow if you're not using any VRAM. But it's possible. Software isn't going to speed that up anytime soon, it's just a hardware bandwidth limit.

rhgraysonii 2 hours ago [-]
I have been having pretty good success with Qwen 3.5 9B for "nontrivial but not challenging work all things considered" -- it runs great on my 24gb unified memory m4 pro MacBook Pro. What do the baseline specs look like Mac-wise for getting this model to run? Am I looking at a 96gb? 128? 256?
MatthiasPortzel 1 hours ago [-]
I posted this elsewhere, but Unsloth says the 27B model should run in 18GB. That leaves little RAM for other tasks, but it depends on your tolerance for slowness I suppose. I haven’t tried it in 24GB so report back if you do.

https://unsloth.ai/docs/models/qwen3.6

dofm 2 hours ago [-]
You might be interested in Ornith 1.0 9B, which is a new intriguing post-training of Qwen 3.5 9B.

Qwen 3.6 27B will run in full offload with a 4-bit quantisation in 64GB on an M1 Max. It is quite slow.

I don't know about 48GB but 64GB should be enough.

simonw 2 hours ago [-]
I've been trying Ornith 1.0 35B, I'm pretty impressed with it: https://simonwillison.net/2026/Jun/29/ornith/
dofm 1 hours ago [-]
It's the one I have loaded right now.

It got rather tangled up when I tried it with one of my coding tests, which is a simple wordpress plugin, but I frustrate the model by asking it to write code for older PHP, break WP coding conventions and use a rather bespoke method for arranging code in objects. So it is sort of a hybrid of a green field and brown field task; a bit muddy.

It did not do as well as Qwen 3.6 35B, but the way it worked through its thoughts was interesting.

TBH I struggled to understand what DeepReinforce are doing that is materially different; the explanation of their training technique goes over my head at this point.

jensC 22 minutes ago [-]
It is also available with Ollama now and I am equally impressed too.
rhgraysonii 2 hours ago [-]
Thanks! I was thinking of doing the 128gb to have some future proofing. I figure at this point, it's akin to a mechanic keeping great tools around, when it comes to having this sort of homelab and exposing it for your own uses. And great practice for building the next era of user facing computing that will be around as this proliferates.
dofm 2 hours ago [-]
I would not buy a 64GB model again, probably, if this were to remain particularly important to me. But I gather memory bandwidth is pretty important here.

So for example I'd favour a used M1 Max over a used M2 Pro, at least based on my naïve understanding. Not quite sure where the balance changes.

There appear to be some hardware improvements with the M3 and up regarding the Apple Neural Engine which I'd hope would show up in MLX performance; I remember seeing some optimisations in image generation models that are only possible on later hardware.

The GPU cores are progressively better I believe, but the memory bandwidth is lower. Though perhaps the M4 can get closer to actually saturating said bandwidth.

(And I must reiterate that my understanding of this stuff is pretty naïve.)

freehorse 46 minutes ago [-]
Used M1 max is still a good choice because its memory bandwidth only got surpassed by generation m4 and later (except with ultra variants which are more expensive). Its prefill speed is not great though, and that is an issue for running larger contexts, which only substantially improved with m5. Moreover, up to m3 they only have thunderbolt 4, not 5, which means that they lack RDMA support which would make stacking machines more effective. So unless you go higher price for m4+ max, or any m ultra, m1 max is pretty decent still compared to m2 and m3 max, definitely better than pro variants, if you can find in a decent price and want to experiment without caring much about time to first token and large contexts.

A very useful resource for characteristics and comparative performance of all M variants, if anybody is interested, is https://github.com/ggml-org/llama.cpp/discussions/4167?sort=...

Its sister discussion for nvidia gpus is https://github.com/ggml-org/llama.cpp/discussions/15013

Note the drop in performance for the base (binned) m3 max version. You are better off with full m1 max than the binned m3 max, even price aside.

The issue I have with my m1 max is that with 64gb you cannot run really decent MoE models, ie the ones you can run like qwen 35B-A3B have only 3b active parameters and are much less capable than qwen 27b in my testing. So I end up running the 27b one, but it runs relatively slow (though still usable at 10-20 tok/s) and I would have been better off a used nvidia gpu setup for dense models. I assume 35B-A3B has its use cases, eg as subagents, just that I cannot find them. With a higher amount of ram I could probably run bigger MoE models which could be more comparable, though prefill would still be an issue (and prob a bigger one). The only hopeful thing is that there are performance hacks appearing (speculative decoding and prefill) that seem to start improving inference speed once getting implemented, so I am mildly hopeful.

(I must also iterate that my understanding is not very deep either)

dofm 19 minutes ago [-]
Good reply, those two links are v. useful and I had missed them.
Otternonsenz 1 hours ago [-]
Is there any hope for people that cant even run 27B parameters, Qwen3.6 or otherwise? Are there any quantized models that do well with tool calling at smaller parameter sizes?

I do not have a crazy rig, a modest gaming one at that, but in trying to understand more about agents and their capabilities, I am SOL with my 16 GB of RAM and 8GB of VRAM. I can get most small, non tool calling models to perform well, but I've had major issues with anything over 9B doing anything more than reasoning (egregiously slow at higher parameter counts).

And so far, I cant get even Pi to extend itself or do any meaningful work with any of the models I currently can get to run.

fumeux_fume 1 hours ago [-]
I suspect with those specs, you're not in the game right now for reliably using local models for code generation. The easiest way in is a MacBook with at least 32GB of RAM. This should be able to run a 4bit quantization of qwen 3.6 using the MLX format really well.
Otternonsenz 34 minutes ago [-]
Now that I’m dipping more into this space, am gonna see what I can upgrade with the motherboard I have, but RAM pricing as it is, I’ll need to be smart about when I upgrade.

I very much appreciate the frank response, as it makes me feel less defeated at knowing my understanding of how it should work is not the full issue, hahaha

fumeux_fume 3 minutes ago [-]
M series macs are usually used for running these LLMs locally because the GPU and CPU share the same pool of RAM at very low latency. If you upgrade your RAM on a different kind of chipset without the Unified Memory Architecture, then it'll be much slower to produce all the tokens you need. Just another data point to add to your upgrade equation.
fluoridation 58 minutes ago [-]
I think at 16 GB you'd struggle to run the regular development tools nowadays, forget about any interesting inference.
Otternonsenz 31 minutes ago [-]
Fully agreed, and my hope is as open models grow and change, that getting some amount of this working on Pro-sumer hardware will be more attainable.

But certainly seems like we are a few years away from that, sadly.

Am I also screwed in being able to train my own small model or adjust another one with such a non-workhorse PC?

fluoridation 26 minutes ago [-]
Training requires even beefier hardware than inference.
jadbox 1 hours ago [-]
[dead]
blopker 2 hours ago [-]
I've been working with local models for the past year. There's so many possibilities, but I don't think coding is one. Coding requires so many layers beyond inference; I spent so much time trying to replicate what Claude Code does end to end locally. Understanding all the layers and keeping up with the advancements feels like a slog. Even this article messes up and misunderstands what some of the settings are doing. Qwen in particular seems to work at first, then often gets stuck in thought loops when used for actual work.

However, text-to-speech, speech-to-text, and non-code LLM use cases are so useful to have local, and don't require big hardware.

Having a universal reliable inference engine interface, I think, is the big unlock that needs to happen before app devs can ship these features.

Personal concrete use case: meeting recording app. This uses Parakeet + Qwen to create local transcriptions and post-cleanup, respectively.

Right now this app has to download and manage all these models, then bundle an inference engine to run them. It's a lot of code that probably should belong to the OS, or at least a standard interface.

While apps can offload some of this to llama.cpp or a similar process over http, that's another set of setup for the user to do before they can have a useful app.

Anyway, if you're getting started on a Mac, I'd suggest trying out oMLX (https://github.com/jundot/omlx) before messing with llama.cpp. In particular they have community benchmarks so you can see what kind of performance you're likely to get: https://omlx.ai/benchmarks. I wished each one had more configuration details though.

iwontberude 1 hours ago [-]
> I don't think coding is one

Certainly this is falsifiable easily by any of us doing it on a regular basis

> Qwen stuck in thought loops

This does happen when context is not managed effectively; creating plans, using subagents and compactions strategically resolves this

blopker 43 minutes ago [-]
Sure, local coding is clearly _possible_, but it's not practical for most people. I've yet to see a reliable setup, if you have one, I'd love to see.

> creating plans, using subagents and compactions

Yes, these are all things that Claude Code does for you. However, for the thought loop issue, these are not the fixes. The canonical fix is to limit the number of thought tokens (llama.cpp's `--reasoning-budget`) or try to mess with the various penalty parameters. In any case, it's not a solved problem as far as I can tell.

kpw94 2 hours ago [-]
> What it does:

>

> --jinja for tool calling support

Pretty sure this flag hasn't done anything for a while. It's enabled by default since ~November of last year

dom96 1 hours ago [-]
What do folks use to keep on top of new model releases that are appropriate to their system? i.e. the models that will actually work on the MacBook Pro with 48GB of RAM or whatever their specs are.

I've seen sites here and there but they feel like quick little toys that don't get updated, so they always suggest old models.

seemaze 2 hours ago [-]
I was interested to see that Qwen3.5-122B-A10B narrowly beat Qwen3.6-27B on Donato Capitella's SWEBench-verified-mini run with a similar 128GB UMA architecture.

https://pi-local-coding-bench.dev

jononor 47 minutes ago [-]
Many people in LocalLLaMA Reddit community has been reporting the same, that 3.5 122B-A10B is on par or slightly better. And a 3.6 or 3.7 od the 122B is one of the models people want to see the most.
blueside 42 minutes ago [-]
i have been trying several open source models for the last few years. running qwen 3.6 27b on my 4090 is the first local llm i have used that made me start to second question if anthropic and openai are actually worth the (already) insane valuations.

don't get me wrong, the frontier models are leaps and bounds ahead of what qwen/kimikgemma are doing - but i don't need to drive a ferrari to the grocery store everytime either.

HotGarbage 2 hours ago [-]
And AI companies will continue to buy up all the silicon to make this prohibitively expensive to run at home.
dofm 2 hours ago [-]
It will run (somewhat slowly) on a five year old M1 Max with 64GB RAM.

Personally I prefer the 35B MoE model, which is fast enough to be interactively useful, and capable, but I would probably use the 27B if I wanted to generate whole applications like that.

I am unconvinced that most "local" AI applications need anything much more powerful than the Gemma 4 12B model. Local agentic coding is a small niche, but there are plenty of ways a local model can help with development tasks.

I would really like to see a 12B or 16B Qwen 3.6.

I am currently playing with Ornith 1.0 in the MoE configuration, which is based on the 35B variant of Qwen 3.5; I am not sure if it is better than the 3.6 version.

Benchmarks say it is; my own silly tests either suggest otherwise or suggest that I have to talk to it a bit differently.

sleepyeldrazi 2 hours ago [-]
I need to ask, since I have desperately wanted to make Gemma 4 12B work, but im not sure if its the quant (i usually up it to q8, which is a lot higher than iq4_nl that i use for 3.6 27B) or the model itself, but it just starts confusing itself really quickly when I give it coding tasks. And quickly starts failing tool calls.

I really want to have a model that i can run locally on my 24gb m4 pro mbp for when i don't have internet to connect to my 3090 running the qwen, and i love how gemma 4 models 'feel', but i can't make them be competent. I am in the middle of finetuning both qwen3.5 9B and gemma 4 12B just to try and make those bridge closer to 27B for coding/agentic tasks (and am trying to ternarize and DQT 27B so that it fits in ~9gb pre-KV).

How do you run the gemma? What do you use it for (and in what harness), maybe llama.cpp and pi-mono just aren't for this model and that's what i'm doing wrong.

dofm 1 hours ago [-]
It sounds to me like you're further along on this than I am, if you are fine tuning?

I am still mostly tinkering/learning rather than spilling out code, and I feel quite slow on it. So it doesn't matter too much to me if it is really slow. More the journey than the destination if that makes sense. I'm stubborn.

I have tried the Gemma 4 12B model (Unsloth's QAT version) with search/browse tools in LM Studio and Unsloth Studio, when I am trying to understand a new thing.

Basically I get it to write introductory starter documentation for me to absorb, because my big personal problem, these days, is focussing enough to start a project and then digging in; I need the help.

I have found its limits on obscure packages (that it sometimes makes up) but before that it's a bit like stumbling on a blog post that happens to be really right for your particular need. Good enough to work through.

It's stuff I could ask Perplexity to do, or ChatGPT, to be fair, I just like LM Studio for this and have the inquisitiveness to want to run it locally.

In your case: I don't believe it's the quant. I'm sure it's the model — it has good coding knowledge but it's clearly not specialised. It might be good enough at writing Python/PHP/JavaScript at a novice level. It is also quite good on WordPress tooling and functions.

But I wouldn't bother with it for agentic coding if you've got experience elsewhere. Might be interesting to see what you can do with the 9B Ornith model?

Qwen 3.6 MoE in its Unsloth version is another matter. Impressive and I am trying to find ways to support my old brain doing what I've done before.

aand16 2 hours ago [-]
I've come from the future to say Qwen 3.7 27B is just around the corner and slaps!
alfiedotwtf 14 minutes ago [-]
Qwen 3.7 120B will kill off Antropic’s IPO
lor_louis 2 hours ago [-]
Do no give me hope like that.
layer8 2 hours ago [-]
Are RAM prices down?
mendeza 2 hours ago [-]
I am eagerly waiting!
jensC 24 minutes ago [-]
Me too, I am on a Jetson Orion 64GB (about 50W max). Using the nvidia graphic cards for AI seem to be so power hungry that it was not a choice I could take with todays environmental problems.
prasanthabr 1 hours ago [-]
Has anyone considered a home server? Assuming mobility is not important if we pick components to match a similar hardware would it be more value for money?
LeBit 42 minutes ago [-]
Which components are you thinking about?
SkitterKherpi 1 hours ago [-]
27-30B in general seems to be the level where you actually start having decent models. I just wish consumer hardware hadn't stagnated so much that we can't easily go higher than that, and that even running those requires a $5k machine now.
mbgerring 2 hours ago [-]
Something I find really confusing from this post is the MLX versions of the model running much slower. As I understand it, these model versions are meant to take advantage of Apple Silicon and MacOS APIs, and should produce better/faster results. Any insight into what’s happening here?
mannyv 36 minutes ago [-]
FYI token speed is somewhat irrelevant for agentic development. You let it run, then you come back. The whole point is that it's asynchronous. If it takes 4 hours, 8 hours, 16 hours...who cares?
kmike84 32 minutes ago [-]
You care if you run it on a laptop. It's getting hot, fans are spinning, and you may want to use laptop for other things while the agent is working.
blobbers 2 hours ago [-]
How does llama.cpp use the GPU efficiently as opposed to MLX?

Is there any way to use MLX and GPU at the same time? Or does memory become a big problem?

TBH, I never understood Apple hyping these neural cores because I didn't think anyone actually uses them except maybe certain photo/video editing software.

If I can generate voice at the same time as video, that would be useful.

dannyw 2 hours ago [-]
Llama.cpp uses the GPU very effectively because inference of LLMs is very rudimentary and basically as simple as your GPU memory bandwidth. That's essentially the baseline performance ceiling, with model-specific optimisations like MTP potentially increasing it.

The neural cores aren't suitable for LLMs/transformers and isn't used in LLM inference. On the M5 and later chips, it comes with neural accelerators, aka Tensor Cores, which speed up the 'prefill' (i.e. processing your context window) part, but don't do anything for inference.

The MLX vs GGUF debate is mostly irrelevant. The GGUF pathways are optimised for apple silicon to the extent of practically identical performance to MLX. MLX is just one way of using Apple GPUs, it comes with many optimisations in the box, but they're not hard and they're no longer MLX-exclusive.

markdog12 1 hours ago [-]
I've tested it extensively for actual local development for my job, and hard disagree here. It's a waste of time to use it. Wish it were not true.
beastman82 1 hours ago [-]
I posted elsewhere but if you have more space try gemma4 31b
anonym29 2 hours ago [-]
Strix Halo user here. While Qwen 3.6 27B exhibits remarkable intelligence density, I will still take unsloth's dynamic IQ2_XXS of Minimax M2.7 over Q8_0 Qwen 3.6 27B any day of the week, and this isn't just because of generation speed either. I wrote my own custom harness, and I get hallucinated tool call parameters and bizarre invocations with Q3.6 27B even at Q8_0, but no issues with the IQ2_XXS of M2.7.
BoredomIsFun 2 hours ago [-]
> I get hallucinated tool call parameters and bizarre invocations

tweaking sampler might help

cat_plus_plus 2 hours ago [-]
Gemma4 31B with MTP enabled is faster and I feel a bit stronger at coding. Either one can run in 32GB VRAM or unified RAM with some tuning (3 bit weights, 8 bit kv cache)
verdverm 2 hours ago [-]
Qwen's new AgentWorld model is good too: https://huggingface.co/Qwen/Qwen-AgentWorld-35B-A3B

I'm running the NVFP4 alongside Gemma4 at the same quant on an OEM Spark

colinsane 44 minutes ago [-]
AgentWorld is _fantastic_. i just migrated "down" from the 122B A10B qwen model to agentworld (35B A3B) because it feels as capable, easier to steer, and it's 3x faster.

also i like that if i drop more sophisticated tools into my harness (e.g. any of the NLP/RAG-based search tools in place of grep/rg), the agent will actually reach for them and make progress faster; previous models have been reluctant to embrace new tools.

ascii0eks84 2 hours ago [-]
Very capable lora adapters are surfacing but it seems they are very niche.
DenisM 2 hours ago [-]
Can you share more? It’s the first I hear of lora outside research papers. Practical applications would be great to see.

Lora if effective could be a great reason to run local models.

dmezzetti 1 hours ago [-]
Local models are great for a lot of things past just software development. We need to move towards solving other real world problems vs just building software. I've been focused on that with TxtAI (https://github.com/neuml/txtai) for 6 years now.
mikert89 2 hours ago [-]
none of these local models are good for development, complete waste of time. nobody has $100k+ hardware sitting around at home to actually run a good model
jlongr 2 hours ago [-]
skill issue
rusk 2 hours ago [-]
Spent a week trying to get sensible results out of llama 3.3 At one point it even simulated doing the work, log output and everything and when I challenged it about the missing artefacts it actually started questioning my intelligence. Seems appropriate for a Zuck enterprise.

Qwen on the other hand got straight to work with astonishing competency on the same system.

From what I read llama3 needs beefier compute to reliably invoke tools, which I presume relates to it focussing more on simulating AGI rather than being a useful tool.

culi 2 hours ago [-]
You might find this helpful. llama is not anywhere near the Pareto distribution (performance vs cost)

https://arena.ai/leaderboard/code/webdev/pareto?license=open...

https://arena.ai/leaderboard/text/pareto?license=open-source

k__ 2 hours ago [-]
Llama3.1 instruct seems to be doing okay on that page, mostly because it's dirt cheap.
am17an 2 hours ago [-]
llama 3? Are you from 2023?
suthakamal 1 hours ago [-]
[flagged]
CurbStomper 1 hours ago [-]
[dead]
217 2 hours ago [-]
This is kind of like saying grass is green to be honest
madduci 2 hours ago [-]
Like everybody got 128 GB RAM..
sleepyeldrazi 2 hours ago [-]
I've been running it almost since launch on a 3090 (24gb vram), you really don't need that much. Second hand those are really cheap and i get 50-70 t/s (with MTP at 2), full ctx. IQ4_NL (unsloth) on this model seems suspiciously competent, and after the (by now not so recent) updates to q4 KV on llama.cpp, I just keep going back to it after dsv4pro disappointed me for the 100th time because it gave up on a task.
dofm 2 hours ago [-]
Doesn't need it at Q4 at least; it'll run in 64GB.
intothemild 22 minutes ago [-]
Q6 can run with 256k at Q4 on 32gb easy.

200k @ K : Q5_0 V: 4_1 (which is a bit of a sweet spot)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 19:44:30 GMT+0000 (Coordinated Universal Time) with Vercel.