Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Running local models on an M4 with 24GB memory (jola.dev)

149 points by shintoist 4 hours ago | 59 comments

soganess 3 hours ago [-]

Getting so close to good!

I consider Gemma 4 31B (dense / no MoE), the new baseline for local models. It's obviously worse than the frontier models, but it feels less like a science experiment than any previous local model I’ve run, including GPT OSS 120B and Nemotron Super 120B.

On my M5 Max with 128 GB of RAM and the full 256K context window, I see RAM use spike to about 70 GB, with something like 14 GB of system overhead. A 64 GB Panther Lake machine with the full Arc B390, or a 48 GB Snapdragon X2 Elite machine, could probably run it with a 128K to 256K context window. Maybe you can squeeze it into 32GB (27.5GB usable) with a 32K context window?

Even last year, seeing this kinda performance on a mainstream-ish/plus configuration would have seemed like a pipe dream.

thot_experiment 10 minutes ago [-]

Gemma 4 IS good, I've literally had it get a thing right that Opus 4.7 missed, the edges are ragged and I'm reliably finding usecases where it's basically equivalent. Ultimately the metric is "what can I RELY on it to do". Opus definitely knows a lot more and can sometimes do much more complex tasks, but especially when you're good about feeding the context Gemma is amazing. The difference between the sets of things I trust the two models to do is surprisingly small. I've had some insanely good runs recently working on my personal tooling as well as random projects. The first local model that can reliably left to implement features in agentic mode on non-trivial projects.

https://thot-experiment.github.io/gradient-gemma4-31b/

This is a relatively complex piece of tooling built entirely by Gemma 4 inside OpenCode where I manually intervened maybe only 4 times over the course of a few hours.

running Q6_K_XL, 128k context @ q8 ~ 800tok/s read 16tok/sec write

eagerly awaiting turboquant and MTP in llama.cpp, should take me to 256k and 25-30tok/s if the rumors are true

discordance 27 minutes ago [-]

Could you please share your time to first token and tok/s?

ls612 3 minutes ago [-]

I’m on an M2 Max and get 10 tok/s with Gemma 4 8bit MLX

quacker 2 hours ago [-]

I could have used this article before I spent the weekend arriving to the same conclusion!

Same laptop, and my contrived test was having it fix 50 or so lint errors in a small vibe-coded C++ repo. I wanted it to be able to handle a bunch of small tasks without getting stuck too often.

GPT OSS 20B was usable but slow, and actually frequently made mistakes like adding or duplicating statements unnecessarily, listing things as fixed without editing the code, and so on.

Qwen 3.5 9B with Opencode was much faster and actually able to work through a majority of the lint warnings without getting stuck, even through compaction and it fixed every warning with a correct edit.

I tried 4bit MLX quants of Qwen 3.5 9B but it eventually would crash due to insufficient memory. I switched to GGUF, which I run with llama.cpp, and it runs without crashing.

It is absolutely not comparable to frontier models. It’s way slower and gets basic info wrong and really can’t handle non trivial tasks in one go. I asked it for an architecture summary of the project and it claimed use of a library that isn’t present anywhere in the repo. So YMMV, but it’s still nice to have and hopefully the local LLM story can get much better on modest hardware over time.

solenoid0937 2 hours ago [-]

> It is absolutely not comparable to frontier models.

This is not said often enough.

Yes, local LLMs are great! But reading most HN posts on the subject, you'd think they're within reach of Opus 4.7.

There is a very small, very vocal, very passionate crowd that dramatically overstates the capabilities of local LLMs on HN.

thot_experiment 3 minutes ago [-]

Very different from my experience, Gemma 31b just solved a physics problem Opus 4.7 gave up on. I definitely don't think they're equivalent in general, Opus for sure is way smarter and way more likely to get things right on the edge, but it's still quite likely to get things wrong too it doesn't make it that useful for a lot of stuff. Conversely there are so many things that you would use an LLM for that they will both reliably oneshot. Especially in agentic mode where you have ground truth feedback between turns the difference gets quite small for a lot of tasks.

HDBaseT 43 minutes ago [-]

At least in my experience, local models are very far away from models like Opus 4.7 or ChatGPT 5.5 in coding and problem solving areas.

I find them useful in basic research and learning and question asking tasks. Although at the same time, a Wikipedia page read or a few Google searches likely could accomplish the same and has been able to for decades.

layoric 1 hours ago [-]

Honestly surprised to hear that GPT OSS 20B runs slow on mac hardware. It's absolutely one of the fastest models I've run on local GPUs for its size, but only tried Nvidia cards.

Edit: TIL it is MoE and only has 3.6B active, explains a lot.

quacker 1 hours ago [-]

Yeah, I'm probably wrong there. GPT OSS 20B is certainly much faster than some other models I've tried. I actually gave GPT OSS 20B a few prompts just now and it seems to respond as fast or faster than Qwen 3.5 9B. But I needed many more prompts for GPT OSS 20B to complete my contrived task, so progress felt much slower.

nl 3 hours ago [-]

I think it's useful to be realistic about what you can do with a local model, especially something as small as the 9B the author is using. A 9B model is around the level of Sonnet 3.6 - it can do autocomplete and small functions but it loses track trying to understand large problems.

But the are interesting and fun to play with! I do a LOT of work on local agent harnesses etc, mostly for fun.

My current project is a zero install agent: https://gemma-agent-explainer.nicklothian.com/ - Python, SQL and React all run completely in browser. Gemma E4B is recommended for the best experience!

This is under heavy development, needs Chrome for both HTML5 Filesystem API support and LiteRT (although most Chromium based browsers can be made to work with it)

It's different to most agents because it is zero install: the model runs in the browser using LiteRT/LiteLLM (which gives better performance than Transformers.js), and Filesystem API gives it optional sandbox access to a directory to read from.

It is self documenting - you can ask questions like "How is the system prompt used" in the live help pane and it has access to its own source code.

There's quite a lot there: press "Tour" to see it all.

Will be open source next week.

ai_fry_ur_brain 3 hours ago [-]

[flagged]

nl 3 hours ago [-]

I think knowledge is power.

I think that the more people who try local models (especially the larger ones) the better.

I sometimes get the impression that many people claiming that local models are as good as frontier models work in "token poor" environments. If you can't build large-scale programs using at least Opus 4.5+ then it's difficult to compare. They compare something like Qwen 27B with Sonnet and see that it is nearly as good, but miss that the frontier models are a lot better.

That knowledge is power, too.

I personally can help making local models more accessible. I can't make Opus cheaper.

bachmeier 3 hours ago [-]

> I sometimes get the impression that many people claiming that local models are as good as frontier models work in "token poor" environments. If you can't build large-scale programs using at least Opus 4.5+ then it's difficult to compare.

I sometimes get the impression that people posting comments on HN don't realize that LLMs do more than vibe coding.

BubbleRings 2 hours ago [-]

Yeah no kidding. For instance, if you are an independent inventor trying to write a patent while keeping your patent lawyer expenses to a minimum, you want to write as much of the first draft(s) of the patent as you can yourself. (You’ll save billable hours with your patent lawyer, and you’ll end up with a better patent because you’ll communicate your innovations more clearly to your lawyer.)

However, and this is the big thing, you absolutely do not want to be asking a SOTA LLM for help with the language in your patent application. This is because describing your invention to a web based LLM could be considered a public “disclosure” of your invention, which, (after a one year grace period goes by), could put your invention in the public domain, basically—and thereby prevent you (or anyone else) from being able to ever patent the invention. Plus, you know, a random unscrupulous employer at the SOTA company could be reviewing logs and notice your great idea, and file a patent on it before you do, and remember, the United States patent office went to “first to file” in 2013.

Oh and don’t take legal advice from random people in the internet by the way.

solenoid0937 2 hours ago [-]

> This is because describing your invention to a web based LLM could be considered a public “disclosure” of your invention, which, (after a one year grace period goes by), could put your invention in the public domain, basically—and thereby prevent you (or anyone else) from being able to ever patent the invention.

This is simply not true. Even if it were true (and again, it's not) you could simply use zero data retention APIs.

No one at the big model companies is trawling through your chats to steal your patents. It's not only illegal and against their own terms of service, but these people have better uses of their time.

sourc3 3 hours ago [-]

I am running qwen 3.6 9b quantized model on my m4 pro 48gb and it is barely useful to do some basic pi.dev/cc driven development. I think 128gb desktops are the sweet setup to actually get meaningful work done. However, getting your hands on one of these machines is difficult at the moment.

As much fun as it is to run these things locally don’t forget that your time is not free. I am slowly migrating my use cases to openrouter and run the largest qwen model for < $2-3/day with serious use for personal projects.

sjones671 3 hours ago [-]

Thanks for saying this. There's so much nonsense out there online about local models being better than Opus 4.7 and the like. It's just not true for regular users.

I have a brand new M5 MacBook Pro - top end with all the specs and I've tried local models and they're barely functional.

Yukonv 3 hours ago [-]

What models and quantizations have you been trying? I've had great success with the larger Qwen 3.x models at 6-bit levels. Using 6 bit quantization is really the bare minimum to give local models a fair shot at agentic flows. Once you start pushing below that the models become more "dumb" from the limited bit space.

SecretDreams 2 hours ago [-]

The main benefits for local are:

1) control 2) privacy 3) transparent cost model

Cloud has tremendous value for speed, plug and play, and performance. You need to decide how those compete with the benefits of local - both today, and a year from now, e.g.

carbocation 3 hours ago [-]

Was the choice of such a small model driven by a desire for high tok/sec? I ask because an m4 pro 48gb machine can run larger models (if model intelligence is the thing that would make it more useful).

sourc3 3 hours ago [-]

Yes that was my goal. Also noticed a huge performance gain going from ollama to mlx. Your mileage may vary.

elij 3 hours ago [-]

I'm using the 30b MOE model on same spec with 65k tokens as a sub agent with tooling and it absolutely writes decent code. The dense 9b I agree wasn't great.

hparadiz 3 hours ago [-]

How does it (the openrouter version) compare to ChatGPT 5.5 or Claude Opus 4.6?

sourc3 3 hours ago [-]

Good enough. It gets 60-70% of the work I need done for a lot less $ (keep in mind I am using these for personal projects that doesn’t generate revenue). If I was using it with the hopes of making money I think I would just use Codex at this point.

3 hours ago [-]

rapatel0 2 hours ago [-]

I got qwen3.6:27B running on my 4090 (24GB) with ~128K context leveraging some of the recent turboquant/rotorquant memory optimizations for activations. Highly suggest going up to that. the q4_xl+rotorquant combo is pretty good.

Some reference code if you want to throw your agent at it. https://github.com/rapatel0/rq-models

canpan 4 hours ago [-]

Recent models (Qwen 3.6 and Gemma) can really do coding locally. Feels like SOTA from maybe a year ago? But you would want about 32-40GB total memory. 24GB is just a bit short of that. A gaming PC with 16GB graphics card and 32GB RAM brings you very close to a usable coding system.

wktmeow 1 hours ago [-]

That’s the exact ram/vram combo of my desktop - what model would you suggest for that gaming pc setup?

canpan 17 minutes ago [-]

I would recommend to start withQwen 3.6 35B at maybe Q5, it should be fast in that setup. For intelligence Qwen 3.7 27b, is smarter but will run much more slow. Others also mention gemma 4, which might be worth a try.

solenoid0937 2 hours ago [-]

> Feels like SOTA from maybe a year ago?

Agree but only for small projects. SOTA from a year ago still wins on larger projects

DrBenCarson 3 hours ago [-]

How are you using that RAM with the GPU?

canpan 3 hours ago [-]

Llama.cpp with automatic offload to main memory. You can also use Ollama, it is easier, but slower.

ai_fry_ur_brain 3 hours ago [-]

"Coding system" "can really do coding locally"

Vibe coders out here thinking all software development is solved by because they made an (ugly and unoriginal) dashboard for their SaaS clone and their single column with 3x3 feature card landing page thats identical to every other vibe coders "startup"

spike021 2 hours ago [-]

I'll have to try some more. I've been playing with gpt-oss 20b on my M4 24GB but it hasn't been the best experience.

rtpg 3 hours ago [-]

What kinda harness do people use with these local models? I am quite happy with the Claude Code permission model and interface in general for coding stuff (For chat-y interfaces I have no real opinion)

BubbleRings 2 hours ago [-]

People do use SOTA LLM’s for other things besides computer programming.

For instance, if you are an independent inventor trying to write a patent while keeping your patent lawyer expenses to a minimum, you want to write as much of the first draft(s) of the patent as possible yourself. (You’ll save billable hours with your patent lawyer, and you’ll end up with a better patent because you’ll communicate your innovations more clearly to your lawyer.)

However, and this is the big thing, you absolutely do not want to be asking a SOTA LLM for help with the language in your patent application. This is because describing your invention to a web based LLM could be considered a public “disclosure” of your invention, which, (after a one year grace period goes by), could put your invention in the public domain, basically… and thereby prevent you (or anyone else) from being able to ever patent the invention. Plus, you know, a random unscrupulous employee at the SOTA company could be reviewing logs and notice your great idea, and file a patent on it before you do. Remember, the United States patent office went to “first inventor to file” in 2013.

Oh and don’t take legal advice from random people on the internet by the way.

dempedempe 38 minutes ago [-]

It takes people years to learn how to write a good patent. If you gave your lawyer your attempt at writing your own patent, they might use the info to understand what you want (you're right about that), but a good lawyer would probably just start from scratch.

Imagine you're a contractor. You have a client who knows nothing about software development that wants you to write some software for them. They give you some code they generated with an LLM to get you started. Would you use the code or start over?

nu11ptr 3 hours ago [-]

Still trying to understand if a Macbook Pro M5 Max with 128GB is likely going to be able to run coding models well enough that I can cancel my Codex, or even go down to the $20/month plan.

guessmyname 3 hours ago [-]

A 128GiB MacBook Pro in Canada is what, north of CAD $11k after tax? That’s around USD $7k. At $20/month for a cloud AI subscription, you’re looking at almost 30 years of service for the same money.

How long do people realistically expect a laptop to stay competitive with SOTA local models? Especially in a space where model sizes, context windows, and inference requirements keep moving every year.

And even if the hardware lasts, the local experience usually doesn’t. A heavily quantized local model running at tolerable speeds on consumer hardware is still nowhere near frontier hosted models in reasoning, coding, multimodal capability, tool use, or reliability.

The economics just don’t make sense to me unless you specifically need offline inference, privacy guarantees, or low latency for a niche workflow. Otherwise you’re tying up $10k upfront to run an approximation of what you can already access through a subscription that continuously improves over time.

You could literally put the difference into index funds and probably cover the subscription indefinitely from the returns alone, even accounting for gradual price increases.

tom_ 2 hours ago [-]

But what if you were going to buy a laptop anyway? Obviously you can't do anything with less than 64 GBytes these days, so the question is just whether you go for the jump to 128.

In the UK, it's currently an extra £800 to get a 128 GB vs the 64 GB equivalent. So that's more like 3 years of Claude - I think? - assuming current prices stay the same.

Or: you might just feel like £800 isn't an unjustifiable amount of money (one way or another), and tick the box, on the basis that it might just work out. As the saying goes, in for 459,900 pennies, in for £5,399...

gabagool 26 minutes ago [-]

> Obviously you can't do anything with less than 64 GBytes these days

I don't think that's true. Plenty of people can run basic workflows at 8GB on the MacBook Neo and most others are fine at 16 GB.

nu11ptr 2 hours ago [-]

You are assuming I'd only get it for that. That would probably just be the straw that broke the camels back, but I'm already thinking about a purchase even if that doesn't work out.

knollimar 1 hours ago [-]

You have to use the item a lot, to the point where you'd be exceeding subs a lot

brcmthrowaway 1 hours ago [-]

This is one of the best takedowns of local models I've ever seen.

I just hate paying money for cloud subscriptions, and work has given me a decent laptop

Yukonv 3 hours ago [-]

Have been using Qwen 3.6 27b recently along with various other models the last month and it is very capable for writing code at a level I haven't need to use a subscription for 95% of what I throw at it. Been using it to write extensions for Pi to expand tool kit without much fuss as one example. Is it as fast or SOTA? No, but you can't ignore how functional it is on hardware you own. Where it can begin to struggle is giving too open ended prompts or investigating complex technical issues. At that level its knowledge is not high enough to solve those problems on its own.

reillyse 2 hours ago [-]

so, interested how many people are running higher end AI models locally? Figure if I'm spending $800/month on tokens I can build a pretty beefy local machine for the cost of a few months spend - what is people's experience with say a $5k server custom built (and only for) running an AI model.

entrope 2 hours ago [-]

You will likely have to compromise on memory bandwidth or capacity under a $10k price. The Radeon R9700 has 32 GB of VRAM and is pretty cheap (~$1500 right now), which is what I primarily use. My home desktop has 128 GB RAM and my laptop has 96 GB RAM, but bandwidth limits make most models slow on those CPUs. Models with multi-token prediction are somewhat usable on them: Nemotron 3 Super runs reasonably well on my desktop but does poorly on agentic coding that I've given it; my laptop can run Qwen3.6-27B reasonably well with a version of llama.cpp that is patched for MTP support; but usually I run Qwen3.6-27B on my R9700. vLLM might support two or three R9700s on some OS, but I've not been able to get it to run at all with Ubuntu 26.04: system ROCm version is apparently different than what's in the container images, and system OpenMPI v5.0 finally removed C++ bindings that were deprecated in 2005 but are linked from some Python wheel that vLLM (probably indirectly) imports.

If you are spending $800/month on tokens you are likely to notice degradation for local models compared to near-frontier models. The models I can run locally are consistently worse than Claude Sonnet 4.6 (again for the work I give them), although Qwen3.6 does feel almost like magic for its size because it can do a lot. The really big open-weight models should be better, but they want 200+GB RAM, which will need a correspondingly expensive CPU.

shouvik12 1 hours ago [-]

[flagged]

Ngraph 3 hours ago [-]

[dead]

zoomuser 3 hours ago [-]

[dead]

NBJack 4 hours ago [-]

I'm puzzled. The M4, as far as I know, doesn't have 24GB. Did the author mean a M40?

tra3 4 hours ago [-]

There’s definitely an option with 24 gigs of ram: https://support.apple.com/en-ca/121552

NBJack 11 minutes ago [-]

Ah, thank you. I was assuming a Nvidia Tesla M4.

sertsa 4 hours ago [-]

M4 Mac Mini w/24GB sitting right here on my desk.

NBJack 9 minutes ago [-]

Thanks; I assumed the author was talking about an Nvidia Tesla M4 (hence my confusion and assumption that they meant the M40 series, which has 24GB of VRAM).

spoonyvoid7 4 hours ago [-]

M4 = M4 Macbook Pro

teaearlgraycold 3 hours ago [-]

Or Air

sbassi 4 hours ago [-]

A useful data to know about this setup is how many tokens/sec generates.

JBorrow 4 hours ago [-]

It’s started in TFA

NDlurker 4 hours ago [-]

You can't expect someone to read 4 paragraphs into an article before commenting

kennywinker 4 hours ago [-]

@grok is this true?

DrBenCarson 3 hours ago [-]

Sorry, @grok is offline after declaring himself MechaMussolini earlier today

bluequbit 44 minutes ago [-]

The site does not have ssl. Please can you enable it so that I can read the article?

Rendered at 03:32:21 GMT+0000 (Coordinated Universal Time) with Vercel.