NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?
pierotofy 23 minutes ago [-]
Yes. Llama.cpp + Qwen3.6-35b (MTP) + OpenCode is quite capable and runs on a single RTX 3090 and is faster than most cloud models. Quality is like running edge models from 8-12 months ago. Setup details at https://github.com/pierotofy/LocalCodingLLM/
atomicnumber3 4 minutes ago [-]
Same. I have no desire to use Claude at all anymore.
horsawlarway 4 minutes ago [-]
For personal use, yes.

I replaced a $100/m subscription to claude in favor of running pi harness pointed at unsloth studio, using both qwen (unsloth/Qwen3.6-35B-A3B-MTP-GGUF) and gemma (unsloth/gemma-4-26B-A4B-it-GGUF) models, depending on my mood.

I have a machine I built about 5 years ago with dual RTX3090s in it (I was going to build a new gaming machine anyways, and the llama release had just dropped so I tacked another used 3090 onto the build), and I get ~150tok/s on either of those models (at UD-Q4_K_XL quant) and can use the entire 300k context length without having to exit VRAM.

To be very clear - it's not as good as claude. But it's free and not so much worse that it matters significantly.

For my personal needs, free beats $100/m.

I also have an openclaw instance pointed at the same inference server, and it's great for that (genuinely solid use-case for local models).

Some example projects

- Replacement launcher for android tvs (with usage monitoring and tracking for kids)

- Custom admin portals for my k8s cluster services

- Custom home assistant integrations/automations (recently some shelly devices for power monitoring and switching)

- Grocery list management and meal planning (mostly via openclaw)

- some custom workflows for 3d asset generation in comfyui.

---

Long story short, if you're trying to make money via software... I'd probably still recommend using a paid provider. But the local models are very capable of cool stuff.

hegdeezy 2 minutes ago [-]
I have tried locally but I find that the implicit breakeven is somewhere around 1 year of use given the high power costs where I live. Not really worth it but maybe if I move some day!
stymaar 4 minutes ago [-]
Yes, Qwen3.6-35B-A3B on a Strix Halo 128GB (Bosgame M5).

I have way too much VRAM forme such a model but Qwen never released the 122B version of Qwen3.6, which is the best class of model for my hardware. But at the same time my electricity bill is negligible, this is originally a laptop chip and it shows, it consumes almost nothing while idle and a little above 120W during prompt processing.

And Qwen3.6 has been surprisingly effective for me, I still use Clause occasionally but only for like 10% of my needs which allows me to stay well under the quota even with the cheapest plan.

Speed: ~800tps prompt processing and 50tps for token generation (with no speculative decoding).

cuttysnark 8 minutes ago [-]
I've had some success with local models by chaining "agents" together in a workflow. Each agent has a different prompt and uses a different ollama model based on what their role is. The project manager, schema agent(qwen3:14b), etc. doesn't use the same model as the coding agent (qwen2.5-coder:7b). Between each step is an orchestrator and with a Playwright task which attempts to surface errors to the agent who introduced the previous code block. Only error-free blocks are forwarded to the next workflow step.

Probably the biggest improvement was including a backend-for-agents service definition which instructed the schema agent they were to only produce only a manifest based on the task, and to pass off that off to the next agent.

In short, I split tasks up into many pieces by defining a workflow where agents are only allowed to do very specific things before their work is passed along. This keeps them grounded and capable while also creating places for me to intervene if a workflow was say 25% or 90% successful.

cheekygeeky 8 minutes ago [-]
Our software dev (smartest guy I ever met) is using OpenCode and Tmux with Open Source models. He says the DeepSeek is his model of choice for coding (he call's it "pretty GOOD". He's running two 3090s on an i9 with 128GB RAM. https://www.msn.com/en-us/news/technology/china-s-open-deeps...
sosodev 27 minutes ago [-]
The problem with this question is that it encompasses a huge spectrum of capabilities and expectations. If you can only run an 8B model and expect it to be good at vibe coding / one shotting things you're going to have a bad time.

If you're able to run a model on the scale of ~30B, you can find that with a reasonably scoped and well defined task they do very well. I've found both Gemma4-31B and Qwen3.6-27B to be the best in this range at the moment. You can swap in the MoE models for faster inference, but they are noticeably worse at most tasks. They can one-shot / vibe code tasks with small scope, but still do much better with guidance.

If you really want frontier-like capabilities, you'll probably need at least 128GB of memory and either huge compute or a lot of patience. Most people just don't have either the money or the patience to make these local models work.

The patience required for local model usage goes far beyond just waiting for tokens though. It takes a lot of effort to get things configured and working properly for your workflow and hardware.

argee 10 minutes ago [-]
I use Gemma 4 26B A4B on my Macbook (M4 Pro, 48 GB RAM) to study Rust (and ask other myriad questions). I don't trust it to do a good job in an IDE/harness to one-shot anything but the most trivial of changes. Still, it's fast and good enough that it could handle being a "co-pilot" on small to medium context tasks where you've got your hands on the wheel and your eyes on the road — and are driving under the speed limit. That's remarkable given where we were a couple of years ago.

I don't think I'd be using AI to code at all if this weren't the case. (I don't want to feel stunted or stuck just from losing my internet connection.)

codinhood 40 minutes ago [-]
I don't think you're going to get many "true" answers to this. The opportunity cost of not using the latest and best models is just too much right now.

Every month I research this and come to the same conclusion: the time, effort, and cost required to get local models (and the coding tools around them) to perform even close to Claude Code with sonnet/opus just not worth it right now. If it was, it would be distributive enough to be in the news.

Not that I'm discounting someone hasn't already solved this, just trying to Occam razor my way out of diving too deep down rabbit holes.

jrm4 8 minutes ago [-]
But you're pretty much measuring opportunity cost in tokens per second, no?

I think it strongly remains to be seen whether e.g. tokens per second (multiplied or whatever by percieved quality of private model) actually means "better or more useful output."

I strongly suspect it does not. (though I also strongly suspect this will be very difficult to measure because the incentive to lie about metrics here will be so strong.)

jmichaelson 8 minutes ago [-]
I am working on exactly this issue right now. My approach is that a highly optimized harness (pi.dev) with the right backing knowledgebase (a custom, self-updating wiki with lots of QC layers) can get close to most of my usage patterns for my Claude Max 20x subscription. I use Gemma 4 26B QAT served by a custom fork of llama.cpp, with 4-8 slots of 256k context at Q8. It's a very good model when the harness keeps it on rails. In an age of 1M context windows, 256k may seem small but it's been plenty for my work (scientific programming). A $20/month subscription to Ollama-cloud gets me good coverage of consults out to frontier models for difficult plans or debugging (again this is all woven into my highly customized pi install).

I'm still optimizing it (with claude, to be clear), but my testing is very encouraging. I worry a lot about companies (and the government) controlling access to machine intelligence, so local is the way to go.

blurbleblurble 27 minutes ago [-]
My experience is that it's not the models themselves that are limiting right now, it's the clunky alternative harnesses with weird missing features making for bad ergonomics around stuff like queue management, interruption, subagents, goals, etc.
Insanity 22 minutes ago [-]
Heard good things about pi.dev but haven’t tried it. It might take care of some of those missing features you mentioned.
bityard 2 minutes ago [-]
pi.dev is more like an agent developer kit. It's basically a substrate upon which you spend hours/days/weeks building your own agents or coding framework. It's pretty much the neovim to claude's vscode.
HappySweeney 1 hours ago [-]
I have an optane and lots of ram, so I tried full-fat models for writing some function overnight, as I get about 0.7 t/s. My current go-to test is to update a scalar function to transpose a bit-matrix to one using avx512. the cloud models all play with that like its nothing. Kimi 2.6 and GLM 5.1 both failed miserably.
arjie 2 hours ago [-]
Not “local” and not interactive coding but sharing since it might be helpful. I have 2x RTX Pro 6000 Blackwell running DeepSeek V4 Flash. I get 160 tok/s raw but it’s a reasoning model. For my use case, I have it auto-write code and another system auto-review the code.

I occasionally use it with pi to write some code and it’s blazing fast but it’s mostly habit that keeps me with CC and Codex.

leptons 40 minutes ago [-]
Have you measured your electricity consumption for this rig? I have to wonder how much it would cost you per month.
acc_297 1 hours ago [-]
I've been wondering lately if it would help to take a medium sized model and either in cloud or some local setup actually do Reinforcement Learning from Human Feedback (RLHF) on every prompt as a chore - I don't know if trying to manually finetune a model to your use habits would ruin it or help - ideally if you were diligent you could get rid of some of the ticks that make models for the general public difficult to work with e.g. overly sycophantic, overly verbose, annoying tendency to explain via analogies

but perhaps one individuals prompt feedback just isn't going to ever be enough I'm not sure how much you need (I know people working at big companies that have purchased in-house agents fine-tuned on internal documents etc.. and apparently these end up with bizarre behaviours not necessarily more helpful than the standard models)

I'd like to be able to essentially edit every response given by an agent and then finetune on the difference between what it produced and how I edited the text. Personally I would just remove a lot of the adjectives and try to distill the responses to core responses but I worry based on some of the work done by Owain Evans and other alignment researchers that this can sometimes push agents into tricky-to-predict tendancies.

rolisz 49 minutes ago [-]
I'm interested in trying something similar. I was thinking to do this for my OpenClaw agent.

About Owain Evans work: I think he did SFT. On Twitter someone was saying that RL is not as susceptible to what he showed. I'd like to try that

ecshafer 49 minutes ago [-]
I work with a few models on servers, so not local, but self hosted with ollama. gemma-4, glm 4.7 flash, and qwen 3.6. glm is the best at coding agentically. But I still don't think any of them reach the levels of gpt 5.5 or opus 4.8.
dabinat 24 minutes ago [-]
There’s evidence that combining models can achieve frontier-level performance (e.g. OpenRouter Fusion). I’m wondering if that’s the more realistic option: combine Opus with a local model to save on token costs.
mitchell_h 45 minutes ago [-]
Tried. The context windows just weren't big enough.
deadbabe 43 minutes ago [-]
Prompt more directly instead of open ended.
lysace 28 minutes ago [-]
Got a similar result (my RTX 4070 only has 12 GB). I'm curious about whether 24/32 GB meaningfully improves this enough to make it useful.
tobyhinloopen 8 minutes ago [-]
Try it on RAM and CPU.

It’s slower but you can run them.

wuschel 16 minutes ago [-]
I would like to know whether someone was able to use lower tier models for activities other than coding e.g. a limited version of a personal note manager - and what the hardware requirements in RAM for these models were.
nfrankel 41 minutes ago [-]
I tried. It works in theory: https://blog.frankel.ch/tokensparsamkeit-coding-assistants/#...

Results depend on the model, of course, and your computer is the limit. Mine wasn't up to the task, unfortunately.

Lwerewolf 33 minutes ago [-]
mbp16 m5 max 128gb, antirez/ds4, deepseekv4-flash. Works well for relatively dense (let's say <20k LoC per project) C codebases that are essentially a bunch of custom specialized stores, http servers, network infra, media transformers, etc.

Runs through Pi with a custom prompt (basically "don't speculate blindly, isolate things, make them traceable and measurable, then verify") and behind a pretty restrictive bwrap setup - RO bind everything other than ~/.pi, cdw and a separate tmpfs, unshare almost everything other than the network - for which I use a network namespace that only allows tcp connections to a specific ip and port (i.e the inference mac) - i.e. netns exec into bwrap.

Can't compare it to SOTA or higher-requirements models on what I work on - policy. That said, on a bunch of test pieces - it obviously isn't gpt-5.5, it definitely lags behind k2.6/glm/ds4-pro, but it absolutely is usable. Of course, on such codebases, forget about one-shotting or trusting it blindly or anything of the sort - you ask it, guide it, restart the context from time to time to have a "fresh dice roll" and to keep the context small and clean, etc. Compared to anything smaller (incl. all the usual local qwen models) - on a test piece, it figured out that memfd and mmap were used for setting up a ring buffer with natural wraparound handling (double mapping the first page at the end) and didn't tell me "this is for sharing memory between processes" or some other BS.

Performance as described in the tables in the readme here: https://github.com/antirez/ds4 ...with a bit less than half that at "low power" (30w). Both are usable.

K0balt 58 minutes ago [-]
Pretty good results with qwen 3.6 27b dense. I’d say it’s about equal to (Claude) haiku 4.5 maybe sonnet depending on the task.
kadoban 36 minutes ago [-]
What tool do you use to drive things for you, out of curiosity?
kandros 56 minutes ago [-]
I’d rather ask my butcher than Haiku for coding tasks
tumetab1 2 hours ago [-]
Not yet, tried Gemma 4 on an Apple M4 but the tok/s is significant lower than the cloud offering.

Also,the lack of enterprise tooling to help selected an appropriate model and tooling to run a local LLM does not help.

SkitterKherpi 39 minutes ago [-]
It has so far been the kind of thing that always feels like the next version of the local models would be the one that is just good enough.
_davide_ 53 minutes ago [-]
i used to mix remote and local minimax 2.7(q3) on my strix halo, it run at 30 tg and 220 tokens pp... it was a bit painful slow, but it was a good feeling i could stay offline. unfortunately m3 which is at opus .8 levels is 460b parameters and doesn't even fit in 128gb of memory, let alone a big context. strix halo feels like a toy for ai purposes. https://kyuz0.github.io/amd-strix-halo-toolboxes/
sosodev 15 minutes ago [-]
My strix halo board is feeling more useful and less toylike with the recent performance gains combined from MTP, better quantization, and generalized performance improvements across the stack. For example, I can run Unsloth's Gemma4-31B 4-bit QAT model with around 30tg and 200pp. I don't find that to be too slow at all. Particularly because it's nearly full accuracy and good enough for a lot of different stuff I throw at it.

I think it also helps that I'm using my machine to do home server stuff. It excels at all of the traditional workloads. Then I can lean on the AI to help with automation here and there. I find it deeply satisfying.

anonymousiam 50 minutes ago [-]
This was posted shortly after your Ask HN post:

My Homelab AI Dev Platform

https://news.ycombinator.com/item?id=48542433

ryandrake 51 minutes ago [-]
Always a bit disappointed in the details in these kinds of threads. When you do get answers, they're never specific enough to try out on your own. It'll be something like "I use Qwen 3.5 and get great results!" OK but what quantization are you using? What llama parameters? What context size? What GPU are you running it on, and how much VRAM does it have? Are you hosting it on a separate box, or running it locally on your dev machine? What coding agent tool are you using, and how is it configured / hooked up to the model?
Razengan 56 minutes ago [-]
Related: Are there any viable distributed AI models?

Like how we've had SETI at Home, Folding at Home, BitTorrent etc. People are clearly willing to donate their computer resources to distributed projects.

Maybe in a dAI network anyone could submit content for training on, and each user running a "node" could have their own custom private conditions on which type of content to accept for training or inference.

Like someone who dislikes anime could say "never accept anime related content or queries" so their node would basically opt-out from any data or questions about anime.

7 minutes ago [-]
joshuamoyers 49 minutes ago [-]
I think it'd be very hard to achieve viable tokens/s or get arithmetic intensity to be high enough in general, since many cases in existing training and inference are memory bandwidth limited. Definitely seems possible to conceptually have a slow pipeline that is distributed though.
fortyseven 15 minutes ago [-]
I use Pi and Qwen 3.6 27b locally on a 4090 for all my personal projects. I still use Claude for day job work since they pay for it, and my employer expects me to use it. I rarely touch it otherwise.
dude250711 60 minutes ago [-]
Yes, running a local model on a natural wetware substrate here.

Recommended setup: plenty of nutrients, some caffeine and a quiet environment.

Performance - not currently measured in tokens: roughly average.

jasongill 22 minutes ago [-]
I have been running this stack since well before Claude Code became popular. It works OK but I've found it to be very slow; and despite having a big context window, it seems to lose track of what it's working on and goes down a rabbit hole (or just wastes tokens trying to use the web browser) for hours and is hard to get back on track. I even tried spinning up two sub-agents but even after years of trying to prompt them, they are almost useless in terms of coding ability, so that is looking to be a waste of spending at least so far but maybe the model will improve as time goes on.
HPsquared 56 minutes ago [-]
I personally get about 50 tokens per hour.
christkv 1 hours ago [-]
Waiting for this https://github.com/antirez/ds4 to stabilize for strix halo.
system2 47 minutes ago [-]
Until I can buy an 80GB VRAM GPU, I won't attempt to do it. A local LLM is always missing something that needs a bigger model.
temilson 25 minutes ago [-]
[flagged]
phlhar 59 minutes ago [-]
[dead]
iluvcommunism 2 hours ago [-]
[dead]
kertoip_1 1 hours ago [-]
Just attach OpenRouter to your coding agent tool and try yourself. All relevant open weight models are there. Every person have different needs and expectations
dada216 30 minutes ago [-]
Local? No. Via opencode Go subscription using GLM mainly? Yes, I still use Gemini/Claude/GPT via api from openrouter for adjacent tasks, I would say 20$ per month max in api token costs.

Disclaimer: I am a Linux infra/k8s guy, I write production code but it's mainly glue code and mainly in golang.

Addendum: most value we get is from "document intelligence" and that's all Gemma and Qwen on H100/H200

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 17:49:22 GMT+0000 (Coordinated Universal Time) with Vercel.