NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B (github.com)
jwr 9 hours ago [-]
That is very, very interesting. I've been hoping to have an assistant in the workshop (hands-free!) that I could talk to and have it help me with simple tasks: timers, calculating, digging up notes, etc. — basically, what the phone assistants were supposed to be, but aren't.

"You will have to unlock your iphone first" is kind of a deal-breaker when you are in the middle of mixing polyurethane resin and have gloves and a mask on.

More and more I find that we have the technology, but the supposedly "tech" companies are the gatekeepers, preventing us from using the technological advances and holding us back years behind the state of the art.

I'll be trying this out on my Macbook, looks very promising!

jamilton 2 hours ago [-]
Siri does have a setting that'll activate it if you say "hey siri" while the phone is locked. Obvious privacy and battery usage concerns though, and it's still Siri, so it's a little clunky.
jwr 1 hours ago [-]
Mhm. I think I use that. But then I say "call my wife" and it says "you'll need to unlock your iphone first".

It's clear Tim Cook doesn't ever try to use Siri wearing gloves. Or ever, for that matter :-)

mft_ 1 hours ago [-]
Siri (on iOS 18, at least) will call people for me without unlocking, in response to a voice command only - I just double-checked...
gtowey 6 hours ago [-]
The computing power we all have in our pockets is staggering. It could be tool that truly makes our lives easier, but instead it's mostly a device that is frustrating to use. Companies have decided to make it simply another conduit for advertising. It's a tool for them to sell us more stuff. Basic usability be damned.
huijzer 6 hours ago [-]
> More and more I find that we have the technology, but the supposedly "tech" companies are the gatekeepers

Yes same with RSS readers being dropped by large companies. Worked too good I guess!

mentalgear 6 hours ago [-]
You might be interested in the open-source https://www.home-assistant.io/voice-pe/ .
QuercusMax 2 hours ago [-]
I've been replacing my Google Homes and Chromecasts with Snapcast streamers, and this is the next thing I've been planning to look into.

It's truly absurd how the Google voice assistant USED to work properly for setting timers, playing music, etc, and then they had to break it 15 times and finally replace it with much slower AI that only kinda does what you want. I'm done.

Selfhosted is the way to go if you want to keep your sanity. My wife has basically given up on any Google/Apple voice assistants being able to do anything useful above "set a 10 minute timer".

magzter 6 hours ago [-]
This is so cool, I'm always speaking to people about how the advancement in the SOTA hosted AI's is also happening in the local model space, i.e. the SOTA hosted AI models 6-12 months ago are what we're seeing now being able to run locally on average hardware - this is such an amazing way to actually demo it.
4 hours ago [-]
myultidevhq 4 hours ago [-]
This is really impressive for running locally on an M3 Pro. The latency looks surprisingly good for real-time audio and video input.

Curious about one thing though, how does it handle switching between languages? I work with both Greek and English daily and local models usually struggle with that.

Great work, bookmarking this.

karimf 3 hours ago [-]
During my limited testing, it works better than I expected at handling multiple languages in a single session. Perhaps I just had a low expectation since I've mostly worked with English-only STT models.
crsAbtEvrthng 3 hours ago [-]
If I run this without internet connection it says "loading..." at the bottom of the localhost site and won't work.

If I run this with internet connected it works flawlessly. Even if I disconnect my internet afterwards it still goes on working fine.

Why there has to be an internet connection established at the time I open the localhost site when all of this should be working purely on device?

Despite of this, I am really impressed that this actually works so fast with video input on my M4 Pro 48 GB.

karimf 3 hours ago [-]
Huh that's weird. I just tried it and it works on my machine. Could you perhaps create a GitHub issue and share the reproduction steps and any relevant logs?
crsAbtEvrthng 2 hours ago [-]
Don't have the time right now but will play around with it next weekend for sure and will give you more feedback with logs when I see that I can reproduce it.

For now what I did was:

- Tested in Chrome/Safari/Firefox on Tahoe.

- Followed the quick start install instructions from github repo

- Everything worked

- Closed terminal

- Disconnected internet (Wifi off)

- Opened terminal

- Started server again (uv run server.py)

- Opened localhost in browser, it asked for camera/mic normally, granted access, saw camera live feed but "loading..." at bottom center of the site and AI did not listen/respond

- Reproduced this about 3 times with switching between wifi on/off before starting the server, always the same (working with internet; not working without)

- Figured it also works fine if I start the server with internet connected and disconnect it afterwards

dvt 13 hours ago [-]
Solid work and great showcase, I've done a bunch of stuff with Kokoro and the latency is incredible. So crazy how badly Apple dropped the ball... feels like your demo should be a Siri demo (I mean that in the most complimentary way possible).
karimf 13 hours ago [-]
Thank you. This reminds me of a paragraph from the LatentSpace newsletter [0]

> The excellent on device capabilities makes one wonder if these are the basis for the models that will be deployed in New Siri under the deal with Apple….

https://www.latent.space/p/ainews-gemma-4-the-best-small-mul...

est 8 hours ago [-]
I am making something similar. Also been using Kokoro for TTS. Very cool project!

Gemma 4 is kinda too heavyweight even with E2B. I am sticking with qwen 0.8B at the moment.

zerop 10 hours ago [-]
I have been looking forward to build something like this using open models. A voice assisstant I can talk while I am driving, as I do have long commute. I do use chatGPT voice mode and it works great for querying any information or discussions. But I want to do tasks like browsing web, act like a social media manager for my business etc.
rubicon33 3 hours ago [-]
Is there anything unique here happening for the video aspect or is it just taking snapshots over and over?

I’ve been looking for a good video summarizing / understanding model!

karimf 3 hours ago [-]
Nothing unique, it's just taking a snapshot when it's processing the input. Even processing a single image will increase the TTFT by ~0.5s on my machine, so for now, it seems to be impossible for feeding a live video and expecting a real-time response.

In regards to the video capability, I haven't tested it myself, but here's a benchmark/comparison from Google [0]

[0] https://huggingface.co/blog/gemma4#video-understanding

rubicon33 2 hours ago [-]
I totally get these are very hard problems so solve and that we're on the bleeding edge of what's possible but I can't help and wonder when someone is going to crack real video understanding.

sure, maybe it's still frame-by-frame but so fast and so often that the model retains a rolling context of what's going on and can answer cleanly temporal questions.

"how packages were delivered over the last hour", etc.

13 hours ago [-]
logicallee 7 hours ago [-]
It might interest people to know you can also easily fine-tune the text portion of this specific model (E2B) to behave however you want! I fine-tuned it to talk like a pirate but you can get it to do anything you have (or can generate) training data for. (This wouldn't make it to the text to speech portion though.) So you can easily train it to act a certain way or give certain types of responses.

Video: https://www.youtube.com/live/WuCxWJhrkIM

Generated writeup: https://taonexus.com/publicfiles/apr2026/pirate-gemma-journa...

an0n-elem 4 hours ago [-]
Cool work buddy:)
divan 8 hours ago [-]
Can someone quickly vibe code MacOS native app for that so it doesn't require running terminal commands and searching for that browser tab? (: (also for iOS, pls)
duartefdias 8 hours ago [-]
Would you pay 2$ for that MacOS native desktop app?
tianqi 6 hours ago [-]
[dead]
agdexai 5 hours ago [-]
[dead]
techpulse_x 10 hours ago [-]
[dead]
k-almuraee 11 hours ago [-]
Amazing, love your work ,
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 18:55:34 GMT+0000 (Coordinated Universal Time) with Vercel.