Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Launch HN: Freestyle: Sandboxes for AI Coding Agents (freestyle.sh)

93 points by benswerd 3 hours ago | 52 comments

_jayhack_ 2 hours ago [-]

Would love to understand how you compare to other providers like Modal, Daytona, Blaxel, E2B and Vercel. I think most other agent builders will have the same question. Can you provide a feature/performance comparison matrix to make this easier?

benswerd 2 hours ago [-]

I'm working on an article deep diving into the differences between all of us. I think the goal of Freestyle is to be the most powerful and most EC2 like of the bunch.

Daytona runs on Sysbox (https://github.com/nestybox/sysbox) which is VM-like but when you run low level things it has issues.

Modal is the only provider with GPU support.

I haven't played around with Blaxel personally yet.

E2B/Vercel are both great hardware virtualized "sandboxes"

Freestyle VMS are built based on the feedback our users gave us that things they expected to be able to do on existing sandboxes didn't work. A good example here is Freestyle is the only provider of the above (haven't tested blaxel) that gives users access to the boot disk, or the ability to reboot a VM.

rsyring 1 hours ago [-]

I'd also be interested in a comparison with exe.dev which I'm currently using.

benswerd 46 minutes ago [-]

Exe.dev is a individual developer oriented service. Freestyle is more oriented at platforms building the next exe.dev.

Thats why our pricing is usage based and we have a much larger API surface.

tomComb 2 hours ago [-]

And fly.io sprites

benswerd 2 hours ago [-]

Fly.io sprites is the most similar to us of the bunch. They do hardware virtualization as well, have comparable start times and are full Linux. What we call snapshots they call checkpoints.

The big pros of Sprites over us is their advanced networking stack and the Fly.io ecosystem. The big cons are that Sprites are incredibly bare bones — they don't have any templating utilities. I've also heard that Sprites sometimes become unavailable for extended periods of time.

The big pros of Freestyle over Sprites is fork, advanced templating, and IMO a better debugging experience because of our structure.

0123456789ABCDE 17 minutes ago [-]

sprites have weird lately, i think fly.io is having trouble with capacity in various locations.

is the experience similar? can i just get console to one machine, work for a bit, logout. come back later, continue?

how does i cost work if i log into a machine and do nothing on it? just hold the connection.

benswerd 12 minutes ago [-]

This will just work on us.

We do auto suspend depending on your configured timeout. We'll pause your VM and when you come back the processes will be in the exact same state as when you left.

knowsuchagency 2 hours ago [-]

Thanks for the thoughtful response. I'm predominantly a self-hoster, but I think your product makes a lot of sense for a wide variety of users and businesses. I'm excited to try out freestyle!

benswerd 1 hours ago [-]

Self hosting can be doable for constant small/medium size workloads

You can handroll a lot with: https://github.com/nestybox/sysbox?tab=readme-ov-file https://gvisor.dev https://github.com/containers/bubblewrap?tab=readme-ov-file

For hardware virtualized machines it much harder but you can do it via: https://github.com/firecracker-microvm/firecracker/ https://github.com/cloud-hypervisor/cloud-hypervisor

Freestyle/other providers will likely provide better debugging experience but thats something you can probably get past for a lot of workloads.

The time when you/anyone should think about Freestyle/anyone is when the load spikes/the need to create hundreds of VMs in short spikes shows up, or when you're looking for some of the more complex feature sets any given provider has built out (forks, GPUs, network boundaries, etc).

I also highly recommend self hosting anything you do outside of your normal VPC. Sandboxes are the biggest possible attack surface and it is a feature of us that we're not in your cloud; If we mess up security your app is still fine.

indigodaddy 1 hours ago [-]

This is what I do (my project) for self hosting on a VPS/server:

https://GitHub.com/jgbrwn/vibebin

Also I'm a huge proponent of exe.dev

Obviously your service/approach is different than exe, more like sprites but like you said more targeted/opinionated to AI coding/sandboxing tasks it looks like. Interesting space for sure!

sahil-shubham 45 minutes ago [-]

I've been building an open-source, self-hostable Firecracker orchestrator for the past month: https://github.com/sahil-shubham/bhatti (https://bhatti.sh)

Still WIP, but the core works — three rootfs tiers (minimal Ubuntu, headless Chromium with CDP, Docker-in-VM), OCI image support (pull any Docker image), automatic thermal management (idle VMs pause then snapshot to disk, wake transparently on next API call), per-user bridge networking with L2 isolation, named checkpoints, persistent volumes, and preview URLs with auto-wake.

Fair warning: the website is too technical and the docs are mostly AI-generated, both being actively reworked. But I've been running it daily on a Hetzner server for my AI agents' browser automation, and deploy previews.

I'd love any feedback if you want to go ahead and try it yourself

stingraycharles 1 hours ago [-]

I’m super interested since it seems like you have given everything a lot of thought and effort but I am not sure I understand it.

When I’m thinking of sandboxes, I’m thinking of isolated execution environments.

What does forking sandboxes bring me? What do your sandboxes in general bring me?

Please take this in the best possible way: I’m missing a use case example that’s not abstract and/or small. What’s the end goal here(

benswerd 55 minutes ago [-]

So isolation is correct. Forking a sandbox gives you multiple exact duplicates of isolated environments.

When your coding agent has 10 ideas for what to do, to evaluate them correctly it needs to be able to evaluate them in isolation.

If you're building a website testing agent and halfway down a website, with a form half filled out a session ongoing, etc and it realizes it wants to test 2 things in isolation, forking is the only way.

We also envision this powering the next generation of devcycles "AI Agent, go try these 10 things and tell me which works best". AI forks the environment 10 times, gets 10 exact copies, does the thing in each of them, evaluates it, then takes the best option.

indigodaddy 51 minutes ago [-]

Yep I can see this especially when the agent is spinning up test servers/smokes and you don't want those conflicting. How do we reconcile all the potential different git hashes though, upstream I guess etc (this might be an easy answer and I'm not super proficient with git so forgive)

benswerd 49 minutes ago [-]

So we recommend branch per fork, merge what you like.

You have to change the branch on each fork individually currently and thats unlikely to change in the short term due to the complexity of git internals, but its not that hard to do yourself `git checkout -b fork-{whateverDiscriminator}`

wsve 49 minutes ago [-]

Agreed, the thing I'd be most interested in is the isolated execution environment you mentioned. Agents running autopilot are powerful. Agents running unsupervised on a machine with developer permissions and certificates where anything could influence the agent to act on an attacker's behalf is terrifying

benswerd 48 minutes ago [-]

I recommend running the agent harness outside of the computer. The mental model I like to use is the computer is a tool the agent is using, and anything in the computer is untrusted.

jeremyjh 41 minutes ago [-]

I would recommend not giving an agent the full run of any computing environment. Do handle fine grained internet access controls and credential injection like OpenShell does?

benswerd 32 minutes ago [-]

I used to believe this, but I think the next generation of agents is much more autonomous and just needs a computer.

The work of a developer is open ended, so we use a computer for it. We don't try to box developers into small granular screwdrivers for each small thing.

Thats whats coming to all agents, they might want to run some analysis with python, want to generate a website/document in typescript, and might want to store data in markdown files or in MongoDB. I expect them to get much more autonomous and with that to end up just needing computers like us.

croes 39 minutes ago [-]

The problem is the agent, which should be treated untrusted. The computer isn’t the problem

benswerd 28 minutes ago [-]

Kind of. The chat logs of the agent are trustworthly, as should any telemetry you have on it or coming out of the VM. Its behavior should be treated as probabilistic and therefore untrustworthly.

vimota 34 minutes ago [-]

This is awesome - the snapshotting especially is critical for long running agents. Since we run agents in a durable execution harness (similar to Temporal / DBOS) we needed a sandboxing approach that would snapshot the state after every execution in order to be able to restore and replay on any failure.

We ended up creating localsandbox [0] with that in mind by using AgentFS for filesystem snapshotting, but our solution is meant for a different use case than Freestyle - simpler FS + code execution for agents all done locally. Since we're not running a full OS it's much less capable but also simpler for lots of use cases where we want the agent execution to happen locally.

The ability to fork is really interesting - the main use case I could imagine is for conversations that the user forks or parallel sub-agents. Have you seen other use cases?

[0] https://github.com/coplane/localsandbox

benswerd 31 minutes ago [-]

Deterministic testing of edge cases. It can be really hard to recreate weird edge cases of running services, but if you can create them we can snapshot them exactly as they are.

MarcelinoGMX3C 1 hours ago [-]

The technical challenges in getting memory forking to deliver those sub-second start and fork times are significant. I've seen the pain of trying to achieve that level of state transfer and rapid provisioning. While "EC2-like" gets the point across for many, going bare metal reveals the practical limits of cloud virtualization for high-performance, complex workloads like these. It shows a real understanding of where cloud abstraction helps and where it just adds overhead.

The cost argument for owning the hardware for this specific use case also makes sense, considering the scale these agent environments will demand. Also worth noting, sandboxes are effectively an open attack surface; architecting them not to be in your main VPC is a sound security decision from the start.

schopra909 2 minutes ago [-]

Honestly never considered the forking use case; but it makes a ton of sense when explained

Congrats on the launch. This is cool tech

stocktech 58 minutes ago [-]

I built something like this at work using plain Docker images. Can you help me understand your value prop a little better?

The memory forking seems like a cool technical achievement, but I don't understand how it benefits me as a user. If I'm delegating the whole thing to the AI anyway, I care more about deterministic builds so that the AI can tackle the problem.

benswerd 52 minutes ago [-]

So first MicroVM != Container, and container is not a secure isolation system. I would not run untrusted containers on your nodes without extra hardening.

The memory forking was originally invented because for AI App Builders and first response driven applications its extremely important that they are instant (difference between running bun dev and the dev server already being running).

However its much more generally applicable, Postgres is a great example of this. You can't fork the filesystem under postgres and get consistency. Same thing with a browser state, a weird server state, or anything that exists in memory. The memory forking gives a huge performance boost while snapshotting whats actually going on at one instant.

n2d4 2 hours ago [-]

Cool! I've been using your API for running sandboxed JS. Nice to see you also support VMs now.

    > we mean forking the whole memory of it

How does this work? Are you copying the entire snapshot, or is this something fancy like copy-on-write memory? If it's the former, doesn't the fork time depend on the size of the machine?

benswerd 2 hours ago [-]

We're using copy on write with the memory itself. Fork time is completely decoupled from the size of the machine.

Creating snapshots takes a 2-4 second interruption in the VM due to sheer IO that we didn't want here.

Whats especially cool about this approach is not only is fork time O(1) with respect to machine size, but its also O(1) with respect to the amount of forks.

skybrian 1 hours ago [-]

It doesn't seem very easy to calculate how much it would cost per month to keep a mostly-idle VM running (for example, with a personal web app). The $20/month plan from exe.dev seems more hobbyist-friendly for that. Maybe that's not the intended use, though?

benswerd 1 hours ago [-]

We're not going after hobbyists. We're building the platform for companies like exe.dev to build on. Thats why its all usage based.

That said, our $50 a month plan can be used as an individual for your coding agents, but I wouldn't recommend it.

indigodaddy 49 minutes ago [-]

Ooof, if you are the middleman platform then it's sure gonna get expensive for the end user

rvz 26 minutes ago [-]

> The $20/month plan from exe.dev seems more hobbyist-friendly for that. Maybe that's not the intended use, though?

And you can go even below that by self-hosting it yourself with a very cheap Hetzner box for $2 or $5.

skybrian 3 minutes ago [-]

Can you start up multiple VM's easily on a Hetzner box?

jnstrdm05 52 minutes ago [-]

how many seconds to provision are we talking about here? 1 sec vs 60 is a dealbreaker for me, some clarity on that would be nice.

benswerd 51 minutes ago [-]

500ms. Less than 1 second. We're aiming to get that down to 200ms in the next 3 months.

maxmaio 39 minutes ago [-]

Congrats Ben and Jacob!

Fraaaank 2 hours ago [-]

Your pricing page is broken

benswerd 2 hours ago [-]

Reviewing this now. our public pricing at www.freestyle.sh/pricing seems to be working, can you point me in a more specific direction?

dominotw 33 minutes ago [-]

dumb question. none of these protect your from prompt injection. yes?

benswerd 32 minutes ago [-]

no, but the goal of these is if you are faced with prompt injection the worst case scenario is the AI uses that computer badly.

dominotw 14 minutes ago [-]

unless i am misundestanding. not sure how this computer prevents secrets from my gmail leaking. thats the worst case.

rasengan 1 hours ago [-]

Interesting!

We're working on a similar solution at UnixShells.com [1]. We built a VMM that forks, and boots, in < 20ms and is live, serving customers! We have a lot of great tools available, via MIT, on our github repo [2] as well!

[1] https://unixshells.com

[2] https://github.com/unixshells

tomComb 22 seconds ago [-]

Can your service scale ram? like the way docker desktop does. Manual is fine.

siva7 1 hours ago [-]

I have so many interesting problems on Ai, sandboxing isn't one of them. It's a pointless excercise yet disproportionately so many people love to to do this. Probably because sandboxing doesn't feel as magic as Agents itself and more like the old times of "traditional" software development.

hobofan 1 hours ago [-]

It is a mostly pointless exercise if the goal is trying to contain negative impact of AI agents (e.g. OpenClaw).

It is a very necessary building block for many common features that can be steered in a more deterministic way, e.g. "code interpreter" feature for data analysis or file creation like commonly seen in chat web UIs.

moezd 51 minutes ago [-]

Believe it or not, once you start working for a regulated industry, it is all you would ever think of. There, people don't care if you are vibing with the latest libraries and harnesses or if it's magic, they care that the entire deployment is in some equivalent of a Faraday cage. Plus, many people just don't appreciate it when their agents go rm -rf / on them.

iterateoften 1 hours ago [-]

Yeah, idk I guess it’s interesting if you are an engineer looking for something to do,

But like I see multiple sandbox for agents products a week. Way too saturated of a market

benswerd 1 hours ago [-]

I disagree (as a sandboxing company).

With respect to the market, every single sandbox sucks. I'm not gonna shit talk competitors but there is not a good sandboxing platform out there yet — including me — compared to where we'll be in 6 months.

We've heard all the platforms have consistent uptime, feature completeness, networking and debugging issues. And in our own platform we're not 1/10ths of the way through solving the requests we've gotten.

Next generation of Agents needs computers, and those computers are gonna look really different than "sandboxes" do today.

tcdent 59 minutes ago [-]

I don't think you're wrong, but if you really want to really re-think the approach, building an orchestration layer for Firecracker like every other company in the space is doing is probably not it.

borakostem 1 hours ago [-]

[flagged]

benswerd 1 hours ago [-]

So this is an ongoing optimization point, no perfect solution exists. Freestyle VMs work with a network namespace and virtual ethernet cable going into them, so they all think they are the same IP.

This means that while complex protocol connections like remote Postgres can break in the forks, stuff like Websockets just automatically reconnects.

n1tro_lab 52 minutes ago [-]

[dead]

aplomb1026 2 hours ago [-]

[dead]

johnwhitman 1 hours ago [-]

[dead]

Rendered at 19:04:55 GMT+0000 (Coordinated Universal Time) with Vercel.