The trajectory export feature is smart. Evaluation and training data collection in the same tool.
I'm curious how the benchmarks handle non-determinism. Real GUIs have loading states, animations, popups that appear sometimes but not always. Does cuabench control for that, or is variance just part of the measurement?
Also interested in what "Windows Arena" tests specifically. Windows has so many edge cases - UAC prompts, driver install dialogs, random update notifications. Those feel like the hard mode for computer-use agents.
frabonacci 39 minutes ago [-]
Thanks - trajectory export was key for us since most teams want both eval and training data.
On non-determinism: we actually handle this in two ways. For our simulated environments (HTML/JS apps like the Slack/CRM clones), we control the full render state so there's no variance from animations or loading states. For native OS environments, we use explicit state verification before scoring - the reward function waits for expected elements rather than racing against UI timing. Still not perfect, but it filters out most flaky failures.
Windows Arena specifically - we're focusing on common productivity flows (file management, browser tasks, Office workflows) rather than the edge cases you mentioned. UAC prompts and driver dialogs are exactly the hard mode scenarios that break most agents today. We're not claiming to solve those yet, but that's part of why we're open-sourcing this - want to build out more adversarial tasks with the community.
visarga 5 hours ago [-]
Interesting, a computer use environment. I made a CUA benchmark too, 200 web tasks with internal code based evaluation. You can integrate them if you want.
Hey visarga - I'm the founder of Cua, we might have met at the CUA ICML workshop? The OS-agnostic VNC approach of your benchmark is smart and would make integration easy. We're open to collaborating - want to shoot me an email at f@trycua.com?
rfw300 1 hours ago [-]
Interesting project, but the lack of any actual benchmark results on existing models/agents is disappointing.
frabonacci 44 minutes ago [-]
Fair point - we just open-sourced this so benchmark results are coming. We're already working with labs on evals, focusing on tasks that are more realistic than OSWorld/Windows Agent Arena and curated with actual workers. If you want to run your agent on it we'd love to include your results.
Rendered at 20:44:58 GMT+0000 (Coordinated Universal Time) with Vercel.
I'm curious how the benchmarks handle non-determinism. Real GUIs have loading states, animations, popups that appear sometimes but not always. Does cuabench control for that, or is variance just part of the measurement?
Also interested in what "Windows Arena" tests specifically. Windows has so many edge cases - UAC prompts, driver install dialogs, random update notifications. Those feel like the hard mode for computer-use agents.
On non-determinism: we actually handle this in two ways. For our simulated environments (HTML/JS apps like the Slack/CRM clones), we control the full render state so there's no variance from animations or loading states. For native OS environments, we use explicit state verification before scoring - the reward function waits for expected elements rather than racing against UI timing. Still not perfect, but it filters out most flaky failures.
Windows Arena specifically - we're focusing on common productivity flows (file management, browser tasks, Office workflows) rather than the edge cases you mentioned. UAC prompts and driver dialogs are exactly the hard mode scenarios that break most agents today. We're not claiming to solve those yet, but that's part of why we're open-sourcing this - want to build out more adversarial tasks with the community.
https://github.com/UiPath/uipath_enterprise_benchmark
https://arxiv.org/abs/2511.17131