NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
N-Day-Bench – Can LLMs find real vulnerabilities in real codebases? (ndaybench.winfunc.com)
Cynddl 1 hours ago [-]
> Each case runs three agents: a Curator reads the advisory and builds an answer key, a Finder (the model under test) gets 24 shell steps to explore the code and write a structured report, and a Judge scores the blinded submission. The Finder never sees the patch. It starts from sink hints and must trace the bug through actual code.

Curator, answer key, Finder, shell steps, structured report, sink hints… I understand nothing. Did you use an LLM to generate this HN submission?

It looks like a standard LLM-as-a-judge approach. Do you manually validate or verify some of the results? Done poorly, the results can be very noisy and meaningless.

rohansood15 20 minutes ago [-]
I worked in AppSec in the past, made sense to me. Maybe you aren't the target audience?

You don't really need manual verification for these, the CVEs (vulnerabilities) are public and can be programmatically validated.

peyton 1 hours ago [-]
> Did you use an LLM to generate this HN submission?

Must have.

> The Finder will never see the patch.

I wasn’t worried that this eval would show the answer to the model before evaluating it. Seems requirements leaked into this post.

mbbutler 2 hours ago [-]
It would be helpful to add in some cases that do not contain any vulnerabilities to assess false-positive rate as well.
mufeedvh 2 hours ago [-]
This is a good idea.

Will incorporate false-positive rates into the rubric from the next run onwards.

At winfunc, we spent a lot of research time taming these models to eradicate false-positive rates (it's high!) so this does feel important enough to be documented. Thanks!

cortesoft 2 hours ago [-]
Any code that is certain that it doesn't have any vulnerabilities is going to be pretty trivial to verify.
Rohinator 3 hours ago [-]
Very curious how Claude Mythos will perform here
volume_tech 26 minutes ago [-]
[dead]
aos_architect 1 hours ago [-]
[dead]
3 hours ago [-]
phantomoc 2 hours ago [-]
[dead]
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 00:40:08 GMT+0000 (Coordinated Universal Time) with Vercel.