I have two ideas about how we’re setting up the evaluation.
First, the Bradley-Terry competition is mixing up what’s important for something to be shown with how new or unusual something is when it’s shown. Detail finds a problem, and three bots looking at pairs of changes also noticed a similar level in their scores, but people won't pay any extra attention to it in the actual running system. The engineer was going to see the problem from the bot’s review anyway. It would be good to specifically measure what Detail alone finds, versus what all the bots together find. We could look at that as a completely separate measure.
Second, when the same AI model is doing the judging, it’s being unfair. When I use a Claude to review code changes and a Codex to review the very same code changes as I’m developing, Codex finds issues that Claude is inclined to overlook. And Sonnet 4.6’s ratings of code, which come from a system run by Sonnet, also have some of that unfairness, even after the system summarizes the code.
lmeyerov 1 hours ago [-]
The baseline Claude prompt being compared to feels pretty laughable so not sure what is learned. Maybe compare to a more realistic baseline for the DIY side for more compelling benchmarketing?
We started with a DIY code review skill because it's inherent to want to customize to our codebase and infra before trying solutions that add layers which may get in our way here. We have a 1 page skill that that does seperate passes on security, spec conformance, proper DRY & architectural abstractions, etc, and adversarial result quality passes to prune & prioritize. Others do similar.
Once quality is fixed, I'd expect the comparisons to be less on hits/misses , and more on token efficiency. That's a tricky one bc developer local review tokens are heavily subsidized right now.
sachiniyer01 5 minutes ago [-]
> Maybe compare to a more realistic baseline for the DIY side for more compelling benchmarketing?
This is fair critique. However, I don’t really trust myself to write a great code review skill for vLLM or OpenClaw. I also don’t think Claude Code is the right harness for this deep and broad scanning work. We find that it struggles to maintain clarity when considering many different bugs at the same time. The coding agents seem really great at single-goal tasks that they can Ralph their way to.
> We started with a DIY code review skill because it's inherent to want to customize to our codebase and infra before trying solutions that add layers which may get in our way here.
Being able to tinker deeply with the tools is pretty inherent to my love of dev tools in general. Our job is to make use of all of those customizations (our agent will use that 1 page skill when doing its bug finding). I also still think externalizing part of your dev workflow is the right way to get ahead. You really don’t want to do the work of eval-ing/maintaining that skill to make sure it still performs well with a mythos or something.
> and more on token efficiency.
I’m really confident in our ability to stretch $20 of tokens ;)
sachiniyer01 3 days ago [-]
Author here!
Lmk if there is any qs I can answer about Detail or the post.
not_right4r987 3 hours ago [-]
Is this something paid post? asking someone with high karma to post and making it to first page of HN news?
sachiniyer01 1 hours ago [-]
Nope, I’m not paying drob to post. Both of us work at Detail.
Rendered at 14:26:21 GMT+0000 (Coordinated Universal Time) with Vercel.
First, the Bradley-Terry competition is mixing up what’s important for something to be shown with how new or unusual something is when it’s shown. Detail finds a problem, and three bots looking at pairs of changes also noticed a similar level in their scores, but people won't pay any extra attention to it in the actual running system. The engineer was going to see the problem from the bot’s review anyway. It would be good to specifically measure what Detail alone finds, versus what all the bots together find. We could look at that as a completely separate measure.
Second, when the same AI model is doing the judging, it’s being unfair. When I use a Claude to review code changes and a Codex to review the very same code changes as I’m developing, Codex finds issues that Claude is inclined to overlook. And Sonnet 4.6’s ratings of code, which come from a system run by Sonnet, also have some of that unfairness, even after the system summarizes the code.
We started with a DIY code review skill because it's inherent to want to customize to our codebase and infra before trying solutions that add layers which may get in our way here. We have a 1 page skill that that does seperate passes on security, spec conformance, proper DRY & architectural abstractions, etc, and adversarial result quality passes to prune & prioritize. Others do similar.
Once quality is fixed, I'd expect the comparisons to be less on hits/misses , and more on token efficiency. That's a tricky one bc developer local review tokens are heavily subsidized right now.
This is fair critique. However, I don’t really trust myself to write a great code review skill for vLLM or OpenClaw. I also don’t think Claude Code is the right harness for this deep and broad scanning work. We find that it struggles to maintain clarity when considering many different bugs at the same time. The coding agents seem really great at single-goal tasks that they can Ralph their way to.
> We started with a DIY code review skill because it's inherent to want to customize to our codebase and infra before trying solutions that add layers which may get in our way here.
Being able to tinker deeply with the tools is pretty inherent to my love of dev tools in general. Our job is to make use of all of those customizations (our agent will use that 1 page skill when doing its bug finding). I also still think externalizing part of your dev workflow is the right way to get ahead. You really don’t want to do the work of eval-ing/maintaining that skill to make sure it still performs well with a mythos or something.
> and more on token efficiency.
I’m really confident in our ability to stretch $20 of tokens ;)
Lmk if there is any qs I can answer about Detail or the post.