I find this study quite suspect. I'd have to dive deeper but there's definitely significant alarm bells that should be going off for anyone reading.
Figure 2 (page 6) screams problems. There's only 16 professors (3k comparisons each?!?!) and the professors are all over the place. That's very high variance, suggesting the study has no meaningful statistical power. Poor instructor 16 can't catch a break lol
There's also really clear bias given that the main results only feature Google models. Other models show up elsewhere, why not there?
I'm no lawyer, but I'm a pretty competent statistician and can confidently say this paper has a smell to it. I can't call it bullshit, but there are red flags all over
volkercraig 37 minutes ago [-]
More than that, the entire structure of the study is pointless. They set up as a question/response and then had humans rate the response. That's literally what LLM's are trained to do, which ultimately is convincing a human to click the "I like this one better" button on it's response.
dragonwriter 23 minutes ago [-]
> There's also really clear bias given that the main results only feature Google models.
The main results also don’t seem to know what a “model” is, as the two “models” it refers to are “stock Gemini 2.5 Pro” and “a retrieval-augmented version of NotebookLM”.
One of which is a model, and the other of which is an interface backed by different models depending on exactly when the analysis was performed.
giancarlostoro 11 minutes ago [-]
I never get the same answer from any two lawyers. I hate law as a result. With developers you might get disagreements based on experience, but there's usually a strong consensus on specific things, with lawyers and courts its all over the flipping place. I wouldn't be surprised if LLMs can "pass" on paper (ie college exams) but in practice, they might 'struggle' in different courts.
...On the other hand, if an LLM has access to every transcript of every case a Judge has overseen, they might have an unfair advantage in any case... Hmmm...
This all assuming the AI lawyer doesn't hallucinate and start referencing cases that don't exist.
Paracompact 9 hours ago [-]
Independent of whether it has any meaning (because the entire paper might be a bit iffy), I find it curious that Instructors 3 and 8 have the lowest harmfulness rates, quite a bit lower than even the LLMs, but not the highest preference rates. Harmfulness anticorrelates with preference, but not perfectly. Some amount of charisma appears to be a factor even in selections by professionals?
RataNova 5 hours ago [-]
This is exactly why I'd be cautious about interpreting the preference metric too strongly
gguncth 6 hours ago [-]
Sure, but in two years AI has gone from “impressive tool, but not a replacement for knowledge workers” to “the study where it beats our highest caliber of knowledge workers may have some methodological deficits.” In another two years it’s going to be curtains.
wouldbecouldbe 5 hours ago [-]
The issue is, it almost always outperforms knowledge workers.
IF the right questions are asked, and IF steered into and corrected at a few crucial points. IF not it goes off in the wrong direction really quick and that's a problem that's still mostly unsolved in the last 2 years.
And that can be catastrophic in high risk environments, like legal, medical or high risk software products where being wrong in the wrong place can mean bankruptcy or even cost a life.
I help run a few marketing websites where I let the CEO's run crazy with Claude cowork, they are making PR's like a madman, but they are not allowed to touch any of the API's & platforms where there is real user data & sensitive information.
pezgrande 4 hours ago [-]
I kinda disagree. High risk environments just means that they will have to have a human-in-the-loop for a longer time which drastically reduce the skill required for such human (which is still requires high skill just not stupidly high).
ambicapter 14 minutes ago [-]
Doesn't it increase the skill required? You need to be able to jump in at the perfect time, while waiting patiently for 99% of the time. It's like self-driving that requires you to "jump in" at the worst possible time (0.5 seconds from a crash), and stay put the rest of the time--but don't get bored or inattentive. The only way to do that would be to be so naturally good at the danger point that you can do it basically reflexively.
jvvw 1 hours ago [-]
The employers will think it requires less skill, whereas in fact it might actually require more skill to do a good job of being the human-in-the-loop.
For example, my sister is a translator and she says that checking AI translations is actually harder in many ways than doing a translation in the first place, but the agencies pay less for checking than actual translation.
wouldbecouldbe 4 hours ago [-]
I think the opposite, only the most skilled will be required.
But it depends on the skill:
- For landing pages & simple saas solutions: marketeers & founders have more skill, since they understand the user best. The real skill is not the basic coding, but understanding the market.
- For security risks/architecture: senior devs can spot things in seconds
Im not a doctor or lawyer, but im sure there are cases where AI is really good in a similar way and cases where they miss the most crucial aspects.
KaiserPro 3 hours ago [-]
> drastically reduce the skill required for such human
I mean thats what is wanted by some companies.
The problem, especially for things like legal is that it requires someone more skilled to read through and understand that the argument is bollocks, or the law/precedent they are banking on is in fact the right one.
We have a tool that auto-writes letters to our management companies when they break SLAs. We have a slider that goes from polite to we are going to extract your first born.
Thats simple ish to do for LLMs, and low risk.
Drafting contracts is also something we could probably do, as its mostly boilerplate. However the consequence for mis-drafting a contract is multi-million dollars.
volkercraig 32 minutes ago [-]
Man, this comment made me think of a Kafkaesque future where two AI lawyers and an AI Judge are stuck in an infinite loop arguing over a case, meanwhile the defendant is running around trying to get anyone in the legal system to recognize that the AI is stuck.
scotty79 2 hours ago [-]
The end game of this is just a human capable of taking the blame when AI makes an occasional mistake and being paid for that service and risk.
moffkalast 4 hours ago [-]
If the human involved has no skill then they might as well not be there, since they're just a fall guy when things go wrong and won't do anything to prevent it from happening.
pezgrande 1 hours ago [-]
I said that still requires skill, just not as much.
goolz 5 hours ago [-]
Ya, while the tools are really solid and have seen huge leaps these past two years, in no way will an LLM be able to do any of it unguided in two years. Just a humble opinion that I would love to see be wrong.
alberto467 4 hours ago [-]
"in no way will an LLM be able to do any of it unguided in two years"
IDK "not any of it" seems a bit strong, especially thinking towards 2028. For a lot of knowledge professions, there is a surprising amount of tasks that are just dumb work compared to the rest.
ryan_n 4 hours ago [-]
> Just a humble opinion that I would love to see be wrong
Out of curiosity, why would you love to be wrong about that? What possible outcome could you see being a net positive for society if the vast majority of knowledge workers (and ultimately, as robotics progress, most workers in general) are replaced by AI?
ToValueFunfetti 1 hours ago [-]
I believe it was Blink-182 who said, "Work sucks". You have to pay people to do that stuff; they don't want to be there. And then you get into second order effects- costs plummet for anything labor intensive, including medical care, prepared food, cleaning, and private tutors. Then onto tertiary effects- if you can spin up a million genius researchers to attack a problem, you start seeing massive progress in every important area and it isn't tied to population growth.
I get that you might have a 'UBI/alternative general welfare is impossible' up your sleeve, but you've written this like it's somehow unfathomable that not forcing everybody to work just to survive would be a good thing. Of course it would be good! It's just a matter of dealing with the (huge) side effect of lost income.
volkercraig 27 minutes ago [-]
In that scenario, AI would have to be a public utility, which it is not. Private corporations have no intention to provide services for public good. If they displace a billion jobs, they'll just throw up their hands and go "we're just an Ai company guyz"
nyeah 12 minutes ago [-]
I agree, work sucks. What stops you from quitting your job now?
UBI doesn't need to be impossible to make me want a job. If UBI was equal to the US federal min wage x 40 hours a week, I'd desperately want a job.
koe123 4 hours ago [-]
In a way, we are betraying something here. My reading is: solving the social problems of capitalism feels so impossible, that reducing the need for anyone to do work is a liability. In a way this sentiment should make extremists of us all?
bluefirebrand 2 hours ago [-]
Unfortunately the fact is that society has some massive imbalances around capitalism
It is not hard for me to imagine a world where if my bosses didn't need me, they would prefer me to be dead than to pay me some kind of permanent income to me. They would prefer to keep that power to themselves
These are already the sort of people who will happily lay you off into a recession, leave you without a way to pay your rent or for food if it improves their bottom line. They do not care if you starve. Or at least they care less than they do about their quarterly bonus
So no, I don't trust these fucks to continue playing nice if they view my value as going to zero
wouldbecouldbe 5 hours ago [-]
Yeah it can do things unguided if the tests to confirm its correctness are very solid. Thats where a lot of progress has been made and where agents are good, but this is domain specific, and a chance where startups can shine.
spider-mario 3 hours ago [-]
> And that can be catastrophic in high risk environments, like legal, medical or high risk software products where being wrong in the wrong place can mean bankruptcy or even cost a life.
Which also happens with humans – does it do so at a lower rate? On its own, it kind of sounds like similar anti-self-driving-car arguments.
wouldbecouldbe 2 hours ago [-]
yeah thats why I mentioned it works well IF guided by the correct expert.
I agree that you can create a set of domain specific rules, reinforcement layer validation tools, like self driving, that vastly improves the accuracy of au & llm's. Making humans less and less needed. But where LLM's comes from the magic of generic knowledge, this will be the opposite, narrowing it down.
nonethewiser 2 hours ago [-]
Yeah but even what you describe makes it an extremely useful tool and productivity boost. Sure, we're not going to deploy a lawyer agent with full autonomy and no more oversight than a real lawyer. But isn't it wild that's now the frontier?
It's not like self driving cars where better than a human 80% of the time isn't good enough and they aren't really usable until its 95%, 99% etc.
ambicapter 16 minutes ago [-]
Autopilots have been able to land planes for years (decades?), and yet they still don't land passengers planes at any increased rate.
amelius 5 hours ago [-]
> Sure, but in two years AI has gone from “impressive tool, but not a replacement for knowledge workers” to “the study where it beats our highest caliber of knowledge workers may have some methodological deficits.”
With that kind of logic ... anything is possible.
AlecSchueler 5 hours ago [-]
> the study where it beats our highest caliber of knowledge workers may have some methodological deficits
The point is that if the study can't validate the claims being made then we can't actually extrapolate from that claim. What you're predicting may or may come true, but the study (which is the topic at hand) isn't useful for supporting the assertion.
2 hours ago [-]
bobro 2 hours ago [-]
>the study where it beats our highest caliber of knowledge workers may have some methodological deficits.
That isn’t even remotely what this study is looking at.
taco_emoji 41 minutes ago [-]
I will never trust an AI as much as a person
Forgeties79 6 hours ago [-]
Assuming it keeps improving at the same rate, which I think we are already seeing not play out. If you compare the first six months when GPT truly hit the mainstream to the previous six months, the improvements are not nearly as evident. That isn’t to say they aren’t noticeable, I could definitely tell it’s improving, but not nearly at the pace it once was.
There’s also the fact that they can’t possibly keep improving frontier models at the same rate (I.e. training investment) when investment starts slowing down. The amount of cash being burned is completely unsustainable and you’re already seeing some pullback.
kenjackson 1 hours ago [-]
The issue is that before GPT models basically were useless for any conversation. We are literally in science fiction realm. From a text conversation perspective the gap between where we are at and what’s left to get to is relatively small.
In my opinion, the main thing we need to do is have training happen continuously. And probably more real world data (from sensors).
the_gastropod 1 hours ago [-]
The ELIZA effect has been around since 1966. I think lots of folks feel “AI” has advanced much more quickly that it really has because of the nature of its many past boom / bust cycles.
nopurpose 6 hours ago [-]
On the other hand we keep seeing only marginal generational imorovements in CPU space, yet performance gains over last 10 years in CPUs are very material.
Every new model might not be a leap like it used to be, but give it enough time and improvements add up.
Forgeties79 5 hours ago [-]
Nobody is disputing that. I specifically said that I can see the improvements from the last six months. What I’m saying is we can’t assume that every two years it will improve at the same rate.
The further we get into this, the more AI feels like 3-D printing. Significantly bigger and will be more widely used for sure. But nowhere near the “new industrial revolution” that all these companies are making it out to be
nopurpose 3 hours ago [-]
Do you agree that economic and behaviour shift will be comparable to mobile and we are at the times of Nokia 3310. Does it count as industrial revolution?
Forgeties79 3 hours ago [-]
I think that’s kind of a strange question/parallel that doesn’t have a concrete answer, partially because even the people making these tools don’t really know where it’s going to land or what the ultimate utility is. Hence why they’re begging all of us to figure out the billion dollar applications for them.
Ultimately they are clearly here to stay but I think they are going to be incredibly important in some industries and minimally present in others (a glorified chatbot/summarizing tool for instance). Whatever form it takes it’s definitely not going to be a model where individuals have subscriptions they pay for monthly.
nopurpose 1 hours ago [-]
> even the people making these tools don’t really know where it’s going to land
exactly my point to compare it with pre-iPhone mobile market: wide (and growing fast!) adoption, clear potential (WAP websites, J2ME games), many players in the game, some real market fit discovered already (Blackberry), influx of capitial and tinkerers alike, but still a lot of unknowns where it will ultimately land.
Even if no single improvement was revolutionary (even first iPhone was just a fancy phone without App Store), overall mobile made billion dollar industries possible, for better or worse, and changed the way we live. Counts as industrial revolution, comparable to the Internet itself in my eyes.
fragmede 47 minutes ago [-]
What would 3D printing have to do in order for it to be the new industrial revolution to you?
Hfuffzehn 3 hours ago [-]
I agree.
But notice that you assume that there is a metric with which you can messure improvement.
Which is fine if you are measuring against your personal taste.
But it might be that the optimization target itself has a ceiling. If you're training toward human approval ratings from a broad population, you converge toward what median preference selects for. The plateau is baked into what you're measuring against.
byzantinegene 5 hours ago [-]
it's also worth keeping in mind that alot of the 'improvements' are actually advancements in harnesses and tools.
windexh8er 3 hours ago [-]
This is the hot button right here. Most of the advancements have also come at the cost of excess: exponential token use at the expense of marginal gains.
Context is still a large limiting factor, and we have band aids around that area already. And the further along we go the further distributed LLMs get in terms of additional pieces.
As for the original article and sentiment I'm sure AI will be a boon for law. It's going to be much easier for the general consumer / person / small business to represent themselves which feels like a win. The downside is I feel like we're tracking towards a digital hell of "virtual lawyers" that will be at the whim of any org. Consumer laws really need to change now to help avoid this dystopian path we're on.
snk 2 hours ago [-]
I'll note that having countless superintelligences handy will accelerate problem-solving e.g. dystopia.
skywhopper 4 hours ago [-]
Your “some methodological deficits” is doing a lot of work.
0xDEAFBEAD 3 hours ago [-]
What if the methodological deficits are actually causing the paper to underestimate the quality of the AI responses? Why assume any deficits would bias the AI's competence upwards instead of downwards?
internet_points 4 hours ago [-]
"the study that claims it beats our highest caliber of knowledge workers has methodological deficits" ftfy
so extrapolating from that, in another two years it will continue to bamboozle
esquivalience 8 hours ago [-]
I think your 3k figure comes from here - It is explained:
> As judges, the professors then completed 2,918 blinded, forced-choice comparisons (median per judge: 200), each time indicating which of the two anonymized responses, from the instructor or the LLM, they would rather give to a student
IshKebab 7 hours ago [-]
So did were the answers fact checked? If not that seems like a pretty obvious flaw!
epolanski 3 hours ago [-]
The study deliberately analyzes questions that don't have clear black or white answers, what matters is the reasoning.
vlan121 45 minutes ago [-]
Reversly viewed ones should ask with what intend the study should be like this. And for obvious reasons it sounds like monetary-nature.
skywhopper 4 hours ago [-]
I find it entirely likely that the preference for the AI generated answers is entirely due to the confidence of its assertions. Given the numbers of evaluations each prof had to do, there’s no way they researched the answers thoroughly. But if there’s one thing we all know LLMs can do well, it’s to generate text that sounds extremely confident. And that signal is appealing in choosing which of two statements you’d give to students.
RataNova 5 hours ago [-]
Agreed. The study might show something useful, but the headline is doing a lot of work.
ALittleLight 11 hours ago [-]
The paper says the professors have a median of 200 comparisons each. It also says they only used 2 models because using more models would require more comparisons and they selected Google models because Google was branded/advertised as being education focused. When you see other models show up elsewhere, that's because they extended the main idea to other models but using LLMs to judge instead of human professors.
godelski 11 hours ago [-]
Sure, but the biggest problem is they have no statistical significance. Variance is too high. How do you distinguish the signal from the noise? Confidence intervals aren't enough.
But is it a surprise law professors aren't great statisticians?
Certhas 7 hours ago [-]
I disagree. 16 isn't necessarily the relevant N here but the number of responses is.
If you have 100 responses from 1 professor, and the AI wins 75% of the time that is very likely a true signal that the AI is better than this prof. It would be incorrect to generalize this to all profs though.
Further, if you sample 16 profs and the AI beats 10 of them you can be fairly certain that the real percentage of profs it beats isn't 10%. Further, when estimating the probability that the AI beats a random prof, it's the relative estimation error that scales with 1/sqrt N. If you have a coin and it lands heads up 16 times, that tells you something quite robust about the coin.
Reasonably estimating confidence intervals at small N and high p is not trivial. But it can be done.
A good heuristic is "add 2 successes and 2 failures" which is due to Agresti & Couli.
But does it really matter? It seems fairly obvious that AI is going to outperform professors. While the studies run, there are three more model releases that change the calculus entirely. I wonder how much we are learning with these studies about what is going on.
greggoB 7 hours ago [-]
> I wonder how much we are learning with these studies about what is going on.
So your alternative is to not have any studies and everyone can just stump up anecdata as "evidence" for the capabilities of these models?
jstummbillig 5 hours ago [-]
Doing things that are well meaning, but ineffective is not great policy. The simplest alternative to doing things that don't work is always not doing them. Better ideas are of course welcome, but not required.
greggoB 1 hours ago [-]
I don't think that's how science/academia works. There is no such thing as a perfect study, there are always non-idealities and noise in the data. Good studies make well-justified efforts to account for these, OP is saying they don't believe this is the case here.
Regardless, your assertion that "oh well, the models will be totally different in a few months anyway, therefore any study done today is pointless" seems more than a stretch. How do you know they will be so different? How can you verify that today's studies are completely irrelevant?
master-lincoln 6 hours ago [-]
it sounds like you are saying science doesn't matter but your feelings do
suddenlybananas 7 hours ago [-]
Does it matter if a study is fraudulent or incompetent? Yes.
zeristor 7 hours ago [-]
That is the assumed narrative; however it shouldn’t bias any evidence.
runarberg 12 hours ago [-]
The study was conducted by Stanford’s HAI institute, which receives heavy funding from Google (how much I couldn’t find because they don‘t publish their donations in a place I could find it; but I suspect it is alot). And the authors did not declare a non-conflict of interest at the end of the paper.
keeda 9 hours ago [-]
Wait, where are you seeing the link to HAI? TFA mentions something called "liftlab" which seems to be something under Stanford Law School and separate from HAI. The study has more than a dozen authors from as many different universities but HAI is not mentioned.
tomjakubowski 8 hours ago [-]
The leader of the study, Julian Nyarko, is Associate Director and Senior Fellow at HAI. I can't say whether that means the study was conducted by HAI, but there is at least a connection to it. https://hai.stanford.edu/people/julian-nyarko
pezgrande 4 hours ago [-]
Do papers need a "non-conflict of interest" disclosure nowadays to not be considered just ads?
net01 3 hours ago [-]
The HAI is also funded with money from OpenAI, Antropic, and other big tech corporations. I don't know what you are trying to prove.
scotty79 2 hours ago [-]
> That's very high variance
Do you doubt that educational value of a law professor can vary from 0 to somewhat reasonable? You are not studying screws here.
philipwhiuk 4 hours ago [-]
This is the bit I'm suspicious of:
> They calibrated AI responses to match the length and structure of human answers
which I would guess removes AI's hallucinations and errors somewhat.
damnesian 5 minutes ago [-]
Does the "outperforming" conclusion incorporate the appropriateness of decisions? Or just if things are technically correct. Without human eyes on cases, things could easily get very off track. AI can do a lot of data wrangling, but there is no conscience.
aristofun 4 hours ago [-]
In general it is not surprising. Even if this particular study is bad.
There are certain areas of law work that are about analyzing large amounts of texts, drawing conclusions and writing other texts based on that and nothing more. That is literally the bread of LLMs.
Those types of lawyers should be the first in line for unemployment, not programmers, not even close.
alansaber 2 hours ago [-]
"That is literally the bread of LLMs." correct. However, programming has a large number of advantages RE LLM use compare to law:
You can execute the logic, and set up loops from the output. You can set up more useful RL. It's easier to generate synthetic training data. It naturally supports tool use and agent parallelism. It's easier to integrate with APIs (with what few APIs the court systems provide). Programming explicitly encodes abstractions at the function, module levels etc that are easier to KG/reason/build upon than text chunks.
nickburns 2 hours ago [-]
'Bread *and butter'. The English expression requires the second part—but otherwise fits perfectly in your well-stated point, with which I wholeheartedly agree.
Source: AAL.
aristofun 59 minutes ago [-]
Thank you! As a non native speaker I was not sure if “and butter” is a mandatory part but didn’t want (nor had time) to llm the comment for the sake of authenticity :) TIL
nickburns 55 minutes ago [-]
Ah, great! I 'took a shot in the dark' there. Glad it 'panned out'. ;)
1 hours ago [-]
conartist6 3 hours ago [-]
I see the same problem with AI in both programming and law though.
AI is like a scab on a wound: it's a temporary filler, it rushes in to fill a void, but it's not going to be the final solution.
Models showed us that there was huuuge unmet demand for literacy, both in software and in law. But now we have a choice to either address the systemic causes of the unmet demand, or just try to paper over them with layers and layers of AI scab.
bluefirebrand 18 minutes ago [-]
> But now we have a choice to either address the systemic causes of the unmet demand, or just try to paper over them with layers and layers of AI scab.
Yeah, but in my experience it won't come down to "which is the better solution" but "which is cheaper/easier"
So I look forward to lots of layers of papered over AI scabs in the future. It won't be cheaper in the long run, but it will pump someone's quarterly numbers enough that they get a promotion before the problem they introduce come back to them
NoboruWataya 3 hours ago [-]
These are academics. Not to disparage them or their work at all but it is very different to the transaction or litigation work that is done in BigLaw. It is a lot more focused on analysing and summarising existing texts, which are themselves more easily available for LLMs to train on (statutes, case law, legal journals, textbooks). As such it is probably the easiest legal work to LLM-ify but also the least valuable, because I assume law professors aren't getting paid nearly as much as BigLaw lawyers. So this approach won't scale. Not to say AI won't crack BigLaw but it will be a different challenge.
scotty79 2 hours ago [-]
LLMs answered student questions of the top of their heads, without any refresher look into the case law. And systems that were primied with the case law like NotebookLM underperformed when compared to baseline LLMs that you'd as anything about anything.
It's not about what LLMs can or are suited to do. This study shows strengths of what's already in them, innately.
epolanski 3 hours ago [-]
The more I see the evolution, the more it looks to me that any knowledge workers is going to be impacted.
streetfighter64 3 hours ago [-]
> analyzing large amounts of texts, drawing conclusions and writing other texts based on that and nothing more
The same could be said about programming. Or if you want to be even more reductive, looking at a screen and pressing buttons to make the correct lights light up https://xkcd.com/722/
aristofun 55 minutes ago [-]
Philosophically or metaphorically speaking - yes.
But in my comment it is literally what some subset of lawyers do.
Literally is much more tangible and risky in terms of real impact on employment etc.
causal 14 hours ago [-]
As a software engineer I have some intuition for what the risks are of letting agents do some tasks vs others.
I don't have a similar intuition calibrated for what could go wrong when asking AI to draft a legal document. Some things seem harmless, i.e. drafting a will, but I don't really know- our legal system is notoriously rife with footguns.
qingcharles 9 hours ago [-]
I've used general purpose LLM AI (e.g. run-of-the-mill Claude, GPT etc) heavily to draft legal documents. The biggest trap is the hallucinated citation. It will easily insert an absolutely authentic sounding quotation from another case that perfectly proves the point you are trying to make, then it'll make up an authentic name for it, e.g. United States v. Shenzhou Electronics Inc or whatever. You can get really comfortable after checking its output a few times and getting no false citations, and then BAM, it'll put three in the next motion it writes.
Any lawyer who isn't using LLMs for research is behind the curve, though. They are unbelievable at finding niche cases you would never have found on your own. Previously it was a lot of exact search term matching, which is inherently useless for a lot of legal research. I need something that can search on vaguer terms, which AI can do incredibly well. Just check the results. I'm sure the LLMs from Lexis Nexis/Westlaw are probably better than the general purpose ones.
LLMs make fantastic paralegals. If you're doing any legal work, you should be using it, even if it's just to shoot ideas at. Have it play devil's advocate. My friend always has it play the other party's lawyer to see what all the counter-arguments are going to be.
Just like you would with software development. If you care about what you are creating, CHECK THE OUTPUT.
em500 8 hours ago [-]
> The biggest trap is the hallucinated citation. It will easily insert an absolutely authentic sounding quotation from another case that perfectly proves the point you are trying to make, then it'll make up an authentic name for it, e.g. United States v. Shenzhou Electronics Inc or whatever.
Naive question from an outsider: aren't there searchable databases of cases (with complete text) so that citations could be checked automatically, either by the same or an independent agent?
timpera 7 hours ago [-]
It depends on the jurisdiction. I'm based in France and all cases here are now freely available online to people and agents [1], but it's very recent for lower courts. However, I recently had to work on Texas case law and we had to purchase access to a (very expensive [2]) database since most of it wasn't public.
It’s a band aid solution because the model can get stuck in a refutation loop, where it argues a point by pulling up a contradicting source ad infinitum. The holy grail, which has not been yet reached, is figuring out how to dynamically align the model to be consistent with all the sources in the first place (and this is a problem of provenance rather than model design)
tiahura 3 hours ago [-]
I’ve been doing ai legal research via caselaw api with Claude code for at least a year and I’ve never seen that happen.
thenickdude 5 hours ago [-]
>The biggest trap is the hallucinated citation
The "biggest problem" being the one thing that is trivial to verify against concrete databases is a bit convenient don't you think?
I think it's more likely that it makes mistakes evenly but the one thing that you are able to check with certainty is the only place you discover the errors.
FeepingCreature 5 hours ago [-]
I've made the same experience with programming AI. It is very convenient, but convenient doesn't mean unlikely. The universe appears to have given us a convenient thing here.
lawtalkinghuman 4 hours ago [-]
Just because the citation exists, what the LLM says it stands for and what it actually stands for are not the same.
For testing, I've asked (admittedly last-gen) LLMs to generate legal opinions regarding issues in commercial English civil litigation, and I received back cases where the citation is real, but the area of law (family law) is not relevant as family courts apply a very different set of procedural rules.
(If you squint a bit, they sometimes might be relevant... and could be useful for a particularly creative litigator to make a novel argument on behalf of a very risk tolerant client. But you would very much want to go read those cases and think quite hard about them.)
skinfaxi 2 hours ago [-]
Chatgpt regularly hallucinates entire cases whole cloth or fabricates an entirely different fact pattern for a given case. Perplexity does much better at citing its sources and providing accurate quotes, at least in my experience.
eunos 8 hours ago [-]
Seems companies like Thomson Reuters or other legal services have incentive to build LLM with RAG over legal cases texts and robust hallucinations detection on reference
RataNova 5 hours ago [-]
I think the paralegal analogy is right, but with one important difference: a human paralegal usually knows when they are unsure, or at least can be trained to flag uncertainty
BartjeD 8 hours ago [-]
A legal professional can be personally liable for not finding the most recent case-law.
The knowledge cut off gap means the models sometimes don't know about the most recent case-law, in a given situation.
I've seent his happen multiple times now. Accountants and legal professionals advising clients based on outdated information assembled through chat-gtp, claude and copilot.
Professionals drafting letters and missing recent case-law which handles their exact case. It's unreliable.So it can save you some work; but it can't save you all of the work. And in some cases its mistakes really force you to redo all the work, and more, to be thorough and have confidence in the result.
lukan 5 hours ago [-]
"The knowledge cut off gap means the models sometimes don't know about the most recent case-law, in a given situation."
But they can perform live websearches or go directly to a DB specified.
timpera 7 hours ago [-]
You definitely want your AI to search legal databases, and not draw from "memory". This is where AI offerings from Thomson or Lexis could shine, especially in jurisdictions where case law is not freely available online.
tiahura 3 hours ago [-]
Or you can just have Claude code search westlaw / vlex/ courtlistener
thewebguyd 14 hours ago [-]
I think this is probably true for most skilled professions. AI is best used in the hands of folks already knowledgeable in the skills/professions they are using it for.
I liken it to me googling things as a sysadmin vs. Jane from accounting doing it. The non-tech end user is far more likely to make the problem worse, or install something sketchy from the ad riddled results than I am, or one of my help desk employees are.
I wouldn't trust myself to draft an important legal document using AI without the advice of a lawyer, much like I wouldn't really want to rely on my lawyer to use AI to write code for me.
godelski 13 hours ago [-]
> I think this is probably true for most skilled professions.
I agree, BUT I also find that it's easy for experts to atrophy quickly. When the AI is right 80/90% of the time it lulls you into over confidence.
I find those that are best and make the greatest use are the ones who remain skeptical but also use the tool. The same people who were already nuanced and picky before AI. The same people who already doubted and questioned their own work, and used that suspicion to help prevent them from having over confidence in their own work. If you weren't willing to just "lgtm" with your own code, it's difficult to do that with AI.
(To be clear, I'm not saying perfectionists. Some might call them that because the picky people have higher standards, but a good expert has to also understand that perfection doesn't exist. That's often a driving force in the suspicion! This also tends to cause them to continually improve)
stult 13 hours ago [-]
I would agree with this point and as I explained in a comment replying to the GP comment above, that atrophy is far more dangerous in the legal field than it is with code because legal documents do not benefit from the structural safeguards available for code, like automated testing, static typing, static analysis tools, etc. IME with legal LLMs so far, they are easily in that most dangerous valley where they can lull you into a false sense of security while still introducing extremely dangerous mistakes that are frequently difficult to detect without very careful reading.
The danger of those mistakes creeping in also grows exponentially the farther a lawyer strays from their core legal expertise. There are a few statutes I know inside and out, and I can spot LLM analytical errors related to them in a split second, but once I venture out into domains where I am not an expert (but where I am nevertheless reasonably qualified to practice), it becomes much harder to spot drafting mistakes because I have not refreshed my own understanding of the law by reviewing the relevant cases or statutes as I would when drafting the analysis myself from scratch.
ChrisMarshallNY 14 hours ago [-]
> I wouldn't really want to rely on my lawyer to use AI to write code for me.
Yet that is exactly what a lot of C-Suiters (many of whom are lawyers), are doing.
xiaoyu2006 14 hours ago [-]
Vice versa there is also a lot of irresponsible programmers doing stupid things with ai. Irresponsible people stay irresponsible, AI just make them more productive at being irresponsible.
consp 9 hours ago [-]
The problem is the low levels have no influence whatsoever. The higher ups force crap down and none ever comes back.
tiahura 3 hours ago [-]
Corporations are DEMANDING legal ai because it is so much more efficient.
ChrisMarshallNY 2 hours ago [-]
Lawyers creating legal stuff, via LLMs is OK. Programmers creating software through LLMs is OK.
Mixing them, is, not, in my experience, OK. In the future, I am sure that LLMs will reach the point, where their output will be beyond reproach, but we're not there, yet.
That means that someone that knows the context and content, needs to vet the output, before sending it on.
zuzululu 14 hours ago [-]
im not so sure
i think devs overestimate their own role and underestimate others
i am seeing lawyers and doctors roll out their own software with AI
but we dont have their training and experience
thatcat 13 hours ago [-]
So a software engineer could diagnose an illness with ai, even if they happen to be right that doesn't really prove much about how bad of an idea it could be in a long tail scenario.
causal 5 hours ago [-]
Also worth remembering that LLMs have jagged intelligence. They are probably better software developers than anything. Is there a complement to Gell Mann Amnesia- where you assume it’s good at other jobs because it’s good at yours?
stackghost 14 hours ago [-]
It's like that in engineering, for sure. My background is in aerospace and there are lots of things that a reasonably technically-inclined random can probably do passably. It takes an engineer to know which tasks those are, though.
I would imagine it's similar in law, in that it takes a lawyer or judge to know where the foot guns lie.
ubercore 2 hours ago [-]
Agreed, and it's the same in software. Probably the biggest time-sink right now as a tech lead is people going from idea to fully-fleshed-out PR, and then having to go back to have a discussion of "was this the right thing to do". It causes frustration all around (being a "no" much more, and having someone tell you your finished work isn't valuable).
stult 13 hours ago [-]
IME so far (as both a lawyer and a software engineer), LLM error rates when drafting code and legal documents are reasonably comparable, but it's more problematic in the legal context because legal documents do not benefit from many of the structural safeguards available for code. For legal documents, there are no automated tests, no static typing, no test environments, no logging/observability instrumentation, no sandboxing.
The time lag between drafting and "deployment" also makes for much less effective, much more expensive debugging loops. You can deploy your code to prod in seconds, see an error pop up in the logs, and immediately start debugging. But it will take at a minimum days and frequently as long as several years before an error in a contract or a court filing will be detected, and often the error is beyond correction at that point. Thus, the errors are both more difficult to detect and to resolve.
And the consequences of error are often much greater, both because they are not correctable and because a legal error may risk someone's life, liberty, or substantial property. Although that's not categorically the case, obviously bugs in certain safety critical systems can be as bad or even worse than legal mistakes. But in general, most software is lower stakes than most legal writing.
On the flip side, LLMs do seem to do a better job with basic style and structure for legal documents compared to code. Things like following IRAC format, citing assertions of law (although hallucination remains an issue), and writing comprehensible sentences. These would be the equivalents in code to best practices like good comments, cohesion, consistent use of design patterns, test coverage, clear variable names, DRY, etc. Although the better performance on those more qualitative metrics may just be because even the longest legal documents are typically simpler in structure and have fewer lines of text than a large, complex codebase. Or maybe it's because LLMs are trained on natural language text more than on code. Or because natural language is more forgiving than code, in that minor variation in diction or grammar is unlikely to have any significant effect on how the document is interpreted, whereas even single character errors in code can have enormous effects.
Otterly99 2 hours ago [-]
There is also one thing I would like to add, and you can correct me if you disagree: coding benefits much more from thorough planning. Now, I exclusively work by first writing a plan that has well-defined steps and goals, which can of course change over time.
It seems to me like it would be more difficult to achieve with legal documents and, in my experience at least, writing a concrete plan has been the decisive factor that make my AI coding robust (plus all that you mentionned).
Hfuffzehn 3 hours ago [-]
This is a very good comment.
But notice how even in software engineering there is still disagreement about these structural safeguards.
So yes, we can say the LLM created bad code when it does not compile or fails prewritten tests.
But experts might disagree what good comments, good cohesion, appropriate use of design patterns, appropriate test coverage or clear variable names are.
So what are we suppossed to train the LLMs towards?
Somebody still has to decide what "good" is.
causal 4 hours ago [-]
Hidden gem of a comment, thanks for writing
calvinmorrison 13 hours ago [-]
Well this is largely the fault of law itself. especially english style law. A legal, parseable code, in which not every single tiny municipality (some less than 1 square mile) has their own set of rules and laws, not all published or available - but which citizens are expected to abide by of course - how could we expect AI to do well and not some typical TV southern lawyer who knows the judge?
stult 13 hours ago [-]
I could not agree more. A simple example: it boggles my mind how every state organizes their statutes in entirely dissimilar ways. I'm not sure there's a need for every state to have slightly different wording for a murder statute in the first place, but even assuming there is, why do they all have to be scattered around in different code sections instead of every state just following some consistent convention like always putting the murder statute at Title V, Section 1.4 (or whatever the case may be, that's just a random invented example).
For murder that's not such a huge deal because the statutes are typically easy to track down and don't really differ all that much substantively, but once you get really into the weeds on something like commercial contracts it can be a huge pain to do cross-jurisdictional research.
And that's just a tiny, super obvious example of how impenetrable statutory law is, which isn't even the really pernicious problem. Case law is infinitely worse. It makes me absolutely furious how difficult legal research still is. The Westlaw/LexisNexis duopoly is a moral crime and wildly destructive to the quality of government in this country. Every single written court opinion should be publicly available for free on the internet in an easily searched format. It would cost practically nothing to achieve. We're talking about less text than Wikipedia hosts. Yet still many states make it almost impossible to access case law. Even though these cases are law. Binding law that we are supposed to follow, yet we cannot even easily access. It's insane, and largely perpetuated by the complacency of lawyers who can charge others for what should be free, the lobbying of the duopoly, and the incompetence of politicians.
If all of the laws were consistently available and stored in reasonable, consistent citation formats (I would settle for hyperlinking as a replacement for the rat's nest of wildly varying jurisdiction-specific citation systems), it would even be possible to introduce a form of unit testing for legal drafting that would allow us to automatically verify if the LLM hallucinated a citation.
It also doesn't help that we (for what were at the time very good reasons) moved away from the system of legal writs that used to provide fairly standardized, almost "cut and paste" templates for legal filings. So now every legal document (filings, memos, contracts, court opinions, statutes) is drafted like a bespoke, artisanal creation with few strict structural or stylistic conventions. That makes automated interpretation much harder than it needs to be.
Merad 12 hours ago [-]
> Some things seem harmless, i.e. drafting a will
Absolutely not harmless if you're the executor of an estate forced to deal with a screwed up AI will. I just handler my dad's estate this spring. It's a frustrating and confusing process even with the simplest of estates.
hparadiz 3 hours ago [-]
I recently had to file to become an estate admin with no will at all. And it was literally cheaper for me to fly 3000 miles to do it in person than it was to pay a lawyer. Because lawyers are frankly greedy scumbags half the time. They don't offer an appropriate cost for the service..instead the conversation immediately goes to "how much" money is in the accounts and suddenly they want a percentage of your father's estate for filing two pieces of paper.
And in my experience if you do actually pay a lawyer for something they will act like you're not worth their time and will literally role their eyes at you when you're trying to explain the minor details of a case because they are too lazy to listen and zone in like I would when doing my job.
b40d-48b2-979e 12 hours ago [-]
Most people don't have anything that could even be called an "estate".
jcranmer 11 hours ago [-]
Judging from reported figures, roughly 80-90% of households in the US [1] have a household net worth of at least $0. That means that most people do in fact have an estate.
Median household net worth is in fact somewhere in the $100k-200k range, which is definitely something that could be meaningfully called an "estate." (Most of this tends to be the house, the median net equity in which is about $190k as of 2022).
[1] This doesn't mean "homeowners," rather it's a recognition that assets for married or cohabitating couples are usually commingled.
acdha 12 hours ago [-]
It’s just the legal term. If you have a relative die with a bit of stuff and an ancient car, they have an estate and someone needs to execute it even if the total value is less than most lawyers care about.
nocoiner 12 hours ago [-]
Everyone has an estate. Only thing is that you have to die first.
toss1 11 hours ago [-]
Ummm, not quite.
An "estate" is a legal term for property, assets, and liabilities a person leaves behind upon their death. A family member is a top practitioner in the field of estate planning and resolution, and some of the messiest estates they have handled are pro-bono cases of exactly the type of people you would put in italicized "most people": poor, not really able to upkeep a house they inherited from a relative which hadn't had title properly transferred on a previous death because they didn't have money for an attny, now can't get a loan to fix the roof...
Yeah, if you are homeless, carless, and have only the clothes on your back and a shopping cart of stuff, you don't have an estate. Everyone in the middle class in the US has an estate. Much of the time it passes automatically to their spouse on death, but it's still an estate.
And if you are concerned about where it goes, get a GOOD attny. There are many bad ones hanging out their shingle as "Trust & Estate" attnys, and some of the next messiest cases are fixing problems made by those not-so-good attnys.
And NO, AI is not good enough.
onlyrealcuzzo 4 hours ago [-]
As someone who's been sued frivolously...
Believe it or not...
A lot can go wrong if you have real life human lawyers draft a legal document.
_heimdall 14 hours ago [-]
I wouldn't consider drafting a will to be harmless. If its done poorly the next of kin could have to deal with a huge headache and potentially months or years of probate proceedings.
grogenaut 11 hours ago [-]
I had a very well crafted will from my parents, one of whom was a very good lawyer hiring other good lawyers. It was still a pain in the ass for many of the reasons they were trying to make it easy for us.
One thing I learned, just bite the bullet and re-write the whole fucking will instead of making riders.
Piecing the will together from riders was terrible. Al the clauses fell away everyone got older. The final will could have been 8 pretty clear pages.
The other part that is hard is just knowing all of the things that happen with assets and a passing. Luckily we had another lawyer and financial folks to advise us. It was still a lot and not that easy to find details. This was pre-ai that would have helped walk through his shit.
rayiner 14 hours ago [-]
I would think that LLMs would be better at avoiding foot-guns. That’s a situation where you have a list of well known rules and potential pit falls, and the work of the lawyer is to apply those to a fact pattern. That’s something that has been hard to automate programmatically, because the fact patterns are similar but different. LLMs, however, seem to excel at applying general principles to differing fact patterns.
atmavatar 13 hours ago [-]
Instead, the LLMs create entirely new foot guns like citing non-existent cases. You can't go more than a week without encountering another news report of a lawyer submitting an AI-generated legal brief rife with bogus case citations, which even includes briefs submitted to state supreme courts.
I would categorize this in the "expertise that people internalize but never figure out how to verbalize" department, and that is a department we have no way to teach an LLM because if nobody is writing out those unspoken, subconscious rules then the LLM has nothing to read about them in its training data.
visarga 7 hours ago [-]
> and that is a department we have no way to teach an LLM because if nobody is writing out those unspoken, subconscious rules then the LLM has nothing to read about them in its training data.
I think on the contrary, LLM providers accumulate huge logs of interaction with their users, which elicit that tacit knowledge and mine it and humans cooperate willingly in order to solve their tasks. Just imagine the corpus of sessions for scientific research, education or software development, it is probably the largest such collection ever to exist. Trillions of HITL tokens per day flow into those logs, carrying our perspectives, choices, original ideas and tacit knowledge. I call this the "human-AI experience flywheel". It's the new stackoverflow, next model generation is based on interaction data from previous one.
My favorite example of this is knowing how to untangle a big pile of cables. There are robots now which can untie a single knotted cable, but I don't think any can do a pile of cables yet. https://www.youtube.com/watch?v=vp-94rsherE
galaxyLogic 13 hours ago [-]
Good point. Same probably applies to code as well, coders much tell us why they write the cde the way they did. And if they have comments in their code, those are highly untrustworthy because noboy fixes comments if the code works.
goodmythical 13 hours ago [-]
I don't know the source off hand, but I've seen llms hallucinating case citations in order to "prove" their premises.
can't get more foot gun than "well according to [fiction] it is a well established practice (that the defendent is guilty)"
dylan604 14 hours ago [-]
But can an LLM come up with questions like what the definition of is is? Seems to me there's a lot of "depends on how you read it" type of stuff that lawyers excel at finding novel interpretations. So what coders thinking of as rules are much less straight forward to understand when it comes to laws
rayiner 14 hours ago [-]
I think that’s a different task than the one OP is referring to. To your example, I’m not familiar with the capability of LLMs in that regard. I have struggled with using the AI features of westlaw when it comes to that sort of argument. (Basically, making an argument that strays from typical route, because that’s the position you happen to find yourself representing.)
xmcp123 13 hours ago [-]
I think that's actually a perfect analogy to AI writing code.
Drafting a will seems like not a big deal, until that will is accepted as "good enough" and is then in court and under fire.
teiferer 9 hours ago [-]
> drafting a will
Such a document may not make a difference to the person that eventually will have died, but it can make or break the life of generations to come in countries that are so heavily optimized for dynasty building like the US.
RataNova 5 hours ago [-]
I think that's the right intuition. Legal AI feels especially dangerous because the output can look competent while hiding jurisdiction-specific footguns
conception 11 hours ago [-]
This is why I can’t see how college grads are going to survive the AI apocalypse. domain experts driving LLMs are super powerful because they can spot where they make mistakes. Juniors don’t have that insight and the LLMs then cost them productivity.
geraneum 10 hours ago [-]
> domain experts driving LLMs are super powerful because they can spot where they make mistakes
I don’t know if that’ll be true for long. I just had my colleague who’s a very competent engineer IMO hand me a frontier model vibed PR to review (after reviewing it himself, he claims) which contained random variable assignments, conditionals that do nothing, etc. He’d never do such a thing before. People become too comfortable and get confirmation bias as well.
15155 4 hours ago [-]
> drafting a will
Tell me you've never been the executor of an estate in the United States without telling me.
hparadiz 3 hours ago [-]
I think going through this process has made me uniquely qualified to write one.
knollimar 14 hours ago [-]
I'm afraid since claude cheats in benches, what will it do with law?
datsci_est_2015 13 hours ago [-]
Hmm, what’s the law equivalent of using docker to bypass sudo?
knollimar 7 hours ago [-]
can you make really convincing but flawed arguments that are historically able to win despite competent opposition?
dgellow 8 hours ago [-]
The same in every other domains. It’s happening now, not in a future tense
godelski 13 hours ago [-]
Cheat.
Or worse, use historical data to determine the laws of today.
prpl 14 hours ago [-]
there’s really no limit to how many times and ways you can review something with AI, except dollars.
Boss0565 14 hours ago [-]
cannot IMAGINE letting ai write my will rn.
jay_kyburz 14 hours ago [-]
I imagine it's really hard to spot a comma in the wrong place, or a missing sentence in a 10 page contract unless you wrote it yourself, or you assembled it from some battle tested templates.
pojzon 10 hours ago [-]
To give you some example of what can happen if you use AI in legal battle you can look at Valve vs Rothchild case [1].
I understand why the conversation on this article looks like it does, but the study is specifically focused on the potential for LLMs to operate as tutors for law students. I enjoy the extrapolation out to whether LLMs will replace lawyers, but did not find that to be discussed in the study itself.
In the framing of using LLMs as legal tutors, with the implication of lowering the cost of legal training, this seems like a socially-positive outcome. Furthermore, it feels kind of intuitive to me that any contemporary system operating with an LLM and access to legal reference material will be prepared to answer _student-originated questions_ comprehensively and with breadcrumbs or direct references to educational/source materials, as seems to have been found in the study.
The authors explicitly and intentionally emphasize that many legal questions require contextualization, as opposed to some discrete calculated answer. The result of the study implies that the LLM-based systems were capable of using what many of us here understand to be the "stochastic best-fit algorithmic generation" of a contemporary language model to adequately contextualize a student's question, providing insight into the trade-offs or complications implicit in the question, while then, critically, _meeting the professional standards of legal educators in explaining that complexity to a student_.
Realistically, I would hope this provides some confidence to readers of HN that they can actually ask a legal question to an LLM and expect the response will explain the complexity of the law in relation to the question. This is great news, and is likely the minimal pre-work any of us should do before actually consulting a lawyer, if time permits.
On the other hand, I do _not_ think that this study provides any indication that an LLM is prepared to actually provide direct legal counsel. Possibly in the same way that a legal textbook does not replace legal counsel, or perhaps more accurately, the same way that stumbling upon a legal case study for approximately the same situation you're in doesn't guarantee you'll have the same result.
scotty79 2 hours ago [-]
> On the other hand, I do _not_ think that this study provides any indication that an LLM is prepared to actually provide direct legal counsel
I think it indicates that LLMs are smart enough to be used in the context of law education.
quantisan 13 hours ago [-]
I'm surprised Stanford Law would go along with this over-reaching press release title. How about "For common first-year contracts-law questions, law professors preferred AI-generated answers to professor-generated answers"
mchl-mumo 9 hours ago [-]
The revised title is spot on. It's odd to me how academics are trying to sound like top research labs' CEOs trying to pump valuations by overreaching claims.
goodcanadian 3 hours ago [-]
It is rarely the academics writing the press release. It is even rarer that the author of the press release chooses the title.
ulrischa 5 hours ago [-]
By its very nature, the field of law is ideally suited for AI language models. Fundamentally, everything is based on interconnected texts. I believe that even larger waves of layoffs could loom here than in the IT sector. However, it is likely that a more powerful lobby will be at work here—one that will grossly inflate the perceived value of their work and shield it from outside intrusion.
tiahura 3 hours ago [-]
As a lawyer, I think your intuition is right re llms. Law is the wordplay that llms thrive at.
However the waves are starting and they ARE going to be huge. Corporate clients are insisting on AI. They don’t want to pay an associate hours to draft anything to be reviewed by a partner. They want top partner to use AI and just proofread.
grosswait 4 hours ago [-]
He who makes the rules…..makes the rules.
chewbacha 14 hours ago [-]
My best guess is that Gemini was trained on the textbooks that the questions are meant to test against, thus they are probably better at explicit recall of those questions or related questions.
This is a pretty limited introductory course based on what it says in the methods of the paper itself.
runarberg 13 hours ago [-]
That and the research is done by Stanford’s HAI institute with an obvious bias and the paper is curiously missing a conflict of interest statement.
EDIT: just found out that Google is a major donor to HAI. So this research is at least partially funded by Google. Which is probably the reason the authors fail to declare no conflict of interest.
8 hours ago [-]
TrackerFF 6 hours ago [-]
In many (most?) countries you can defend yourself, waive your court appointed attorney. You are of course highly discouraged to do so. But sometimes people do it, mostly for smaller claims where they don't want to rack up legal bills for things which might cost more than what is at stake.
But, it makes me wonder, will clients be able to use these AI-attorney systems in the future, in the court. Where they basically either just parrot what the model is instructing them to do, or - I dunno - give the model permission to speak for them (while waiving liabilities).
I have no doubt that some complex AI system can perform better than a bottom-tier, overworked lawyer.
bonesss 5 hours ago [-]
Pro se litigants are hyper vulnerable to LLM hallucinations.
One wrong advice clump and, like a step onto the wrong path while hiking, all subsequent steps go in the wrong direction. And sycophancy tuning means marginal one-sides takes get presented as sure-fire things.
I’m of the opinion that the big wins aren’t in using the LLMs to do the work (legal, in this case), but rather to refine and improve the dialog and presentation from all parties. A court-centric LLM that could give likely procedural needs to a litigant, and a law-firm-centric LLM could help a pro se litigant create a meaningful and refined set of questions for lawyer consideration, condensed and targeted, saving all parties time and confusion while meeting the clients linguistic needs ‘where they are’.
All the lawyers know things LLMs never will, the law is interpreted, and the written part isn’t engineering grade facts but suggestions interpreted in context. Arguably this is a racket and a thin veneer of plausible deniability for authoritarian rule. But as the law stands even with federal statues and citations from the courts website, practicing lawyers will frequently end up explaining that in this county/country/court/jurisdiction The Way of Things is different.
TrackerFF 4 hours ago [-]
I think it could work for some things. Years before LLMs became capable of doing anything substantial, people were selling "legal services" via websites where people could dispute trivial stuff like parking tickets, and what have you in the small courts.
Those services were usually just based on NLP + simple decision trees, and people actually won their cases.
Of course, doing huge corporate contract disputes, IP disputes, M&A, and whatever will probably be out of question for a good while. Same with more serious criminal cases where the stakes are very high.
But I think there's potential for automating away less serious cases, especially where there's good structure.
And of course, it all depends on what kind of legal system one is situated in. Immediately I'd think that Civil Law would be easier for AI lawyers, as its inherent structure is a better fit for machine reasoning. So I'd expect to see more AI products start in Civil Law countries.
15155 4 hours ago [-]
> Arguably this is a racket and a thin veneer of plausible deniability for authoritarian rule.
The fact that Lexis and WestLaw have such an iron grip on the entirety of the US legal system is exactly why general LLMs are completely unequipped to be useful in this domain.
applicative 13 hours ago [-]
What the LLM cannot do is explain why it said what it said, when cross-examined. It simply hallucinates the best account of why someone would have said such a thing as it said, same as it can give a probable account of why someone else said something different. The question 'But why did you say this not that ...?' does not lead it to make explicit its grounds for what it said, but just to make a new more complicated statement.
U4E4 12 hours ago [-]
This is true in the naive case.
There are however LLM context building techniques that anchor completions in data structures that persist the structure of claims that support the conclusion contained in a completion. Lots of different patterns exist —organizing logic in language is a rich domain— but the one I’ve liked the most is something called a Claim Dependency Graph that models the relationships between atomic claims as graph edges.
There’s a whole suite of operations you can perform on these structures, and “reconstruct how you came to this conclusion” is absolutely one of them.
mdlman 12 hours ago [-]
I’d love to read more about these type of patterns. Do you have any recommendations?
A human has a motive that exists that frames the thought being expressed. An LLM is going to be creating a “de novo” thought in response to a line of questioning.
Paradigma11 3 hours ago [-]
Psychology has shown that a lot of those motives are just post hoc narratives, similar to LLM.
ashdksnndck 13 hours ago [-]
Same is probably true of humans. In a conversation, we often respond from instinct, then work backwards to a rationalization only when asked. For more considered thoughts, if we’re lucky, we can remember our “reasoning traces” but that’s as deep as our introspection goes. Unless we’re neuroscientists, we don’t even know how many neurons we have, let alone have any understanding of how they generate our thoughts. Motivated reasoning impairs our introspection further, and then dishonesty and communication errors prevent us from relaying the limited remaining information to each other.
Model interpretability work has advanced a lot. Arguably we already can explain AI decision-making better than human brains.
applicative 13 hours ago [-]
No, it happens in the immediate context, where e.g. we say 'No I meant Meredith Jones, not Meredith Smith'- and the possibility of this elaboration is actually part of ordinary communication. I did mean Meredith Jones, not Meredith Smith - thus the use of the past tense The LLM will just give the best answer for what one might have meant, completely reopening calculation.
Nonsense, some of my friends are lawyers and they're able to give you consistent interpretations on why they think about a certain aspect of a law a certain way.
The whole thing is that they work with this the entire time, so they have a really consistent 'head model' of how things work and why and how considerations should be weighted/ordered/whatever. LLMs just do not have this, there's no consistent underlying reasoning (the 'reasoning' traces in LLMs are really inconsistent)
j45 13 hours ago [-]
LLMs hallucinate, because humans hallucinate.
Asking the LLM in a way where it annotates its sources, it can greatly increase the pattern matching to closely simulate logic, just like in humans.
I understand the question of why did you say this, not that, I have seen other ways of asking that which do not seem to trigger the LLMs over-response in the other direction.
latentsea 13 hours ago [-]
Humans hallucinate because they take shrooms or have schizophrenia.
applicative 13 hours ago [-]
No, the hallucination of its reasons follows immediately from the technique of probabilistic inference. You can see this in real time, just ask 'why did you use this word, not that word?' It is in the position of a desperate liar. All its responses are essentially 'rationalizations'
rockskon 12 hours ago [-]
I do question at what point AI could be useful as a teaching aid.
The quality of LLMs depends heavily on, among other things, how you word your questions.
Knowing the correct questions to ask is not something most students know how to do given that it tends to require a fair bit of pre-existing domain knowledge.
piker 5 hours ago [-]
Having been a law student and practicing lawyer, it's clear to me that law professors aren't really representative of much if any part of private practice. Most of the things they think and reason about are quite theoretical and academic, and it doesn't surprise me that the models would regurgitate a more average response which most human graders would prefer.
That's the entire point, though!
The legal academy is supposed to have outlying opinions on things and present novel philosophical answers to questions. (And questions to answers!) So in addition to the statistical arguments against this paper made elsewhere, to me it doesn't real much new information.
songting591 2 hours ago [-]
The interesting shift isn't whether AI beats law professors on tests â it's what happens to the value chain after that threshold is crossed.
When AI clears the knowledge bar in a domain, the remaining moat becomes trust, accountability, and local regulatory context. That's actually good news for niche SaaS builders targeting specific jurisdictions: the generic AI layer commoditizes, but the "AI + local compliance + human accountability" bundle still has real pricing power.
Curious whether anyone has seen this play out already in contract review or compliance tooling outside the US.
galaxyLogic 13 hours ago [-]
I'm going to need some legal help for my startup. But I can't pay much. So I figured I will ask AI all relevant questions, as well as forms filled etc. Perhaps even create a patent-application for me.
THEN I find a human lawyer and give AI's answers to them and say "Can you find any errors in this? Can you improve it?" .
That way I think my legal bills should be smaller because the AI has already done most of the work. What do you think? Which LLM is best for legal work?
apparent 11 hours ago [-]
I think that within a few years, most lawyers will expect that clients will have run contracts through an LLM prior to sending them to outside counsel. Emails will be along the lines of:
Please see attached contract we received from [counterparty]. ChatGPT says blah, blah and blah should be revised. What do you think? Is there anything else that we should change?
galaxyLogic 10 hours ago [-]
Right. That will reduce workload for the lawyers. But will their fees then go down? I'm kinda worried that if I don't give them the LLM produced legal docs for review they will just use the LLM themselves and then charge me for the work the LLM did :-)
It's bit like with doctors, you'll want a second opinion, if you can afford it.
apparent 9 hours ago [-]
TBD. Probably depends on whether what you're paying for is access to their lawyer-level LLM, which they would run it through, or for actual expertise.
Probably for important deals, detailed human review will be expected.
Maybe the real value-add will be the insertion of language that LLMs won't be able to figure out, but which will be favorable for the side that inserted them.
dlahoda 13 hours ago [-]
i use codex to do initial research and draft texts (in typst). i use files-output skill so that all research contexts are rendered into files md files.
i do second phase on codex, by asking to download all pdfs and extract all text of laws it references. can repeat fully local research step.
after i ask gemini to find issues and criticize.
UPDATE: there many legal skills on github to try, not used so any yet
galaxyLogic 13 hours ago [-]
Are you a lawyer yourself?
SomaticPirate 13 hours ago [-]
[flagged]
throw7 14 hours ago [-]
Oh, a "Human-Cented" study by AI lover:
Julian Nyarko
Professor of Law
Co-Chair Stanford Law AI Initiative
Senior Fellow, Stanford Institute for Human-Cented AI (HAI)
LOL!
weatherlite 7 hours ago [-]
It is important for society to understand it is not merely programmers and customer support who are at risk of losing their jobs. Clearly A.I can do much more than just program.
motbus3 5 hours ago [-]
As others pointed. It kind implies it surpasses professors, but reading more carefully it seems more like the mythos situation. There was a single professor or test that it surpasses.
Reading it makes me extremely suspicious on how cherry picked this was
epicureanideal 12 hours ago [-]
One way to make legal services more affordable and accessible would be to put the burden of ensuring the AI legal services are accurate on a private-public partnership with the government.
If a person using the service is given inaccurate legal advice and acts on that advice, the person can't be charged with a crime, can't be given any civil penalties, etc., as long as the law in question is non-obvious.
Obviously if by some exploit, some fundamentally obvious crime (murder, theft, obvious fraud, etc.) is said to be legal, that wouldn't apply, but of course the service should try to prevent those kinds of exploits anyway.
Could limit this to something like business regulations to begin with, or even specifically for small businesses, or contracts within some time limit and dollar amount that would otherwise be coverable by small claims court, etc.
francisdavey 2 hours ago [-]
I'm not a law lecturer. I spend most of my time wrangling contracts and advising about data law. But I did a stint of part-time work teaching a masters in law.
My experience then (this was back before "Attention Is All You Need", I hadn't met the output of generative models) was that students tended to produce work that did not have a proper thread of reasoning in it. There was a tendency to repeat things they had read but rehashed in various ways.
Reviewing some of their texts it was clear that much of the writing - by law tutors - was of the same kind. Much was incorrect. The fact that someone at some time had said a particular case was a proposition for something, meant that got repeated from book to book. Many authors simply didn't read their sources or check their references. Students repeated what they had been told incuriously.
Note: this was a graduate level course. Not wet about the ears undergraduates.
The worst material was little potted notes produced for law students. Utterly awful material in most cases.
Anyway, when LLM's became a thing, a lot of what did not feel right about their output and many of their error patterns, reminded me of the experience of teaching masters' students.
One of the saving graces of English court room practice (when I did that sort of thing) was that judges would say to you "where does it say that?" in a case you cited. You had better have them all at your fingertips and know exactly where you had cited. That avoided a lot of hallucination.
Just a random remark which might be of interest.
scotty79 2 hours ago [-]
I'm curious what would be your take on the productions of this year's models.
aitchnyu 6 hours ago [-]
Tangential, is there a "test suite/CI" for AI writing legal documents? Long back in terms of AI progress, a lawyer filed something with hallucinated sources. Do new tools prevent this?
gaiagraphia 14 hours ago [-]
Incredible that the common people will be able to wrestle the right to rule of law away from the bloated legal caste, who have built themselves quite the moat.
The inaccessibility of justice is a huge driver of inequality. Any tools which bridge this gap will help make a more just society.
hparadiz 3 hours ago [-]
The profession is walking into a court room 90 minutes late because you know the judge's work pattern then going "hey Mike, how are the kids" after 22 years in the same jurisdiction. Then they old boys haggle based on how much the lawyer is charging. You are basically paying for access to the social club. Better outcomes when part of the in-group of course.
gaiagraphia 2 hours ago [-]
Would like to plot attitudes to AI against parental incomes or inheritance. If your value derives from having contacts and access to gatekept materials, rather than pure technical expertise, you've got a lot to lose as the walls come crumbling down.
There was another thread about the impact of AI on maths, and one of the arguments was about peer review... Made me wonder whether the writer was more concerned about the established order and gates being upset, or whether there's actually a valid technical criticism.
mchl-mumo 9 hours ago [-]
16 is such a small number for what they phrase as an important finding. It really couldn't be much harder to coordinate with 100+ professors.
RataNova 6 hours ago [-]
I'd read this less as "AI replaces law professors" and more as "AI may be a surprisingly strong first-pass tutor, especially when the student knows enough to question it"
elnatro 9 hours ago [-]
When I see news pieces like this I wonder about the failures. Maybe the failure percentage is low but what happens if a bot gives bad counseling? Who is responsible then?
Attorneys will be using LLMs for convenience but they will not disappear, because there needs to be an ultimately human responsible of the decisions.
dguest 5 hours ago [-]
I'm not a lawyer, I program.
My understanding is that Civil Law (most of the world excluding UK, US, AU) is like a program: you feed it a situation, it outputs a decision, every once in a while you edit it.
Common Law (UK, US) isn't really a program, but you could stretch and say it's a state machine that has been running since the country started. Every interaction sets a new precedent and changes the state. But the programming analogy falls apart because no one in the right mind would design such a program.
LLMs might actually be the best example of such a program though: Common Law is basically one long chat with an LLM, hundreds of years long.
Before LLMs came along, a Common Law system seemed to have a finite time limit before it's co-opted by wealthy people with the resources to read the whole history. Now I think maybe can push it a bit further.
But it's still a terrible program.
KnuthIsGod 13 hours ago [-]
In the hands of a domain expert, AI is useful.
In the hands of the naive, it is a foot gun.
I killed my Arch installation and was stuck at the GRUB prompt.Unwilling to brush up my rusty knowledge of GRUB syntax, I asked Gemini for help. The commands Gemini suggested would have wiped my hd...
Once Gemini was told that I was using BTRFS, the suggestion from Gemini looked a bit more sane, but still looked incorrect to me.
It was only after I informed Gemini that I was using a NMVE with BTRFS that it finally produced a sane command.
eichi_uehara 13 hours ago [-]
I beat lawyers twice before generative AI even existed.
Recently I asked Gemini a few questions about personal conflicts in everyday life. It's often too conservative, with views too shallow for the problem. So I still handle human conflicts myself. I only outsource the templated stuff like routine chat replies or marketing copy though it saves me huge amount of time. People who quote AI in serious conflicts are too weak to handle them on their own.
airstrike 14 hours ago [-]
Yes, LLMs are great at search. That's not news.
gaiagraphia 4 hours ago [-]
Isn't "getting greater" the more accurate representation, though?
In 'critical' industries, the error rate is massively important, and if the quality of search is reaching an acceptable error rate, that's quite big news.
u1hcw9nx 3 hours ago [-]
After quick look of study details and statistics, it does not look very definitive in one way or another.
I mean, LLM's do OK with tutoring, but it depends more of how unique the questions are, not how difficult they are.
Aperocky 13 hours ago [-]
> rated AI responses significantly higher than answers written by other professors, with AI winning 75% of head-to-head matchups.
That's the problem, you never know when the 25% deliver a true stink bomb, and that's not considering prompting - while a fair prompt/question maybe considered objective, it's very easy to stray.
wilg 14 hours ago [-]
> In a blind evaluation of nearly 3,000 anonymized comparisons, professors rated AI responses significantly higher than answers written by other professors, with AI winning 75% of head-to-head matchups.
I wonder to what degree the AI was just better at communicating. My experience with attorneys is that they are often some of the worst writers.
applicative 13 hours ago [-]
The writing is always fluid and grammatically flawless. This carries much more weight with us than we believe. I know the illusion well from decades of grading college papers. Many of the highest quality students use English as a second language, and I know this, but an American well trained in writing, grammar, spelling always gives an impression of superiority. (Being well trained in writing, grammar, spelling etc is of course high merit, which is how the illusion forms - it is basically an illusion of global 'intelligence')
falcor84 14 hours ago [-]
Yeah, 75% win rate is a ~200 points Elo difference, which is quite massive.
jshier 14 hours ago [-]
I do wish they'd used some more objective criteria. Simply being preferable one of the things LLMs have trained for since the beginning, hence its sycophantic nature.
adornKey 7 hours ago [-]
Maybe sycophantic nature is a good fit for the legal system. A successful lawyer once told me that the most important thing is to know your judge. Objectivity isn't a big thing in court. They'll cite random newspaper articles as evidence and throw out expert opinions - if they like. There might be a way to appeal - but that road often is not functional.
wilg 14 hours ago [-]
What criteria would you use for judging legal arguments?
mitkebes 14 hours ago [-]
The arguments need to be based on actual law, and any cited reference cases need to be real.
There's been a lot of news stories about lawyers using AI, and then getting in trouble for citing hallucinated laws or cases. It doesn't matter if the AI response is "preferred" over the human one if it gets thrown out when put under the scrutiny of a real case.
wilg 14 hours ago [-]
Who's gonna determine that? A bunch of law professors?
voxl 13 hours ago [-]
But did they? Or did they just go off what answer felt better? Did they put in any work to actually confirm the answer? Or did the busy law professors just click through and move on with their life?
mylifeandtimes 14 hours ago [-]
maybe seeing if the case law it cited was real or imagined? Just one idea, IANAL
gamerDude 14 hours ago [-]
Well, they had the data around if the answer would be harmful to the students learning. AI was scored at 3.5% harmful answers and 12% of law professor answers were considered harmful.
14 hours ago [-]
tipsytoad 7 hours ago [-]
Curious how they do a “blind” preference test. To any evaluator I’m sure it’s quite clear which answer is AI vs human.
Esophagus4 14 hours ago [-]
Yeah this could be interesting. A lot of the spotlight has been on “law firm stuff” like demand letters and writing contracts…
But imagine if a dev team didn’t have to go engineer -> product manager -> legal team to get a question answered on local data retention requirements. You could ship that much faster.
ares623 14 hours ago [-]
Would you take responsibility for missing details about local data retention requirements?
zuzululu 14 hours ago [-]
honestly if you just avoid EU and China
you can get away with anything
jedberg 14 hours ago [-]
California too.
applfanboysbgon 14 hours ago [-]
And with those three places listed you've ruled out literally 40% of the world economy. Great, you can ship your product in bumfuck Nebraska.
Esophagus4 14 hours ago [-]
Yes.
If the only purpose of asking a lawyer is transferring risk (aka cover your ass) while getting the same advice as an LLM, that’s slowing down delivery for purely bureaucratic reasons.
I’ve seen that mentality at big companies where everyone is scared to stick their neck out and be accountable for a decision. And nothing gets done. Drives me crazy.
But the people who move up are the people who take ownership and get shit done (and are right a lot).
(BTW, I have been at companies that were sued by regulators. They never really punish the individual(s) who were in the room when the decision is made. So your worry is kind of misplaced.)
himata4113 8 hours ago [-]
There is quite a simple solution for many of the problems described in the comments: Make drafting legal papers a defined interface.
If you think about it and extract sematics of any law you get something that looks familiar, sort of like code. Of course there's some complexities where certain phrases can mean different things, but legal papers in a way are written like they're programming languages already especially when it comes to law.
First we would have to define a language that can handle ambigious operations and we alread y have this with programatic proofs where n should land in x. So in the end I'd assume it would look something like this in a two party dispute:
This is very simplified and pseudo like language, writing out a full contract would be as long as a real contract.
DEFINE DEFENDANT "A Corp"
DEFINE PLAINTIFF "B Corp"
DEFINE CONTRACT CONTRACT(PLAINTIFF, DEFENDANT, 3054-41-95)
// attaching extracted requirements, definitions and obligations of contract
FACT PLAINTIFF delivered(goods) ON 7054-34-99
FACT DEFENDANT paid(0) OF CONTRACT.amount
CLAIM breach WHEN obligation(DEFENDANT, "pay") IS NOT satisfied
PROVE breach:
REQUIRE PLAINTIFF performed
REQUIRE DEFENDANT.paid < CONTRACT.amount
ASSERT delay WITHIN reasonable(time)
IF PROVE(breach):
AWARD PLAINTIFF (CONTRACT.amount - DEFENDANT.paid) + interest()
ELSE:
DISMISS
Then you would run a proof based LLM to generate it into target language and since we already had an example of this from one of the AI labs we know it works. Automatic citations and supporting proof would be automatically populated from reviewed legal -> DSL extracted papers as supporting evidence.
I am sure that many AI labs are working on something similar already and we will see something like that in the near future as proof based llms evolve.
vessenes 11 hours ago [-]
* Gemini 2.5 Pro (no outside resources), and
* NotebookLM (not versioned -- with added legal resources).
NotebookLM was considered slightly better than 2.5 Pro by the evaluators.
teiferer 9 hours ago [-]
Question is: if a legal question is answered incorrectly by an LLM, who is going to be held responsible?
king_zee 15 hours ago [-]
I think there will be a market for firms that aggressively market themselves as non-AI, and then as more people turn towards that human connection we'll go full circle
rayiner 14 hours ago [-]
Nobody wants to pay their lawyers more than they have to. There will be a huge market for firms that can use AI to avoid charging clients for $1,000/hour junior associates.
zuzululu 14 hours ago [-]
that worked out for artists and translators right ?
citizenpaul 14 hours ago [-]
If you want human connection the legal system is not where you are going to find it, period.
I don't think there will be any such market for "non ai" law. If I'm involved with the legal system I just want out as quick as possible as cheap as possible.
applfanboysbgon 14 hours ago [-]
Bad legal advice will keep you dealing with the legal system for much longer and at much greater cost. Something being cheap and quick upfront doesn't mean it will be cheap and quick by the end of the process.
Esophagus4 14 hours ago [-]
But isn’t this study saying that the legal advice could actually be better with AI?
A bit of extrapolation from the study, but not a crazy stretch.
applfanboysbgon 14 hours ago [-]
Maybe, although I would be extremely hesitant to extrapolate from this one study and trust my legal life to an LLM. One thing that's worth noting, though, is that regardless of the quality of objective legal advice in the abstract, for a lot of smaller scale stuff the human connection actually is literally what is important. There are ambiguities in the law, which are not resolved deterministically but rather at the individual discretion of judges. Your lawyer, if they're any good at their job, knows the local judges and how they're likely to rule for given circumstances, which can influence their legal advice to you specifically.
Esophagus4 14 hours ago [-]
Fair.
But I could also see a world where that, too, is fed to models for hyper-local results.
Could be a way off, but I could see it.
zuzululu 12 hours ago [-]
I think you are ignoring that there are bad lawyers and they give bad legal advice too
Even the good ones will not step above and beyond what they are paid to do
but an AI ? it will and can go above and beyond
atleastoptimal 9 hours ago [-]
And this was done with Gemini 2.5
By the time any research study is done on AI is published the models are already 0.5-1 generation ahead. Even this bullish outcome for AI models and their ability to perform useful work does not reflect how good they are now.
iLoveOncall 5 hours ago [-]
The title of the study "Law Professors Prefer AI Over Peer Answers" is VERY different from the title on HackerNews. This is completely clickbait at this point.
tj_hustler_1966 1 hours ago [-]
interesting
lp4v4n 6 hours ago [-]
Honestly it's not surprising that AI provided answers that were flagged less often as "pedagogically harmful" if we take in account that somehow LLMs create an "average" of all knowledge they ingested.
homeonthemtn 14 hours ago [-]
Personally I think this is very good. One of the hardest things out there is maintaining a society in the face of changing times and it's because law is dense and slow.
I think, in the right hands, this could be huge.
wholinator2 14 hours ago [-]
It turns out everybody has at least one right hand, even the people we trust the least.
gamblor956 10 hours ago [-]
While they provided the questions that professors and LLMs were asked to respond to, they don't include any of the answers from either the humans or the LLMs, so there's no way to independently verify that the LLMs actually returned "better" answers.
Given the number of responses the professors were asked to rate (200 each), they probably graded them the same way that bar exam responses are graded: quickly and superficially. Not surprising that LLMs achieved higher scores in this scenario, since they excel at producing superficially nice answers that don't hold up under scrutiny.
Also...unless statistics has changed in the past 2 decades, the math in the charts doesn't math. That's probably why they're leaving out the actual numerical data. I also wouldn't be surprised if we learn in the coming days that the charts were AI generated.
Thaxll 14 hours ago [-]
AI will never convince a jury though.
jojobas 14 hours ago [-]
A couple of acting classes might be cheaper than a lawyer, then you can go all out representing yourself.
xyzal 7 hours ago [-]
This contradicts my anecdata.
Recently, I tasked Opus 4.6 to study a new Czech building permit law in conjunction with some waste disposal regulations and the result was disappointing. The model could not stop drawing conclusions from obsolete regulations in its training dataset, even when given the fulltext of the new law. The usual "you are totally right" also applied and its conclusions were most of the time obviously wrong even to a human with cursory knowledge of the subject.
I ended with studying the relevant regulations myself over the weekend.
cess11 8 hours ago [-]
I skimmed portions of the study but didn't manage to figure out whether this actually measures a preference for confident mediocrity.
Eufrat 7 hours ago [-]
What is the point of this conclusion? That law professors like the tone and verbosity of AI slop? Okay?
Leptonmaniac 5 hours ago [-]
I had a similar thought. What if the result, statistical and significance critique aside, mostly means that when it comes to first-year tutoring of law students, the vibe, tone and overall presentation of arguments weighs a lot, maybe even more than the factual arguments themselves?
In such a framing I don't find it surprising at all that teachers prefer the more polished answers generated by AI, because if LLMs are good at one thing, it is being confident in whatever they generate and present it convincingly.
t0lo 14 hours ago [-]
Library outperforms student... more news at 9
apparent 11 hours ago [-]
Except the library outperformed the professors, which is quite a bit more impressive.
lern_too_spel 11 hours ago [-]
This was an open book test. The real problem with this study is that winning the most head-to-head preference tests is not the right metric. It doesn't much matter if two answers are right, and one is written a little better than the other. It matters quite a lot if one answer is right and another is wrong.
The authors point out that this other metric was computed in prior work and incorrectly dismiss it as being not as good as winning percentage in head to head competitions. The cited prior work shows that the models fare poorly on that metric. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5166938
34981t 15 hours ago [-]
He is basically an AI professor for law. This study just confirms his existence:
Stanford and its donors of course want to replace anyone but its administrators, so they cheer on such anti-intellectual nonsense.
signatoremo 14 hours ago [-]
This is the state of HN. Created new account. Accused without evidence. Emotional clickbait.
vessenes 11 hours ago [-]
I vibe coded hn10k earlier this year. You could choose to see pages with comments only started by 1k+, 10k+ or 100k+ karma contributors. I'm too lazy to keep it up, but I found 1k and 10k both to be better experiences than "vanilla".
flanked-evergl 9 hours ago [-]
...
frwrfwrfeefwf 9 hours ago [-]
they'll embed it in the weights so it can't be jailbroken
9 hours ago [-]
rimliu 10 hours ago [-]
Yes yes, the IPO is near.
dfilppi 6 minutes ago [-]
[dead]
infoinlet 5 hours ago [-]
[flagged]
charliewang0322 9 hours ago [-]
[dead]
steele 14 hours ago [-]
[flagged]
jimbokun 14 hours ago [-]
[flagged]
jatora 14 hours ago [-]
definitely not needed if you're in the middle-man slime trades (law)
jimbokun 14 hours ago [-]
In an advanced economy everyone’s the middle man for something. We’re not self sustaining agrarian farmers anymore.
zuzululu 14 hours ago [-]
what do you think software devs do all day
jatora 11 hours ago [-]
create value and utility without purposely gatekeeping and hamstringing society
Waterluvian 14 hours ago [-]
the memes were nice tho
14 hours ago [-]
fgh_ask 14 hours ago [-]
[flagged]
maxbond 14 hours ago [-]
Just so you know, I have nothing to do with Stanford, but I am flagging this as conspiratorial nonsense. So when you comment is flagged, I just want you to know that it doesn't confirm your belief, it's just that this comment harms discussion and so must be removed.
hoppyhoppy2 14 hours ago [-]
>Don't feed egregious comments by replying; flag them instead. If you flag, please don't also comment that you did.
Yes, mea culpa. Occasionally I break that rule on my own judgement. Feel free to flag my comment. (I think it's important to disconfirm conspiracy theories.)
thin_carapace 14 hours ago [-]
for what it's worth I have no idea why it would be nonsense to question institutional motivations especially in the context of an academic article that could easily be corporate propaganda, I also think that shutting conversations down is much more harmful than discussing topics that are potentially harmful
maxbond 13 hours ago [-]
Completely unevidenced conspiracy theories can only harm the discussion. The only possible benefit is to disconfirm conspiracy theories and discourage paranoid thinking. The odds that Standford as an institution are astroturfing on HN round down to 0.
What they're almost certainly observing is that these critical comments are being flagged as inappropriate. People make inappropriate comments that happen to contain criticism all the time, and I frequently see people edit them to declare that they were flagged because the group they're criticizing is astroturfing. It's virtually never the case. I've never seen it happen.
But to be clear I am completely ambivalent on Stanford and if you want to criticize them, more power to you.
thin_carapace 13 hours ago [-]
may I ask why you effectively said 'conversation over due to harm reasons' instead of asking for evidence to support the conclusion that you believe is not possible? I don't see why it is inherently harmful to discuss the seemingly impossible. I also don't see why it's relevant to bring up your n=1 sample (although it is as relevant as my n=1 sample, which has plenty of astroturfing witnessing [unspecific to Stanford])
maxbond 12 hours ago [-]
[dead]
19skitsch 14 hours ago [-]
uh alright buddy
14 hours ago [-]
aetq51 14 hours ago [-]
[flagged]
rfw300 14 hours ago [-]
A law professor studying AI has an affiliation with the center at their university that studies applications of AI? Scandalous!
dang 14 hours ago [-]
Would you please stop creating accounts to post this?
wilg 14 hours ago [-]
You're suspicious that the person doing academic research on how AI applies to law has a job related to research on law and AI?
runarberg 14 hours ago [-]
You are not? It is at least worth investigating how much this professor benefits from AI companies. In fact this is HN. Let me come back to you in about 10 minutes.
EDIT: 10 min later. I give up. I tried to find who is funding HAI, and came empty handed, usually you can see that in their yearly reports, but no such luck for me. I know Google and Bill Gates are big donors, so take that as you will.
ares623 14 hours ago [-]
Running out of IPO juice. Each bump is less effective and lasts shorter.
t0lo 14 hours ago [-]
More great news from the prestigious university where 40% of students claim they are disabled
Marc Andreessen argued that we've already reached AGI. He says that the top AI models give better answers than 99% of people he has access to, and he has access to some of the best people in their field.
I'm getting more convinced. I mean, sure it makes dumb mistakes sometimes but its a particular set of self serving mistakes, commenting out tests in order to pass. We obv don't want this behavior but I wouldn't say it's dumb.
It'll be like the Turing test, which we just blew past years ago and no one cared. After all the hand-wringing about sentience and rights of the AI if it passes the Turing test, and now we just have AI bots running 24/7 writing slop.
How does everyone else feel?
acdha 14 hours ago [-]
> Marc Andreessen argued that we've already reached AGI. He says that the top AI models give better answers than 99% of people he has access to, and he has access to some of the best people in their field.
He stands to make billions if enough people believe him — unless you also do, consider that you’re the mark. For example, if that was true, it would have to mean that AI companies either aren’t letting customers use the good models or are instructing them to frequently make errors which reveal a fundamental lack of reasoning ability.
Consider also that his wealth means he hasn’t had to defend an idea stringently since the 90s. I wouldn’t be surprised if he does think LLMs give deep answers because it often looks that way until you critically review the response and ask questions like what’s missing which require you to have a decent understanding of the problem domain.
threethirtytwo 2 hours ago [-]
And you stand to lose your job and your identity as a programmer.
He makes billions but he already is a billionaire. Gaining billions more doesn't mean shit. The guy really has nothing to lose and the utility of what he gains contribute little to his life style.
I will tell you this. HN has been comically wrong about everything related to AI. They said driverless cars have no chance of becoming useable. Now Tesla FSD is almost there and I sleep in waymo cars. HN said AI will never code, now everyone uses it to code.
It's fucking stupid. This is one of the smartest forums on the internet but HN becomes next to stupid when predicting AI. Why? Because humans can't face the truth. When the victim of attack is yourself, it doesn't matter how smart you are... you have to scaffold a rationalization to spare yourself as the victim. You have to lie to yourself and tell yourself that you matter.
The truth of it is, while LLMs are not the end game, AI in general is on a trajectory to take over. It shows us how meaningless our skills are... not only as programmers but as artists. That beautiful song you felt had greater meaning? It's all reproducible via an algorithm because it never really had a greater meaning. It was just a pattern.
coldtea 3 hours ago [-]
>Marc Andreessen argued that we've already reached AGI. He says that the top AI models give better answers than 99% of people he has access to, and he has access to some of the best people in their field.
He has access to employees and yes-men. What he actually needs to hear, nobody will tell him, AI even less so. Every shit idea he has, would be "what a bright idea"-ed by both everyone around him and AI.
And of course there's the little matter that he makes money and increases his power by selling AI. What seller doesn't promote their stuff as the greatest ever?
moregrist 14 hours ago [-]
Marc Andreessen has a strong financial incentive to feel this way and to convince others to feel this way.
I also think it’s easy to think that AI gives good answers if you don’t know the field well. In fields where I know the material, the answers are pretty variable and can be quite bad.
threethirtytwo 2 hours ago [-]
HNers have strong incentive to feel the opposite. Humanity in general has strong incentive to feel the opposite.
AI is not only replacing programmers, but art and the meaning of being human itself. It's showing us how trivial all of human creation is as it's just patterns from an algorithm.
paulmist 14 hours ago [-]
Knowing the question is half of the answer. LLMs are great at scoping your context and answering precisely what you asked; it's also why they go off the rails when they misunderstand a part of your question. Incidentally, they're great at "knowing" and reaching for knowledge.
Humans have the advantage of perspective. We always lack some knowledge and answer broadly. This is bad if you have a particular goal in mind, but better if you're just generally learning, because you see more and learn to discriminate the correct from the wrong. And most importantly, being wrong is part of human ingenuity - because sometimes we turn something "obviously" wrong into something right.
scottfalconer 14 hours ago [-]
Getting the right answers is just half of it, you need to know the right questions to ask. I haven't yet seen AI crack that one.
foolserrandboy 14 hours ago [-]
He would tell you NFTs were AGIs if it might get you to buy them.
rvz 14 hours ago [-]
> Marc Andreessen argued that we've already reached AGI. He says that the top AI models give better answers than 99% of people he has access to, and he has access to some of the best people in their field.
Investor with vested interest in AI companies makes claim of reaching "AGI".
He is one of the last people to listen to about AGI. Unless the term "AGI" means something entirely different to him vs to independent researchers vs to CEOs, since the term has become entirely meaningless.
12AHg 14 hours ago [-]
[flagged]
futuraperdita 14 hours ago [-]
I’m not an AI stan by any means and certainly no fan of Andreessen, but using the term “clanker” immediately biases your statement and can discredit what is a well-referenced or well-meaning comment.
IFC_LLC 3 hours ago [-]
This is exactly what LLM designed to do. Double up a lot of data and find connections and patterns in it.
So no wonder on this point.
One thing I want to mention: Law != Justice.
So while LLMs are awesome at the law study they will suck at justice. Just because one has to solve very emotional problems with it at times. And LLMs are not that good at finding the correct emotion.
coldtea 3 hours ago [-]
Also because their reasoning is just a statistical model of whatever they've been fed. No experience of pain, humility, human connection, etc in this.
Rendered at 15:14:02 GMT+0000 (Coordinated Universal Time) with Vercel.
Figure 2 (page 6) screams problems. There's only 16 professors (3k comparisons each?!?!) and the professors are all over the place. That's very high variance, suggesting the study has no meaningful statistical power. Poor instructor 16 can't catch a break lol
There's also really clear bias given that the main results only feature Google models. Other models show up elsewhere, why not there?
I'm no lawyer, but I'm a pretty competent statistician and can confidently say this paper has a smell to it. I can't call it bullshit, but there are red flags all over
The main results also don’t seem to know what a “model” is, as the two “models” it refers to are “stock Gemini 2.5 Pro” and “a retrieval-augmented version of NotebookLM”.
One of which is a model, and the other of which is an interface backed by different models depending on exactly when the analysis was performed.
...On the other hand, if an LLM has access to every transcript of every case a Judge has overseen, they might have an unfair advantage in any case... Hmmm...
This all assuming the AI lawyer doesn't hallucinate and start referencing cases that don't exist.
IF the right questions are asked, and IF steered into and corrected at a few crucial points. IF not it goes off in the wrong direction really quick and that's a problem that's still mostly unsolved in the last 2 years.
And that can be catastrophic in high risk environments, like legal, medical or high risk software products where being wrong in the wrong place can mean bankruptcy or even cost a life.
I help run a few marketing websites where I let the CEO's run crazy with Claude cowork, they are making PR's like a madman, but they are not allowed to touch any of the API's & platforms where there is real user data & sensitive information.
For example, my sister is a translator and she says that checking AI translations is actually harder in many ways than doing a translation in the first place, but the agencies pay less for checking than actual translation.
But it depends on the skill:
- For landing pages & simple saas solutions: marketeers & founders have more skill, since they understand the user best. The real skill is not the basic coding, but understanding the market.
- For security risks/architecture: senior devs can spot things in seconds
Im not a doctor or lawyer, but im sure there are cases where AI is really good in a similar way and cases where they miss the most crucial aspects.
I mean thats what is wanted by some companies.
The problem, especially for things like legal is that it requires someone more skilled to read through and understand that the argument is bollocks, or the law/precedent they are banking on is in fact the right one.
We have a tool that auto-writes letters to our management companies when they break SLAs. We have a slider that goes from polite to we are going to extract your first born.
Thats simple ish to do for LLMs, and low risk.
Drafting contracts is also something we could probably do, as its mostly boilerplate. However the consequence for mis-drafting a contract is multi-million dollars.
IDK "not any of it" seems a bit strong, especially thinking towards 2028. For a lot of knowledge professions, there is a surprising amount of tasks that are just dumb work compared to the rest.
Out of curiosity, why would you love to be wrong about that? What possible outcome could you see being a net positive for society if the vast majority of knowledge workers (and ultimately, as robotics progress, most workers in general) are replaced by AI?
I get that you might have a 'UBI/alternative general welfare is impossible' up your sleeve, but you've written this like it's somehow unfathomable that not forcing everybody to work just to survive would be a good thing. Of course it would be good! It's just a matter of dealing with the (huge) side effect of lost income.
UBI doesn't need to be impossible to make me want a job. If UBI was equal to the US federal min wage x 40 hours a week, I'd desperately want a job.
It is not hard for me to imagine a world where if my bosses didn't need me, they would prefer me to be dead than to pay me some kind of permanent income to me. They would prefer to keep that power to themselves
These are already the sort of people who will happily lay you off into a recession, leave you without a way to pay your rent or for food if it improves their bottom line. They do not care if you starve. Or at least they care less than they do about their quarterly bonus
So no, I don't trust these fucks to continue playing nice if they view my value as going to zero
Which also happens with humans – does it do so at a lower rate? On its own, it kind of sounds like similar anti-self-driving-car arguments.
I agree that you can create a set of domain specific rules, reinforcement layer validation tools, like self driving, that vastly improves the accuracy of au & llm's. Making humans less and less needed. But where LLM's comes from the magic of generic knowledge, this will be the opposite, narrowing it down.
It's not like self driving cars where better than a human 80% of the time isn't good enough and they aren't really usable until its 95%, 99% etc.
With that kind of logic ... anything is possible.
The point is that if the study can't validate the claims being made then we can't actually extrapolate from that claim. What you're predicting may or may come true, but the study (which is the topic at hand) isn't useful for supporting the assertion.
That isn’t even remotely what this study is looking at.
There’s also the fact that they can’t possibly keep improving frontier models at the same rate (I.e. training investment) when investment starts slowing down. The amount of cash being burned is completely unsustainable and you’re already seeing some pullback.
In my opinion, the main thing we need to do is have training happen continuously. And probably more real world data (from sensors).
Every new model might not be a leap like it used to be, but give it enough time and improvements add up.
The further we get into this, the more AI feels like 3-D printing. Significantly bigger and will be more widely used for sure. But nowhere near the “new industrial revolution” that all these companies are making it out to be
Ultimately they are clearly here to stay but I think they are going to be incredibly important in some industries and minimally present in others (a glorified chatbot/summarizing tool for instance). Whatever form it takes it’s definitely not going to be a model where individuals have subscriptions they pay for monthly.
exactly my point to compare it with pre-iPhone mobile market: wide (and growing fast!) adoption, clear potential (WAP websites, J2ME games), many players in the game, some real market fit discovered already (Blackberry), influx of capitial and tinkerers alike, but still a lot of unknowns where it will ultimately land.
Even if no single improvement was revolutionary (even first iPhone was just a fancy phone without App Store), overall mobile made billion dollar industries possible, for better or worse, and changed the way we live. Counts as industrial revolution, comparable to the Internet itself in my eyes.
But it might be that the optimization target itself has a ceiling. If you're training toward human approval ratings from a broad population, you converge toward what median preference selects for. The plateau is baked into what you're measuring against.
Context is still a large limiting factor, and we have band aids around that area already. And the further along we go the further distributed LLMs get in terms of additional pieces.
As for the original article and sentiment I'm sure AI will be a boon for law. It's going to be much easier for the general consumer / person / small business to represent themselves which feels like a win. The downside is I feel like we're tracking towards a digital hell of "virtual lawyers" that will be at the whim of any org. Consumer laws really need to change now to help avoid this dystopian path we're on.
so extrapolating from that, in another two years it will continue to bamboozle
> As judges, the professors then completed 2,918 blinded, forced-choice comparisons (median per judge: 200), each time indicating which of the two anonymized responses, from the instructor or the LLM, they would rather give to a student
But is it a surprise law professors aren't great statisticians?
If you have 100 responses from 1 professor, and the AI wins 75% of the time that is very likely a true signal that the AI is better than this prof. It would be incorrect to generalize this to all profs though.
Further, if you sample 16 profs and the AI beats 10 of them you can be fairly certain that the real percentage of profs it beats isn't 10%. Further, when estimating the probability that the AI beats a random prof, it's the relative estimation error that scales with 1/sqrt N. If you have a coin and it lands heads up 16 times, that tells you something quite robust about the coin.
Reasonably estimating confidence intervals at small N and high p is not trivial. But it can be done.
A good heuristic is "add 2 successes and 2 failures" which is due to Agresti & Couli.
See down the page here for source papers:
https://en.wikipedia.org/wiki/Binomial_proportion_confidence...
So your alternative is to not have any studies and everyone can just stump up anecdata as "evidence" for the capabilities of these models?
Regardless, your assertion that "oh well, the models will be totally different in a few months anyway, therefore any study done today is pointless" seems more than a stretch. How do you know they will be so different? How can you verify that today's studies are completely irrelevant?
Do you doubt that educational value of a law professor can vary from 0 to somewhat reasonable? You are not studying screws here.
> They calibrated AI responses to match the length and structure of human answers
which I would guess removes AI's hallucinations and errors somewhat.
There are certain areas of law work that are about analyzing large amounts of texts, drawing conclusions and writing other texts based on that and nothing more. That is literally the bread of LLMs.
Those types of lawyers should be the first in line for unemployment, not programmers, not even close.
You can execute the logic, and set up loops from the output. You can set up more useful RL. It's easier to generate synthetic training data. It naturally supports tool use and agent parallelism. It's easier to integrate with APIs (with what few APIs the court systems provide). Programming explicitly encodes abstractions at the function, module levels etc that are easier to KG/reason/build upon than text chunks.
Source: AAL.
AI is like a scab on a wound: it's a temporary filler, it rushes in to fill a void, but it's not going to be the final solution.
Models showed us that there was huuuge unmet demand for literacy, both in software and in law. But now we have a choice to either address the systemic causes of the unmet demand, or just try to paper over them with layers and layers of AI scab.
Yeah, but in my experience it won't come down to "which is the better solution" but "which is cheaper/easier"
So I look forward to lots of layers of papered over AI scabs in the future. It won't be cheaper in the long run, but it will pump someone's quarterly numbers enough that they get a promotion before the problem they introduce come back to them
It's not about what LLMs can or are suited to do. This study shows strengths of what's already in them, innately.
The same could be said about programming. Or if you want to be even more reductive, looking at a screen and pressing buttons to make the correct lights light up https://xkcd.com/722/
But in my comment it is literally what some subset of lawyers do.
Literally is much more tangible and risky in terms of real impact on employment etc.
I don't have a similar intuition calibrated for what could go wrong when asking AI to draft a legal document. Some things seem harmless, i.e. drafting a will, but I don't really know- our legal system is notoriously rife with footguns.
Any lawyer who isn't using LLMs for research is behind the curve, though. They are unbelievable at finding niche cases you would never have found on your own. Previously it was a lot of exact search term matching, which is inherently useless for a lot of legal research. I need something that can search on vaguer terms, which AI can do incredibly well. Just check the results. I'm sure the LLMs from Lexis Nexis/Westlaw are probably better than the general purpose ones.
LLMs make fantastic paralegals. If you're doing any legal work, you should be using it, even if it's just to shoot ideas at. Have it play devil's advocate. My friend always has it play the other party's lawyer to see what all the counter-arguments are going to be.
Just like you would with software development. If you care about what you are creating, CHECK THE OUTPUT.
Naive question from an outsider: aren't there searchable databases of cases (with complete text) so that citations could be checked automatically, either by the same or an independent agent?
[1] https://www.legifrance.gouv.fr/
[2] https://legal.thomsonreuters.com/en/westlaw/plans-and-pricin...
The "biggest problem" being the one thing that is trivial to verify against concrete databases is a bit convenient don't you think?
I think it's more likely that it makes mistakes evenly but the one thing that you are able to check with certainty is the only place you discover the errors.
For testing, I've asked (admittedly last-gen) LLMs to generate legal opinions regarding issues in commercial English civil litigation, and I received back cases where the citation is real, but the area of law (family law) is not relevant as family courts apply a very different set of procedural rules.
(If you squint a bit, they sometimes might be relevant... and could be useful for a particularly creative litigator to make a novel argument on behalf of a very risk tolerant client. But you would very much want to go read those cases and think quite hard about them.)
The knowledge cut off gap means the models sometimes don't know about the most recent case-law, in a given situation.
I've seent his happen multiple times now. Accountants and legal professionals advising clients based on outdated information assembled through chat-gtp, claude and copilot.
Professionals drafting letters and missing recent case-law which handles their exact case. It's unreliable.So it can save you some work; but it can't save you all of the work. And in some cases its mistakes really force you to redo all the work, and more, to be thorough and have confidence in the result.
But they can perform live websearches or go directly to a DB specified.
I liken it to me googling things as a sysadmin vs. Jane from accounting doing it. The non-tech end user is far more likely to make the problem worse, or install something sketchy from the ad riddled results than I am, or one of my help desk employees are.
I wouldn't trust myself to draft an important legal document using AI without the advice of a lawyer, much like I wouldn't really want to rely on my lawyer to use AI to write code for me.
I find those that are best and make the greatest use are the ones who remain skeptical but also use the tool. The same people who were already nuanced and picky before AI. The same people who already doubted and questioned their own work, and used that suspicion to help prevent them from having over confidence in their own work. If you weren't willing to just "lgtm" with your own code, it's difficult to do that with AI.
(To be clear, I'm not saying perfectionists. Some might call them that because the picky people have higher standards, but a good expert has to also understand that perfection doesn't exist. That's often a driving force in the suspicion! This also tends to cause them to continually improve)
The danger of those mistakes creeping in also grows exponentially the farther a lawyer strays from their core legal expertise. There are a few statutes I know inside and out, and I can spot LLM analytical errors related to them in a split second, but once I venture out into domains where I am not an expert (but where I am nevertheless reasonably qualified to practice), it becomes much harder to spot drafting mistakes because I have not refreshed my own understanding of the law by reviewing the relevant cases or statutes as I would when drafting the analysis myself from scratch.
Yet that is exactly what a lot of C-Suiters (many of whom are lawyers), are doing.
Mixing them, is, not, in my experience, OK. In the future, I am sure that LLMs will reach the point, where their output will be beyond reproach, but we're not there, yet.
That means that someone that knows the context and content, needs to vet the output, before sending it on.
i think devs overestimate their own role and underestimate others
i am seeing lawyers and doctors roll out their own software with AI
but we dont have their training and experience
I would imagine it's similar in law, in that it takes a lawyer or judge to know where the foot guns lie.
The time lag between drafting and "deployment" also makes for much less effective, much more expensive debugging loops. You can deploy your code to prod in seconds, see an error pop up in the logs, and immediately start debugging. But it will take at a minimum days and frequently as long as several years before an error in a contract or a court filing will be detected, and often the error is beyond correction at that point. Thus, the errors are both more difficult to detect and to resolve.
And the consequences of error are often much greater, both because they are not correctable and because a legal error may risk someone's life, liberty, or substantial property. Although that's not categorically the case, obviously bugs in certain safety critical systems can be as bad or even worse than legal mistakes. But in general, most software is lower stakes than most legal writing.
On the flip side, LLMs do seem to do a better job with basic style and structure for legal documents compared to code. Things like following IRAC format, citing assertions of law (although hallucination remains an issue), and writing comprehensible sentences. These would be the equivalents in code to best practices like good comments, cohesion, consistent use of design patterns, test coverage, clear variable names, DRY, etc. Although the better performance on those more qualitative metrics may just be because even the longest legal documents are typically simpler in structure and have fewer lines of text than a large, complex codebase. Or maybe it's because LLMs are trained on natural language text more than on code. Or because natural language is more forgiving than code, in that minor variation in diction or grammar is unlikely to have any significant effect on how the document is interpreted, whereas even single character errors in code can have enormous effects.
It seems to me like it would be more difficult to achieve with legal documents and, in my experience at least, writing a concrete plan has been the decisive factor that make my AI coding robust (plus all that you mentionned).
So yes, we can say the LLM created bad code when it does not compile or fails prewritten tests.
But experts might disagree what good comments, good cohesion, appropriate use of design patterns, appropriate test coverage or clear variable names are.
So what are we suppossed to train the LLMs towards? Somebody still has to decide what "good" is.
For murder that's not such a huge deal because the statutes are typically easy to track down and don't really differ all that much substantively, but once you get really into the weeds on something like commercial contracts it can be a huge pain to do cross-jurisdictional research.
And that's just a tiny, super obvious example of how impenetrable statutory law is, which isn't even the really pernicious problem. Case law is infinitely worse. It makes me absolutely furious how difficult legal research still is. The Westlaw/LexisNexis duopoly is a moral crime and wildly destructive to the quality of government in this country. Every single written court opinion should be publicly available for free on the internet in an easily searched format. It would cost practically nothing to achieve. We're talking about less text than Wikipedia hosts. Yet still many states make it almost impossible to access case law. Even though these cases are law. Binding law that we are supposed to follow, yet we cannot even easily access. It's insane, and largely perpetuated by the complacency of lawyers who can charge others for what should be free, the lobbying of the duopoly, and the incompetence of politicians.
If all of the laws were consistently available and stored in reasonable, consistent citation formats (I would settle for hyperlinking as a replacement for the rat's nest of wildly varying jurisdiction-specific citation systems), it would even be possible to introduce a form of unit testing for legal drafting that would allow us to automatically verify if the LLM hallucinated a citation.
It also doesn't help that we (for what were at the time very good reasons) moved away from the system of legal writs that used to provide fairly standardized, almost "cut and paste" templates for legal filings. So now every legal document (filings, memos, contracts, court opinions, statutes) is drafted like a bespoke, artisanal creation with few strict structural or stylistic conventions. That makes automated interpretation much harder than it needs to be.
Absolutely not harmless if you're the executor of an estate forced to deal with a screwed up AI will. I just handler my dad's estate this spring. It's a frustrating and confusing process even with the simplest of estates.
And in my experience if you do actually pay a lawyer for something they will act like you're not worth their time and will literally role their eyes at you when you're trying to explain the minor details of a case because they are too lazy to listen and zone in like I would when doing my job.
Median household net worth is in fact somewhere in the $100k-200k range, which is definitely something that could be meaningfully called an "estate." (Most of this tends to be the house, the median net equity in which is about $190k as of 2022).
Source: https://www2.census.gov/library/publications/2024/demo/p70br...
[1] This doesn't mean "homeowners," rather it's a recognition that assets for married or cohabitating couples are usually commingled.
An "estate" is a legal term for property, assets, and liabilities a person leaves behind upon their death. A family member is a top practitioner in the field of estate planning and resolution, and some of the messiest estates they have handled are pro-bono cases of exactly the type of people you would put in italicized "most people": poor, not really able to upkeep a house they inherited from a relative which hadn't had title properly transferred on a previous death because they didn't have money for an attny, now can't get a loan to fix the roof...
Yeah, if you are homeless, carless, and have only the clothes on your back and a shopping cart of stuff, you don't have an estate. Everyone in the middle class in the US has an estate. Much of the time it passes automatically to their spouse on death, but it's still an estate.
And if you are concerned about where it goes, get a GOOD attny. There are many bad ones hanging out their shingle as "Trust & Estate" attnys, and some of the next messiest cases are fixing problems made by those not-so-good attnys.
And NO, AI is not good enough.
Believe it or not...
A lot can go wrong if you have real life human lawyers draft a legal document.
One thing I learned, just bite the bullet and re-write the whole fucking will instead of making riders.
Piecing the will together from riders was terrible. Al the clauses fell away everyone got older. The final will could have been 8 pretty clear pages.
The other part that is hard is just knowing all of the things that happen with assets and a passing. Luckily we had another lawyer and financial folks to advise us. It was still a lot and not that easy to find details. This was pre-ai that would have helped walk through his shit.
e.g., https://www.npr.org/2026/04/03/nx-s1-5761454/penalties-stack...
I think on the contrary, LLM providers accumulate huge logs of interaction with their users, which elicit that tacit knowledge and mine it and humans cooperate willingly in order to solve their tasks. Just imagine the corpus of sessions for scientific research, education or software development, it is probably the largest such collection ever to exist. Trillions of HITL tokens per day flow into those logs, carrying our perspectives, choices, original ideas and tacit knowledge. I call this the "human-AI experience flywheel". It's the new stackoverflow, next model generation is based on interaction data from previous one.
My favorite example of this is knowing how to untangle a big pile of cables. There are robots now which can untie a single knotted cable, but I don't think any can do a pile of cables yet. https://www.youtube.com/watch?v=vp-94rsherE
can't get more foot gun than "well according to [fiction] it is a well established practice (that the defendent is guilty)"
Such a document may not make a difference to the person that eventually will have died, but it can make or break the life of generations to come in countries that are so heavily optimized for dynasty building like the US.
I don’t know if that’ll be true for long. I just had my colleague who’s a very competent engineer IMO hand me a frontier model vibed PR to review (after reviewing it himself, he claims) which contained random variable assignments, conditionals that do nothing, etc. He’d never do such a thing before. People become too comfortable and get confirmation bias as well.
Tell me you've never been the executor of an estate in the United States without telling me.
Or worse, use historical data to determine the laws of today.
TL;DR Its never a good idea and it will bite you.
1. https://finance.yahoo.com/news/valve-wins-trial-against-pate...
In the framing of using LLMs as legal tutors, with the implication of lowering the cost of legal training, this seems like a socially-positive outcome. Furthermore, it feels kind of intuitive to me that any contemporary system operating with an LLM and access to legal reference material will be prepared to answer _student-originated questions_ comprehensively and with breadcrumbs or direct references to educational/source materials, as seems to have been found in the study.
The authors explicitly and intentionally emphasize that many legal questions require contextualization, as opposed to some discrete calculated answer. The result of the study implies that the LLM-based systems were capable of using what many of us here understand to be the "stochastic best-fit algorithmic generation" of a contemporary language model to adequately contextualize a student's question, providing insight into the trade-offs or complications implicit in the question, while then, critically, _meeting the professional standards of legal educators in explaining that complexity to a student_.
Realistically, I would hope this provides some confidence to readers of HN that they can actually ask a legal question to an LLM and expect the response will explain the complexity of the law in relation to the question. This is great news, and is likely the minimal pre-work any of us should do before actually consulting a lawyer, if time permits.
On the other hand, I do _not_ think that this study provides any indication that an LLM is prepared to actually provide direct legal counsel. Possibly in the same way that a legal textbook does not replace legal counsel, or perhaps more accurately, the same way that stumbling upon a legal case study for approximately the same situation you're in doesn't guarantee you'll have the same result.
I think it indicates that LLMs are smart enough to be used in the context of law education.
However the waves are starting and they ARE going to be huge. Corporate clients are insisting on AI. They don’t want to pay an associate hours to draft anything to be reviewed by a partner. They want top partner to use AI and just proofread.
This is a pretty limited introductory course based on what it says in the methods of the paper itself.
EDIT: just found out that Google is a major donor to HAI. So this research is at least partially funded by Google. Which is probably the reason the authors fail to declare no conflict of interest.
But, it makes me wonder, will clients be able to use these AI-attorney systems in the future, in the court. Where they basically either just parrot what the model is instructing them to do, or - I dunno - give the model permission to speak for them (while waiving liabilities).
I have no doubt that some complex AI system can perform better than a bottom-tier, overworked lawyer.
One wrong advice clump and, like a step onto the wrong path while hiking, all subsequent steps go in the wrong direction. And sycophancy tuning means marginal one-sides takes get presented as sure-fire things.
I’m of the opinion that the big wins aren’t in using the LLMs to do the work (legal, in this case), but rather to refine and improve the dialog and presentation from all parties. A court-centric LLM that could give likely procedural needs to a litigant, and a law-firm-centric LLM could help a pro se litigant create a meaningful and refined set of questions for lawyer consideration, condensed and targeted, saving all parties time and confusion while meeting the clients linguistic needs ‘where they are’.
All the lawyers know things LLMs never will, the law is interpreted, and the written part isn’t engineering grade facts but suggestions interpreted in context. Arguably this is a racket and a thin veneer of plausible deniability for authoritarian rule. But as the law stands even with federal statues and citations from the courts website, practicing lawyers will frequently end up explaining that in this county/country/court/jurisdiction The Way of Things is different.
Those services were usually just based on NLP + simple decision trees, and people actually won their cases.
Of course, doing huge corporate contract disputes, IP disputes, M&A, and whatever will probably be out of question for a good while. Same with more serious criminal cases where the stakes are very high.
But I think there's potential for automating away less serious cases, especially where there's good structure.
And of course, it all depends on what kind of legal system one is situated in. Immediately I'd think that Civil Law would be easier for AI lawyers, as its inherent structure is a better fit for machine reasoning. So I'd expect to see more AI products start in Civil Law countries.
The fact that Lexis and WestLaw have such an iron grip on the entirety of the US legal system is exactly why general LLMs are completely unequipped to be useful in this domain.
There are however LLM context building techniques that anchor completions in data structures that persist the structure of claims that support the conclusion contained in a completion. Lots of different patterns exist —organizing logic in language is a rich domain— but the one I’ve liked the most is something called a Claim Dependency Graph that models the relationships between atomic claims as graph edges.
There’s a whole suite of operations you can perform on these structures, and “reconstruct how you came to this conclusion” is absolutely one of them.
Model interpretability work has advanced a lot. Arguably we already can explain AI decision-making better than human brains.
The point is familiar but there are good illustrations in the Atlantic article by a book editor. At first it seems abstract AI hate, but then she gets to the details. AI text cannot be edited. https://www.theatlantic.com/technology/2026/05/how-to-tell-a... or https://archive.ph/YJsGK
Asking the LLM in a way where it annotates its sources, it can greatly increase the pattern matching to closely simulate logic, just like in humans.
I understand the question of why did you say this, not that, I have seen other ways of asking that which do not seem to trigger the LLMs over-response in the other direction.
The quality of LLMs depends heavily on, among other things, how you word your questions.
Knowing the correct questions to ask is not something most students know how to do given that it tends to require a fair bit of pre-existing domain knowledge.
That's the entire point, though!
The legal academy is supposed to have outlying opinions on things and present novel philosophical answers to questions. (And questions to answers!) So in addition to the statistical arguments against this paper made elsewhere, to me it doesn't real much new information.
When AI clears the knowledge bar in a domain, the remaining moat becomes trust, accountability, and local regulatory context. That's actually good news for niche SaaS builders targeting specific jurisdictions: the generic AI layer commoditizes, but the "AI + local compliance + human accountability" bundle still has real pricing power.
Curious whether anyone has seen this play out already in contract review or compliance tooling outside the US.
THEN I find a human lawyer and give AI's answers to them and say "Can you find any errors in this? Can you improve it?" .
That way I think my legal bills should be smaller because the AI has already done most of the work. What do you think? Which LLM is best for legal work?
Please see attached contract we received from [counterparty]. ChatGPT says blah, blah and blah should be revised. What do you think? Is there anything else that we should change?
It's bit like with doctors, you'll want a second opinion, if you can afford it.
Probably for important deals, detailed human review will be expected.
Maybe the real value-add will be the insertion of language that LLMs won't be able to figure out, but which will be favorable for the side that inserted them.
i do second phase on codex, by asking to download all pdfs and extract all text of laws it references. can repeat fully local research step.
after i ask gemini to find issues and criticize.
UPDATE: there many legal skills on github to try, not used so any yet
Julian Nyarko
LOL!Reading it makes me extremely suspicious on how cherry picked this was
If a person using the service is given inaccurate legal advice and acts on that advice, the person can't be charged with a crime, can't be given any civil penalties, etc., as long as the law in question is non-obvious.
Obviously if by some exploit, some fundamentally obvious crime (murder, theft, obvious fraud, etc.) is said to be legal, that wouldn't apply, but of course the service should try to prevent those kinds of exploits anyway.
Could limit this to something like business regulations to begin with, or even specifically for small businesses, or contracts within some time limit and dollar amount that would otherwise be coverable by small claims court, etc.
My experience then (this was back before "Attention Is All You Need", I hadn't met the output of generative models) was that students tended to produce work that did not have a proper thread of reasoning in it. There was a tendency to repeat things they had read but rehashed in various ways.
Reviewing some of their texts it was clear that much of the writing - by law tutors - was of the same kind. Much was incorrect. The fact that someone at some time had said a particular case was a proposition for something, meant that got repeated from book to book. Many authors simply didn't read their sources or check their references. Students repeated what they had been told incuriously.
Note: this was a graduate level course. Not wet about the ears undergraduates.
The worst material was little potted notes produced for law students. Utterly awful material in most cases.
Anyway, when LLM's became a thing, a lot of what did not feel right about their output and many of their error patterns, reminded me of the experience of teaching masters' students.
One of the saving graces of English court room practice (when I did that sort of thing) was that judges would say to you "where does it say that?" in a case you cited. You had better have them all at your fingertips and know exactly where you had cited. That avoided a lot of hallucination.
Just a random remark which might be of interest.
The inaccessibility of justice is a huge driver of inequality. Any tools which bridge this gap will help make a more just society.
There was another thread about the impact of AI on maths, and one of the arguments was about peer review... Made me wonder whether the writer was more concerned about the established order and gates being upset, or whether there's actually a valid technical criticism.
Attorneys will be using LLMs for convenience but they will not disappear, because there needs to be an ultimately human responsible of the decisions.
My understanding is that Civil Law (most of the world excluding UK, US, AU) is like a program: you feed it a situation, it outputs a decision, every once in a while you edit it.
Common Law (UK, US) isn't really a program, but you could stretch and say it's a state machine that has been running since the country started. Every interaction sets a new precedent and changes the state. But the programming analogy falls apart because no one in the right mind would design such a program.
LLMs might actually be the best example of such a program though: Common Law is basically one long chat with an LLM, hundreds of years long.
Before LLMs came along, a Common Law system seemed to have a finite time limit before it's co-opted by wealthy people with the resources to read the whole history. Now I think maybe can push it a bit further.
But it's still a terrible program.
I killed my Arch installation and was stuck at the GRUB prompt.Unwilling to brush up my rusty knowledge of GRUB syntax, I asked Gemini for help. The commands Gemini suggested would have wiped my hd...
Once Gemini was told that I was using BTRFS, the suggestion from Gemini looked a bit more sane, but still looked incorrect to me.
It was only after I informed Gemini that I was using a NMVE with BTRFS that it finally produced a sane command.
In 'critical' industries, the error rate is massively important, and if the quality of search is reaching an acceptable error rate, that's quite big news.
I mean, LLM's do OK with tutoring, but it depends more of how unique the questions are, not how difficult they are.
That's the problem, you never know when the 25% deliver a true stink bomb, and that's not considering prompting - while a fair prompt/question maybe considered objective, it's very easy to stray.
75% win rate seems pretty good!
Paper link: https://law.stanford.edu/wp-content/uploads/2026/06/salinas_...
There's been a lot of news stories about lawyers using AI, and then getting in trouble for citing hallucinated laws or cases. It doesn't matter if the AI response is "preferred" over the human one if it gets thrown out when put under the scrutiny of a real case.
But imagine if a dev team didn’t have to go engineer -> product manager -> legal team to get a question answered on local data retention requirements. You could ship that much faster.
you can get away with anything
If the only purpose of asking a lawyer is transferring risk (aka cover your ass) while getting the same advice as an LLM, that’s slowing down delivery for purely bureaucratic reasons.
I’ve seen that mentality at big companies where everyone is scared to stick their neck out and be accountable for a decision. And nothing gets done. Drives me crazy.
But the people who move up are the people who take ownership and get shit done (and are right a lot).
(BTW, I have been at companies that were sued by regulators. They never really punish the individual(s) who were in the room when the decision is made. So your worry is kind of misplaced.)
If you think about it and extract sematics of any law you get something that looks familiar, sort of like code. Of course there's some complexities where certain phrases can mean different things, but legal papers in a way are written like they're programming languages already especially when it comes to law.
First we would have to define a language that can handle ambigious operations and we alread y have this with programatic proofs where n should land in x. So in the end I'd assume it would look something like this in a two party dispute:
This is very simplified and pseudo like language, writing out a full contract would be as long as a real contract.
Then you would run a proof based LLM to generate it into target language and since we already had an example of this from one of the AI labs we know it works. Automatic citations and supporting proof would be automatically populated from reviewed legal -> DSL extracted papers as supporting evidence.I am sure that many AI labs are working on something similar already and we will see something like that in the near future as proof based llms evolve.
NotebookLM was considered slightly better than 2.5 Pro by the evaluators.
I don't think there will be any such market for "non ai" law. If I'm involved with the legal system I just want out as quick as possible as cheap as possible.
A bit of extrapolation from the study, but not a crazy stretch.
But I could also see a world where that, too, is fed to models for hyper-local results.
Could be a way off, but I could see it.
Even the good ones will not step above and beyond what they are paid to do
but an AI ? it will and can go above and beyond
By the time any research study is done on AI is published the models are already 0.5-1 generation ahead. Even this bullish outcome for AI models and their ability to perform useful work does not reflect how good they are now.
I think, in the right hands, this could be huge.
Given the number of responses the professors were asked to rate (200 each), they probably graded them the same way that bar exam responses are graded: quickly and superficially. Not surprising that LLMs achieved higher scores in this scenario, since they excel at producing superficially nice answers that don't hold up under scrutiny.
Also...unless statistics has changed in the past 2 decades, the math in the charts doesn't math. That's probably why they're leaving out the actual numerical data. I also wouldn't be surprised if we learn in the coming days that the charts were AI generated.
Recently, I tasked Opus 4.6 to study a new Czech building permit law in conjunction with some waste disposal regulations and the result was disappointing. The model could not stop drawing conclusions from obsolete regulations in its training dataset, even when given the fulltext of the new law. The usual "you are totally right" also applied and its conclusions were most of the time obviously wrong even to a human with cursory knowledge of the subject.
I ended with studying the relevant regulations myself over the weekend.
In such a framing I don't find it surprising at all that teachers prefer the more polished answers generated by AI, because if LLMs are good at one thing, it is being confident in whatever they generate and present it convincingly.
The authors point out that this other metric was computed in prior work and incorrectly dismiss it as being not as good as winning percentage in head to head competitions. The cited prior work shows that the models fare poorly on that metric. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5166938
https://juliannyarko.com/
Stanford and its donors of course want to replace anyone but its administrators, so they cheer on such anti-intellectual nonsense.
https://news.ycombinator.com/newsguidelines.html
What they're almost certainly observing is that these critical comments are being flagged as inappropriate. People make inappropriate comments that happen to contain criticism all the time, and I frequently see people edit them to declare that they were flagged because the group they're criticizing is astroturfing. It's virtually never the case. I've never seen it happen.
But to be clear I am completely ambivalent on Stanford and if you want to criticize them, more power to you.
EDIT: 10 min later. I give up. I tried to find who is funding HAI, and came empty handed, usually you can see that in their yearly reports, but no such luck for me. I know Google and Bill Gates are big donors, so take that as you will.
https://fortune.com/article/rise-in-elite-students-seeking-a...
and where they wanted to ban words such as "chief", "stupid", "karen" and "American"
https://reason.com/2022/12/21/stanford-elimination-harmful-l...
I'm getting more convinced. I mean, sure it makes dumb mistakes sometimes but its a particular set of self serving mistakes, commenting out tests in order to pass. We obv don't want this behavior but I wouldn't say it's dumb.
It'll be like the Turing test, which we just blew past years ago and no one cared. After all the hand-wringing about sentience and rights of the AI if it passes the Turing test, and now we just have AI bots running 24/7 writing slop.
How does everyone else feel?
He stands to make billions if enough people believe him — unless you also do, consider that you’re the mark. For example, if that was true, it would have to mean that AI companies either aren’t letting customers use the good models or are instructing them to frequently make errors which reveal a fundamental lack of reasoning ability.
Consider also that his wealth means he hasn’t had to defend an idea stringently since the 90s. I wouldn’t be surprised if he does think LLMs give deep answers because it often looks that way until you critically review the response and ask questions like what’s missing which require you to have a decent understanding of the problem domain.
He makes billions but he already is a billionaire. Gaining billions more doesn't mean shit. The guy really has nothing to lose and the utility of what he gains contribute little to his life style.
I will tell you this. HN has been comically wrong about everything related to AI. They said driverless cars have no chance of becoming useable. Now Tesla FSD is almost there and I sleep in waymo cars. HN said AI will never code, now everyone uses it to code.
It's fucking stupid. This is one of the smartest forums on the internet but HN becomes next to stupid when predicting AI. Why? Because humans can't face the truth. When the victim of attack is yourself, it doesn't matter how smart you are... you have to scaffold a rationalization to spare yourself as the victim. You have to lie to yourself and tell yourself that you matter.
The truth of it is, while LLMs are not the end game, AI in general is on a trajectory to take over. It shows us how meaningless our skills are... not only as programmers but as artists. That beautiful song you felt had greater meaning? It's all reproducible via an algorithm because it never really had a greater meaning. It was just a pattern.
He has access to employees and yes-men. What he actually needs to hear, nobody will tell him, AI even less so. Every shit idea he has, would be "what a bright idea"-ed by both everyone around him and AI.
And of course there's the little matter that he makes money and increases his power by selling AI. What seller doesn't promote their stuff as the greatest ever?
I also think it’s easy to think that AI gives good answers if you don’t know the field well. In fields where I know the material, the answers are pretty variable and can be quite bad.
AI is not only replacing programmers, but art and the meaning of being human itself. It's showing us how trivial all of human creation is as it's just patterns from an algorithm.
Humans have the advantage of perspective. We always lack some knowledge and answer broadly. This is bad if you have a particular goal in mind, but better if you're just generally learning, because you see more and learn to discriminate the correct from the wrong. And most importantly, being wrong is part of human ingenuity - because sometimes we turn something "obviously" wrong into something right.
Investor with vested interest in AI companies makes claim of reaching "AGI".
He is one of the last people to listen to about AGI. Unless the term "AGI" means something entirely different to him vs to independent researchers vs to CEOs, since the term has become entirely meaningless.
So no wonder on this point.
One thing I want to mention: Law != Justice.
So while LLMs are awesome at the law study they will suck at justice. Just because one has to solve very emotional problems with it at times. And LLMs are not that good at finding the correct emotion.