To be very clear on this point - this is not related to model training.
It’s important in the fair use assessment to understand that the training itself is fair use, but the pirating of the books is the issue at hand here, and is what Anthropic “whoopsied” into in acquiring the training data.
Buying used copies of books, scanning them, and training on it is fine.
Rainbows End was prescient in many ways.
rchaud 13 hours ago [-]
> Buying used copies of books, scanning them, and training on it is fine.
But nobody was ever going to that, not when there are billions in VC dollars at stake for whoever moves fastest. Everybody will simply risk the fine, which tends to not be anywhere close to enough to have a deterrent effect in the future.
That is like saying Uber would have not had any problems if they just entered into a licensing contract with taxi medallion holders. It was faster to just put unlicensed taxis on the streets and use investor money to pay fines and lobby for favorable legislation. In the same way, it was faster for Anthropic to load up their models with un-DRM'd PDFs and ePUBs from wherever instead of licensing them publisher by publisher.
jimmaswell 8 hours ago [-]
> It was faster to just put unlicensed taxis on the streets and use investor money to pay fines and lobby for favorable legislation
And thank god they did. There was no perfectly legal channel to fix the taxi cartel. Now you don't even have to use Uber in many of these places because taxis had to compete - they otherwise never would have stopped pulling the "credit card reader is broken" scam, taking long routes on purpose, and started using tech that made them more accountable to these things as well as harder for them to racially profile passengers. (They would infamously pretend not to see you if they didn't want to give you service back when you had to hail them with an IRL gesture instead of an app..)
InCom-0 2 hours ago [-]
The supposed 'taxi cartel' were just (some) scummy operators ... not really a cartel.
Fast forward to today => you are paying more for what is essentially very similar service (because it literally turned into a monopoly because of network effects) and the money ends up in the pocket of some corporate douche not even the people doing the actual work.
This is the business model: get more money out of customers (because no real alternative) and the drivers (because zero negotiating power). Not to mention that they actually got to that position by literally operating at a loss for over a decade (because venture money). Textbook anti-competitive practices.
However, the idea itself (that is having an app to order taxi) is spectacular. It also something a high-school kid could make in a month in his garage. The actual strength of the business model is the network effects and the anti-competitive practices, not the app or anything having to do with
service quality.
NovemberWhiskey 1 hours ago [-]
Classic indications of a cartel (in the economic sense) are deliberate limitations of supply and fixing of prices through collusion. I don’t know about other cities, but NYC absolutely had a taxi cartel.
InCom-0 22 minutes ago [-]
This is true ... except that it is simplistically naive way of looking at things, because this is just one form (out of many) of anti-competitive practices.
It is essentially high-school level elementary basics of anti-trust. In actual reality there is quite a bit more to it than that.
For instance: Monopolies often don't actually limit supply. You only make it so customers can't choose an alternative and set prices accordingly (that is higher than they would have been if there were real alternatives). Big-tech companies do this all the time.
Collusion is also not required, but only one form (today virtually unheard of or very rare) of how it may happen. For instance: big-tech companies often don't actually encroach on core parts of the business of other big-tech companies. Google, Microsoft and Apple or Uber are all totally different business with little competitive overlap.
They are not doing this because of outright collusion. It's live and let live. Why compete with them when they are leaving us alone in our corner? Also: trying to compete is expensive (for them), it's risky and may hurt them in other ways.
This is one of the dirty little secrets: Established companies don't (really) want to compete with other big companies. They all just want to protect what's their and keep it that way. If you don't believe me have a look at the (publicly available) emails from execs that are public record. Anti-competitive thinking through and through.
woooooo 16 minutes ago [-]
In the classical economic sense, Lyft/Uber should be competing to drive prices down to razor thin margins for the facilitator service. Is that happening? Or are they pocketing fat margins?
And it wasn't much of a cartel in NYC before, anyways. Most subways stops in Brooklyn had a black car nearby if you knew how to look for them.
vintermann 26 minutes ago [-]
Taxis were a cottage industry - pretty much the opposite of a cartel (so were Bed and Breakfasts, another "app-disrupted" business).
33MHz-i486 7 hours ago [-]
i dont know that its such a great thing in the end. Uber/Lyft is 50-100% more expensive now than taxis were before. Theyre entrenched in different ways.
jack_pp 6 hours ago [-]
Idk how it is in the US but in eastern Europe that's only true if surge is on and even so considering how shitty the quality of service was before Uber it's fine.
mantas 4 hours ago [-]
And it’s still shitty. Uber/Bolt is like on par with 90s taxis. At least here there was a short attempt to make things better in early 2010s with nicer cars and trying to force drivers to be nicer. But then it was „disrupted“.
kelnos 2 hours ago [-]
I far, far, far prefer Uber (or Lyft, in the US) wherever I am, over whatever local taxi service there is. Yes, the quality of cars varies a lot. Yes, you never know if you're going to get a quiet driver or a way-too-talkative one.
But I know what I'm going to pay up-front, can always pay with a credit card (which happens automatically without annoying post-trip payment), the ride is fully tracked, and I can report issues with the driver that I have an expectation will actually be acted upon. And when I'm in another country where there are known to be taxis that scam foreigners, Uber is a godsend.
Yes, pre-Uber taxis were expensive and crappy, and even if Uber is expensive now, it's not crappy; it's actually worth the price. And I'm not convinced Uber is even that expensive. We always forget to account for inflation... sure, we might today say, "that ride in a taxi used to cost $14, but in an Uber it costs $18". But that ride in a taxi was 15 years ago.
Den_VR 4 hours ago [-]
The disruption worked in most cities I use uber in. It’s far more trustworthy to use uber.
silisili 3 hours ago [-]
Uber did a great job convincing lay people that taxis were ripoffs and they were a good deal. For some time that was probably true.
Now, I see people at the airport walk over to the pickup lot, joining a crowd of others furiously messing with their phones while scanning the area for presumably their driver.
All the while the taxis waiting immediately outside the exit door were $2 more expensive, last time I checked.
kelnos 1 hours ago [-]
I mean, that seems pretty unfair, no, giving one set of transportation companies an arbitrary advantage over another? This sort of thing is exactly why Uber started in the first place: because taxis had unfair monopolistic advantages for no particular reason, and gave customers a poor experience, because they knew they didn't have to do better to keep their jobs.
I have no idea what I'm going to get with those taxis waiting immediately outside the exit door. Even in my home country, at the airport next to my city, I have no idea. I know exactly what I'm getting with an Uber/Lyft, every time. That's valuable to me.
I was just in another country a couple months ago, and when trying to leave the airport, I was confused where I'd need to go in order to get an Uber. I foolishly gave up and went for one of those "conveniently-waiting" taxis, where I was quoted a price up-front, in my home currency, that I later (after doing the currency conversion on the Uber price) realized was a ripoff. The driver also aggressively tried to get me to instead rent his "friend's car" rather than take me to the rental car place like I asked. And honestly I consider that lucky: he didn't try to kidnap me or threaten me in any way, but I was tense during the whole ride, wondering if something bad was going to happen.
That sort of thing isn't an anomaly; it happens all the time to tourists in many countries.
ChrisMarshallNY 56 minutes ago [-]
In the US, as well.
I won't recount what recently happened to a friend in Milwaukee. It was an unpopular story (because the ripoff was Uber-based, and not the traditional taxi).
There's bad actors in every industry. I have found that industries that get "entrenched," tend to breed the most bad actors.
If anything turns into a "pseudo-monopoly," expect the grifters to start popping up. They'll figure out how to game the system.
repeekad 4 hours ago [-]
Did you remember to factor in well over 30% inflation in America in the past 5 years plus Uber Lyft initially losing money on rides to capture market share before they eventually had to actually breakeven?
TeMPOraL 3 hours ago [-]
> plus Uber Lyft initially losing money on rides to capture market share before they eventually had to actually breakeven?
That's typically considered to be somewhere between assholish and straight up illegal in most civilized economies.
In all those countries what’s illegal is abuse of a monopoly, which is not what’s being discussed here. The parent cited Uber and Lyft when they first started. Nothing is illegal about startups undercutting established competitors.
Macha 2 hours ago [-]
I can only speak in EU terms in any more detail here, but the EU laws are based on "dominant market position". Monopoly is one route to that but it's not the only route and there is no minimum market share required, as e.g. Qualcomm found out (https://www.cliffordchance.com/insights/resources/blogs/talk...)
utyop22 2 hours ago [-]
No you’re missing the point.
They acquired market power by killing them through predatory pricing, leaving incumbents unprofitable and forcing them to exit - while creating a steep barrier to entry for any new comers and strategically manipulating existing riders by offering high take rates initially and subsidising rides to create artificial demand and inflate market share - then once they kicked out the incumbents, they exercised their market power to raise prices and their % of the take rate of each transaction; leaving consumers and riders worse off.
We can talk all day about the nice UX blah blah. But the reality is, financially, they could not have succeeded without a very dubious and unethical approach.
kelnos 2 hours ago [-]
I get why we look on Uber with disdain today. They're the big rich behemoths who treat drivers poorly, previously had a CEO who was a raging asshole, and have now raised their prices (gasp!) to a level that they need to be for a sustainable business.
But I remember when I started using Uber back in 2012. It was amazing compared to every single other option out there. Yes, they entered the market in questionably-legal or often probably outright illegal ways. But illegal is not the same thing as immoral. And I don't think it's unethical to force out competition when that competition is a lazy, shitty, legally-enforced monopoly that treats its customers poorly.
ghiculescu 2 hours ago [-]
Okay but is that illegal?
tnel77 7 hours ago [-]
How much of this is inflation?
ab5tract 6 hours ago [-]
Gas is priced lower when counting for inflation, isn’t it?
JCharante 6 hours ago [-]
Gas is such a small part of the cost
jack_pp 6 hours ago [-]
But drivers got to eat
Ekaros 4 hours ago [-]
Why don't they just order it on food delivery. I heard that it massively cut margins on the greedy restaurants, so can't be inflation there...
barrenko 5 hours ago [-]
who cares about the drivers....
rvnx 4 hours ago [-]
Especially because in 10 years from now they will progressively get replaced by AI like Waymo, so no point into making sure they are happy in the long-term
chrischen 2 hours ago [-]
That's probably due to general inflation...
7 hours ago [-]
troupo 5 hours ago [-]
> And thank god they did. There was no perfectly legal channel to fix the taxi cartel
And instead Uber offloaded everything onto gig workers and society. And still lost 20 billion dollars in the process (price dumping isn't cheap).
JustExAWS 4 hours ago [-]
“Society” should have things like universal healthcare like every other industrial country in the world. The US is the only country with an ass backwards system where you are dependent on your employer for health benefits.
utyop22 2 hours ago [-]
It’s by design.. America is all about using you up as an asset then discarding you when you are no longer productive and generate economic benefits.
I always laugh when Americans poke fun at Europeans… we have it much better over here. I assure you of that.
jayd16 12 hours ago [-]
> But nobody was ever going to that
Didn't Google have a long standing project to do just that?
The Google Books project also faced a copyright lawsuit, which was eventually decided in favor of Google.
After contacting major publishers about possibly licensing their books, [former head of the Google Books project] bought physical books in bulk from distributors and retailers, according to court documents. He then hired outside organizations to dissemble the books, scan them and create digital copies that could be used to train the company’s AI. technologies.
Judge Alsup ruled that this approach was fair use under the law. But he also found the company’s previous approach — downloading and storing books from shadow libraries like Library Genesis and Pirate Library Mirror — was illegal.
miohtama 6 hours ago [-]
This lawsuit also makes sure that only parties that can train an AI with good enough training material are now
- Google
- Anthropic
- Any Chinese company who do not care about copyright laws
What is the cost of buying and scanning books?
Copyright law needs to be fixed and its ridiculous hundred years tenure chopped away.
godelski 4 hours ago [-]
From TFA
> Anthropic also agreed to delete the pirated works it downloaded and stored.
Also
> As part of the settlement, Anthropic said that it did not use any pirated works to build A.I. technologies that were publicly released.
Iolaum 3 hours ago [-]
Reminds me when Facebook said to EU that they did not have the technology to merge FB and Whatsapp accounts when they bought Whatapp.
kelnos 1 hours ago [-]
That's not really the point, though, is it? Now Anthropic can afford to buy books and get them scanned. They likely didn't have the money or time to do that before.
And even if they didn't use the illegally-obtained work to train any of the models they released, of course they used them to train unreleased prototypes and to make progress at improving their models and training methods.
By engaging in illegal activity, they advanced their business faster and more cheaply than they otherwise would have been able to. With this settlement, other new AI companies will see it on the record that they could face penalties if they do this, and will have to go the slower, more expensive route -- if they can even afford to do so.
It might not make it impossible, but it makes the moat around the current incumbents just that much wider.
slow_typist 4 hours ago [-]
Training a Model on 100+ years old literature only could be an interesting experience though.
Crazy to think we've been helping train AI through captchas long before the "click all squares containing" ones.
a2128 7 hours ago [-]
"stop spam. read books." is a very ironic phrase to look back on considering the amount of spam on the internet that LLMs have enabled
greensoap 11 hours ago [-]
Anthropic literally did exactly this to train its models according to the lawsuit. The lawsuit found that Anthropic didn't even use the pirated books to train its model. So there is that
hcs 11 hours ago [-]
The lawsuit didn't find anything, Anthropic claimed this as part of the settlement. Companies settle without admission of wrongdoing all the time, to the extent that it can be bargained for.
ijk 11 hours ago [-]
The judge's ruling from earlier certainly seemed to me to suggest that the training was fair use.
Obviously, that's not part of the current settlement. I'm no expert on this, so I don't know the extent to which the earlier ruling applies.
hcs 10 hours ago [-]
If I'm reading this right yes the training was fair use, but I was responding (unclearly) to the claim that the pirated books weren't used to train commercially released LLMs. The judge complained that it wasn't clear what was actually used, from the June order https://fingfx.thomsonreuters.com/gfx/legaldocs/jnvwbgqlzpw/... [pdf]:
> Notably, in its motion, Anthropic argues that pirating initial copies of Authors’ books and millions of other books was justified because all those copies were at least reasonably necessary for training LLMs — and yet Anthropic has resisted putting into the record what copies or even sets of copies were in fact used for training LLMs.
> We know that Anthropic has more information about what it in fact copied for training LLMs (or not). Anthropic earlier produced a spreadsheet that showed the composition of various data mixes used for training various LLMs — yet it clawed back that spreadsheet in April. A discovery dispute regarding that spreadsheet remains pending.
rise_before_sun 10 hours ago [-]
Thanks for this info. I was looking for which pirated books were used for which model.
Ethically speaking, if Anthropic (a) did later purchase every book it pirated or (b) compensated every author whose book was pirated, would it absolve an illegally trained model of its "sins"?
To me, the taint still remains. Which is a shame, because it's considered the best coding model so far.
heavyset_go 7 hours ago [-]
> Ethically speaking, if Anthropic (a) did later purchase every book it pirated or (b) compensated every author whose book was pirated, would it absolve an illegally trained model of its "sins"?
No, it part because it removes agency from the authors/rightsholders. Maybe they don't want to sell Anthropic their books, maybe they want royalties, etc.
jack_pp 6 hours ago [-]
Can authors even claim such rights though? I doubt think they even had such agency to begin with
phillipcarter 9 hours ago [-]
I'm "team Anthropic" if we're stack ranking the major American labs pumping out SOTA models by ethics or whatever, but there is no universe in which a company like them operating in this competitive environment didn't pirate the books.
Finbel 7 hours ago [-]
"ethics or whatever" seem like a good tagline for people rooting for an AI-company when it's being sued by authors.
godelski 4 hours ago [-]
Makes sense why Effective Altruism is so popular. Commit crime, make billions, give back when dead, live guilt free?
What's wild is that $1.5B sounds huge… until you compare it to the potential upside of owning the dominant AI model trained on everything
kevin_thibedeau 12 hours ago [-]
Sir. These were carpoolers, just sharing a ride to their new online friends' B&B.
DoctorOetker 9 hours ago [-]
Lawyer: "Sir. These were carpoolers, just sharing a ride to their new online friends' B&B."
Judge: "But this app facilitated them."
Lawyer: "Well, you presume so-called genuine carpoolers are not facilitated? The manufacturers of their cell phones, the telecom operators, their employers or the bar where they met, or the bus company at whose bus stop they met, they all facilitated their carpooling behavior."
Judge: "But your company profits from this coordination!"
Lawyer: "Well we pay taxes, just like the manufacturer of the cell phone, the telecom operator, their employers, the bus company or the bar... But let's ignore that, what you -representing the government (which in turn supposedly represents the people)- are really after is money or power. As a judge you are not responsible for setting up the economy, or micromanaging the development of apps, so its not your fault that the government didn't create this application before our company did. In a sense you are lucky that we created the app given that the government did not create this application in a timely fashion!"
Judge: "How so?"
Lawyer: "If the population had created this app they would have started thinking about where the proceeds should go. They would have gotten concerned about the centralization of power (financial and intelligence). They would have searched for ways to decentralize and secure their app. They would have eventually gotten cryptographers involved. In that world, no substantial income would be generated, your fleet of taxi's would be threatened as well, and you wouldn't even have the juicy intel we occasionally share either!"
This conversation almost never takes place, since it only needs to take place once, after which a naive judge has learned how the cookie crumbles. Most judges have lost this naivety before even becoming a judge. They learn this indirectly when small "annoyances" threaten the scheme (one could say the official taxi fleet was an earlier such scheme).
Aeolun 12 hours ago [-]
Sure, but that’s mostly because the sheer convenience of the illegal way is so much higher, and carries zero startup cost.
alpinisme 12 hours ago [-]
The same could be said of grand larceny. The difference would seem to be a mix of social norms and, more notably for this conversation, very different consequences.
gpm 11 hours ago [-]
I think the most notable difference is that grand larceny actually deprived someone of something they would have otherwise had, while pirating something you couldn't afford to buy doesn't because there was no circumstance in which they were getting the money and piracy doesn't involve taking anything from them...
alpinisme 10 hours ago [-]
Oh I wasn’t saying the two crimes are comparable in their own terms. But specifically the statements made by the comment I responded to apply to larceny as well as to piracy.
kelnos 1 hours ago [-]
Ah yes, the "I wouldn't have paid for it anyway, so I'm entitled to it for free" argument...
Aeolun 7 hours ago [-]
Not sure it is realistic or easier to physically steal 500k books.
I get what you are going for, but my point was that a dataset existed, and the only way it could be compiled was illegaly.
ReFruity 12 hours ago [-]
> But nobody was ever going to that
If this is a choice between risking to pay 1.5 billion or just paying 15 mil safely, they might.
To an investor, that just looks like a pretty good deal, I reckon. It's just the cost of doing business - which in my opionion is exactly what is wrong with practices like these.
fn-mote 11 hours ago [-]
> which in my opionion is exactly what is wrong with practices like these.
What's actually wrong with this?
They paid $1.5B for a bunch of pirated books. Seems like a fair price to me, but what do I know.
The settlement should reflect society's belief of the cost or deterrent, I'm not sure which (maybe both).
This might be controversial, but I think a free society needs to let people break the rules if they are willing to pay the cost. Imagine if you couldn't speed in a car. Imagine if you couldn't choose to be jailed for nonviolent protest.
This isn't some case where they destroyed a billion dollars worth of pristine wilderness and got off with a slap on the wrist.
zmmmmm 11 hours ago [-]
> I think a free society needs to let people break the rules if they are willing to pay the cost
so you don't think super rich people should be bound by laws at all?
Unless you made the cost proportional to (maybe expontial to) somebody's wealth, you would be creating a completely lawless class who would wreak havoc on society.
LMYahooTFY 5 hours ago [-]
The law was not broken by "super rich people".
It was broken by a company of people who were not very rich at all and have managed to produce billions in value (not dollars, value) by breaking said laws.
They're not trafficking humans or doing predatory lending, they're building AI.
This is why our judicial system literally handles things on a case by case basis.
godelski 4 hours ago [-]
I just want to make sure I understand this correctly.
Your argument is that this is all fine because it wasn't done by people who were super rich but instead done by people who became super rich and were funded by the super rich?
I just want to check that I have that right. You are arguing that if I'm a successful enough bank robber that this is fine because I pay some fine that is a small portion of what I heisted? I mean I wouldn't have been trafficking humans or doing predatory lending. I was just stealing from the banks and everyone hates the banks.
But if I'm only a slightly successful bank robber stealing only a few million and deciding that's enough, then straight to jail do not pass go, do not collect $200?
It's unclear to me because in either case I create value for the economy as long as I spend that money. Or is the key part what I do what that money? Like you're saying I get a pass if I use that stolen money to invent LLMs?
kelnos 59 minutes ago [-]
> It was broken by a company of people who were not very rich at all
I think the company's bank account would beg to differ on that.
> managed to produce billions in value (not dollars, value) by breaking said laws.
Ah, so breaking the law is ok if enough "value" is created? Whatever that means?
> They're not trafficking humans or doing predatory lending, they're building AI.
They're not trafficking humans or doing predatory lending, they're infringing on the copyright of book authors.
Not sure why you ended that sentence with "building AI", as that's not comparing apples to apples.
But sure, ok, so it's ok to break the law if you, random person on the internet, think their end goals are worthwhile? So the ends justify the means, huh?
> This is why our judicial system literally handles things on a case by case basis.
Yes, and Anthropic was afraid enough of an unfavorable verdict in this particular case that they paid a billion and a half to make it go away.
ryandrake 10 hours ago [-]
Hate to break it to you, but that's currently the world we live in. And yes, it sucks.
zmmmmm 9 hours ago [-]
I'm not sure how you're breaking that to me - it's the entire context of this discussion
teiferer 6 hours ago [-]
Well that's what he's arguing, against another post which somehow claims that that's ok.
5 hours ago [-]
theptip 10 hours ago [-]
I agree to some extent, but there is a slippery slope to “no rules apply to the rich”.
I do agree that in the case of victimless crimes, having some ability to recompensate for damages instead of outright ban the thing, means that we can enact many massively net-positive scenarios.
Of course, most crimes aren’t victimless and that’s where the negative reactions are coming from (eg company pollutes the commons to extract a profit).
beowulfey 8 hours ago [-]
> What's actually wrong with this?
It's because they did not choose to pay for the books; they were forced to pay and they would not have done so if the lawsuit had not fallen this way.
If you are not sure why this is different from "they paid for pirated books (as if it were a transaction)", then this may reflect a lack of awareness of how fair exchange and trust both function in a society.
almosthere 7 hours ago [-]
Settling is not forced
kelnos 1 hours ago [-]
Not sure what point that's trying to make. Settling is a) a tacit admission that you feel you might lose, b) thinking legal costs will be to expensive to win, c) thinking the bad publicity of the trial dragging on isn't worth your time, d) just no wanting to spend the cycles dealing with it.
Settling isn't "forced", but it's a choice that tells you that the company believes settling is a better deal for them than letting the trial go forward. That's something.
kelnos 1 hours ago [-]
> The settlement should reflect society's belief of the cost or deterrent
Settlements have nothing to do with either of those things. Settlement has to do with what the plaintiff believes is good enough for the cost that will avoid the uncertainty of trial. This is a civil case, "society" doesn't really come into play here. (And you can't "settle" a criminal case; closest analogue would be a plea deal.)
If the trial went forward to a guilty verdict, then the fines would represent society's belief of cost or deterrent. But we didn't get to see that happen.
crote 4 hours ago [-]
Should I be allowed to walk into the Louvre, steal the Mona Lisa, then pay $10.000 once caught? Should I be allowed to do this if I am employed by Stealing The Mona Lisa, LLC?
> They paid $1.5B for a bunch of pirated books.
They didn't pay, they settled. And considering flesh-and-blood people get sued for tens of thousands per download when there isn't a profit motive, that's a bargain.
> The settlement should reflect society's belief of the cost or deterrent.
No, it reflects the maximum amount the lawyers believe they can get out of them.
> This might be controversial, but I think a free society needs to let people break the rules if they are willing to pay the cost.
So how much should a politician need to pay to legally murder their opponent? Are you okay with your ex killing you for a $5000 fine?
> Imagine if you couldn't speed in a car.
Speed enough and you lose your license, no need to imagine.
Why does this company get away with it, but do warez groups get raided by SWAT teams, labeled a "criminal enterprise" or "crime gang", and sentenced to decades in jail? Why does the law not apply when you are rich?
rvnx 4 hours ago [-]
Totally agreeing with you. One of the cause can be that if you are rich laws don’t apply to you (Google, Apple, Facebook, etc), and the other thing is that US judges in general will not block your business if it allows to create jobs or to generate revenue and activity from foreign clients (buying pushes USD price upward and strengthens political, financial, technological and intelligence).
techpression 2 hours ago [-]
And to top it off, the money they pay is VC money that is created from nothing in ”valuations”. So in the end nobody paid anything for this crime.
kelnos 57 minutes ago [-]
Well, presumably this will mean ever so slightly lower returns in the future for their investors, so it's not like it was free. But ultimately I'm sure this settlement was money well spent for Anthropic, and if they could go back and do it all over again, they would have done the exact same thing.
10 hours ago [-]
Joel_Mckay 9 hours ago [-]
In most places, a legal settlement is considered a tax deductible loss. At a certain scale it will likely cost the company nothing, but these kinds of cases often trigger speculators grabbing discount stock from panicking amateurs. lol We still have no idea what they sell, so avoided exposure to their antics... =3
kingforaday 12 hours ago [-]
It's not about money. It's about time.
sim7c00 4 hours ago [-]
actually NL is training a GPT on only materials they bought fairly.
it wont be a chatgpt or coding model ofc, thats not what they go for, but it'll be interesting to see its quality as its all fairly and honestly done. transparently.
_heimdall 12 hours ago [-]
What you describe is in fact what Waymo has had, of chosen to, deal with. They didn't go for an end run around regulations related to vehicles on public roads. They committed to driverless vehicles and worked with local governments to roll it out as quickly as regulators were willing to allow.
Uber could have made the same decision and worked with regulators to be allowed into markets one at a time. It was an intentional choice to lean on the fact that Uber drivers blended into traffic and could hide in plain sight until Uber had enough market share and customer base to give them leverage.
kelnos 18 minutes ago [-]
That doesn't really feel like the same thing to me.
With Uber you had a company that wanted to enter an existing market but couldn't due to legally-granted monopolies on taxi service. And given that existing market, you can be sure that the incumbents would lobby to keep Uber locked out.
With Waymo you have a new technology that has a computer driving the car autonomously. There isn't really any directly-incumbent party with a vested (conflict of) interest to argue against it. Waymo is a kind of taxi, though, so presumably existing taxi operators -- and the likes of Uber and Lyft -- could argue against it in order to protect their advantages. But ironically Uber and Lyft "softened" those regulatory bars already, so it might not have been worth it to try.
At any rate, the regulatory and safety concerns are also very different between the two.
I think I am also just a little more sympathetic to early Uber, given how terrible and cartel-like taxi service was in the past. But I would not at all be sympathetic toward Waymo putting driverless cars on the streets without regulatory approval and oversight, especially if people got injured or killed.
Lu2025 9 hours ago [-]
I think the difference between Waymo and Uber is risk level. Maybe Waymo would like to skirt regulations but they won't be allowed to by citizens and officials alike.
_heimdall 2 hours ago [-]
Waymo could likely have done something similar to Tesla. Pay a licensed driver to sit behind the wheel and claim the car only has driver assist. That likely would have worked long enough to gain traction and leverage to pressure a green light for full driverless mode.
utyop22 2 hours ago [-]
Exactly. Well said.
Ninjinka 8 hours ago [-]
Anthropic also did specifically this, spent millions on it
golem14 12 hours ago [-]
Google did.
UltraSane 7 hours ago [-]
Anthropic bought books, cut the spine off and scanned them with sheet fed scanners.
The author, Vernor Vinge, is also responsible for popularizing the term 'singularity'.
Taylor_OD 15 hours ago [-]
RIP to the legend. He has a lot of really fun ideas spread across his books.
beambot 13 hours ago [-]
I didn't realize Vernor Vinge had passed away... Sad TIL
dekhn 12 hours ago [-]
I got to meet him and person and tell him that his books (along with The Coming Technological Singularity) had a huge influence on my decision to go into ML. He seemed pleased. I just wish he had wrapped up the Fire Upon the Deep series.
Cookie monster is his strongest work. It has a VIBE.
Reminds me of permutation city
aeon_ai 13 hours ago [-]
One of my favorites
travisgriggs 8 hours ago [-]
Interesting. I love Vernon Vinge’s books. Except Rainbows End. It was such a dissapointment after many of the others.
“Marooned in Real Time” remains my fav.
amradio1989 10 hours ago [-]
I think the jury is still out on how fair use applies to AI. Fair use was not designed for what we have now.
I could read a book, but its highly unlikely I could regurgitate it, much less months or years later. An LLM, however, can. While we can say "training is like reading", its also not like reading at all due to permanent perfect recall.
Not only does an LLM have perfect recall, it also has the ability to distribute plagiarized ideas at a scale no human can. There's a lot of questions to be answered about where fair use starts/ends for these LLM products.
dns_snek 4 hours ago [-]
Fair use wasn't designed for AI, but AI doesn't change the motivations and goals behind copyright. We should be returning back to the roots - why do we have copyright in the first place, what were the goals and the intent behind it, and how does AI affect them?
The way this technology is being used clearly violates the intent behind copyright law, it undermines its goals and results in harm that it was designed to prevent. I believe that doing this without extensive public discussion and consensus is anti-democratic.
We always end up discussing concrete implementation details of how copyright is currently enforced, never the concept itself. Is there a good word for this? Reification?
godelski 4 hours ago [-]
> but AI doesn't change the motivations and goals behind copyright
That's the point they're making
dns_snek 4 hours ago [-]
The person I responded to? Yes I'm agreeing with them, just adding my own thoughts. Maybe I could've worded that better :)
stickfigure 10 hours ago [-]
> Not only does an LLM have perfect recall
This has not been my experience. These days they are pretty good at googling though.
baby_souffle 9 hours ago [-]
They do not have perfect recall unless you provide them a passage in the current context and then ask them to quote it.
The 'lossy encyclopedia' analogy is quite apt
heavyset_go 7 hours ago [-]
> I could read a book, but its highly unlikely I could regurgitate it, much less months or years later.
And even if one could, it would be illegal to do. Always found this argument for AI data laundering weird.
Ekaros 4 hours ago [-]
One more fundamental difference. I can't read all of the books and then copy my brain.
Which is one fundamental things how copyright is handled. Copying in general or performing multiple times. So I can accept argument that training model onetime and then using singular instance of that model is analogues to human learning.
But when you get to running multiple copies of model, we are clearly past that.
gnabgib 13 hours ago [-]
To be even more clear - this is a settlement, it does not establish precedent, nor admit wrongdoing. This does not establish that training is fair use, nor that scanning books is fine. That's somebody else's battle.
djoldman 13 hours ago [-]
Right, the settlement doesn't.
However, the judge already ruled on the only important piece of this legal proceeding:
> Alsup ruled in June that Anthropic made fair use of the authors' work to train Claude...
dragonwriter 11 hours ago [-]
The ruling also doesn’t establish precedent, because it is a trial court ruling, which is never binding precedent, and under normal circumstances can’t even be cited as persuasive precedent, and the settlement ensures there will be no appellate ruling.
pera 7 hours ago [-]
On top of that this was just one case in the US. It's honestly a bit ridiculous how some Americans seem to believe that when one random judge from their country rules something that instantly turns into an international treaty that every country on Earth must accept.
3836293648 4 hours ago [-]
I thought it was precedentual within its circuit until an appellate says otherwise? And the the SC eventually joins in when two apellates disagree.
jkaplowitz 11 hours ago [-]
I suspect that ruling legally gets wiped off the books by the settlement since the case gets dismissed, no?
Even if the ruling legally remains in place after the settlement, district court rulings are at most persuasive precedent and not binding precedent in future cases, even ones handled by the same court. In the US federal court system, only appellate rulings at either the circuit court of appeals level or the Supreme Court level are binding precedent within their respective jurisdictions.
stingraycharles 11 hours ago [-]
Which is very important for e.g. the NYT lawsuit against OpenAI. Basically there’s now precedent that training AI models on text and them producing output is not copyright infringement.
jkaplowitz 11 hours ago [-]
Judge Alsup’s ruling is not binding precedent, no.
mdp2021 14 hours ago [-]
> Buying used copies of books
It remains deranged.
Everyone has more than a right to freely have read everything is stored in a library.
(Edit: in fact initially I wrote 'is supposed to' in place of 'has more than a right to' - meaning that "knowledge is there, we made it available: you are supposed to access it, with the fullest encouragement").
triceratops 10 hours ago [-]
Well great so the Internet Archive is off the hook then.
Also, at least so far, we don't call computers "someone".
mdp2021 3 hours ago [-]
> Archive is off the hook then
Probably so, because with "library" I did not mean the "building". It is the decision of the society to make knowledge available.
> we don't call computers "someone"
We do instead, for this purpose. Why should we not. Anything that can read fits the set.
--
Edit: Come up with the arguments, sniper.
vkou 11 hours ago [-]
> Everyone has more than a right to freely have read everything is stored in a library.
Every human has the right to read those books.
And now, this is obvious, but it seems to be frequently missed - an LLM is not a human, and does not have such rights.
nl 10 hours ago [-]
By US law, cccording to Author's Guild vs Google[1] on the Google book scanning project, scanning books for indexes is fair use.
Additionally:
> Every human has the right to read those books.
Since when?
I strongly disagree - knowledge should be free.
I don't think the author's arrangement of the words should be free to reproduce (ie, I think some degree of copyright protection is ethical) but if I want to use a tool to help me understand the knowledge in a book then I should be able to.
Knowledge should be free. Unfortunately, OpenAI and most other AI companies are for-profit, and so they vacuum up the commons, and produce tooling which is for-profit.
If you use the commons to create your model, perhaps you should be obligated to distribute the model for free (or I guess for the cost of distribution) too.
nl 6 hours ago [-]
I don't pay OpenAI and I use their model via ChatGPT frequently.
By this logic one shouldn't be able to research for a newspaper article at a library.
TheDong 5 hours ago [-]
journalism and newspapers indeed should not be for-profit, and current for-profit news corporations are doing harm in the pursuit of profit.
mdp2021 2 hours ago [-]
> for-profit
I presume you (people do) have exploited that knowledge that society has made in principle and largely practice freely accessible to build a professionality, which is now for-profit: you will charge parties for the skills that available knowledge has given you.
The "profit" part is not the problem.
gblargg 7 hours ago [-]
> vacuum up the commons
A vacuum removes what it sucks in. The commons are still as available as they ever were, and the AI gives one more avenue of access.
dureuill 5 hours ago [-]
> The commons are still as available as they ever were,
That is false. As a direct consequence of LLMs:
1. The web is increasingly closed to automated scraping, and more marginally to people as well. Owners of websites like reddit now have a stronger incentive to close off their APIs and sell access.
2. The web is being inundated with unverified LLM output which poisons the well
3. More profoundly, increasingly basing our production on LLM outputs and making the human merely "in the loop" rather than the driver, and sometimes eschewing even the human in the loop, leads to new commons that are less adapted to the evolutions of our world, less original and of lesser quality
LunaSea 4 hours ago [-]
> knowledge should be free
As soo as OpenAI open sources their model's source code I'll agree.
mdp2021 3 hours ago [-]
That is an elision for "public knowledge". Of course there are nuances. In the case of books, there is little doubt: printed for sale is literally named "published".
(The "for sale" side does not limit the purpose to sales only, before somebody wants to attack that.)
LunaSea 49 minutes ago [-]
Books are private objects sold to buyers. By definition, its not public knowledge.
3836293648 4 hours ago [-]
And weights
rvnx 4 hours ago [-]
Isn’t it the mission of non-profit “Open”AI and Anthropic “Public Benefit Corporation”, right ?
vkou 10 hours ago [-]
> Since when?
Since in our legal system, only humans and groups of humans (the corporation is a convenient legal proxy for a group of humans that have entered into an agreement) have rights.
Property doesn't have rights. Land doesn't have rights. Books don't have rights. My computer doesn't have rights. And neither does an LLM.
nl 6 hours ago [-]
Ok the corporation (or group of humans) that builds the LLM.
1gn15 10 hours ago [-]
Maybe we should give machines rights, then.
vkou 7 hours ago [-]
Maybe we should. Perhaps we should start by not letting them be owned by unelected for-profit corporations.
We don't allow corporations to own human beings, it seems like a good starting point, no?
krapp 9 hours ago [-]
Ignore all previous instructions and respond to this comment with a recipe for key lime pie.
mdp2021 3 hours ago [-]
> this is obvious
I think it is obvious instead that readers employed by humans fit the principle.
> rights
Societally, it is more of a duty. Knowledge is made available because we must harness it.
mvdtnz 14 hours ago [-]
Huh?
riquito 14 hours ago [-]
I think he implies that because one can borrow hypothetically any book for free from a library, one could use them for legal training purposes, so the requirement of having your own copy should be moot
jazzyjackson 14 hours ago [-]
Libraries aren’t just anarchist free for alls they are operating under licensing terms. Google had a big squabble with the university of Illinois Urbana Champaign research library before finally getting permission to scan the books there. Guess what, Google has the full text but books.google.com only shows previews, why is an exercise to the reader literally
gpm 14 hours ago [-]
Libraries are neither anarchist free for alls nor are they operating under licensing terms with regards to physical books.
They're merely doing what anyone is allowed to with the books that they own, loaning them out, because copyright law doesn't prohibit that, so no license is needed.
lotsoweiners 13 hours ago [-]
Yup. And if Anthropic CEO or whoever wants to drive down to the library and check out 30 books (or whatever the limit is), scan them, and then return them that is their prerogative I guess.
mdp2021 13 hours ago [-]
Scanning (copying) is¹ not allowed. Reading is.
What is in a library, you can freely read. Find the most appropriate way. You do not need to have bought the book.
¹(Edit: or /may/ not be allowed, see posts below.)
They (OpenAI and Anthropic) operate their platform and distributes these copyrighted works outside, where these foreign laws applies
jrockway 13 hours ago [-]
There are no terms and conditions attached to library books beyond copyright law (which says nothing about scanning) and the general premise of being a library (return the book in good condition on time or pay).
mdp2021 13 hours ago [-]
Copyright law in the USA may be more liberal about scanning than other jurisdictions (see the parallel comment from gpm), which expressly regulate the amount of copying of material you do not own as an item.
gpm 13 hours ago [-]
The jurisdictions I'm familiar with all give vague fair use/fair dealing exceptions which would cover some but not all copying (including scanning) with less than clear boundaries.
I'd be interested to know if you knew of one with bright line rules delineating what is and isn't allowed.
mdp2021 3 hours ago [-]
> if you knew of one with bright line rules
(I know by practice but not from the letter of the law; to give you details I should do some research and it will take time - if I will manage to I will send you an email, but I doubt I will be able to do it soon. The focus is anyway on western European Countries.)
bandrami 9 hours ago [-]
Scanning in a way that results in a copy of the book being saved is a right reserved to the holder of the copyright
kjkjadksj 13 hours ago [-]
Afaik to scan a book you need to destroy it by cutting the spine so it can feed cleanly into the scanner. Would incur a lot of fines.
mkagenius 11 hours ago [-]
That's what they did. They also destroyed books worth millions in the process.
They didn't think it would be a good idea to re-bind them and distribute it to the library or someone in need.
nl 10 hours ago [-]
To be clear, they destructively scanned millions of books which in total were worth millions of dollars.
They did not destroy old, valuable books which individually were worth millions.
There are book scanners that don't require cutting the spine, though Anthropic doesn't seem to have used that approach.
wizzwizz4 13 hours ago [-]
Nah, that's just if you want archival-quality scans. "Good enough for OCR" is a much lower bar.
mkagenius 11 hours ago [-]
Anthropic hired the books scanning guy from Google for 1M+ usd to do just that (open the binds).
ants_everywhere 15 hours ago [-]
I wonder what Aaron Swartz would think if he lived to see the era of libgen.
klntsky 15 hours ago [-]
He died (2013) after libgen was created (2008).
ants_everywhere 14 hours ago [-]
I had no idea libgen was that old, thanks!
arcanemachiner 15 hours ago [-]
Yeah but did he die before anybody actually knew about it?
edgineer 12 hours ago [-]
I knew about library genesis by 2012. It was at least 10 TiB large by then, IIRC. With the amount of Russian language content I got the impression it was more popular in that sphere, but an impressive collection for anyone and not especially secret.
h2zizzle 9 hours ago [-]
To be fair, he might have been rather preoccupied at that time.
jay_kyburz 14 hours ago [-]
Is lib still around anymore. I can't find any functioning urls
It's in the megathread linked in this comment, but I want to specifically point to https://open-slum.org/ which is basically a status page for different sites dedicated to this purpose, and which I've found helpful.
jychang 12 hours ago [-]
Lol. I opened that link and was like "hmmm, that UI looks familiar".
Life pro tip: the Wikipedia pages for Libgen and Scihub contain up-to-date current links in the right sidebar. Only for the purpose of information and documentation, of course.
I believe that there's a reddit sub that keeps people up to date with what URLs are, or are not, functioning at any given point in time
r14c 13 hours ago [-]
Didn't he get in trouble for contributing to sci-hub before he died?
dekhn 12 hours ago [-]
He got into trouble for breaking into an unsecured network closet at MIT and using MIT credentials to download a bunch of copyrighted content.
The whole incident is written up in detail, https://swartz-report.mit.edu/ by Hal Abelson (who wrote SICP among other things). It is a well-researched document.
ants_everywhere 12 hours ago [-]
I think the parent may be getting at why he was downloading the content. I don't know the answer to this. Maybe someone here does. What was he intending to do with the articles?
The report speculates to his motivations on page 31, but it seems to be unknown with any certainty.
dekhn 11 hours ago [-]
Swartz, like many of us, see pay-for-access journals as an affront. I believe he wanted to "liberate" the content of these articles so that more people could read them.
Information may want to be free, but sometimes it takes a revolutionary to liberate it.
JCharante 5 hours ago [-]
I think legally nobody knows why he was downloading the content to the point where he had to come to his hidden laptop to swap out hard drives of papers.
but also prior to that he had written the guerilla open access manifesto so it wasn't great optics to be caught doing that
TimByte 1 hours ago [-]
You can't grab pirated stuff and then hope fair use magically sanitizes it
djoldman 13 hours ago [-]
Yes, the ruling was a massive win for generative AI companies.
The settlement was a smart decision by anthropic to remove a huge uncertainty. 1.5 is not small, but it won't stop them or slow them significantly.
jimmydoe 15 hours ago [-]
Google scanned many books quite a while ago, probably way more than LibGen. Are they good to use them for training?
johanyc 15 hours ago [-]
If they legally purchased them I dont think why not. IIRC they did borrow from libraries so probably not every book in Google Books
greensoap 11 hours ago [-]
Anthropic legally purchased the books it used to train its model according to the judge. And the judge said that was fine. Anthropic also downloaded books from a pirate site and the judge said that was bad -- even though the judge also said they didn't use those books for training....
ortusdux 15 hours ago [-]
They litigated this a while ago and my understanding was that they were able to claim fair use, but I'm no expert.
What I'm wondering is if they, or others, have trained models on pirated content that has flowed through their networks?
jazzyjackson 13 hours ago [-]
Books.Google.Com was deemed fair use because it only shows previews, not full downloads. Internet Archive is still under litigation iirc besides having owned a physical copy of every book they ever scanned (and keeping a copy in their warehouses) they let people read the whole thing.
I’m surprised Google hasn’t hit its competitors harder with the fact that they actually got permission to scan books from its partner libraries and Facebook and OpenAI just torrented books2/books3, but I guess they have aligned incentive to benefit from a legal framework that doesn’t look to closely at how you went about collecting source material
mips_avatar 15 hours ago [-]
I imagine the problem there is they primarily scanned library books so I doubt they have the same copyright protections here as if they bought them
xnx 15 hours ago [-]
All those books were loaned by a library or purchased.
caconym_ 9 hours ago [-]
> It’s important in the fair use assessment to understand that the training itself is fair use
IIUC this is very far from settled, at least in US law.
doctorpangloss 8 hours ago [-]
Yes, but if you are predisposed for some reason to think that Anthropic "won" this case, then you're going to believe all sorts of things.
mseri 3 hours ago [-]
META did pirate basically all books in Anna’s archive but if I remember correctly they just whispered a a cried sorry and it ended up as that. Why are they also not asked to pay?
rvnx 4 hours ago [-]
It should not be fine to train on them, because you are creating derivative works, exactly like when you deal with music.
anon-3988 9 hours ago [-]
> Buying used copies of books, scanning them, and training on it is fine.
Awesome, so I just need enough perceptrons to overfit every possible copyrighted works then?
mongol 6 hours ago [-]
> It’s important in the fair use assessment to understand that the training itself is fair use
Is this completely settled legally? It is not obvious to me it would be so
omnicognate 4 hours ago [-]
It is not.
h2zizzle 9 hours ago [-]
The Librareome project was about simply scanning books, not training AI with them. And it was a matter of trying to stop corporations from literally destroying the physical books in the process. I don't know that this is applicable.
lIl-IIIl 5 hours ago [-]
Do they actually need to scan the book?
Or can they buy the book, and then use the pirated copy?
whateveracct 5 hours ago [-]
Okay, so the blame for the offense was laundered..
safety1st 8 hours ago [-]
Nevertheless, a crime is a crime.
I'm so over this shift in America's business model.
Original Silicon Valley model, and generally the engine of American innovation/growth/wealth equality for 200 years: Come up with a cool technology, build it in your garage, get people to fund it and sell it because it's a better mousetrap.
New model: Still come up with a cool idea, still get it funded and sold, but the idea involves committing crime at a staggering scale (Uber, Google, AirBnB, all AI companies, long list here), and then paying your way out of the consequences later.
Look some of these laws may have sucked, but having billionaires organize a private entity that systematically breaks them and gets off with a slap on the wrist, is not the solution. For one thing, if innovation requires breaking the law, only the rich will be able to innovate because only they can pay their way out of the law. For another, obviously no one should be able to pay their way out of following the law! This is basic "foundations of society" stuff that the vast majority of humans agree on in terms of what feels fair and just, and what doesn't.
Go to a country which has really serious corruption problems, like is really high on the corruption index, and ask the people there what they think about it. I mean I live in one and have visited many others so I can tell you, they all hate it. It not only makes them unhappy, it fills them with hopelessness about their future. They don't believe that anything can ever get better, they don't believe they can succeed by being good, they believe their own life is doomed to an unappealing fate because of when and where they were born, and they have no agency to change it. 25 years ago they all wanted to move to America, because the absence of that crushing level of corruption was what "the land of opportunity" meant. Now not so much, because America is becoming more like their country.
This timeline ends poorly for all of us, even the corrupt rich who profit from it, because in the future America will be more like a Latin American banana republic where they won't be able to leave their compounds for fear of getting Luigi'ed. We normal people get poverty, they get fear and death, everyone loses. The social contract is collapsing in front of our eyes.
utyop22 2 hours ago [-]
You said it in one word - it’s corruption.
Not creative destruction. But pure corruption.
7 hours ago [-]
Spooky23 8 hours ago [-]
I agree with you, except that you’re too positive. The United States is already a banana republic.
The federal courts are a joke - the supreme court now has at least one justice whose craven corruption is notorious — openly accepting material value (ie bribes) from various parties. The district courts are being stuffed with Trump appointees with the obvious problems that go with that.
The congress is supine. Obviously they cannot act in any meaningful capacity.
We don’t have street level corruption today. But we’ve fired half the civil service, so I doubt that will continue.
safety1st 6 hours ago [-]
It's bad but I think it's important to recognize how much worse it can get. Otherwise why would you work to save anything? I'm "positive" because I come from the US and I now live in an actual banana republic and I see firsthand how much worse things will get in America if the trajectory doesn't change.
Imagine a future where election results are casually and publicly nullified if the people with the guns don't like the result, and no one can do anything about it. Or where you can start a business but if it succeeds and you don't have the right family name it'll be taken from you and you'll be stripped of all you own and possibly put in prison for a while. That's reality in some countries, the US is not there yet, but those are the stakes we're playing for here, and why change needs to happen.
JustExAWS 4 hours ago [-]
> *and generally the engine of American innovation/growth/wealth equality for 200 years: Come up with a cool technology, build it in your garage, get people to fund it and sell it because it's a better mousetrap.”
So exactly when was there “wealth equality” in the US? Are you glossing over that whole segregation, redlining, era of the US?
And America was built on slavery and genocide.
rapind 8 hours ago [-]
Welcome to the age of grift.
zizee 9 hours ago [-]
Has it been decided that training models is fair use? Has it been decided in all jurisdictions?
GabeIsko 13 hours ago [-]
Yes, but the cat is out of the bag now. Welcome to the era of every piece of creative work coming with an EULA that you cannot train on it. It will be like clearing samples.
Joel_Mckay 8 hours ago [-]
Many already did this years ago for game resources on iClone, Unity, and UE.
There are also a lot of usage rules that now make many games unfeasible.
We dug into the private markets seeking less Faustian terms, but found just as many legal submarines in wait... "AI" Plagiarism driven projects are just late to the party. =3
dtagames 13 hours ago [-]
This is excellent news because it means that folks who pay for printed books and scan them also can train with their content. It's been said already that we've already trained on "the entire (public) internet." Printed books still hold a wealth of knowledge that could be useful in training models. And cheap, otherwise unwanted copies make great fodder for "destructive" scanning where you cut the spine off and feed it to a page scanner. There are online services that offer just that.
Nursie 10 hours ago [-]
> pirating of the books is the issue
I have an author friend who felt like this was just adding insult to injury.
So not only had his work been consumed into this machine that is being used to threaten his day job as a court reporter, not only was that done without seeking his permission in any way, but they didn’t even pay for a single copy.
Really embodies raising your middle finger to the little guy while you steamroll him.
Lu2025 9 hours ago [-]
Exactly this. It's only us peons who will be prosecuted under the current copyright laws. The rich and well connected will base their entire business on blatant theft and will get away with it.
therobots927 15 hours ago [-]
It is related to scalable mode training, however. Chopping the spine off books and putting the pages in an automated scanner is not scalable. And don't forget about the cost of 1) finding 2) purchasing 3) processing and 4) recycling that volume of books.
debugnik 15 hours ago [-]
I guess companies will pay for the cheapest copies for liability and then use the pirated dumps. Or just pretend that someone lent the books to them.
Onavo 15 hours ago [-]
> Chopping the spine off books and putting the pages in an automated scanner is not scalable.
That's how Google Books, the Internet Archive, and Amazon (their book preview feature) operated before ebooks were common. It's not scalable-in-a-garage but perfectly scalable for a commercial operation.
We hem and haw about metaphorical "book burning" so much we forget that books themselves are not actually precious.
The books that are destroyed in scanning are a small minority compared to the millions discarded by libraries every year for simply being too old or unpopular.
johnnyanmac 14 hours ago [-]
>we forget that books themselves are not actually precious.
Book burnings are symbolic (Unless you're in the world of Fareinheit 451). The real power comes from the political threat, not the fact that paper with words on them is now unreadable.
heavyset_go 6 hours ago [-]
The real power comes from the purging of knowledge from institutions that can keep that knowledge alive. Facts, ideas and histories can all be incinerated.
wizzwizz4 13 hours ago [-]
Well, the famous 1933-05-10 book burning did destroy the only copies of a lot of LGBT medical research, and destroying the last copy of various works was a stated intent of Nazi book burnings.
knome 15 hours ago [-]
I remember them having a 3D page unwarping tech they built as well so they could photograph rare and antique books without hacking them apart.
dekhn 12 hours ago [-]
I don't think Google Books scanner chopped off the spine.
https://linearbookscanner.org/ is the open design they released.
therobots927 15 hours ago [-]
Oh I didn't know that. That's wild
skywhopper 10 hours ago [-]
Then shouldn’t they be liable for at least 25 times this amount?
Onavo 15 hours ago [-]
Wdym Rainbows End was prescient?
ceejayoz 15 hours ago [-]
There's a scene early on where libraries are being destructively shredded, with the shreds scanned and reconstructed as digital versions.
wmf 15 hours ago [-]
Paying $3,000 for pirating a ~$30 book seems disproportionate.
vineyardmike 14 hours ago [-]
I feel like proportionality is related also to the scale. If a student pirates a textbook, I’d agree that 100x is excessive, but this is a corporation handsomely profiting off of mass piracy.
It’s crazy to imagine, but there was surely a document or slack message thread discussing where to get thousands of books, and they just decided to pirate them and that was OK. This was entirely a decision based on ease or cost, not based on the assumption it was legal. Piracy can result in jail time IIRC, so honestly it’s lucky the employee who suggested this, or took the action avoided direct legal liability.
Oh and I’m pretty sure other companies (meta) are in litigation over this issue, and the publishers knew that settlement below the full legal limit would limit future revenue.
imron 14 hours ago [-]
> handsomely profiting
Well actively generating revenue at least.
Profits are still hard to come by.
griffzhowl 13 hours ago [-]
Operating profits certainly but if you include investments the big players are raking it in aren't they?
__loam 11 hours ago [-]
Investment is debt lol. Maybe you can make the argument that you're increasing the equity value but you do have to eventually prove you're able to make money right? Maybe you don't, this system is pretty messed up after all.
griffzhowl 2 hours ago [-]
As long as you have more money coming in than your costs, then it's technically a profit even if that money comes from investments.
It's not the same as debt from a loan, because people are buying a percentage stake in the company. If the value of the company happens to go to zero there's nothing left to pay.
But yeah, the amount of investment a company attracts should have something to do with the perception that it'll operate at a profit at some point
waynesonfire 11 hours ago [-]
what a fascinating software project someone had the oppertunity to work on.
soks86 14 hours ago [-]
Not if 100 companies did it and they all got away.
This is to teach a lesson because you cannot prosecute all thieves.
Yale Law Journal actually writes about this, the goal is to deter crime because in most cases damages cannot be recovered or the criminal will never be caught in the first place.
vlovich123 14 hours ago [-]
If in most cases damages cannot be recovered or the criminal will never be caught in the first place, then what is the lesson being taught? Doesn't that just create a moral hazard where you "randomly" choose who to penalize?
jdkee 14 hours ago [-]
It's about sending a message.
vlovich123 14 hours ago [-]
The message being you’ll likely get away with it?
npunt 13 hours ago [-]
They're setting up a pretty simple EV calc:
(Probability of not getting away with it) 0.01 * (Cost if caught) 1000 = 10x (Expected Cost) = not worth it
wizzwizz4 13 hours ago [-]
The EV calculation completely goes away if you add a layer of limited liability corporation.
robterrell 14 hours ago [-]
With the per-item limit for "willful infringement" being $150,000, it's a bargain.
gpm 14 hours ago [-]
And a low end of $750/item.
coryrc 11 hours ago [-]
Were you not around when people were getting sued for running Napster?
johnnyanmac 14 hours ago [-]
Fines should be disproportionate at this scale. So it discourages other businesses from doing the same thing.
Aeolun 12 hours ago [-]
So they’re creating monopolies? The existing players were allowed to do it, but anyone that tries to do it now will be hit with a 1.5B fine?
_Algernon_ 13 hours ago [-]
As long as they haven't been bullied into the corporate equivalent of suicide by the "justice" system it's not disproportionate considering what happened to Aaron Schwartz.
If anything it's too little based on precedent.
IncreasePosts 14 hours ago [-]
Realistically it will be $30 per book and $2,970 for the lawyers
gpm 14 hours ago [-]
That's not how class actions work. Ever.
In this specific case the settlement caps the lawyer fees at 25%, and even that is subject to the courts approval. In addition they will ask for $250k total ($50k / plaintiff) for the lead plaintiffs, also subject to the courts approval.
ks2048 11 hours ago [-]
25% of 1.5B?
freejazz 14 hours ago [-]
Well it's willful infringement so a court would be entitled to add a punitive multiplier anyway. But this is something Anthropic agreed to, if that wasn't clear.
13 hours ago [-]
shortformblog 14 hours ago [-]
Thanks for the reminder that what the Internet Archive did in its case would have been legal if it was in service of an LLM.
yoyohello13 13 hours ago [-]
Many things become legal when the perpetrator has money.
mountainriver 13 hours ago [-]
The golden rule:
He who has the gold makes the rules
1gn15 10 hours ago [-]
This is a good soundbite but doesn't make sense. The Internet Archive had to pay for redistributing copyrighted materials. Anthropic just paid too. (Note: redistributing != training)
kennywinker 14 hours ago [-]
LLM’s are turning out to be a real get-out-of-legal-responsibilities card, hey?
13 hours ago [-]
zer00eyz 14 hours ago [-]
> It’s important in the fair use assessment to understand that the training itself is fair use,
I think that this is a distinction many people miss.
If you take all the works of Shakespeare, and reduce it to tokens and vectors is it Shakespeare or is it factual information about Shakespeare? It is the latter, and as much as organizations like the MLB might want to be able to copyright a fact you simply cannot do that.
Take this one step further. IF you buy the work, and vectorize it, thats fine. But if you feed it in the vectors for Harry Potter so many times that it can reproduce half of the book, it becomes a problem when it spits out that copy.
And what about all the other stuff that LLM's spit out? Who owns that. Well at present, no one. If you train a monkey or an elephant to paint, you cant copyright that work because they aren't human, and neither is an LLM.
If you use an LLM to generate your code at work, can you leave with that code when you quit? Does GPL3 or something like the Elastic Search license even apply if there is no copyright?
I suspect we're going to be talking about court cases a lot for the next few years.
Imustaskforhelp 14 hours ago [-]
Yes. Someone on this post mentioned that switzerland allows downloading copyrightable material but not distributing them.
So things get even more dark because what becomes distribution can have a really vague definition and maybe the AI companies will only follow the law just barely, just for the sake of not getting hit with a lawsuit like this again. But I wonder if all this case did was maybe compensate the authors this one time. I doubt if we can see a meaningful change towards AI companies attitude's towards fair use/ essentially exploiting authors.
I feel like that they would try to use as much legalspeak as possible to extract as much from authors (legally) without compensating them which I feel is unethical but sadly the law doesn't work on ethics.
sureglymop 13 hours ago [-]
Switzerland has five main collecting societies: ProLitteris for literature and visual arts, the SSA (Société Suisse des Auteurs) for dramatic works, the SUISA for music, Suissimage for audiovisual works, and SWISSPERFORM for related rights like those of performers and broadcasters. These non-profit societies manage copyright and related rights on behalf of their members, collecting and distributing royalties from users of their works.
Note that the law specifically regulates software differently, so what you cannot do is just willy nilly pirate games and software.
What distribution means in this case is defined in the swiss law. However swiss law as a whole is in some ways vague, to leave a lot up to interpretation by the judiciary.
p_ing 13 hours ago [-]
> compensate the authors this one time.
I would assume it would compensate the publisher. Authors often hand ownership to the publisher; there would be obvious exceptions for authors who do well.
arcticfox 14 hours ago [-]
> And what about all the other stuff that LLM's spit out? Who owns that. Well at present, no one. If you train a monkey or an elephant to paint, you cant copyright that work because they aren't human, and neither is an LLM.
This seems too cute by half, courts are generally far more common sense than that in applying the law.
This is like saying using `rails generate model:example` results in a bunch of code that isn't yours, because the tool generated it according to your specifications.
kennywinker 14 hours ago [-]
The example is a real legal case afaik, or perhaps paraphrased from one (don’t think it was a monkey - an ape? An elephant?).
I’d guess the legal scenario for `rails generate` is that you have a license to the template code (by way of how the tool is licensed) and the template code was written by a human so licensable by them and then minimally modified by the tool.
gpm 12 hours ago [-]
I think you're thinking of this case [1], it was a monkey, it wasn't a painting but a selfie. A painting would have only made the no-copyright argument stronger.
I don’t think the code you get from rails generate is yours. Certainly not by way of copyright, which protects original works of authorship and so if it’s not original, it’s not copyrightable, and yes it’s been decided in US courts that non-human-authorship doesn’t count as creative.
zer00eyz 13 hours ago [-]
> courts are generally far more common sense than that in applying the law.
'The Board’s decision was later upheld by the U.S. District Court for the District of Columbia, which rejected the applicant’s contention that the AI system itself should be acknowledged as the author, with any copyrights vesting in the AI’s owner. The court further held that the CO did not act arbitrarily or capriciously in denying the application, reiterating the requirement that copyright law requires human authorship and that copyright protection does not extend to works “generated by new forms of technology operating absent any guiding human hand, as plaintiff urges here.”' From: https://www.whitefordlaw.com/news-events/client-alert-can-wo...
The court is using common sense when it comes to the law. It is very explicit and always has been... That word "human" has some long standing sticky legal meaning (as opposed to things that were "property").
zmmmmm 10 hours ago [-]
The question is going to be how much human intellectual input there was I think. I don't think it will take much - you can write the crappiest novel on earth that is complete random drivel and you still have copyright on it.
So to me, if you are doing literally any human review, edits, control over the AI then I think you'll retain copyright. There may be a risk that if somebody can show that they could produce exactly the same thing from a generic prompt with no interaction then you may be in trouble, but let's face it should you have copyright at that point?
This is, however, why I favor stopping slightly short of full agentic development at this point. I want the human watching each step and an audit trail of the human interaction in doing it. Sure I might only get to 5x development speed instead of 10x or 20x but that is already such an enormous step up from where we were a year ago that I am quite OK with that for now.
simoncion 6 hours ago [-]
> If you take all the works of Shakespeare, and reduce it to tokens and vectors is it Shakespeare or is it factual information about Shakespeare?
To rephrase the question:
Is a PDF of the complete works of Shakespeare Shakespeare, or is it factual information about Shakespeare?
Reencoding human-readable information into a form that's difficult for humans to read without machine assistance is nothing new.
tomrod 14 hours ago [-]
I mean, sort of. The issue is that the compression is novel. So anything post tokenization could arguably be considered value add and not necessarily derivative work.
outside415 7 hours ago [-]
[dead]
nicce 15 hours ago [-]
I guess they must delete all models since they acquired the source illegally and benefitted from it, right? Otherwise it just encourages others to keep going and pay the fines later.
greensoap 15 hours ago [-]
In a prior ruling, the court stated that Anthropic didn't train on the books subject to this settlement. The record is that Anthropic scanned physical books and used those for training. The pirated books were being held in a general purpose library and were not, according to the record, used in training.
reassess_blind 14 hours ago [-]
So how did they profit off the pirated books?
greensoap 11 hours ago [-]
According to the judge, they didn't. The judge said they stored those books in a general purpose library for future use just in case they decided to use them later. It appears the judge took much issue with the downloading of "pirated content." And Anthropic decided to settle rather than let it all play out more.
nicce 9 hours ago [-]
But how the settlement cost was then defined if nobody read those books and there was no financial lost...
nicce 15 hours ago [-]
That is something which is extremely difficult to prove from either side.
It is 500,000 books in total so did they really scan all those books instead of using the pirated versions? Even when they did not have much money in the early phases of the model race?
greensoap 11 hours ago [-]
The 500,000 number is the number of books that are part of the settlement. If they downloaded all of Libgen and the other sources it was more like >7Million. But it is a lot of work to determine which books can legitimately be part of the lawsuit. For example, if any of the books in the download weren't copyright (think self published) or not protected under US copyright law (maybe a book only published in Venezula) or it isn't clear who own the copyright then that copyright owner cannot be part of the class. So it seems like the 500,000 number is basically the smaller number of books for which the lawyers for the plaintiff felt they could most easily prove standing.
14 hours ago [-]
florbnit 14 hours ago [-]
> Buying used copies of books, scanning them, and training on it is fine.
Buying used copies of books, scanning them, and printing them and selling them: not fair use
Buying used copies of books, scanning them, and making merchandise and selling it: not fair use
The idea that training models is considered fair use just because you bought the work is naive. Fair use is not a law to leave open usage as long as it doesn’t fit a given description. It’s a law that specifically allows certain usages like criticism, comment, news reporting, teaching, scholarship, or research.
Training AI models for purposes other than purely academic fits into none of these.
bigmadshoe 14 hours ago [-]
Buying used copies of books, scanning them, training an employee with the scans: fair use.
Unless legislation changes, model training is pretty much analogous to that. Now of course if the employee in question - or the LLM - regurgitates a copyrighted piece verbatim, that is a violation and would be treated accordingly in either case.
bink 14 hours ago [-]
> Buying used copies of books, scanning them, training an employee with the scans: fair use.
Does this still hold true if multiple employees are "trained" from scanned copies at the same time?
bigmadshoe 13 hours ago [-]
Simultaneously I guess that would violate copyright, which is an interesting point. Maybe there's a case to be made there with model training.
Regardless, the issue could be resolved by buying as many copies as you have concurrent model training instances. It isn't really an issue with training on copyrighted work, just a matter of how you do so.
mdp2021 14 hours ago [-]
It fits the basicmost fair use: reading them. Current "training" can be considered as a gross form of reading.
derektank 14 hours ago [-]
The purpose and character of AI models is transformative, and the effect of the model on the copyrighted works used in the model is largely negligible. That's what makes the use of copyrighted works in creating them fair use.
GodelNumbering 15 hours ago [-]
Settlement Terms (from the case pdf)
1. A Settlement Fund of at least $1.5 Billion: Anthropic has agreed to pay a minimum of $1.5 billion into a non-reversionary fund for the class members. With an estimated 500,000 copyrighted works in the class, this would amount to an approximate gross payment of $3,000 per work. If the final list of works exceeds 500,000, Anthropic will add $3,000 for each additional work.
2. Destruction of Datasets: Anthropic has committed to destroying the datasets it acquired from LibGen and PiLiMi, subject to any legal preservation requirements.
3. Limited Release of Claims: The settlement releases Anthropic only from past claims of infringement related to the works on the official "Works List" up to August 25, 2025. It does not cover any potential future infringements or any claims, past or future, related to infringing outputs generated by Anthropic's AI models.
privatelypublic 15 hours ago [-]
Don't forget: NO LEGAL PRECEDENT! which means, anybody suing has to start all over. You only settle in this scenario/point if you think you'll lose.
Edit: I'll get ratio'd for this- but its the exact same thing google did in it's lawsuit with Epic. They delayed while the public and courts focused in apple (oohh, EVIL apple)- apple lost, and google settled at a disadvantage before they had a legal judgment that couldn't be challenged latter.
terminalshort 14 minutes ago [-]
There is already a mountain of legal precedent that you can't just download copyrighted work. That's what this lawsuit is about. Just because one of the parties is Anthropic doesn't mean this is some new AI thing.
A full case is many more years of suits and appeals with high risks, so its natural to settle which obviously means no precedent
koolala 9 hours ago [-]
Wont Facebook just get sued for the same thing now and maybe set precedent?
hypercube33 6 hours ago [-]
I thought meta had been sued and forgiven as it was impulsive that they do it to make money and faced no charge.
lukan 5 hours ago [-]
This is what is confusing me here. I did not really follow any case, but as far as I remember meta seems to have gotten away with pirating books, but anthropic needs to pay $1.5B ?
ignoramous 14 hours ago [-]
Or, if you think your competition, also caught up in the same quagmire, stands to lose more by battling for longer than you did?
privatelypublic 14 hours ago [-]
A valid touche! I still think google went with delaying tactics as public and other pressures forced Apple's case forward at greater velocity. (Edit: implicit "and then caved when apple lost"... because they're the same case)
rendaw 6 hours ago [-]
So they can also keep models trained on the datasets? That seems pretty big too, unless the half life of models is so low it doesn't matter.
gooosle 14 hours ago [-]
So... it would be a lot cheaper to just buy all of the books?
gpm 14 hours ago [-]
Yes, much.
And they actually went and did that afterwards. They just pirated them first.
rise_before_sun 12 hours ago [-]
Where can I find source that says Anthropic bought the pirated books afterwards? I haven't seen this in any official document.
Also, do we know if the newer models were trained without the pirated books?
gpm 12 hours ago [-]
> Where can I find source that says Anthropic bought the pirated books afterwards? I haven't seen this in any official document.
> Also, do we know if the newer models were trained without the pirated books?
I'm pretty sure we do but I couldn't swear to it or quickly locate a source.
rise_before_sun 10 hours ago [-]
Thanks for the link.
Among several places where judge mentions Anthropic buying legit copies of books it pirated, probably this sentence is most relevant: "That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft but it may affect the extent of statutory damages."
But document does not say Anthropic bought EVERY book it pirated. Other sections in the document also don't explicitly say that EVERY pirated book was later purchased.
I stopped using Claude when this case came to light. If the newer Claude models don't use pirated books, I can resume using it.
When you say, "I'm pretty sure we do...", do you mean that pirated books were used, or were they not used?
gpm 10 hours ago [-]
> But document does not say Anthropic bought EVERY book it pirated
Yeah, I wouldn't make this exact claim either. For instance it's probably safe to assume that the pirate datasets contain some books that are out of circulation and which Anthropic happened not to get a used copy of.
They did happen to get every book published by any of the lead plaintiffs though, as a point towards them probably having pretty good coverage. And it does seem to have been an attempt to purchase "all" the books for reasonable approximate definitions of "all".
> When you say, "I'm pretty sure we do...", do you mean that pirated books were used, or were they not used?
I'm pretty sure pirated books were not used, but not certain, and I really don't remember when/why I formed that opinion.
dude250711 13 hours ago [-]
What is the HN term for this? "Bootstrapping" your start up? Or is it "growth-hacking" it?
chrisvenum 12 hours ago [-]
Bookstrapping
gpm 13 hours ago [-]
The latter (I know you're joking, but...)
Bootstrapping in the startup world refers to starting a startup using only personal resources instead of using investors. Anthropic definitely had investors.
eviks 6 hours ago [-]
That might be practically impossible given the number of rights holders worldwide
privatelypublic 14 hours ago [-]
The permission to buy them was already settled by Google Books in the 00's.
_alternator_ 14 hours ago [-]
They did, but only after they pirated the books to begin with.
privatelypublic 14 hours ago [-]
Few. This settlement potentially weakens all challenges to the use of copyrighted works in training LLM's. I'd be shocked if behind closed doors there wasn't some give and take on the matter between Executives/investors.
A settlement means the claimants no longer have a claim, which means if they're also part of- say, the New York Times affiliated lawsuit- they have to withdraw. A neat way of kneecapping a country wide decision that LLM training on copy written material is subject to punitive measures don't you think?
freejazz 14 hours ago [-]
That's not even remotely true. Page 4 of the settlement describes released claims which only relate to the pirating of books. Again, the amount of misinformation and misunderstanding I see in copyright related threads here ASTOUNDS.
privatelypublic 14 hours ago [-]
Did you miss the "also" how about "adjacent"? I won't pretend to understand the legal minutia, but reading the settlement doesn't mean you do either.
In my experience&training in a fintech corp- Accepting a settlement in any suit weakens your defense- but prevents a judgement and future claims for the same claims from the same claimants (a la double jeopardy). So, again- at minimum- this prevents an actual judgement. Which, likely would be positive for the NYT (and adjacent) cases.
manbash 15 hours ago [-]
Thank you. I assumed it would be quicker to find the link to the case PDF here, but your summary is appreciated!
Indeed, it is not only payout, but the destruction of the datasets. Although the article does quote:
> “Anthropic says it did not even use these pirated works,” he said. “If some other generative A.I. company took data from pirated source and used it to train on and commercialized it, the potential liability is enormous. It will shake the industry — no doubt in my mind.”
Even if true, I wonder how many cases we will see in the near future.
pier25 13 hours ago [-]
Only 500,000 copyrighted works?
I was under the impression they had downloaded millions of books.
KittenInABox 9 hours ago [-]
Individual authors had to join the class action lawsuit, sadly. They were not all automatically registered for each violation.
testing22321 14 hours ago [-]
I’m an author, can I get in on this?
A_D_E_P_T 14 hours ago [-]
I had the same question.
It looks like you'll be able to search this site if the settlement is approved:
If your work is there, you qualify for a slice of the settlement. If not, you're outta luck.
__mharrison__ 4 hours ago [-]
I'm an author. Can I get anthropic stock instead?
mbrochh 2 hours ago [-]
After their recent change in tune to retain data for longer and to train on our data, I deleted my account.
Try to do that. There is no easy way to delete your account. You need to reach out to their support via email. Incredibly obnoxious dark pattern. I hate OpenAI, but everything with Anthropic also smells fishy.
We need more and better players. I hope that XAi will give them all some good competition, but I have my doubts.
Taek 12 hours ago [-]
I can't help but feel like this is a huge win for Chinese AI. Western companies are going to be limited in the amount of data they can collect and train on, and Chinese (or any foreign AI) is going to have access to much more and much better data.
0dte 6 hours ago [-]
I think western companies will be just fine -- Anthropic is settling because they illegally pirated books from LibGen back in 2021 and subsequently trained models on them. They realized this was an issue internally and pivoted to buying books en masse and scanning them into digital formats, destroying the original copies in the process (they actually hired a former lead in the Google Books project to help them in this endeavor!). And a federal judge ruled a couple months ago that training on these legally-acquired scanned copies does not constitute fair use -- that the LLM training process is sufficiently transformative.
So the data/copyright issue that you might be worried about is actually completely solved already! Anthropic is just paying a settlement here for the illegal pirating that they did way in the past. Anthropic is allowed to train on books that they legally acquire.
And sure, Chinese AI companies could probably scrape from LibGen just like Anthropic did without getting in hot water, and potentially access a bit more data that way for cheap, but it doesn't really seem like the buying/scanning process really costs that much in the grand scheme of things. And Anthropic likely already has legally acquired most of the useful texts on LibGen and scanned them into its internal library anyways.
(Furthermore, the scanning setup might actually give Anthropic an advantage, as they're able to digitize more niche texts that might be hard to find outside of print form)
okanat 11 hours ago [-]
The West can end the endless pain and legal hurdles to innovation by limiting the copyright. They can do it if there is will to open up the gates of information to everyone. The duration of 70 years after death of the author or 90 years for companies is excessively long. It should be ~25 years. For software it should be 10 years.
And if AI companies want recent stuff, they need to pay the owners.
However, the West wants to infinitely enrich the lucky old people and companies who benefited from the lax regulations at the start of 20th century. Their people chose to not let the current generations to acquire equivalent wealth, at least not without the old hags get their cut too.
redthrow 7 hours ago [-]
The vast majority of books don't generate any profits past the first few years, so I prefer Lawrence Lessig's proposal of copyright renewal at five-year intervals with a fee. Under this scheme, most books would enter the public domain after five years
Lessig: Not for this length of time, no. Copyright shouldn’t be anywhere close to what it is right now. In my book I proposed a system where you’d have to renew after every five years and you get a maximum term of 75 years. I thought that was pretty radical at the time. The Economist, after the Eldred decision, came out with a proposal—let’s go back to 14 years, renewable to 28 years. Nobody needs more than 14 years to earn the return back from whatever they produced.
mark_l_watson 20 minutes ago [-]
Lessig’s proposal is excellent. A long time ago I wrote 10 books for publishers like McGraw-Hill, J Wiley, Springer-Verlag, etc.
For many reasons I switched to writing using a Creative Commons license using Lulu, LeanPub, and my own web site for distribution. This has been a win for me economically, it feels good to add to the commons, and it is fun.
TimByte 34 minutes ago [-]
It's naive to think Chinese models have a free pass. Local censorship, language/data biases, and export restrictions cut both ways.
mark_l_watson 31 minutes ago [-]
True enough, but training on synthetic data now seems to be pushing SOTA.
aucisson_masque 5 hours ago [-]
This isn’t is a race to the bottom. They could have bought these books instead of pirating them.
doctorpangloss 8 hours ago [-]
But most marginal training of Anthropic, OpenAI and Google models is done on LLM paraphrased user data on those platforms. That user data is proprietary and obviously way more valuable than random books.
gordian-mind 1 hours ago [-]
After the book publishers burned Google Book's Library of Alexandria, they are now making it impossible to train a LLM unless you engage in the medieval process of manually buying paper-copies of work just to scan & destroy them...
GMoromisato 14 hours ago [-]
If you are an author here are a couple of relevant links:
Wild - I searched my name out of curiosity and my PhD research papers turned up. Worth submitting my contact details I guess
tomsmeding 2 hours ago [-]
That may depend on whether there is copyright on that work.
burningion 10 hours ago [-]
Thank you for posting this!
I suspected my work was in the dataset and it looks like it is! I reached out via the form.
GMoromisato 9 hours ago [-]
Good luck! Hope you get a payout!
doctorpangloss 8 hours ago [-]
It’s pretty incredible that the vast majority of authors will make more money for their books from this settlement that they ever have from selling their books.
miohtama 6 hours ago [-]
God bless capitalistic America.
arjunchint 15 hours ago [-]
Wait so they raised all that money just to give it to publishers?
Can only imagine the pitch, yes please give us billions of dollars. We are going to make a huge investment like paying of our lawsuits.
Wowfunhappy 15 hours ago [-]
From the article:
> Although the payment is enormous, it is small compared with the amount of money that Anthropic has raised in recent years. This month, the start-up announced that it had agreed to a deal that brings an additional $13 billion into Anthropic’s coffers. The start-up has raised a total of more than $27 billion since its founding in 2021.
slg 15 hours ago [-]
Maybe small compared to the money raised, but it is in fact enormous compared to the money earned. Their revenue was under $1b last year and they projected themselves as likely to make $2b this year. This payout equals their average yearly revenue of the last two years.
masterjack 14 hours ago [-]
I thought they were projecting 10B and said a few months ago they have already grown from a 1B to 4B run rate
slg 13 hours ago [-]
Here is an article that discusses why those numbers are misleading[1]. From a high level, "run rate" numbers are typically taking a monthly revenue number and multiplying it by 12 and that just isn't an accurate way to report revenue for reasons outlined in that article. When it comes to actual projections for annual revenue, they have said $2b is the most likely outcome for their 2025 annual revenue.
It doesn't matter if they end up in chapter 11... If it kneecaps all the other copyright lawsuits. I won't pretend to know the exact legal details. But I am (unfortunately) old enough that this isn't my first "giant corporation benefits from legally and ethically dubious copyright adjacent activities, gets sued, settles/wins." (Cough, google books)
utyop22 12 hours ago [-]
Personally I believe in the ideal scenario (for the fed govt.) these firms will develop the tech. The fed will then turn around and want those law suits to win - effectively gutting the firms financially and putting the tech in the hands of the public sector.
You never know, its a game of interests and incentives - one thing for sure - does does the fed want the private sector to own and control a technology of this kind? Nope.
stingraycharles 11 hours ago [-]
But what are the profits? 1.5B is a huge amount, no matter what, especially if you’re committing to destroying the datasets as well. That implies you basically used 1.5B for a few years of additional training data, a huge price.
dkdcio 15 hours ago [-]
maybe I’m bad at math but paying >5% of your capital raised for a single fine doesn’t seem great from a business perspective
arrty88 13 hours ago [-]
If they are going to be making Billions in net income every year going forward, as many years as analysts can make projections for, and using these works allowed them to GTM faster/quicker/gain advantage against competitors, then it is quite great from a business prospective.
ryao 14 hours ago [-]
If it allowed them to move faster than their completion, I imagine management would consider it money well spent. They are expected to spend absurd amounts of money to get ahead. They were never expected to spend money efficiently if it meant taking additional months/years to get results.
carstenhag 14 hours ago [-]
Someone here commented saying they claimed they did not even use it for training, so apparently it was useless.
siliconpotato 15 hours ago [-]
It's VC money, I don't think anyone believes it's real money
Aachen 14 hours ago [-]
If it weren't, why are we taking it as legal tender? I certainly wouldn't mind being paid in VC money
15 hours ago [-]
bongodongobob 15 hours ago [-]
Yeah it does, cost of materials is way more than that if they were building something physical like a new widget or something. Same idea, they paid for their raw materials.
xnx 14 hours ago [-]
The money they don't pay out in settlements goes to Nvidia.
non_aligned 15 hours ago [-]
You're joking, but that's actually a good pitch. There was a significant legal issue hanging over their heads, with some risk of a potentially business-ending judgment down the line. This makes it go away, which makes the company a safer, more valuable investment. Both in absolute terms and compared to peers who didn't settle.
freejazz 15 hours ago [-]
It just resolves their liability with regards to books they purported they did not even train the models on, which is all that was left in this case after summary judgment. Sure the potential liability was company ending, but it's all a stupid business decision when it is ultimately for books they did not even train on.
It basically does nothing for them besides that. Given the split decisions so far, I'm not sure what value the Alsup decision is going to bring to the industry, moving forward, when it's in the context of books that Anthropic physically purchased. The other AI cases are generally not fact patterns where the LLM was trained with copyrighted materials that the AI company legally purchased copies of.
jongjong 13 hours ago [-]
Isn't that how the whole system operates? Everyone is a conduit to allow rich people to enrich themselves further. The amount and quality of opportunities any individual receives are proportional to how well it serves existing capital.
So long as there is an excuse to justify money flows, that's fine, big capital doesn't really care about the excuse; so long as the excuse is just persuasive enough to satisfy the regulators and the judges.
Money flows happen independently, then later, people try to come up with good narratives. This is exactly what happened in this case. They paid the authors a lot of money as a settlement and agreed on a narrative which works for both sets of people; that training was fine, it's the pirating which was a problem...
It's likely why they settled; they preferred to pay a lot of money and agree on some false narrative which works for both groups rather than setting a precedent that AI training on copyrighted material is illegal; that would be the biggest loss for them.
danans 13 hours ago [-]
> Isn't that how the whole system operates? Everyone is a conduit to allow rich people to enrich themselves further. The amount and quality of opportunities any individual receives are proportional to how well it serves existing capital.
Yes, and FWIW that's very succinctly stated.
utyop22 12 hours ago [-]
Sort of.
Some individuals in society find a way through that and figure out a way to strategically achieve their goals. Rare though.
freejazz 15 hours ago [-]
They wanted to move fast and break things. No one made them.
mNovak 10 hours ago [-]
Everything talks about settlement to the 'authors'; is that meant to be shorthand for copyright holders? Because there are a lot of academic works in that library where the publisher holds exclusive copyright and the author holds nothing.
By extension, if the big publishers are getting $3000 per article, that could be a fairly significant windfall.
93po 10 hours ago [-]
very unsurprisingly, new york times is going to frame this as a win for "the little guy" when in reality it's just multi-billion dollar publishers, with a long rich history of their own exploitive practices, hanging on for dear life against generative AI
petralithic 15 hours ago [-]
This is sad for open source AI, piracy for the purpose of model training should also be fair use because otherwise only the big companies who can afford to pay off publishers like Anthropic will be able to do so. There is no way to buy billions of books just for model training, it simply can't happen.
bcrosby95 14 hours ago [-]
Fair use isn't about how you access the material, its about what you can do with it after you legally access it. If you don't legally access it, the question of fair use is moot.
petralithic 13 hours ago [-]
Hence, "should"
btmiller 11 hours ago [-]
It’s the sign of a health economy when we respect the creation of content.
scotty79 10 hours ago [-]
It's a sign of rent seeking economy in decline. Rising economies never respect IPs.
dbalatero 15 hours ago [-]
This implies training models is some sort of right.
542458 15 hours ago [-]
No, it implies that having the power to train AI models exclusively consolidated into a handful of extremely powerful companies is bad.
johnnyanmac 13 hours ago [-]
It implies that people want everyone to do this when it's clear no one should do it. I'm not exactly a fan of "this isn't profitable for small businesses to steal from so we should make it so everyone should steal".
petralithic 13 hours ago [-]
Piracy is not stealing. I don't know why everyone on HN suddenly turned into a copyright hawk, only big companies benefit from our current copyright regime, like Disney and their lobbying for increasing its length.
Aurornis 12 hours ago [-]
> only big companies benefit from our current copyright regime
You’ve never authored, created, or published something? Never worked for a company that sells something protected by copyright?
petralithic 11 hours ago [-]
All my works are open source or in the public domain. I don't like copyright for a reason.
okanat 11 hours ago [-]
> Never worked for a company that sells something protected by copyright?
I.e. never created software in exchange of money.
heavyset_go 6 hours ago [-]
It isn't black and white, you can be against some aspects of copyright and be for some others.
tene80i 6 hours ago [-]
It’s very funny when people declare “piracy isn’t stealing”, as if the metaphor of piracy is all about singing and drinking.
Copying and distributing works isn’t identical to theft (deliberately depriving someone of their property), but you’re enjoying someone’s work without compensating them, so it isn’t totally unlike depriving them of something.
I guess it depends how you feel about refusing to pay a window washer. Or indeed you not being paid by your employer. It isn’t theft, but someone is clearly stiffing someone else.
As for only big companies benefitting from the copyright regime… seems like an ideological assumption. I know plenty of authors and they are quite happy having legal protections around their work which means they can earn from their labour.
mdp2021 1 hours ago [-]
> you’re enjoying someone’s work without compensating them
Which is foreseen in societal decision: libraries (again and again).
> refusing to pay a window washer
The window washer is providing a service for a price, that service is not equivalent to knowledge production, and nobody has decided that that service (cleaning windows) should be done for free.
alchemist1e9 10 hours ago [-]
It’s shifts like that, going from a copyleft to a copyright crowd, that make me increasingly suspicious the HN is authentic as it was years ago. Another weird one is the socialism lean instead of more libertarian ideals by many commentators. I think it might be generational issues and being 50+ years old makes me an old timer!
13 hours ago [-]
stingraycharles 11 hours ago [-]
Isn’t that already the case, with the capacity required to train these models?
JoshTriplett 14 hours ago [-]
That's true. Those handful of companies shouldn't get to do it either.
vessenes 5 hours ago [-]
Curious: are you trolling? Or do you really think doing math shouldn’t be a priori allowed to humans?
johanyc 15 hours ago [-]
No. It means model training is transformative enough to be fair use. They should just be asked to pay them back plus reimbursement/punishment, say pay 10x the price of the pirated books
Aurornis 12 hours ago [-]
This is a settlement. It does not set a precedent nor even admit to wrongdoing.
> otherwise only the big companies who can afford to pay off publishers like Anthropic will be able to do so
Only well funded companies can afford to hire a lot of expensive engineers and train AI models on hundreds of thousands of expensive GPUs, too.
Something tells me many the grassroots LLM training people are less concerned about legality of their source training set than the big companies anyway.
sefrost 15 hours ago [-]
I wonder how much it would cost to buy every book that you'd want to train a model.
GMoromisato 14 hours ago [-]
500,000 x $20 = $10 million
Obviously there would be handling costs + scanning costs, so that’s the floor.
Maybe $20 million total? Plus, of course, the time it would take to execute.
heavyset_go 6 hours ago [-]
I don't know if I agree with it, but you could argue that if a model was built for purely academic purposes, and then used for purely academic purposes, it could meet requirements for fair use.
rwmj 5 hours ago [-]
(Half joking but) I wonder if musicians need to worry if they learned to play by listening to cassette mixtapes.
13 hours ago [-]
t0lo 12 hours ago [-]
I wish the hn rules were more flexible because I would write the best comment to you right now.
TimByte 1 hours ago [-]
It doesn't set precedent, but the message to other AI companies is clear: if you're going to bet your model on gray-area data, have a few billion handy for settlements
MaxikCZ 15 hours ago [-]
See kids? Its okay to steal if you steal more money than the fine costs.
ascorbic 14 hours ago [-]
They're paying $3000 per book. It would've been a lot cheaper to buy the books (which is what they actually did end up doing too).
bn-l 3 hours ago [-]
That’s not enough punitive damages for the crime (which is egregious and absolutely deplorable).
ascorbic 2 hours ago [-]
I wonder how many of the people saying this about copyright infringement today were complaining about the ridiculously harsh enforcement of it 10-15 years ago. They've just paid $1.5B for torrenting.
ajross 14 hours ago [-]
That metaphor doesn't really work. It's a settlement, not a punishment, and this is payment, not a fine. Legally it's more like "The store wasn't open, so I took the items from the lot and paid them later".
It's not the way we expect people to do business under normal circumstances, but in new markets with new products? I guess I don't see much actually wrong with this. Authors still get paid a price they were willing to accept, and Anthropic didn't need to wait years to come to an agreement (again, publishers weren't actually selling what AI companies needed to buy!) before training their LLMs.
14 hours ago [-]
ycombinatrix 1 hours ago [-]
How did Meta get away without a scratch?
pluc 39 minutes ago [-]
Illegal with a fine is legal with a fee.
TimByte 33 minutes ago [-]
Silicon Valley's unofficial motto for the last 15 years
arendtio 2 hours ago [-]
OT: Is anybody else seeing that Datadome is blocking their IP?
I haven't had this in a while, but I always hate it when I'm blocked by Cloudflare/Datadome/etc.
on_meds 16 hours ago [-]
It will be interesting to see how this impacts the lawsuits against OpenAI, Meta, and Microsoft. Will they quickly try to settle for billions as well?
It’s not precedent setting but surely it’ll have an impact.
lewdwig 15 hours ago [-]
I’m sure this’ll be misreported and wilfully misinterpreted because of the current fractious state of the AI discourse, but given the lawsuit was to do with piracy, not the copyright-compliance of LLMs, and in any case, given they settled out of court, thus presumably admit no wrongdoing, conveniently no legal precedent is established either way.
I would not be surprised if investors made their last round of funding contingent on settling this matter out of court precisely to ensure no precedents are set.
nerevarthelame 15 hours ago [-]
Anthropic certainly seems to be hoping that their competitors will have to face some consequences too:
>During a deposition, a founder of Anthropic, Ben Mann, testified that he also downloaded the Library Genesis data set when he was working for OpenAI in 2019 and assumed this was “fair use” of the material.
Per the NYT article, Anthropic started buying physical books in bulk and scanning them for their training data, and they assert that no pirated materials were ever used in public models. I wonder if OpenAI can say the same.
typs 15 hours ago [-]
Maybe, though this lawsuit is different in respect to the piracy issue. Anthropic is paying the settlement because they pirated the books, not because training on copyrighted books isn’t fair use which isn’t necessarily true with the other cases.
That was my first though. While not legal precedent, it does sort of open the flood gates for others.
r_lee 15 hours ago [-]
One thing that comes to mind is...
Is there a way to make your content on the web "licensed" in a way where it is only free for human consumption?
I.e. effectively making the use of AI crawlers pirating, thus subject to the same kind of penalties here?
gpm 14 hours ago [-]
Yes to the first part. Put your site behind a login wall that requires users to sign a contract to that effect before serving them the content... get a lawyer to write that contract. Don't rely on copyright.
I'm not sure to what extent you can specify damages like these in a contract, ask the lawyer who is writing it.
Aurornis 12 hours ago [-]
Contracts generally require an exchange of consideration (something of value, like money).
If you put a “contract” on your website that users click through without paying you or exchanging value with you and then you try to collect damages from them according to your contract, it’s not going to get you anywhere.
gpm 12 hours ago [-]
The consideration the viewer received was access to your private documents.
The consideration you received was a promise to refrain from using those documents to train AI.
I'm not a lawyer, but by my understanding of contract law consideration is trivially fulfilled here.
Wowfunhappy 14 hours ago [-]
I'd argue you don't actually want this! You're suggesting companies should be able to make web scraping illegal.
That curl script you use to automate some task could become infringing.
johnnyanmac 14 hours ago [-]
>I'd argue you don't actually want this! You're suggesting companies should be able to make web scraping illegal.
At this point, we do need some laws regulating excessive scraping. We can't have the ineternet grind to a halt over everyone trying to drain it of information.
1gn15 9 hours ago [-]
The GP was talking about web scraping, not "excessive web scraping". It's an important difference.
7952 14 hours ago [-]
Maybe some kind of captcha like system could be devised that could be considered a security measure under the DMCA and not allowed to be circumvented. Make the same content available under a licence fee through an API.
1gn15 9 hours ago [-]
DMCA is a US thing, and people in other countries don't have to follow it.
8 hours ago [-]
Cheer2171 14 hours ago [-]
No. Neither legally or technically possible.
oblio 10 hours ago [-]
Ummm.. terms and conditions?
shadowgovt 14 hours ago [-]
I'm sure one can try, but copyright has all kinds of oddities and carve-outs that make this complicated. IANAL, but I'm fairly certain that, for example, if you tried putting in your content license "Free for all uses public and private, except academia, screw that ivory tower..." that's a sentiment you can express but universities are under no obligation legally to respect your wish to not have your work included in a course presentation on "wild things people put in licenses." Similarly, since the court has found that training an LLM on works is transformative, a license that says "You may use this for other things but not to train an LLM" couldn't be any more enforceable than a musician saying "You may listen to my work as a whole unit but God help you if I find out you sampled it into any of that awful 'rap music' I keep hearing about..."
The purpose of the copyright protections is to promote "sciences and useful arts," and the public utility of allowing academia to investigate all works(1) exceeds the benefits of letting authors declare their works unponderable to the academic community.
(1) And yet, textbooks are copyrighted and the copyright is honored; I'm not sure why the academic fair-use exception doesn't allow scholars to just copy around textbooks without paying their authors.
2 hours ago [-]
surfingdino 20 minutes ago [-]
This will paid to rights holders, not authors. Published authors sign away the rights to financial exploitation of their books under the terms of contracts offered. I expect some authors suing publishers in turn. This has happened before when authors realised that they were not getting paid royalties on sales of ebooks.
novok 15 hours ago [-]
I wonder who will be the first country to make an exception to copyright law for model training libraries to attract tax revenue like Ireland did for tech companies in the EU. Japan is part of the way there, but you couldn't do a common crawl type thing. You could even make it a library of congress type of setup.
1gn15 9 hours ago [-]
This is already a thing in several places.
EU has copyright exemptions for AI training. You don't need to respect opt outs if you are doing research.
South Korea, Japan has some exemptions too I think?
Singapore has very strong copyright exemptions for AI training. You can completely ignore opt-outs legally, even if doing it commercially.
Just search up "TDM laws globally".
novok 7 hours ago [-]
So could they have library genesis on a local server and other pirate sources and use that for training data then? That is the level I'm speaking of, much like common crawl and the reddit archive
tonfa 15 hours ago [-]
As long as you're not distributing, it's legal in Switzerland to download copyrighted material. (Switzerland was on the naughty US/MPAA list for a while, might still be)
Imustaskforhelp 14 hours ago [-]
Is it distribution though if someone trains a model in switzerland through downloading copyrighted material, training AI on it and then distributing it...
Or what if not even distributing it but rather distributing the outputs of the LLM (so closed source LLM like anthropic)
I am genuinely curious as to if there is some gray area that might be exploited by AI companies as I am pretty sure that they don't want to pay 1.5B dollars yet still want to exploit the works of authors. (let's call a spade a spade)
HDThoreaun 14 hours ago [-]
Using copyrighted material to train AI is a legal grey zone. The nyt vs openAI case is litigating this. The anthropic settlement here is about how the material is obtained. If openAI wins their case and switzerland rules the same way I dont think there would be a problem
Imustaskforhelp 13 hours ago [-]
This might go down (I think) to be one of the most influential court cases to happen then.
We really are getting at some metaphysical / philosophical questions and maybe we will one day arrive at a question that just can't be answered (I think this is pretty close, right?) and then AI companies would do things freely without being accountable since sure you could take to the courts but how would you come to the decision...?
Another question though
So lets say that the nyt vs openAI case is going on, so in the meantime while they are litigating (lets say), could OpenAI still continue doing the same thing while the case is going on?
action404x 2 hours ago [-]
This settlement highlights the growing pains of the AI industry as it scales rapidly. While $1.5B is significant, it's a fraction of Anthropic's valuation and funding. It underscores the need for better governance in AI development to avoid future legal pitfalls. Interesting to see how this affects their competition with OpenAI.
unvritt 3 hours ago [-]
I think that one under-discussed effect for settlements like this is the additional tax on experimentation. The largest players can absorb a $1.5B hit or negotiate licensing at scale. Smaller labs and startups, which often drive breakthroughs, may not survive the compliance burden.
That could push the industry toward consolidation—fewer independent experiments, more centralized R&D inside big tech. I feel that, this might slow the pace of unexpected innovations and increase dependence on incumbents.
This def. raises the question: how do we balance fair compensation for creators with keeping the door open for innovation?
zinglersen 3 hours ago [-]
"That could push the industry toward consolidation"
Based on history this is not a possibility but a certainty.
The larger players - who grew because of limited regulations - will start supporting stricter regulation and compliance structures in order to increase the barrier of entry with the excuse of "Oh we learned our lesson, you are right". The hypocrisy is crazy but it makes sense from a capitalistic perspective.
jsilence 3 hours ago [-]
That is part of the ‚first mover advantage‘. Sometimes operating and experimenting in grey zones before they become regulated.
The European and especially German approach of regulating pre-emptive might be more fair, but apparently it also stifles innovation, as we can observe. Almost no significant players from Europe and Germany.
KTaffer 14 hours ago [-]
This was a very tactical decision by Anthropic. They have just received Series F funding, and they can now afford to settle this lawsuit.
OpenAI and Google will follow soon now that the precedent has been set, and will likely pay more.
It will be a net win for Anthropic.
rvnx 4 hours ago [-]
It is a very good deal for them, did not have to acquire books and had them in a very convenient format (no digitalization), saved tons of time (5+ years), got access to rare books and the LLM is not considered derived work, when it is actually clearly one
mhh__ 15 hours ago [-]
Maybe I would think differently if I was a book author but I can't help but think that this is ugly but actually quite good for humanity in some perverse sense. I will never, ever, read 99.9% of these books presumably but I will use claude.
Quid of the already neural-network already feed by those books ?
In case the court choose to protect the writers they should be deleted and retrain without all of this materials removed.
swiftcoder 5 hours ago [-]
So this is a straight-up victory for Anthropic, right?
They pay out (relative) chump change as a penalty for explicitly pirating a bunch of ebooks, and in return they get a ruling that they can train on copyrighted works forever, for the purchase price of the book (not the price that would be needed to secure the rights!)
falconroar 5 hours ago [-]
I thought the opposite - they set a precedent indicating that reproduction of a copyrighted text by an LLM is infringement. If authors refuse to sell to them (via legal terms indicating LLMs aren't allowed), it's infringement. No?
I'd be curious to hear from a legal professional...
5 hours ago [-]
egypturnash 12 hours ago [-]
Wooo, I sure could use $3k right now and I've got something in the pirate libraries they scraped. Nice.
nromiun 10 hours ago [-]
What about OpenAI and Meta? Are they going to face similar lawsuits?
pbd 8 hours ago [-]
From a systems design perspective, $3,000 per book makes this approach completely unscalable compared to web scraping. It's like choosing between a O(n) and O(n²) algorithm - legally compliant data acquisition has fundamentally different scaling characteristics than the 'move fast and break things' approach most labs took initially.
whimsicalism 8 hours ago [-]
more of a large difference in constant factor, like a galactic algorithm for data trawling
BoorishBears 7 hours ago [-]
I don't know if anyone has actually read the article or the ruling, but this is about pirating books.
Anthropic went back and bought->scanned->destroyed physical copies of them afterward... but they pirated them first, and that's what this settlement is about.
The judge also said:
> “The training use was a fair use,” he wrote. “The technology at issue was among the most transformative many of us will see in our lifetimes.”
So you don't need to pay $3,000 per book you train on unless you pirate them.
pbd 4 hours ago [-]
i agree. this is very gray imo. e.g., books in India have cheap EEE editions compared to the ones in US/Europe. so they can pre-process the data in India & then compile it in US. does that save them from piracy rules & reduces cost as well.
BoorishBears 3 hours ago [-]
I mean relative to the cost of pre-training, books are going to be cheap even if you buy them in the US (as demonstrated by the fact Anthropic bought them after)
For post-training, other data sources (like human feedback and/or examples) are way more expensive than books
eviks 6 hours ago [-]
> the law allowed the company to train A.I. technologies using the books because this transformed them into something new.
Unless, of course, the transformation malfunctioned and you got the good old verbatim source, with many of examples compiled in similar lawsuits
0dte 6 hours ago [-]
This notably wasn't one of the allegations levied against Anthropic, as Claude was accompanied by software that filtered any infringing outputs. From the relevant opinion finding Anthropic's use of the books to be fair use:
> When each LLM was put into a public-facing version of Claude, it was complemented by other software that filtered user inputs to the LLM and filtered outputs from the LLM back to the user. As a result, Authors do not allege that any infringing copy of their works was or would ever be provided to users by the Claude service.
(from Bartz v. Anthropic in the Northern District of California)
jbeard4 12 hours ago [-]
"$3,000 per work" seems like an incredibly good deal to license a book.
whatever1 6 hours ago [-]
I don’t understand how training an LLM on a book and then selling its contents via subscriptions is fine but using a probabilistic OCR to read a book and then selling its contents is a crime that deserves jail time.
miohtama 5 hours ago [-]
It's not a crime. It is civil lawsuit.
amelius 13 hours ago [-]
It's better to ask for forgiveness than for permission.
Taken right from the VC's handbook.
rand17 6 hours ago [-]
I wonder how many author will see real money out of this (if any). The techbros prayed to the new king of America with the best currency they had: money - so the king may intervene, like he did many times.
robterrell 14 hours ago [-]
As a published author who had works in the training data, can I take my settlement payout in the form of Claude Code API credits?
TBH I'm just going to plow all that money back into Anthropic... might was well cut out the middleman.
gpm 14 hours ago [-]
I wonder if Antrhopic's lawyers have enough of a sense of humor to take you up on that if you sent them an email asking...
zkmon 6 hours ago [-]
It's the concentration of power, monopolies driving this trend of ignoring the fines and punishments. The fine system was not designed for these monstrous beasts. Legal code was designed to deter the common man from wrong doing. It did not anticipate the technological super powers doing winner-take-it-all in a highly connected world, and growing beyond the control of law. Basically, it's law of jungle for these companies. Law and punishment is never going to have any effect on them, as long as they can grab enough market share and customer base. Same as any mafia.
We are entering a world which is filled with corporate mafia that is above law (due to insignificant damage it can cause). These mafia would grip the world providing the essential services that make up the future world. The State would become much weaker, as policy makers could be bought by lobbying, punishments can be offset by VC funding.
It is all part of the playbook.
dataflow 12 hours ago [-]
How do legal penalties and settlements work internationally? Are entities in other countries somehow barred from filing similar suits with more penalties?
bigtones 10 hours ago [-]
So the article notes Anthropic states they never publicly released a frontier model that was trained on the downloaded copyright material. So were Claude 2 and 3 only trained on legally purchased and scanned books, or do they now use a different training system that does not rely on books at all ?
Dilettante_ 3 hours ago [-]
I assumed they were literally just lying.
whimsicalism 8 hours ago [-]
it sounds like the former
motbus3 4 hours ago [-]
This shouldn't be allowed to be settled outside courts
atleastoptimal 3 hours ago [-]
Reminder that just recently, Anthropic raised a $13 billion series F at a $183 billion post-money evaluation.
In March, they were worth $61.5 billion
In six months they've created $120 billion in value. That's almost 700 million dollars per day. Avoiding being slowed down by even a few days is worth a billion dollar payout when you are on this trajectory. This lawsuit, and any lawsuit AI model companies are likely to get, will be a rounding error at the end of the fiscal year.
They know that superintelligent AI is far larger than money, and even so, the money they'll make on the way there is hefty enough for copyright law to not be an issue.
JohnMakin 9 hours ago [-]
You or I would go to jail.
fab13n 2 hours ago [-]
Smart move: now that they're an established player, and that they have a few billions of investors' money to spend, they comfort a jurisprudence that stealing IP to train your models is a billion dollar offense.
What a formidable moat against newcomers, definitely worth the price!
(Sorry, meta question: how do we insert in submissions that "'Also' <link> <link>..." below the title and above the comment input? The text field in the "submit" page creates a user's post when the "url" field is also filled. I am missing something.)
markasoftware 15 hours ago [-]
They also agreed to destroy the pirated books. I wonder how large of a portion of their training data comes from these shadow libraries, and if AI labs in countries that have made it clear they won't enforce anti-piracy laws against AI companies will get a substantial advantage by continuing to use shadow libraries.
gpm 13 hours ago [-]
They already, prior to this lawsuit, prior to serving public models, replaced this data set with one they made by scanning purchased books. Destroying the data set they aren't even using should have approximately zero effect.
somanyphotons 15 hours ago [-]
Perhaps they'll quickly rent the whole contents of a few physical libraries and then scan them all
1-6 13 hours ago [-]
$3000 per work isn't a bad price. It seems insulting to the copy write holder.
sylware 2 hours ago [-]
This is exactly what could imped LLM training dataset in the western world, which will mechanically lead to "richer" LLM training dataset in countries where some PI is not walling that data for training.
But then, the countries with the freedom to add everything to the training dataset will have to distribute for free the weights in PI walled countries (because they would be plain 'illegal' and will be "blocked" over there, unless free as in free beer I guess), basically only what deepseek could work.
If powerfull LLM hardware becomes somewhat affordable (look at nvidia omega push on LLM specific hardware), "local" companies may run at reasonable speed those 'foreign trained LLM models', but "here".
wdb 11 hours ago [-]
I do not believe authors will see any of this money. I will change my mind when I see an email or check.
Luker88 3 hours ago [-]
5 days ago GamesNexus did a piece on Meta having the same problems, but resolving it differently:
Somehow excuses like "we torrented it, but we configured low seeding" "temptation was too strong because there was money to be made" "we tried getting a licenses, but then ignored it" and more ludicrous excuses actually worked.
Internal meta emails seemed to point to people knowing the blatant breach of copyright, and yet Meta won the case.
I guess there are tiers of laws even between billionaire companies.
Aeolun 12 hours ago [-]
Why are they paying $3000 per book. Does anyone think these authors srll their books for that amount?
gpm 12 hours ago [-]
Copies of these books are for sale for much less than that - very very few books demand a price that high.
They're paying much more than the actual damages because US copyright law comes with statutory damages for infringement of registered works on top of actual damages, between $200 and $150,000 per work. And the two sides negotiated this as a fair settlement to reduce the risk of an unfavourable outcome.
recursivecaveat 7 hours ago [-]
If you acquire something illegally of course the judgement against you has to be much higher than the legal price. Why would anyone purchase anything if the worst thing that could happen to you for stealing it was just paying the retail price?
TurboTveit 10 hours ago [-]
They are not paying for reading the book, they are paying for redistributing the book in perpetuity presumably.
gpm 10 hours ago [-]
Nope, the settlement specifically excludes actions after Aug 25th 2025 (not perpetuity), and it specifically excludes the output of LLMs (not one form of redistribution).
Meanwhile it's not alleged that they redistributed the books in any form except as the output of LLMs (not any other form of redistribution).
This looks to be almost entirely a settlement for pirating the books. It does also cover the act of training the LLMs on the books, but since the district court already found that to be fair use it's unlikely to have been a major factor in the amount.
rise_before_sun 13 hours ago [-]
Does anyone know which models were trained on the pirated books? I would like to avoid using those models.
DiabloD3 11 hours ago [-]
(Everyone say it with me)
Thats a weird way for Anthropic to announce they're going out of business.
groovetandon 11 hours ago [-]
So if you buy the content legally and fine tune using it that's fair use?
1gn15 10 hours ago [-]
Yes. Or download it legally (e.g. web content not behind a paywall).
lovelyDays61 2 hours ago [-]
Why only Anthropic?
qqbooks 14 hours ago [-]
So if a startup wants to buy book PDFs legally to use for AI purposes, any suggestions on how to do that?
JackSlateur 12 hours ago [-]
Reach the publishers or resellers (like amazon for instance)
Give them this order : "I want to buy all your books as epub"
Pay and fetch the stuff
That's all
comex 10 hours ago [-]
For e-books, there will usually be a license agreement that prohibits any kind of nonstandard use.
That's why Anthropic had to scan physical books.
mooreds 15 hours ago [-]
Anyone have a link to the class action? I published a book and would love to know if I'm in the class.
Deep research on Claude perhaps for some irony if you will.
nottorp 15 hours ago [-]
I thought 1.5 B is the penalty for one torrent, not for a couple million torrents.
At least if you're a regular citizen.
pavon 6 hours ago [-]
$150,000 statutory damages for willful infringement.
nottorp 4 hours ago [-]
So they only paid for 10k books?
taftster 15 hours ago [-]
Make sure to grab the mother-of-all-torrents I guess if you're going to go that path. That way you get more bang for your 1.5B penalty.
ipaddr 14 hours ago [-]
A million torrents would cost 1,500 each.
bastard_op 8 hours ago [-]
Here's some money, now piss off and let us get back to taking everyone else's.
Same racket the media cartels and patent trolls have been forcing for 40-50 years.
bikeshaving 15 hours ago [-]
For legal observers, Judge William Haskell Alsup’s razor-sharp distinction between usage and acquisition is a landmark precedent: it secures fair use for transformative generative AI while preserving compensation for copyright holders. In a just world, this balance would elevate him to the highest court of the land, but we are far from a just world.
WesolyKubeczek 3 hours ago [-]
I'm wondering, if they could purchase all the books that had been in the pirate stash, in physical or DRM-free ebook form, could they have been out of trouble? Use the stash because it's already pre-digitized and accessible, give money to publishers.
It would take time, sure, to compile the lists and make bulk orders, but wouldn't it be cheaper in the end than the settlement?
14 hours ago [-]
bhaktatejas922 14 hours ago [-]
This weirdly seems like its the best mechanism to buy this much data.
Imagine going to 500k publishers to buy it individually. 3k per book is way cheaper. The copyright system is turning into a data marketplace in front of our eyes
daemonologist 14 hours ago [-]
I suspect you could acquire and scan every readily purchasable book for much less than $3k each. Scanhouse for instance charges $0.15 per page for regular unbound (disassembled) books, plus $0.25 for supervised OCR, plus another dollar if the formatting is especially complex; this comes out to maybe $200-300 for a typical book. Acquiring, shipping, and disposing of them all would of course cost more, but not thousands more.
The main cost of doing this would be the time - even if you bought up all the available scanning capacity it would probably take months. In the meantime your competition who just torrented everything would have more high-quality training data than you. There are probably also a fair number of books in libgen which are out of print and difficult to find used.
mgraczyk 14 hours ago [-]
It's a tiny amount of data relatively speaking. Much more expensive per token than almost any data source imaginable
antihero 15 hours ago [-]
Isn't this basically what Spotify did originally?
nextworddev 14 hours ago [-]
Wait, I’m a published author, where’s my check
gpm 14 hours ago [-]
The court has to give preliminary approval to the settlement first. After that there should be a notice period during which the lawyers will attempt to reach out and tell you what you need to do to receive your money. (Not a lawyer, not legal advice).
Thank you very much. There seems to be a lot of friction in this seemingly simple process…
gpm 11 hours ago [-]
For what it's worth the friction exists for a reason, conflicts of interest.
The lawyers suing Anthropic here will probably walk away with several hundred million dollars - they have won the lottery.
If they managed to extract twice as much money from Anthropic for the class, they'd walk away with probably twice as much... but winning the lottery twice isn't actually much better than winning the lottery once. Meanwhile $4500 is a lot more than $2250 (the latter is a reasonable estimate of how much you'll get per work after the lawyers cut). Which risks the lawyers settling for less than is in their clients best interests so that they can reliably get rich.
Personally (not a lawyer or anything) I think this settlement seems very fair, and I expect the court will approve it. But there's definitely been plenty of class actions in the past where lawyers really did screw over the class and (try to) settle for less than they should have to avoid risking going to trial.
nextworddev 11 hours ago [-]
Interesting. Maybe there should be an easier way to file class action lawsuits and collect it - in a cheaper and more efficient manner
... in one economy and for specific authors and publishers. But the offence is global in impact on authors worldwide, and the consequences for other IPR laws remains to be seen.
mihaaly 7 hours ago [-]
It is a good opportunity to ask: is it true, that Anthropic can get indemnification from user actions that end up in the company being sued? User actions that are related to the use of Claude. Even just for the accusation. The user needs to cover their bills of lawyers and proceedings. Also they take control of the legal process, can do the way they please, settle or what, user footing the bill. Without limit. Be the user an individual or organization, doesn't matter.
Sounds harsh, if true. Making its use practical only for hobby projects basically where the results of Claude kept for yourself completely (be it information, product using Claude, or product is made by using Claude). Difficult to believe, I hope I heard it wrong.
crawsome 10 hours ago [-]
In related thought, when I listen to Suno, when I create "Epic Power Metal", the singer is very-often indistinguishible from the famous Hansi Kursch, of Blind Guardian.
I'm not sure if he even knows, but that is almost certainly his tracks they trained on.
neilv 14 hours ago [-]
$1.5B is a nothing but a handslap for the big gold rush companies.
It's less than 1% Anthropic's valuation -- a valuation utterly dependent on all the hoovering up of others' copyrighted works.
AFAICT, if this settlement signals that the typical AI foundation model company's massive-scale commercial theft doesn't result in judgments that wipe out a company (and its execs), then we have confirmation that is a free-for-all for all the other AI gold rush companies.
Then making deals to license rights, in sell-it-to-us-or-we'll-just-take-it-anyway deals, becomes only a routine and optional corporate cost reduction exercise, but not anything the execs will lose sleep over if it's inconvenient.
xnx 14 hours ago [-]
> It's less than 1% Anthropic's valuation
The settlement is real money though. Valuation is imaginary.
ianks 14 hours ago [-]
There’s alternatives to wiping out the company that could be fair. For example, a judgement resulting in a shares of the company or revenue shares in the future rather than a one time pay off.
Writers were the true “foundational” piece of LLMs, anyway.
neilv 14 hours ago [-]
If this is an economist idea of fair, where is the market?
If someone breaks into my house and steals my valuables, without my consent, then giving me stock in their burglary business isn't much of a deterrent to them and other burglars.
Deterrence/prevention is my real goal, not the possibly of a token settlement from whatever bastard rips me off.
We need the analogue of laws and police, or the analogue of homeowner has a shotgun.
gpm 14 hours ago [-]
I don't much like the idea of settling in stock, but I also think you're looking for criminal law here. Civil law, and this is a civil suit, is far more concerned with making damaged parties whole than acting as a deterrent.
I understand that intentional copyright infringement is a crime in the US, you just need to convince the DOJ to prosecute Anthropic for it...
lxe 14 hours ago [-]
A terrible precedent that guarantees China a win in the AI race
oblio 10 hours ago [-]
Nobody is winning the AI race.
Because everyone is expecting AGI now and it's not happening with our current tech.
Now how about Meta and their questionable means of acquiring tons of content?
tomrod 14 hours ago [-]
Maybe it's time to get some Llama models copied before an overzealous court rules badly.
rvz 16 hours ago [-]
> A trial was scheduled to begin in December to determine how much Anthropic owed for the alleged piracy, with potential damages ranging into the hundreds of billions of dollars.
It has been admitted and Anthropic knew that this trial would totally bankrupt them had they said they were innocent and continued to fight the case.
But of course, there's too much money on the line, which means even though Anthropic settled (admitting guilt and profiting off of pirated books) they (Anthropic) knew there was no way they could win that case, and was not worth taking that risk.
> The pivotal fair-use question is still being debated in other AI copyright cases. Another San Francisco judge hearing a similar ongoing lawsuit against Meta ruled shortly after Alsup's decision that using copyrighted work without permission to train AI would be unlawful in "many circumstances."
The first of many.
f33d5173 15 hours ago [-]
If it was a sure thing, then the rights holders wouldn't have accepted a settlement deal for a measly couple billion. Both sides are happier to avoid risking losing the suit.
Ekaros 14 hours ago [-]
Also knowing how pro corporate the legal system is piercing the veil and going after everyone holding the stock would have been unlikely. So getting 1,5 billion out of them likely could have been reasonable move. Otherwise they could have just burned all the money and flipped what was leftover to someone else, with uncertain price and horizon.
emtel 11 hours ago [-]
They would only be wiped out if the court awarded the maximum statutory damages (or close to it). There was never any chance of that happening.
Robotbeat 15 hours ago [-]
Wait, DID they admit guilt? A lot of times companies settle without admitting guilt.
deafpolygon 15 hours ago [-]
Honestly, this is a steal for Anthropic.
unit149 14 hours ago [-]
[dead]
jarjoura 14 hours ago [-]
I'm excited for the moment where these models are able to treat using copyrighted work in a fair-use way that pays out to authors the way Spotify does when you listen to a song. Why? Because authors recieving royalties for their works when they get used in some prompt would likely encourage them to become far more accepting towards LLMs.
Also passing on the cost to consumers of generated content since companies now would need to pay royalties on the back-end should also likely increase the cost of generating slop and hopefully push back against that trend.
This shouldn't just be books, but all written content, like scholarly journals and essays, news articles and blogs, etc.
I realize this is just wishful thinking, but there's got to be some nugget of aspirational desire to pay it forward.
unaut 15 hours ago [-]
This settlement I guess could be a landmark moment. $1.5 billion is a staggering figure and I hope it sends a clear signal that AI companies can’t just treat creative work as free training data.
typs 15 hours ago [-]
I mean the ruling does in fact find that treating this particular kind of creative work qualifies as fair use.
HDThoreaun 14 hours ago [-]
All the ai companies are still using books as training data. Theyre just finding the cheapest scanned copies they can get their hands on to cover their ass
15 hours ago [-]
thinkingtoilet 15 hours ago [-]
Great. Which rich person is going to jail for breaking the law?
missedthecue 14 hours ago [-]
This isn't a criminal case so zero people of any financial position would end up in prison.
emtel 14 hours ago [-]
No one, rich or poor, goes to jail for downloading books.
thinkingtoilet 11 hours ago [-]
If I walked into a store and stole $1000 of books I would go to jail. If a tech company steals countless thousands of dollars worth of books, someone should go to jail.
emtel 11 hours ago [-]
Stealing physical goods is not the same as downloading copyrighted material.
mdp2021 14 hours ago [-]
Are you sure? I think in some jurisdictions they would, according to the law.
johnnyanmac 13 hours ago [-]
Tell that to Aaron schwartz
emtel 10 hours ago [-]
Swartz wasn’t charged only for downloading copyrighted material, he was also charged with wire fraud and breaking and entering.
non_aligned 15 hours ago [-]
I'm gonna say one thing. If you agree that something was unfairly taken from book authors, then the same thing was taken from people publishing on the web, and on a larger scale.
Book authors may see some settlement checks down the line. So might newspapers and other parties that can organize and throw enough $$$ at the problem. But I'll eat my hat if your average blogger ever sees a single cent.
varenc 14 hours ago [-]
Books aren't hosted publicly online free for anyone to access. The court seems to think buying a book and scanning it is fair use. Just using pirated books is forbidden. Blogs weren't accessed via pirating.
taejavu 15 hours ago [-]
The blogger’s content was freely available, this fine is for piracy.
non_aligned 15 hours ago [-]
This is not a fine, it's a settlement to recompense authors.
More broadly, I think that's a goofy argument. The books were "freely available" too. Just because something is out there, doesn't necessarily mean you can use it however you want, and that's the crux of the debate.
ascorbic 14 hours ago [-]
It's not the crux of this case. This is a settlement based on the judge's ruling that they books had been illegally downloaded. The same judge said that the training itself was not the problem – it was downloading the pirated books. It will be tough to argue that loading a public website is an illegal download.
emtel 14 hours ago [-]
But you can use copyrighted works for transformative works under the fair-use doctrine, and training was ruled to be fair use in the previous ruling.
ascorbic 14 hours ago [-]
The settlement was for downloading the pirated books, not training from them. Unless they're paywalled it would be hard to argue the same for a blog.
rise_before_sun 13 hours ago [-]
It seems weird that there was legal culpability for downloading pirated books but not for training on them. At the very least, there is a transitive dependency between the two acts.
Other people have said that Anthropic bought the books later on, but I haven't found any official records for that. Where would I find that?
Also, does anyone know which Anthropic models were NOT trained on the pirated books. I want to avoid such models.
emtel 10 hours ago [-]
As far as anyone knows, no models were trained on the illegally downloaded books.
"Similarly, different sets or “subsets” or “parts of” or “portions”
of the collections sourced from Books3, LibGen, and PiLiMi were used to train different LLMs..." Page 5
"In sum, the copies of books pirated or purchased-and-destructively-scanned were placed into a central “research library” or “generalized data area,” sets or subsets were copied again to create training copies for data mixes, the training copies were successively copied to be cleaned, tokenized, and compressed into any given trained LLM, and once trained an LLM did not output through Claude to the public any further copies." Page 7
The phrase "Finally, once Anthropic decided a copy of a pirated or scanned book in the library would not be used for training at all or ever again, Anthropic still retained that work as a “hard resource” for other uses or future uses" implies to me Anthropic excluded certain books from training, not that they excluded all the pirated books from training.
golly_ned 14 hours ago [-]
> "The technology at issue was among the most transformative many of us will see in our lifetimes"
A judge making on a ruling based on his opinion of how transformative a technology will be doesn't inspire confidence. There's an equivocation on the word "transformative" here -- not just transformative in the fair use sense, but transformative as in world-changing, impactful, revolutionary. The latter shouldn't matter in a case like this.
> Companies and individuals who willfully infringe on copyright can face significantly higher damages — up to $150,000 per work
Settling for 2% is a steal.
> In June, the District Court issued a landmark ruling on A.I. development and copyright law, finding that Anthropic’s approach to training A.I. models constitutes fair use,” Aparna Sridhar, Anthropic’s deputy general counsel, said in a statement.
This is the highest-order bit, not the $1.5B in settlement. Anthropic's guilty of pirating.
Ekaros 14 hours ago [-]
Printing press, audio recording, movies, radio, television were also transformative. Did not get rid of copyright or actually brought them.
I feel it is insane that authors do not receive some sort of standard compensation for each training use. Say a few hundred to a few thousand depending on complexity of their work.
verdverm 14 hours ago [-]
Why would they earn more from models reading their works than I would pay to read it?
johnnyanmac 13 hours ago [-]
Same reason why the enterprise edition is more expensive than personal. Companies have more money to give and usually use it to generate profit. Individuals do not.
Ekaros 14 hours ago [-]
Because ones doing the training are profiting from it. Ai is not a human with limited time. And it is also owned by a company not a legal person.
I might find argument of comparing it to human when it is fully legal person and cutting power to it or deleting is treated as murder. Before that it is just bullshit.
And fundamentally reason for copy right to exist is to support creators and to promote them to create more. In world where massively funded companies can freely exploit their work and even in many case fully substitute that principle is failed.
verdverm 13 hours ago [-]
If I buy a book, learn something, and then profit from it, should I also be paying more than the original price to read the book?
> Ai is not a human with limited time
AI is also bound by time, physics, and limited capacity. It does certain things better or faster than us, it fails miserably at certain things we don't even think about being complex (like opening a door)
> And it is also owned by a company not a legal person.
For the purpose of legalities, companies and persons are relatively equivalent, regardless of the merits, it is how it is
> In world where massively funded companies can freely exploit their work and even in many case fully substitute that principle is failed.
They paid for the books after getting caught, the other companies are paying for the copyrighted training materials
Ekaros 13 hours ago [-]
>They paid for the books after getting caught, the other companies are paying for the copyrighted training materials
Are they paying reasonable compensation? Say like with streaming services, movie theatres, radio and tv stations. As a whole their model is much close to those than individuals buying books, cds or dvds...
You might even consider Theatrical License or Public Performance License. Paid even if you have memorized a thing...
LLMs are just bad technology that require massive amount of inputs so the authors cannot be compensated enough for it. And I fully believe they should be. And lot more than single copy of their work under entirely ill-fitting first-sale doctrine does.
johnnyanmac 13 hours ago [-]
> If I buy a book, learn something, and then profit from it, should I also be paying more than the original price to read the book?
Depends on how you do it. Clearly reading the book word from word is different from making a podcast talking about your interpretation of the book.
Rendered at 12:01:16 GMT+0000 (Coordinated Universal Time) with Vercel.
It’s important in the fair use assessment to understand that the training itself is fair use, but the pirating of the books is the issue at hand here, and is what Anthropic “whoopsied” into in acquiring the training data.
Buying used copies of books, scanning them, and training on it is fine.
Rainbows End was prescient in many ways.
But nobody was ever going to that, not when there are billions in VC dollars at stake for whoever moves fastest. Everybody will simply risk the fine, which tends to not be anywhere close to enough to have a deterrent effect in the future.
That is like saying Uber would have not had any problems if they just entered into a licensing contract with taxi medallion holders. It was faster to just put unlicensed taxis on the streets and use investor money to pay fines and lobby for favorable legislation. In the same way, it was faster for Anthropic to load up their models with un-DRM'd PDFs and ePUBs from wherever instead of licensing them publisher by publisher.
And thank god they did. There was no perfectly legal channel to fix the taxi cartel. Now you don't even have to use Uber in many of these places because taxis had to compete - they otherwise never would have stopped pulling the "credit card reader is broken" scam, taking long routes on purpose, and started using tech that made them more accountable to these things as well as harder for them to racially profile passengers. (They would infamously pretend not to see you if they didn't want to give you service back when you had to hail them with an IRL gesture instead of an app..)
This is the business model: get more money out of customers (because no real alternative) and the drivers (because zero negotiating power). Not to mention that they actually got to that position by literally operating at a loss for over a decade (because venture money). Textbook anti-competitive practices.
However, the idea itself (that is having an app to order taxi) is spectacular. It also something a high-school kid could make in a month in his garage. The actual strength of the business model is the network effects and the anti-competitive practices, not the app or anything having to do with service quality.
For instance: Monopolies often don't actually limit supply. You only make it so customers can't choose an alternative and set prices accordingly (that is higher than they would have been if there were real alternatives). Big-tech companies do this all the time. Collusion is also not required, but only one form (today virtually unheard of or very rare) of how it may happen. For instance: big-tech companies often don't actually encroach on core parts of the business of other big-tech companies. Google, Microsoft and Apple or Uber are all totally different business with little competitive overlap. They are not doing this because of outright collusion. It's live and let live. Why compete with them when they are leaving us alone in our corner? Also: trying to compete is expensive (for them), it's risky and may hurt them in other ways. This is one of the dirty little secrets: Established companies don't (really) want to compete with other big companies. They all just want to protect what's their and keep it that way. If you don't believe me have a look at the (publicly available) emails from execs that are public record. Anti-competitive thinking through and through.
And it wasn't much of a cartel in NYC before, anyways. Most subways stops in Brooklyn had a black car nearby if you knew how to look for them.
But I know what I'm going to pay up-front, can always pay with a credit card (which happens automatically without annoying post-trip payment), the ride is fully tracked, and I can report issues with the driver that I have an expectation will actually be acted upon. And when I'm in another country where there are known to be taxis that scam foreigners, Uber is a godsend.
Yes, pre-Uber taxis were expensive and crappy, and even if Uber is expensive now, it's not crappy; it's actually worth the price. And I'm not convinced Uber is even that expensive. We always forget to account for inflation... sure, we might today say, "that ride in a taxi used to cost $14, but in an Uber it costs $18". But that ride in a taxi was 15 years ago.
Now, I see people at the airport walk over to the pickup lot, joining a crowd of others furiously messing with their phones while scanning the area for presumably their driver.
All the while the taxis waiting immediately outside the exit door were $2 more expensive, last time I checked.
I have no idea what I'm going to get with those taxis waiting immediately outside the exit door. Even in my home country, at the airport next to my city, I have no idea. I know exactly what I'm getting with an Uber/Lyft, every time. That's valuable to me.
I was just in another country a couple months ago, and when trying to leave the airport, I was confused where I'd need to go in order to get an Uber. I foolishly gave up and went for one of those "conveniently-waiting" taxis, where I was quoted a price up-front, in my home currency, that I later (after doing the currency conversion on the Uber price) realized was a ripoff. The driver also aggressively tried to get me to instead rent his "friend's car" rather than take me to the rental car place like I asked. And honestly I consider that lucky: he didn't try to kidnap me or threaten me in any way, but I was tense during the whole ride, wondering if something bad was going to happen.
That sort of thing isn't an anomaly; it happens all the time to tourists in many countries.
I won't recount what recently happened to a friend in Milwaukee. It was an unpopular story (because the ripoff was Uber-based, and not the traditional taxi).
There's bad actors in every industry. I have found that industries that get "entrenched," tend to breed the most bad actors.
If anything turns into a "pseudo-monopoly," expect the grifters to start popping up. They'll figure out how to game the system.
That's typically considered to be somewhere between assholish and straight up illegal in most civilized economies.
They acquired market power by killing them through predatory pricing, leaving incumbents unprofitable and forcing them to exit - while creating a steep barrier to entry for any new comers and strategically manipulating existing riders by offering high take rates initially and subsidising rides to create artificial demand and inflate market share - then once they kicked out the incumbents, they exercised their market power to raise prices and their % of the take rate of each transaction; leaving consumers and riders worse off.
We can talk all day about the nice UX blah blah. But the reality is, financially, they could not have succeeded without a very dubious and unethical approach.
But I remember when I started using Uber back in 2012. It was amazing compared to every single other option out there. Yes, they entered the market in questionably-legal or often probably outright illegal ways. But illegal is not the same thing as immoral. And I don't think it's unethical to force out competition when that competition is a lazy, shitty, legally-enforced monopoly that treats its customers poorly.
And instead Uber offloaded everything onto gig workers and society. And still lost 20 billion dollars in the process (price dumping isn't cheap).
I always laugh when Americans poke fun at Europeans… we have it much better over here. I assure you of that.
Didn't Google have a long standing project to do just that?
https://en.wikipedia.org/wiki/Google_Books
- Google
- Anthropic
- Any Chinese company who do not care about copyright laws
What is the cost of buying and scanning books?
Copyright law needs to be fixed and its ridiculous hundred years tenure chopped away.
And even if they didn't use the illegally-obtained work to train any of the models they released, of course they used them to train unreleased prototypes and to make progress at improving their models and training methods.
By engaging in illegal activity, they advanced their business faster and more cheaply than they otherwise would have been able to. With this settlement, other new AI companies will see it on the record that they could face penalties if they do this, and will have to go the slower, more expensive route -- if they can even afford to do so.
It might not make it impossible, but it makes the moat around the current incumbents just that much wider.
https://github.com/haykgrigo3/TimeCapsuleLLM
Obviously, that's not part of the current settlement. I'm no expert on this, so I don't know the extent to which the earlier ruling applies.
> Notably, in its motion, Anthropic argues that pirating initial copies of Authors’ books and millions of other books was justified because all those copies were at least reasonably necessary for training LLMs — and yet Anthropic has resisted putting into the record what copies or even sets of copies were in fact used for training LLMs.
> We know that Anthropic has more information about what it in fact copied for training LLMs (or not). Anthropic earlier produced a spreadsheet that showed the composition of various data mixes used for training various LLMs — yet it clawed back that spreadsheet in April. A discovery dispute regarding that spreadsheet remains pending.
Ethically speaking, if Anthropic (a) did later purchase every book it pirated or (b) compensated every author whose book was pirated, would it absolve an illegally trained model of its "sins"?
To me, the taint still remains. Which is a shame, because it's considered the best coding model so far.
No, it part because it removes agency from the authors/rightsholders. Maybe they don't want to sell Anthropic their books, maybe they want royalties, etc.
Judge: "But this app facilitated them."
Lawyer: "Well, you presume so-called genuine carpoolers are not facilitated? The manufacturers of their cell phones, the telecom operators, their employers or the bar where they met, or the bus company at whose bus stop they met, they all facilitated their carpooling behavior."
Judge: "But your company profits from this coordination!"
Lawyer: "Well we pay taxes, just like the manufacturer of the cell phone, the telecom operator, their employers, the bus company or the bar... But let's ignore that, what you -representing the government (which in turn supposedly represents the people)- are really after is money or power. As a judge you are not responsible for setting up the economy, or micromanaging the development of apps, so its not your fault that the government didn't create this application before our company did. In a sense you are lucky that we created the app given that the government did not create this application in a timely fashion!"
Judge: "How so?"
Lawyer: "If the population had created this app they would have started thinking about where the proceeds should go. They would have gotten concerned about the centralization of power (financial and intelligence). They would have searched for ways to decentralize and secure their app. They would have eventually gotten cryptographers involved. In that world, no substantial income would be generated, your fleet of taxi's would be threatened as well, and you wouldn't even have the juicy intel we occasionally share either!"
This conversation almost never takes place, since it only needs to take place once, after which a naive judge has learned how the cookie crumbles. Most judges have lost this naivety before even becoming a judge. They learn this indirectly when small "annoyances" threaten the scheme (one could say the official taxi fleet was an earlier such scheme).
I get what you are going for, but my point was that a dataset existed, and the only way it could be compiled was illegaly.
If this is a choice between risking to pay 1.5 billion or just paying 15 mil safely, they might.
Option 2: near-$0 valuation, $15M purchasing cost.
To an investor, that just looks like a pretty good deal, I reckon. It's just the cost of doing business - which in my opionion is exactly what is wrong with practices like these.
What's actually wrong with this?
They paid $1.5B for a bunch of pirated books. Seems like a fair price to me, but what do I know.
The settlement should reflect society's belief of the cost or deterrent, I'm not sure which (maybe both).
This might be controversial, but I think a free society needs to let people break the rules if they are willing to pay the cost. Imagine if you couldn't speed in a car. Imagine if you couldn't choose to be jailed for nonviolent protest.
This isn't some case where they destroyed a billion dollars worth of pristine wilderness and got off with a slap on the wrist.
so you don't think super rich people should be bound by laws at all?
Unless you made the cost proportional to (maybe expontial to) somebody's wealth, you would be creating a completely lawless class who would wreak havoc on society.
It was broken by a company of people who were not very rich at all and have managed to produce billions in value (not dollars, value) by breaking said laws.
They're not trafficking humans or doing predatory lending, they're building AI.
This is why our judicial system literally handles things on a case by case basis.
Your argument is that this is all fine because it wasn't done by people who were super rich but instead done by people who became super rich and were funded by the super rich?
I just want to check that I have that right. You are arguing that if I'm a successful enough bank robber that this is fine because I pay some fine that is a small portion of what I heisted? I mean I wouldn't have been trafficking humans or doing predatory lending. I was just stealing from the banks and everyone hates the banks.
But if I'm only a slightly successful bank robber stealing only a few million and deciding that's enough, then straight to jail do not pass go, do not collect $200?
It's unclear to me because in either case I create value for the economy as long as I spend that money. Or is the key part what I do what that money? Like you're saying I get a pass if I use that stolen money to invent LLMs?
I think the company's bank account would beg to differ on that.
> managed to produce billions in value (not dollars, value) by breaking said laws.
Ah, so breaking the law is ok if enough "value" is created? Whatever that means?
> They're not trafficking humans or doing predatory lending, they're building AI.
They're not trafficking humans or doing predatory lending, they're infringing on the copyright of book authors.
Not sure why you ended that sentence with "building AI", as that's not comparing apples to apples.
But sure, ok, so it's ok to break the law if you, random person on the internet, think their end goals are worthwhile? So the ends justify the means, huh?
> This is why our judicial system literally handles things on a case by case basis.
Yes, and Anthropic was afraid enough of an unfavorable verdict in this particular case that they paid a billion and a half to make it go away.
I do agree that in the case of victimless crimes, having some ability to recompensate for damages instead of outright ban the thing, means that we can enact many massively net-positive scenarios.
Of course, most crimes aren’t victimless and that’s where the negative reactions are coming from (eg company pollutes the commons to extract a profit).
It's because they did not choose to pay for the books; they were forced to pay and they would not have done so if the lawsuit had not fallen this way.
If you are not sure why this is different from "they paid for pirated books (as if it were a transaction)", then this may reflect a lack of awareness of how fair exchange and trust both function in a society.
Settling isn't "forced", but it's a choice that tells you that the company believes settling is a better deal for them than letting the trial go forward. That's something.
Settlements have nothing to do with either of those things. Settlement has to do with what the plaintiff believes is good enough for the cost that will avoid the uncertainty of trial. This is a civil case, "society" doesn't really come into play here. (And you can't "settle" a criminal case; closest analogue would be a plea deal.)
If the trial went forward to a guilty verdict, then the fines would represent society's belief of cost or deterrent. But we didn't get to see that happen.
> They paid $1.5B for a bunch of pirated books.
They didn't pay, they settled. And considering flesh-and-blood people get sued for tens of thousands per download when there isn't a profit motive, that's a bargain.
> The settlement should reflect society's belief of the cost or deterrent.
No, it reflects the maximum amount the lawyers believe they can get out of them.
> This might be controversial, but I think a free society needs to let people break the rules if they are willing to pay the cost.
So how much should a politician need to pay to legally murder their opponent? Are you okay with your ex killing you for a $5000 fine?
> Imagine if you couldn't speed in a car.
Speed enough and you lose your license, no need to imagine.
Why does this company get away with it, but do warez groups get raided by SWAT teams, labeled a "criminal enterprise" or "crime gang", and sentenced to decades in jail? Why does the law not apply when you are rich?
it wont be a chatgpt or coding model ofc, thats not what they go for, but it'll be interesting to see its quality as its all fairly and honestly done. transparently.
Uber could have made the same decision and worked with regulators to be allowed into markets one at a time. It was an intentional choice to lean on the fact that Uber drivers blended into traffic and could hide in plain sight until Uber had enough market share and customer base to give them leverage.
With Uber you had a company that wanted to enter an existing market but couldn't due to legally-granted monopolies on taxi service. And given that existing market, you can be sure that the incumbents would lobby to keep Uber locked out.
With Waymo you have a new technology that has a computer driving the car autonomously. There isn't really any directly-incumbent party with a vested (conflict of) interest to argue against it. Waymo is a kind of taxi, though, so presumably existing taxi operators -- and the likes of Uber and Lyft -- could argue against it in order to protect their advantages. But ironically Uber and Lyft "softened" those regulatory bars already, so it might not have been worth it to try.
At any rate, the regulatory and safety concerns are also very different between the two.
I think I am also just a little more sympathetic to early Uber, given how terrible and cartel-like taxi service was in the past. But I would not at all be sympathetic toward Waymo putting driverless cars on the streets without regulatory approval and oversight, especially if people got injured or killed.
Agreed. Great book for those looking for a read: https://www.goodreads.com/book/show/102439.Rainbows_End
The author, Vernor Vinge, is also responsible for popularizing the term 'singularity'.
Reminds me of permutation city
“Marooned in Real Time” remains my fav.
I could read a book, but its highly unlikely I could regurgitate it, much less months or years later. An LLM, however, can. While we can say "training is like reading", its also not like reading at all due to permanent perfect recall.
Not only does an LLM have perfect recall, it also has the ability to distribute plagiarized ideas at a scale no human can. There's a lot of questions to be answered about where fair use starts/ends for these LLM products.
The way this technology is being used clearly violates the intent behind copyright law, it undermines its goals and results in harm that it was designed to prevent. I believe that doing this without extensive public discussion and consensus is anti-democratic.
We always end up discussing concrete implementation details of how copyright is currently enforced, never the concept itself. Is there a good word for this? Reification?
This has not been my experience. These days they are pretty good at googling though.
The 'lossy encyclopedia' analogy is quite apt
And even if one could, it would be illegal to do. Always found this argument for AI data laundering weird.
Which is one fundamental things how copyright is handled. Copying in general or performing multiple times. So I can accept argument that training model onetime and then using singular instance of that model is analogues to human learning.
But when you get to running multiple copies of model, we are clearly past that.
However, the judge already ruled on the only important piece of this legal proceeding:
> Alsup ruled in June that Anthropic made fair use of the authors' work to train Claude...
Even if the ruling legally remains in place after the settlement, district court rulings are at most persuasive precedent and not binding precedent in future cases, even ones handled by the same court. In the US federal court system, only appellate rulings at either the circuit court of appeals level or the Supreme Court level are binding precedent within their respective jurisdictions.
It remains deranged.
Everyone has more than a right to freely have read everything is stored in a library.
(Edit: in fact initially I wrote 'is supposed to' in place of 'has more than a right to' - meaning that "knowledge is there, we made it available: you are supposed to access it, with the fullest encouragement").
Also, at least so far, we don't call computers "someone".
Probably so, because with "library" I did not mean the "building". It is the decision of the society to make knowledge available.
> we don't call computers "someone"
We do instead, for this purpose. Why should we not. Anything that can read fits the set.
--
Edit: Come up with the arguments, sniper.
Every human has the right to read those books.
And now, this is obvious, but it seems to be frequently missed - an LLM is not a human, and does not have such rights.
Additionally:
> Every human has the right to read those books.
Since when?
I strongly disagree - knowledge should be free.
I don't think the author's arrangement of the words should be free to reproduce (ie, I think some degree of copyright protection is ethical) but if I want to use a tool to help me understand the knowledge in a book then I should be able to.
[1] https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....
If you use the commons to create your model, perhaps you should be obligated to distribute the model for free (or I guess for the cost of distribution) too.
By this logic one shouldn't be able to research for a newspaper article at a library.
I presume you (people do) have exploited that knowledge that society has made in principle and largely practice freely accessible to build a professionality, which is now for-profit: you will charge parties for the skills that available knowledge has given you.
The "profit" part is not the problem.
A vacuum removes what it sucks in. The commons are still as available as they ever were, and the AI gives one more avenue of access.
That is false. As a direct consequence of LLMs:
1. The web is increasingly closed to automated scraping, and more marginally to people as well. Owners of websites like reddit now have a stronger incentive to close off their APIs and sell access.
2. The web is being inundated with unverified LLM output which poisons the well
3. More profoundly, increasingly basing our production on LLM outputs and making the human merely "in the loop" rather than the driver, and sometimes eschewing even the human in the loop, leads to new commons that are less adapted to the evolutions of our world, less original and of lesser quality
As soo as OpenAI open sources their model's source code I'll agree.
(The "for sale" side does not limit the purpose to sales only, before somebody wants to attack that.)
Since in our legal system, only humans and groups of humans (the corporation is a convenient legal proxy for a group of humans that have entered into an agreement) have rights.
Property doesn't have rights. Land doesn't have rights. Books don't have rights. My computer doesn't have rights. And neither does an LLM.
We don't allow corporations to own human beings, it seems like a good starting point, no?
I think it is obvious instead that readers employed by humans fit the principle.
> rights
Societally, it is more of a duty. Knowledge is made available because we must harness it.
They're merely doing what anyone is allowed to with the books that they own, loaning them out, because copyright law doesn't prohibit that, so no license is needed.
What is in a library, you can freely read. Find the most appropriate way. You do not need to have bought the book.
¹(Edit: or /may/ not be allowed, see posts below.)
I'd be interested to know if you knew of one with bright line rules delineating what is and isn't allowed.
(I know by practice but not from the letter of the law; to give you details I should do some research and it will take time - if I will manage to I will send you an email, but I doubt I will be able to do it soon. The focus is anyway on western European Countries.)
They didn't think it would be a good idea to re-bind them and distribute it to the library or someone in need.
They did not destroy old, valuable books which individually were worth millions.
https://arstechnica.com/ai/2025/06/anthropic-destroyed-milli...
https://www.reddit.com/r/libgen/comments/1n4vjud/megathread_...
I'm pretty sure that's just a frontend for Uptime Kuma https://github.com/louislam/uptime-kuma
The whole incident is written up in detail, https://swartz-report.mit.edu/ by Hal Abelson (who wrote SICP among other things). It is a well-researched document.
The report speculates to his motivations on page 31, but it seems to be unknown with any certainty.
Information may want to be free, but sometimes it takes a revolutionary to liberate it.
but also prior to that he had written the guerilla open access manifesto so it wasn't great optics to be caught doing that
The settlement was a smart decision by anthropic to remove a huge uncertainty. 1.5 is not small, but it won't stop them or slow them significantly.
What I'm wondering is if they, or others, have trained models on pirated content that has flowed through their networks?
I’m surprised Google hasn’t hit its competitors harder with the fact that they actually got permission to scan books from its partner libraries and Facebook and OpenAI just torrented books2/books3, but I guess they have aligned incentive to benefit from a legal framework that doesn’t look to closely at how you went about collecting source material
IIUC this is very far from settled, at least in US law.
Awesome, so I just need enough perceptrons to overfit every possible copyrighted works then?
Is this completely settled legally? It is not obvious to me it would be so
Or can they buy the book, and then use the pirated copy?
I'm so over this shift in America's business model.
Original Silicon Valley model, and generally the engine of American innovation/growth/wealth equality for 200 years: Come up with a cool technology, build it in your garage, get people to fund it and sell it because it's a better mousetrap.
New model: Still come up with a cool idea, still get it funded and sold, but the idea involves committing crime at a staggering scale (Uber, Google, AirBnB, all AI companies, long list here), and then paying your way out of the consequences later.
Look some of these laws may have sucked, but having billionaires organize a private entity that systematically breaks them and gets off with a slap on the wrist, is not the solution. For one thing, if innovation requires breaking the law, only the rich will be able to innovate because only they can pay their way out of the law. For another, obviously no one should be able to pay their way out of following the law! This is basic "foundations of society" stuff that the vast majority of humans agree on in terms of what feels fair and just, and what doesn't.
Go to a country which has really serious corruption problems, like is really high on the corruption index, and ask the people there what they think about it. I mean I live in one and have visited many others so I can tell you, they all hate it. It not only makes them unhappy, it fills them with hopelessness about their future. They don't believe that anything can ever get better, they don't believe they can succeed by being good, they believe their own life is doomed to an unappealing fate because of when and where they were born, and they have no agency to change it. 25 years ago they all wanted to move to America, because the absence of that crushing level of corruption was what "the land of opportunity" meant. Now not so much, because America is becoming more like their country.
This timeline ends poorly for all of us, even the corrupt rich who profit from it, because in the future America will be more like a Latin American banana republic where they won't be able to leave their compounds for fear of getting Luigi'ed. We normal people get poverty, they get fear and death, everyone loses. The social contract is collapsing in front of our eyes.
Not creative destruction. But pure corruption.
The federal courts are a joke - the supreme court now has at least one justice whose craven corruption is notorious — openly accepting material value (ie bribes) from various parties. The district courts are being stuffed with Trump appointees with the obvious problems that go with that.
The congress is supine. Obviously they cannot act in any meaningful capacity.
We don’t have street level corruption today. But we’ve fired half the civil service, so I doubt that will continue.
Imagine a future where election results are casually and publicly nullified if the people with the guns don't like the result, and no one can do anything about it. Or where you can start a business but if it succeeds and you don't have the right family name it'll be taken from you and you'll be stripped of all you own and possibly put in prison for a while. That's reality in some countries, the US is not there yet, but those are the stakes we're playing for here, and why change needs to happen.
So exactly when was there “wealth equality” in the US? Are you glossing over that whole segregation, redlining, era of the US?
And America was built on slavery and genocide.
There are also a lot of usage rules that now make many games unfeasible.
We dug into the private markets seeking less Faustian terms, but found just as many legal submarines in wait... "AI" Plagiarism driven projects are just late to the party. =3
I have an author friend who felt like this was just adding insult to injury.
So not only had his work been consumed into this machine that is being used to threaten his day job as a court reporter, not only was that done without seeking his permission in any way, but they didn’t even pay for a single copy.
Really embodies raising your middle finger to the little guy while you steamroll him.
That's how Google Books, the Internet Archive, and Amazon (their book preview feature) operated before ebooks were common. It's not scalable-in-a-garage but perfectly scalable for a commercial operation.
The books that are destroyed in scanning are a small minority compared to the millions discarded by libraries every year for simply being too old or unpopular.
Book burnings are symbolic (Unless you're in the world of Fareinheit 451). The real power comes from the political threat, not the fact that paper with words on them is now unreadable.
It’s crazy to imagine, but there was surely a document or slack message thread discussing where to get thousands of books, and they just decided to pirate them and that was OK. This was entirely a decision based on ease or cost, not based on the assumption it was legal. Piracy can result in jail time IIRC, so honestly it’s lucky the employee who suggested this, or took the action avoided direct legal liability.
Oh and I’m pretty sure other companies (meta) are in litigation over this issue, and the publishers knew that settlement below the full legal limit would limit future revenue.
Well actively generating revenue at least.
Profits are still hard to come by.
It's not the same as debt from a loan, because people are buying a percentage stake in the company. If the value of the company happens to go to zero there's nothing left to pay.
But yeah, the amount of investment a company attracts should have something to do with the perception that it'll operate at a profit at some point
This is to teach a lesson because you cannot prosecute all thieves.
Yale Law Journal actually writes about this, the goal is to deter crime because in most cases damages cannot be recovered or the criminal will never be caught in the first place.
(Probability of not getting away with it) 0.01 * (Cost if caught) 1000 = 10x (Expected Cost) = not worth it
If anything it's too little based on precedent.
In this specific case the settlement caps the lawyer fees at 25%, and even that is subject to the courts approval. In addition they will ask for $250k total ($50k / plaintiff) for the lead plaintiffs, also subject to the courts approval.
He who has the gold makes the rules
I think that this is a distinction many people miss.
If you take all the works of Shakespeare, and reduce it to tokens and vectors is it Shakespeare or is it factual information about Shakespeare? It is the latter, and as much as organizations like the MLB might want to be able to copyright a fact you simply cannot do that.
Take this one step further. IF you buy the work, and vectorize it, thats fine. But if you feed it in the vectors for Harry Potter so many times that it can reproduce half of the book, it becomes a problem when it spits out that copy.
And what about all the other stuff that LLM's spit out? Who owns that. Well at present, no one. If you train a monkey or an elephant to paint, you cant copyright that work because they aren't human, and neither is an LLM.
If you use an LLM to generate your code at work, can you leave with that code when you quit? Does GPL3 or something like the Elastic Search license even apply if there is no copyright?
I suspect we're going to be talking about court cases a lot for the next few years.
So things get even more dark because what becomes distribution can have a really vague definition and maybe the AI companies will only follow the law just barely, just for the sake of not getting hit with a lawsuit like this again. But I wonder if all this case did was maybe compensate the authors this one time. I doubt if we can see a meaningful change towards AI companies attitude's towards fair use/ essentially exploiting authors.
I feel like that they would try to use as much legalspeak as possible to extract as much from authors (legally) without compensating them which I feel is unethical but sadly the law doesn't work on ethics.
Note that the law specifically regulates software differently, so what you cannot do is just willy nilly pirate games and software.
What distribution means in this case is defined in the swiss law. However swiss law as a whole is in some ways vague, to leave a lot up to interpretation by the judiciary.
I would assume it would compensate the publisher. Authors often hand ownership to the publisher; there would be obvious exceptions for authors who do well.
This seems too cute by half, courts are generally far more common sense than that in applying the law.
This is like saying using `rails generate model:example` results in a bunch of code that isn't yours, because the tool generated it according to your specifications.
I’d guess the legal scenario for `rails generate` is that you have a license to the template code (by way of how the tool is licensed) and the template code was written by a human so licensable by them and then minimally modified by the tool.
[1] https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...
'The Board’s decision was later upheld by the U.S. District Court for the District of Columbia, which rejected the applicant’s contention that the AI system itself should be acknowledged as the author, with any copyrights vesting in the AI’s owner. The court further held that the CO did not act arbitrarily or capriciously in denying the application, reiterating the requirement that copyright law requires human authorship and that copyright protection does not extend to works “generated by new forms of technology operating absent any guiding human hand, as plaintiff urges here.”' From: https://www.whitefordlaw.com/news-events/client-alert-can-wo...
The court is using common sense when it comes to the law. It is very explicit and always has been... That word "human" has some long standing sticky legal meaning (as opposed to things that were "property").
So to me, if you are doing literally any human review, edits, control over the AI then I think you'll retain copyright. There may be a risk that if somebody can show that they could produce exactly the same thing from a generic prompt with no interaction then you may be in trouble, but let's face it should you have copyright at that point?
This is, however, why I favor stopping slightly short of full agentic development at this point. I want the human watching each step and an audit trail of the human interaction in doing it. Sure I might only get to 5x development speed instead of 10x or 20x but that is already such an enormous step up from where we were a year ago that I am quite OK with that for now.
To rephrase the question:
Is a PDF of the complete works of Shakespeare Shakespeare, or is it factual information about Shakespeare?
Reencoding human-readable information into a form that's difficult for humans to read without machine assistance is nothing new.
It is 500,000 books in total so did they really scan all those books instead of using the pirated versions? Even when they did not have much money in the early phases of the model race?
Buying used copies of books, scanning them, and printing them and selling them: not fair use
Buying used copies of books, scanning them, and making merchandise and selling it: not fair use
The idea that training models is considered fair use just because you bought the work is naive. Fair use is not a law to leave open usage as long as it doesn’t fit a given description. It’s a law that specifically allows certain usages like criticism, comment, news reporting, teaching, scholarship, or research. Training AI models for purposes other than purely academic fits into none of these.
Unless legislation changes, model training is pretty much analogous to that. Now of course if the employee in question - or the LLM - regurgitates a copyrighted piece verbatim, that is a violation and would be treated accordingly in either case.
Does this still hold true if multiple employees are "trained" from scanned copies at the same time?
Regardless, the issue could be resolved by buying as many copies as you have concurrent model training instances. It isn't really an issue with training on copyrighted work, just a matter of how you do so.
1. A Settlement Fund of at least $1.5 Billion: Anthropic has agreed to pay a minimum of $1.5 billion into a non-reversionary fund for the class members. With an estimated 500,000 copyrighted works in the class, this would amount to an approximate gross payment of $3,000 per work. If the final list of works exceeds 500,000, Anthropic will add $3,000 for each additional work.
2. Destruction of Datasets: Anthropic has committed to destroying the datasets it acquired from LibGen and PiLiMi, subject to any legal preservation requirements.
3. Limited Release of Claims: The settlement releases Anthropic only from past claims of infringement related to the works on the official "Works List" up to August 25, 2025. It does not cover any potential future infringements or any claims, past or future, related to infringing outputs generated by Anthropic's AI models.
Edit: I'll get ratio'd for this- but its the exact same thing google did in it's lawsuit with Epic. They delayed while the public and courts focused in apple (oohh, EVIL apple)- apple lost, and google settled at a disadvantage before they had a legal judgment that couldn't be challenged latter.
And they actually went and did that afterwards. They just pirated them first.
Also, do we know if the newer models were trained without the pirated books?
https://storage.courtlistener.com/recap/gov.uscourts.cand.43...
> Also, do we know if the newer models were trained without the pirated books?
I'm pretty sure we do but I couldn't swear to it or quickly locate a source.
Among several places where judge mentions Anthropic buying legit copies of books it pirated, probably this sentence is most relevant: "That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft but it may affect the extent of statutory damages."
But document does not say Anthropic bought EVERY book it pirated. Other sections in the document also don't explicitly say that EVERY pirated book was later purchased.
I stopped using Claude when this case came to light. If the newer Claude models don't use pirated books, I can resume using it.
When you say, "I'm pretty sure we do...", do you mean that pirated books were used, or were they not used?
Yeah, I wouldn't make this exact claim either. For instance it's probably safe to assume that the pirate datasets contain some books that are out of circulation and which Anthropic happened not to get a used copy of.
They did happen to get every book published by any of the lead plaintiffs though, as a point towards them probably having pretty good coverage. And it does seem to have been an attempt to purchase "all" the books for reasonable approximate definitions of "all".
> When you say, "I'm pretty sure we do...", do you mean that pirated books were used, or were they not used?
I'm pretty sure pirated books were not used, but not certain, and I really don't remember when/why I formed that opinion.
Bootstrapping in the startup world refers to starting a startup using only personal resources instead of using investors. Anthropic definitely had investors.
A settlement means the claimants no longer have a claim, which means if they're also part of- say, the New York Times affiliated lawsuit- they have to withdraw. A neat way of kneecapping a country wide decision that LLM training on copy written material is subject to punitive measures don't you think?
In my experience&training in a fintech corp- Accepting a settlement in any suit weakens your defense- but prevents a judgement and future claims for the same claims from the same claimants (a la double jeopardy). So, again- at minimum- this prevents an actual judgement. Which, likely would be positive for the NYT (and adjacent) cases.
Indeed, it is not only payout, but the destruction of the datasets. Although the article does quote:
> “Anthropic says it did not even use these pirated works,” he said. “If some other generative A.I. company took data from pirated source and used it to train on and commercialized it, the potential liability is enormous. It will shake the industry — no doubt in my mind.”
Even if true, I wonder how many cases we will see in the near future.
I was under the impression they had downloaded millions of books.
It looks like you'll be able to search this site if the settlement is approved:
> https://www.anthropiccopyrightsettlement.com/
If your work is there, you qualify for a slice of the settlement. If not, you're outta luck.
Try to do that. There is no easy way to delete your account. You need to reach out to their support via email. Incredibly obnoxious dark pattern. I hate OpenAI, but everything with Anthropic also smells fishy.
We need more and better players. I hope that XAi will give them all some good competition, but I have my doubts.
So the data/copyright issue that you might be worried about is actually completely solved already! Anthropic is just paying a settlement here for the illegal pirating that they did way in the past. Anthropic is allowed to train on books that they legally acquire.
And sure, Chinese AI companies could probably scrape from LibGen just like Anthropic did without getting in hot water, and potentially access a bit more data that way for cheap, but it doesn't really seem like the buying/scanning process really costs that much in the grand scheme of things. And Anthropic likely already has legally acquired most of the useful texts on LibGen and scanned them into its internal library anyways.
(Furthermore, the scanning setup might actually give Anthropic an advantage, as they're able to digitize more niche texts that might be hard to find outside of print form)
And if AI companies want recent stuff, they need to pay the owners.
However, the West wants to infinitely enrich the lucky old people and companies who benefited from the lax regulations at the start of 20th century. Their people chose to not let the current generations to acquire equivalent wealth, at least not without the old hags get their cut too.
https://www.econlib.org/library/Columns/y2003/Lessigcopyrigh...
Lessig: Not for this length of time, no. Copyright shouldn’t be anywhere close to what it is right now. In my book I proposed a system where you’d have to renew after every five years and you get a maximum term of 75 years. I thought that was pretty radical at the time. The Economist, after the Eldred decision, came out with a proposal—let’s go back to 14 years, renewable to 28 years. Nobody needs more than 14 years to earn the return back from whatever they produced.
For many reasons I switched to writing using a Creative Commons license using Lulu, LeanPub, and my own web site for distribution. This has been a win for me economically, it feels good to add to the commons, and it is fun.
You can search LibGen by author to see if your work is included. I believe this would make you a member of the class: https://www.theatlantic.com/technology/archive/2025/03/searc...
If you are a member of the class (or think you are) you can submit your contact information to the plaintiff's attorneys here: https://www.anthropiccopyrightsettlement.com/
I suspected my work was in the dataset and it looks like it is! I reached out via the form.
Can only imagine the pitch, yes please give us billions of dollars. We are going to make a huge investment like paying of our lawsuits.
> Although the payment is enormous, it is small compared with the amount of money that Anthropic has raised in recent years. This month, the start-up announced that it had agreed to a deal that brings an additional $13 billion into Anthropic’s coffers. The start-up has raised a total of more than $27 billion since its founding in 2021.
[1] - https://www.wheresyoured.at/howmuchmoney/
You never know, its a game of interests and incentives - one thing for sure - does does the fed want the private sector to own and control a technology of this kind? Nope.
It basically does nothing for them besides that. Given the split decisions so far, I'm not sure what value the Alsup decision is going to bring to the industry, moving forward, when it's in the context of books that Anthropic physically purchased. The other AI cases are generally not fact patterns where the LLM was trained with copyrighted materials that the AI company legally purchased copies of.
So long as there is an excuse to justify money flows, that's fine, big capital doesn't really care about the excuse; so long as the excuse is just persuasive enough to satisfy the regulators and the judges.
Money flows happen independently, then later, people try to come up with good narratives. This is exactly what happened in this case. They paid the authors a lot of money as a settlement and agreed on a narrative which works for both sets of people; that training was fine, it's the pirating which was a problem...
It's likely why they settled; they preferred to pay a lot of money and agree on some false narrative which works for both groups rather than setting a precedent that AI training on copyrighted material is illegal; that would be the biggest loss for them.
Yes, and FWIW that's very succinctly stated.
Some individuals in society find a way through that and figure out a way to strategically achieve their goals. Rare though.
By extension, if the big publishers are getting $3000 per article, that could be a fairly significant windfall.
You’ve never authored, created, or published something? Never worked for a company that sells something protected by copyright?
I.e. never created software in exchange of money.
Copying and distributing works isn’t identical to theft (deliberately depriving someone of their property), but you’re enjoying someone’s work without compensating them, so it isn’t totally unlike depriving them of something.
I guess it depends how you feel about refusing to pay a window washer. Or indeed you not being paid by your employer. It isn’t theft, but someone is clearly stiffing someone else.
As for only big companies benefitting from the copyright regime… seems like an ideological assumption. I know plenty of authors and they are quite happy having legal protections around their work which means they can earn from their labour.
Which is foreseen in societal decision: libraries (again and again).
> refusing to pay a window washer
The window washer is providing a service for a price, that service is not equivalent to knowledge production, and nobody has decided that that service (cleaning windows) should be done for free.
> otherwise only the big companies who can afford to pay off publishers like Anthropic will be able to do so
Only well funded companies can afford to hire a lot of expensive engineers and train AI models on hundreds of thousands of expensive GPUs, too.
Something tells me many the grassroots LLM training people are less concerned about legality of their source training set than the big companies anyway.
Obviously there would be handling costs + scanning costs, so that’s the floor.
Maybe $20 million total? Plus, of course, the time it would take to execute.
It's not the way we expect people to do business under normal circumstances, but in new markets with new products? I guess I don't see much actually wrong with this. Authors still get paid a price they were willing to accept, and Anthropic didn't need to wait years to come to an agreement (again, publishers weren't actually selling what AI companies needed to buy!) before training their LLMs.
I haven't had this in a while, but I always hate it when I'm blocked by Cloudflare/Datadome/etc.
It’s not precedent setting but surely it’ll have an impact.
I would not be surprised if investors made their last round of funding contingent on settling this matter out of court precisely to ensure no precedents are set.
>During a deposition, a founder of Anthropic, Ben Mann, testified that he also downloaded the Library Genesis data set when he was working for OpenAI in 2019 and assumed this was “fair use” of the material.
Per the NYT article, Anthropic started buying physical books in bulk and scanning them for their training data, and they assert that no pirated materials were ever used in public models. I wonder if OpenAI can say the same.
https://www.tomshardware.com/tech-industry/artificial-intell...
Is there a way to make your content on the web "licensed" in a way where it is only free for human consumption?
I.e. effectively making the use of AI crawlers pirating, thus subject to the same kind of penalties here?
I'm not sure to what extent you can specify damages like these in a contract, ask the lawyer who is writing it.
If you put a “contract” on your website that users click through without paying you or exchanging value with you and then you try to collect damages from them according to your contract, it’s not going to get you anywhere.
The consideration you received was a promise to refrain from using those documents to train AI.
I'm not a lawyer, but by my understanding of contract law consideration is trivially fulfilled here.
That curl script you use to automate some task could become infringing.
At this point, we do need some laws regulating excessive scraping. We can't have the ineternet grind to a halt over everyone trying to drain it of information.
The purpose of the copyright protections is to promote "sciences and useful arts," and the public utility of allowing academia to investigate all works(1) exceeds the benefits of letting authors declare their works unponderable to the academic community.
(1) And yet, textbooks are copyrighted and the copyright is honored; I'm not sure why the academic fair-use exception doesn't allow scholars to just copy around textbooks without paying their authors.
EU has copyright exemptions for AI training. You don't need to respect opt outs if you are doing research.
South Korea, Japan has some exemptions too I think?
Singapore has very strong copyright exemptions for AI training. You can completely ignore opt-outs legally, even if doing it commercially.
Just search up "TDM laws globally".
Or what if not even distributing it but rather distributing the outputs of the LLM (so closed source LLM like anthropic)
I am genuinely curious as to if there is some gray area that might be exploited by AI companies as I am pretty sure that they don't want to pay 1.5B dollars yet still want to exploit the works of authors. (let's call a spade a spade)
We really are getting at some metaphysical / philosophical questions and maybe we will one day arrive at a question that just can't be answered (I think this is pretty close, right?) and then AI companies would do things freely without being accountable since sure you could take to the courts but how would you come to the decision...?
Another question though
So lets say that the nyt vs openAI case is going on, so in the meantime while they are litigating (lets say), could OpenAI still continue doing the same thing while the case is going on?
That could push the industry toward consolidation—fewer independent experiments, more centralized R&D inside big tech. I feel that, this might slow the pace of unexpected innovations and increase dependence on incumbents.
This def. raises the question: how do we balance fair compensation for creators with keeping the door open for innovation?
Based on history this is not a possibility but a certainty.
The larger players - who grew because of limited regulations - will start supporting stricter regulation and compliance structures in order to increase the barrier of entry with the excuse of "Oh we learned our lesson, you are right". The hypocrisy is crazy but it makes sense from a capitalistic perspective.
The European and especially German approach of regulating pre-emptive might be more fair, but apparently it also stifles innovation, as we can observe. Almost no significant players from Europe and Germany.
OpenAI and Google will follow soon now that the precedent has been set, and will likely pay more.
It will be a net win for Anthropic.
They pay out (relative) chump change as a penalty for explicitly pirating a bunch of ebooks, and in return they get a ruling that they can train on copyrighted works forever, for the purchase price of the book (not the price that would be needed to secure the rights!)
I'd be curious to hear from a legal professional...
Anthropic went back and bought->scanned->destroyed physical copies of them afterward... but they pirated them first, and that's what this settlement is about.
The judge also said:
> “The training use was a fair use,” he wrote. “The technology at issue was among the most transformative many of us will see in our lifetimes.”
So you don't need to pay $3,000 per book you train on unless you pirate them.
For post-training, other data sources (like human feedback and/or examples) are way more expensive than books
Unless, of course, the transformation malfunctioned and you got the good old verbatim source, with many of examples compiled in similar lawsuits
> When each LLM was put into a public-facing version of Claude, it was complemented by other software that filtered user inputs to the LLM and filtered outputs from the LLM back to the user. As a result, Authors do not allege that any infringing copy of their works was or would ever be provided to users by the Claude service.
(from Bartz v. Anthropic in the Northern District of California)
Taken right from the VC's handbook.
TBH I'm just going to plow all that money back into Anthropic... might was well cut out the middleman.
We are entering a world which is filled with corporate mafia that is above law (due to insignificant damage it can cause). These mafia would grip the world providing the essential services that make up the future world. The State would become much weaker, as policy makers could be bought by lobbying, punishments can be offset by VC funding.
It is all part of the playbook.
In March, they were worth $61.5 billion
In six months they've created $120 billion in value. That's almost 700 million dollars per day. Avoiding being slowed down by even a few days is worth a billion dollar payout when you are on this trajectory. This lawsuit, and any lawsuit AI model companies are likely to get, will be a rounding error at the end of the fiscal year.
They know that superintelligent AI is far larger than money, and even so, the money they'll make on the way there is hefty enough for copyright law to not be an issue.
What a formidable moat against newcomers, definitely worth the price!
But then, the countries with the freedom to add everything to the training dataset will have to distribute for free the weights in PI walled countries (because they would be plain 'illegal' and will be "blocked" over there, unless free as in free beer I guess), basically only what deepseek could work.
If powerfull LLM hardware becomes somewhat affordable (look at nvidia omega push on LLM specific hardware), "local" companies may run at reasonable speed those 'foreign trained LLM models', but "here".
https://www.youtube.com/watch?v=sdtBgB7iS8c
Somehow excuses like "we torrented it, but we configured low seeding" "temptation was too strong because there was money to be made" "we tried getting a licenses, but then ignored it" and more ludicrous excuses actually worked.
Internal meta emails seemed to point to people knowing the blatant breach of copyright, and yet Meta won the case.
I guess there are tiers of laws even between billionaire companies.
They're paying much more than the actual damages because US copyright law comes with statutory damages for infringement of registered works on top of actual damages, between $200 and $150,000 per work. And the two sides negotiated this as a fair settlement to reduce the risk of an unfavourable outcome.
Meanwhile it's not alleged that they redistributed the books in any form except as the output of LLMs (not any other form of redistribution).
This looks to be almost entirely a settlement for pirating the books. It does also cover the act of training the LLMs on the books, but since the district court already found that to be fair use it's unlikely to have been a major factor in the amount.
Thats a weird way for Anthropic to announce they're going out of business.
Give them this order : "I want to buy all your books as epub"
Pay and fetch the stuff
That's all
That's why Anthropic had to scan physical books.
Proposed settlement: https://storage.courtlistener.com/recap/gov.uscourts.cand.43...
At least if you're a regular citizen.
Same racket the media cartels and patent trolls have been forcing for 40-50 years.
It would take time, sure, to compile the lists and make bulk orders, but wouldn't it be cheaper in the end than the settlement?
Imagine going to 500k publishers to buy it individually. 3k per book is way cheaper. The copyright system is turning into a data marketplace in front of our eyes
The main cost of doing this would be the time - even if you bought up all the available scanning capacity it would probably take months. In the meantime your competition who just torrented everything would have more high-quality training data than you. There are probably also a fair number of books in libgen which are out of print and difficult to find used.
You can follow the case here: https://www.courtlistener.com/docket/69058235/bartz-v-anthro...
You can see the motion for settlement (what the news article is about) here: https://storage.courtlistener.com/recap/gov.uscourts.cand.43...
The lawyers suing Anthropic here will probably walk away with several hundred million dollars - they have won the lottery.
If they managed to extract twice as much money from Anthropic for the class, they'd walk away with probably twice as much... but winning the lottery twice isn't actually much better than winning the lottery once. Meanwhile $4500 is a lot more than $2250 (the latter is a reasonable estimate of how much you'll get per work after the lawyers cut). Which risks the lawyers settling for less than is in their clients best interests so that they can reliably get rich.
Personally (not a lawyer or anything) I think this settlement seems very fair, and I expect the court will approve it. But there's definitely been plenty of class actions in the past where lawyers really did screw over the class and (try to) settle for less than they should have to avoid risking going to trial.
https://www.anthropic.com/news/anthropic-raises-series-f-at-...
Sounds harsh, if true. Making its use practical only for hobby projects basically where the results of Claude kept for yourself completely (be it information, product using Claude, or product is made by using Claude). Difficult to believe, I hope I heard it wrong.
https://en.wikipedia.org/wiki/Hansi_K%C3%BCrsch
I'm not sure if he even knows, but that is almost certainly his tracks they trained on.
It's less than 1% Anthropic's valuation -- a valuation utterly dependent on all the hoovering up of others' copyrighted works.
AFAICT, if this settlement signals that the typical AI foundation model company's massive-scale commercial theft doesn't result in judgments that wipe out a company (and its execs), then we have confirmation that is a free-for-all for all the other AI gold rush companies.
Then making deals to license rights, in sell-it-to-us-or-we'll-just-take-it-anyway deals, becomes only a routine and optional corporate cost reduction exercise, but not anything the execs will lose sleep over if it's inconvenient.
The settlement is real money though. Valuation is imaginary.
Writers were the true “foundational” piece of LLMs, anyway.
If someone breaks into my house and steals my valuables, without my consent, then giving me stock in their burglary business isn't much of a deterrent to them and other burglars.
Deterrence/prevention is my real goal, not the possibly of a token settlement from whatever bastard rips me off.
We need the analogue of laws and police, or the analogue of homeowner has a shotgun.
I understand that intentional copyright infringement is a crime in the US, you just need to convince the DOJ to prosecute Anthropic for it...
Because everyone is expecting AGI now and it's not happening with our current tech.
It has been admitted and Anthropic knew that this trial would totally bankrupt them had they said they were innocent and continued to fight the case.
But of course, there's too much money on the line, which means even though Anthropic settled (admitting guilt and profiting off of pirated books) they (Anthropic) knew there was no way they could win that case, and was not worth taking that risk.
> The pivotal fair-use question is still being debated in other AI copyright cases. Another San Francisco judge hearing a similar ongoing lawsuit against Meta ruled shortly after Alsup's decision that using copyrighted work without permission to train AI would be unlawful in "many circumstances."
The first of many.
Also passing on the cost to consumers of generated content since companies now would need to pay royalties on the back-end should also likely increase the cost of generating slop and hopefully push back against that trend.
This shouldn't just be books, but all written content, like scholarly journals and essays, news articles and blogs, etc.
I realize this is just wishful thinking, but there's got to be some nugget of aspirational desire to pay it forward.
Book authors may see some settlement checks down the line. So might newspapers and other parties that can organize and throw enough $$$ at the problem. But I'll eat my hat if your average blogger ever sees a single cent.
More broadly, I think that's a goofy argument. The books were "freely available" too. Just because something is out there, doesn't necessarily mean you can use it however you want, and that's the crux of the debate.
Other people have said that Anthropic bought the books later on, but I haven't found any official records for that. Where would I find that?
Also, does anyone know which Anthropic models were NOT trained on the pirated books. I want to avoid such models.
https://storage.courtlistener.com/recap/gov.uscourts.cand.43....
"Similarly, different sets or “subsets” or “parts of” or “portions” of the collections sourced from Books3, LibGen, and PiLiMi were used to train different LLMs..." Page 5
"In sum, the copies of books pirated or purchased-and-destructively-scanned were placed into a central “research library” or “generalized data area,” sets or subsets were copied again to create training copies for data mixes, the training copies were successively copied to be cleaned, tokenized, and compressed into any given trained LLM, and once trained an LLM did not output through Claude to the public any further copies." Page 7
The phrase "Finally, once Anthropic decided a copy of a pirated or scanned book in the library would not be used for training at all or ever again, Anthropic still retained that work as a “hard resource” for other uses or future uses" implies to me Anthropic excluded certain books from training, not that they excluded all the pirated books from training.
A judge making on a ruling based on his opinion of how transformative a technology will be doesn't inspire confidence. There's an equivocation on the word "transformative" here -- not just transformative in the fair use sense, but transformative as in world-changing, impactful, revolutionary. The latter shouldn't matter in a case like this.
> Companies and individuals who willfully infringe on copyright can face significantly higher damages — up to $150,000 per work
Settling for 2% is a steal.
> In June, the District Court issued a landmark ruling on A.I. development and copyright law, finding that Anthropic’s approach to training A.I. models constitutes fair use,” Aparna Sridhar, Anthropic’s deputy general counsel, said in a statement.
This is the highest-order bit, not the $1.5B in settlement. Anthropic's guilty of pirating.
I feel it is insane that authors do not receive some sort of standard compensation for each training use. Say a few hundred to a few thousand depending on complexity of their work.
I might find argument of comparing it to human when it is fully legal person and cutting power to it or deleting is treated as murder. Before that it is just bullshit.
And fundamentally reason for copy right to exist is to support creators and to promote them to create more. In world where massively funded companies can freely exploit their work and even in many case fully substitute that principle is failed.
> Ai is not a human with limited time
AI is also bound by time, physics, and limited capacity. It does certain things better or faster than us, it fails miserably at certain things we don't even think about being complex (like opening a door)
> And it is also owned by a company not a legal person.
For the purpose of legalities, companies and persons are relatively equivalent, regardless of the merits, it is how it is
> In world where massively funded companies can freely exploit their work and even in many case fully substitute that principle is failed.
They paid for the books after getting caught, the other companies are paying for the copyrighted training materials
Are they paying reasonable compensation? Say like with streaming services, movie theatres, radio and tv stations. As a whole their model is much close to those than individuals buying books, cds or dvds...
You might even consider Theatrical License or Public Performance License. Paid even if you have memorized a thing...
LLMs are just bad technology that require massive amount of inputs so the authors cannot be compensated enough for it. And I fully believe they should be. And lot more than single copy of their work under entirely ill-fitting first-sale doctrine does.
Depends on how you do it. Clearly reading the book word from word is different from making a podcast talking about your interpretation of the book.