AI models can legally be trained on books | Top 10 emerging technologies of 2025
#193 - A US court decides 'Fair Use' of books in model training | AI is stale news, but getting AI to be trustworthy is not
A landmark judgement in the US says that AI can learn from books…
…as long as you do not use pirated versions
How do you win the AI race? 🧱 By building models that are better than the competition.
How do you build models that are better than the competition? 🚊 Train them on data that the competition does not have.
How do you get better data than your competitors? ⬇️ Download e-books from pirated sites.
That’s what Anthropic did. (So did Meta for Llama3).
Some authors found this to be a violation of copyright. They sued these companies. This week, a court in California said that training AI on books is just fine, but training AI on pirated books is not.
If an author reads many books and builds up her writing style, skills and knowledge based on these books, you would not consider that piracy. So, when an AI model trains itself on books and ‘learns’, it’s about the same.
If you read the official judgement [link to pdf] you will find that the authors who filed the case in court did not challenge that the output of Anthropic’s LLM (Claude) violated copyright. Instead, they contested that the input - the downloading of pirated copies, digitizing them and then storing them in a sort of library and the use to train the LLM - violated copyright.
The court ruled that, apart from the piracy, all other uses by Anthropic are fair use.
Take Action:
Copyright Lawyers 🧑🏻💼 - If you are advising any AI model developer, you should take this ruling into consideration. You understand this case better than information security professionals. Advise the poor CISOs.
Cybersecurity/ Data Privacy Professionals 🕵🏼♀️ - Copyright violation is a real risk. When doing an AI system impact assessment, document the risk of copyright violation and validate the data used for training any AI model with a copyright lawyer.
The World Economic Forum releases the top 10 emerging technologies for 2025
There’s AI in it, but only to flag off AI content
Every year, The World Economic Forum (WEF) releases the top 10 emerging technologies for the year. The 2025 version released recently. The 2024 version started with AI for scientific discovery, but the 2025 version talks about improving trust in AI with invisible watermarks.
From using AI for scientific discovery, to watermarking AI content, we seem to have made quite the leap in a year of emerging technology. 😈
The emerging technologies make for interesting reading each year. They provide a direction of what is ‘in’ and what is not. This year, AI is not the rage at all.
A key emerging technology for 2025 is ‘Generative Watermarking’, a technology that:
… embed invisible markers in AI-generated content – including text, images, audio and video – to verify authenticity and help trace content origins. As AI-generated content becomes increasingly hard to differentiate from that created without AI, there has been a surge in innovative watermarking technologies designed to help combat misinformation, protect intellectual property, counter academic dishonesty and promote trust in digital content.
Regular readers might remember my post “The battle for attention is lost. The battle for trust begins.”
In the post, I had argued that we no longer know what to trust and also posited the question that was to be raised for copyright violation. I seem to have been proved right.
Take Action:
No specific action here - the WEF emerging technology report gives a general direction and makes for interesting reading for us corporate folks.