What is “corpus”? And why is everyone in AI suddenly talking about it? Here’s what you need to know.!
Writer Michael Grothaus
Bill Gates, Reddits CEO, and other tech leaders are increasingly talking about their "corpus." Now is the time to learn what that means.
Thanks to
ChatGPT and similar platforms, the rise of artificial intelligence has been one
of the most headline-grabbing subjects of 2023. Not a day goes by without a new
article coming out about some way AI tech spells either doom or salvation for
the creative fields, your job, or humanity.
And if
you’ve been reading these articles, you might have noticed one particular word
being thrown around by tech executives recently: “corpus.” Reddit’s CEO has
mentioned it; so has Wikipedia’s founder Jimmy Wales; and so has Microsoft
founder Bill Gates.
Here’s what
it means, and why it’s critical to understanding how artificial intelligence
platforms like ChatGPT and Midjourney operate.
WHAT IS
AN AI CORPUS?
Those who
studied Latin in school will immediately know that corpus means “body.” (The
modern word for a dead body—“corpse”—is derived from corpus.) Others might
recognize the word corpus because of its use in a legal mechanism still in
place today: habeas corpus. This phrase literally means “you should have the
body” and it ensures that anyone arrested has the right to appear before a
judge (thus, the judge “has the body” of the person arrested) to determine if
that arrest is lawful.
But when
used in the artificial intelligence realm, the term “corpus” doesn’t refer to a
physical body at all. Instead, it refers to the metaphorical “body,” or
collection, of data that was used to train the AI. This corpus is the material
the AI reviews to become intelligent in whatever it was designed for.
Every AI’s
corpus will be different, because it is humans who decide what kind of data,
they want to train an AI on. And the corpus the humans decide to train the AI
on will depend on what they want the AI to be proficient in.
TYPES OF
CORPORA
There is no
limit to the types of corpora (the plural of corpus) that can exist. What makes
up an AI’s corpus simply depends upon what the human creator of the AI intends
for it to do.
Take
Midjourney, for example. Midjourney is a popular generative art platform for
creating images with AI. Since Midjouney lets a user create images using
nothing but text prompts, its AI needed to be trained on both a series of
images and associated text descriptions. For example, in order for Midjounry to
generate an image of a waterfall, its corpus must have included images of
waterfalls and the accompanying text that labeled a wall of falling water as a
“waterfall”.
Then there
are AI platforms such as ChatGPT, a type of AI known as a large language model,
or LLM. Robust LLM’s have the ability to have conversational text-based chats
with a person—provided their corpus is large and rich enough. And depending on
what its corpus contains, an LLM can also answer complex questions or even
generate original creative works, like short stories or the code to create a
space shooter game. Its abilities simply depend on the data contained in the
corpus that was used to train the AI.
In ChatGPT’s
case, I wanted to know what made up its corpus, so I just asked it. “[The
ChatGPT corpus] consists of a wide range of text from the internet, including
websites, books, articles, and other publicly available sources,” it replied.
Not content with the rather vague answer, I asked ChatGPT to elaborate on the
types of data in its corpus. This time ChatGPT was more detailed:
• Websites:
Text from websites across different domains and topics.
• Books:
Text from a wide range of books covering various genres and subjects.
• Articles:
Text from news articles, magazine features, and blog posts.
• Research
Papers: Text from scientific papers and publications.
•
Conversational Data: Text from dialogues, conversations, and interactions.
• Social
Media: Text from platforms like Twitter, Reddit, and online forums.
• Wikipedia:
Text from Wikipedia articles spanning numerous topics.
ChatGPT
Notice one
big omission from ChatGPT’s corpus: images. That’s because ChaptGPT is a
text-based AI generator. It can’t generate images because its corpus never
contained any to train on.
The data
funneled into Midjourney and ChatGPT are just two examples of what can make up
a corpus. But a corpus can be made of any kind of data. For example, if you
wanted to make an AI that could create music, you would simply include audio
songs in its corpus. Or if you wanted an AI that could write a novel in the
sparse style of Hemingway, you would use a corpus containing only Hemingway’s
written works.
THE
LEGALITY OF CORPORA
If you don’t
have a corpus to feed an AI, the AI cannot learn. And the larger your corpus
is, the more proficient, or intelligent, the AI can become. But the actual data
that makes up an AI’s corpus opens up a whole new can of worms when it comes to
copyright and intellectual property law.
ADVERTISEMENT
Have the
owners of AI that trained on a corpus of copyrighted material violated the law?
For example, if I create an AI that can generate Banksy-like artwork, and I
trained the AI on a corpus of Banksy’s works, have I violated Banksy’s
copyrights or intellectual property? My AI doesn’t reproduce his artwork, just
his style, so is it still a violation of copyright or intellectual property?
Or, say I create an AI with a corpus containing Rihanna’s songs. The AI can
then generate completely new, original songs, but with Rihanna’s voice, or
something close to it. Is that legal?
Universal Music
Group already answered with a hard “no” after AI-generated songs by Drake and
The Weekend made the rounds on streaming services earlier this year. But
creators who use AI tools might say otherwise. Ultimately, whether it’s in
regard to AI-generated audio, visual, or text-based media, it’s a question that
is likely to tie up courts around the world for years to come as generative AI
programs like ChatGPT and Midjourney become more commonplace.
At the same
time, governments are already planning legislation that would place regulations
on generative AI models. The European Union, for example, is proposing a law
that would require the owner of an AI to divulge whether the AI’s corpus
contained copyrighted material. That transparency would make it easier for
copyright holders to identify which corpora their work has been used in—and
thus seek compensation.
In the
United States, the Congressional Research Service recently advised Congress
that it may wish to “adopt a wait-and-see approach” before updating copyright
legislation, suggesting that it monitor how the courts react in the years ahead
to AI-generated copyright cases.
AI
CORPORA AS A REVENUE STREAM
Of course,
some content creators will choose to embrace the revenue-generating
opportunities that AI stands to offer—those that have large enough bodies of
work, anyway. Let’s say a living painter did want to make some extra cash. She
could simply package her collection of works in a corpus and sell access to it
to generative AI companies. Authors could sell a corpus of their novels;
magazine publishers could sell a corpus of their back-issues; and singers could
sell a corpus of their vocals—or demand a part of the cut earned by any
AI-generated work their corpus fueled, as Grimes has already proposed.
Heck, if
Elon Musk wanted a new revenue stream for his flailing Twitter, he might
consider packaging all the tweets on the platform into a corpus to sell to AI
startups. Meta’s Facebook would also find a new revenue stream in this
(provided Twitter and Meta can claim ownership of users’ posts, that is).
Indeed, Reddit’s corpus of users’ posts has been used to help train ChatGPT,
and in a recent interview with The New York Times, Reddit CEO Steve Huffman
said he knew the value of that corpus. “The Reddit corpus of data is really
valuable. But we don’t need to give all of that value to some of the largest
companies in the world for free.”
In this
sense, as more companies expand into the AI space, robust, pre-packaged corpora
may become as important in the tech world as pick axes were to the miners of
the gold rush, and a whole new cottage industry of corpora sellers may appear.
If that’s
the case, in the months and years ahead, “corpus” is set to become a regular
part of the vernacular when we talk about, and debate, AI.

Comments
Post a Comment
Please no profanity or political comments.