Skip to main content

What is “corpus”? And why is everyone in AI suddenly talking about it? Here’s what you need to know.!

 


Writer Michael Grothaus

Bill Gates, Reddits CEO, and other tech leaders are increasingly talking about their "corpus." Now is the time to learn what that means.

Thanks to ChatGPT and similar platforms, the rise of artificial intelligence has been one of the most headline-grabbing subjects of 2023. Not a day goes by without a new article coming out about some way AI tech spells either doom or salvation for the creative fields, your job, or humanity.

And if you’ve been reading these articles, you might have noticed one particular word being thrown around by tech executives recently: “corpus.” Reddit’s CEO has mentioned it; so has Wikipedia’s founder Jimmy Wales; and so has Microsoft founder Bill Gates.

Here’s what it means, and why it’s critical to understanding how artificial intelligence platforms like ChatGPT and Midjourney operate.

WHAT IS AN AI CORPUS?

Those who studied Latin in school will immediately know that corpus means “body.” (The modern word for a dead body—“corpse”—is derived from corpus.) Others might recognize the word corpus because of its use in a legal mechanism still in place today: habeas corpus. This phrase literally means “you should have the body” and it ensures that anyone arrested has the right to appear before a judge (thus, the judge “has the body” of the person arrested) to determine if that arrest is lawful.

But when used in the artificial intelligence realm, the term “corpus” doesn’t refer to a physical body at all. Instead, it refers to the metaphorical “body,” or collection, of data that was used to train the AI. This corpus is the material the AI reviews to become intelligent in whatever it was designed for.

Every AI’s corpus will be different, because it is humans who decide what kind of data, they want to train an AI on. And the corpus the humans decide to train the AI on will depend on what they want the AI to be proficient in.

TYPES OF CORPORA

There is no limit to the types of corpora (the plural of corpus) that can exist. What makes up an AI’s corpus simply depends upon what the human creator of the AI intends for it to do.

Take Midjourney, for example. Midjourney is a popular generative art platform for creating images with AI. Since Midjouney lets a user create images using nothing but text prompts, its AI needed to be trained on both a series of images and associated text descriptions. For example, in order for Midjounry to generate an image of a waterfall, its corpus must have included images of waterfalls and the accompanying text that labeled a wall of falling water as a “waterfall”.

Then there are AI platforms such as ChatGPT, a type of AI known as a large language model, or LLM. Robust LLM’s have the ability to have conversational text-based chats with a person—provided their corpus is large and rich enough. And depending on what its corpus contains, an LLM can also answer complex questions or even generate original creative works, like short stories or the code to create a space shooter game. Its abilities simply depend on the data contained in the corpus that was used to train the AI.

In ChatGPT’s case, I wanted to know what made up its corpus, so I just asked it. “[The ChatGPT corpus] consists of a wide range of text from the internet, including websites, books, articles, and other publicly available sources,” it replied. Not content with the rather vague answer, I asked ChatGPT to elaborate on the types of data in its corpus. This time ChatGPT was more detailed:

• Websites: Text from websites across different domains and topics.

• Books: Text from a wide range of books covering various genres and subjects.

• Articles: Text from news articles, magazine features, and blog posts.

• Research Papers: Text from scientific papers and publications.

• Conversational Data: Text from dialogues, conversations, and interactions.

• Social Media: Text from platforms like Twitter, Reddit, and online forums.

• Wikipedia: Text from Wikipedia articles spanning numerous topics.

ChatGPT

Notice one big omission from ChatGPT’s corpus: images. That’s because ChaptGPT is a text-based AI generator. It can’t generate images because its corpus never contained any to train on.

The data funneled into Midjourney and ChatGPT are just two examples of what can make up a corpus. But a corpus can be made of any kind of data. For example, if you wanted to make an AI that could create music, you would simply include audio songs in its corpus. Or if you wanted an AI that could write a novel in the sparse style of Hemingway, you would use a corpus containing only Hemingway’s written works.

THE LEGALITY OF CORPORA

If you don’t have a corpus to feed an AI, the AI cannot learn. And the larger your corpus is, the more proficient, or intelligent, the AI can become. But the actual data that makes up an AI’s corpus opens up a whole new can of worms when it comes to copyright and intellectual property law.

ADVERTISEMENT

Have the owners of AI that trained on a corpus of copyrighted material violated the law? For example, if I create an AI that can generate Banksy-like artwork, and I trained the AI on a corpus of Banksy’s works, have I violated Banksy’s copyrights or intellectual property? My AI doesn’t reproduce his artwork, just his style, so is it still a violation of copyright or intellectual property? Or, say I create an AI with a corpus containing Rihanna’s songs. The AI can then generate completely new, original songs, but with Rihanna’s voice, or something close to it. Is that legal?

Universal Music Group already answered with a hard “no” after AI-generated songs by Drake and The Weekend made the rounds on streaming services earlier this year. But creators who use AI tools might say otherwise. Ultimately, whether it’s in regard to AI-generated audio, visual, or text-based media, it’s a question that is likely to tie up courts around the world for years to come as generative AI programs like ChatGPT and Midjourney become more commonplace.

At the same time, governments are already planning legislation that would place regulations on generative AI models. The European Union, for example, is proposing a law that would require the owner of an AI to divulge whether the AI’s corpus contained copyrighted material. That transparency would make it easier for copyright holders to identify which corpora their work has been used in—and thus seek compensation.

In the United States, the Congressional Research Service recently advised Congress that it may wish to “adopt a wait-and-see approach” before updating copyright legislation, suggesting that it monitor how the courts react in the years ahead to AI-generated copyright cases.

AI CORPORA AS A REVENUE STREAM

Of course, some content creators will choose to embrace the revenue-generating opportunities that AI stands to offer—those that have large enough bodies of work, anyway. Let’s say a living painter did want to make some extra cash. She could simply package her collection of works in a corpus and sell access to it to generative AI companies. Authors could sell a corpus of their novels; magazine publishers could sell a corpus of their back-issues; and singers could sell a corpus of their vocals—or demand a part of the cut earned by any AI-generated work their corpus fueled, as Grimes has already proposed.

Heck, if Elon Musk wanted a new revenue stream for his flailing Twitter, he might consider packaging all the tweets on the platform into a corpus to sell to AI startups. Meta’s Facebook would also find a new revenue stream in this (provided Twitter and Meta can claim ownership of users’ posts, that is). Indeed, Reddit’s corpus of users’ posts has been used to help train ChatGPT, and in a recent interview with The New York Times, Reddit CEO Steve Huffman said he knew the value of that corpus. “The Reddit corpus of data is really valuable. But we don’t need to give all of that value to some of the largest companies in the world for free.”

In this sense, as more companies expand into the AI space, robust, pre-packaged corpora may become as important in the tech world as pick axes were to the miners of the gold rush, and a whole new cottage industry of corpora sellers may appear.

If that’s the case, in the months and years ahead, “corpus” is set to become a regular part of the vernacular when we talk about, and debate, AI.

About the author

Michael Grothaus is a novelist and author. His new novel, the speculative fiction 'BEAUTIFUL SHINING PEOPLE', is out now More


Comments

Popular posts from this blog

Hurricane Knocked The Power Out? New Orleans Firemen’s FCU Is Ready.

  Hurricane Knocked The Power Out? New Orleans Firemen’s FCU Is Ready. The next big storm in the Gulf isn’t an “if,” it’s a “when,” but the small Gulf-area credit union has a plan to help the community get back on its feet when the time comes. Aaron Passman This article is part of Callahan & Associates’ “ CDFI Grants In Action ,” a limited editorial series that showcases how credit unions leverage CDFI funding to advance their mission and deliver measurable impact for members. To learn how CDFI certification can change lives and unlock opportunities at your credit union, visit  CU Strategic Planning , A Callahan Company. When hurricanes rip through the Gulf, they leave behind disrupted lives and disconnected communities. In those moments, access matters as much as empathy. When disaster strikes,  The New Orleans Firemen’s Federal Credit Union   ($275.0M, Metarie, LA) is ready to roll with a mobile branch that brings back banking to the front line of recovery. The...

Sunday Reading - Lake Manly Returns

  Lake Manly Returns   An ancient lake has  reemerged in California's Death Valley National Park following record rainfall this year.  Between 128,000 and 186,000 years ago, meltwater from ice covering the Sierra Nevada fed rivers that emptied into Badwater Basin, North America’s lowest point at 282 feet below sea level. The steady flow sustained Lake Manly, nearly 100 miles long and roughly 600 feet deep. The lake disappeared as Death Valley evolved into the driest place in North America , with some areas receiving under two inches of rain annually. This year, however, the park received 2.41 inches between September and November, marking its wettest autumn on record and triggering the temporary return of a shorter, shallower Lake Manly.  Above-average rainfall periodically brings Lake Manly back, including in 2023 when Hurricane Hilary dumped 2.2 inches of rain on a single August day, allowing visi...

The US Senate makes major step towards recognizing firefighter cancers as line‑of‑duty deaths

   18 Dec 2025 The US Senate makes major step towards recognizing firefighter cancers as line‑of‑duty deaths en Fire Fighter´s Advocacy   Firefighter Cancer   Firefighter Unions   Firefighter's Health   Line of Duty Deaths The US senate  has passed the   Honoring Our Fallen Heroes Act , recognizing firefighter occupational cancers as line‑of‑duty deaths and extending federal benefits to families. This marks a shift in U.S. policy towards aligning with decades of advocacy by firefighter unions and survivors. According to a statement on IAFF.org,  the passing of the Act in the Senate is a "major step forward for the thousands of survivors who have been denied PSOB benefits after losing their loved one to cancer...  It now moves to the U.S. House of Representatives for consideration." According to IAFF.org, the Honor Act has strong bipartisan support in both chambers of Congress. A companion bill in the House ( H.R. 1269 ) currently has 152...

Happy Holidays To All Who Serve

  Happy Holidays To All Who Serve 12/22/2025 10:28 am   By Grant Sheehan and Anthony Hernandez Every year, many Americans celebrate the joy of family and relief from work the holidays bring. Apart from the hustle and bustle, the holiday season is a special time to be with loved ones, engaging in family traditions and rituals, and making memories that will last a lifetime. However, not everyone gets to partake in the holiday gatherings.   There are over a hundred thousand military members serving in harm’s way or in 24-hour command center...

Sunday Reading - The gold standard, explained

  Gold Standard       The gold standard, explained A gold standard is a system where a country’s currency is pegged to, and can be converted into, a fixed amount of gold. It’s typically meant to create a sense of security in the country’s currency: When a government uses a gold standard , its currency can be exchanged for an equivalent amount of gold—although regulations around redemption vary by country.   After the Civil War, in 1873, America adopted the gold standard for the first time. At the time, if gold was priced at $100 an ounce, each dollar  rep...

Syracuse Fire Department Credit Union

Remember, you're not alone with  NCOFCU.org Join/Upgrade Check out some of NCOFCU's additional features: First Responder Credit Union Academy Financial Literacy Podcasts YouTube Mini's Blog Job Board

Buy Now, Pay Later Keeps Gaining Ground: New Study Shows Growth Surge

03/10/2025 06:31 pm Share         TROY, Mich.— A new study reveals the appeal of buy now, pay later is not waning, as the service saw significant growth last year. The J.D. Power 2025 U.S. Buy Now Pay Later Satisfaction Study shows BNPL enjoyed continued, significant growth in the number of consumers using the product year over year, with the highest usage among consumers from Generations Y and Z, and the highest growth period during the holidays. “The BNPL segment has undoubtedly grown in popularity, with more customers using these services than ever before,” said Sean Gelles, senior director of banking and payments at J.D. Power. “That’s been especially true around seasonal periods of higher spending, such as the holidays. Card-based BNPL products continue to lead the charge on satisfaction, as issuers are leveraging their existing brand awareness and equity to retain would-be defectors.” Following are some of the key findings of the 2025 study: Gene...

Sunday reading - What's the story behind Thanksgiving?

What's the story behind Thanksgiving? While European settlers in North America had long observed days of thanks, prayer, and reflection, the “ first Thanksgiving ” most often refers to a 1621 meal between the Pilgrims and the native Wampanoag people.   In 1863, Abraham Lincoln declared a national Thanksgiving Day on the final Thursday of November to be celebrated each year. A large meal shared with loved ones is the centerpiece of most Thanksgiving celebrations, where the average gathering size is seven and most people consume 3,150-4,500 calories .   What began as a neighborly meal to celebrate a successful harvest has transformed into an annual economic and cultural powerhouse: The day before Thanksgiving is one of the busiest days of the year for air travel as Americans prepare to eat upward of 40 million turkeys  and 80 million pounds of cranberries. ... Read what else we  learned about the holiday here . ...

Here’s What Consumers are Saying About Gift Cards, According to New Fiserv Study

BROOKFIELD, Wis.–Eighty percent of consumers say they enjoy receiving a gift card as a gift, and 68% of consumers will spend the full value of that gift card in three months or less, according to the 19th Annual Prepaid Consumer Insights Study from Fiserv. The study further found employers are increasingly using gift cards to reward their employees, and retailers are finding new ways to leverage gift cards in their incentives and rewards programs. According to Fiserv, five of the most interesting findings in the survey of more than 1,000 U.S. consumers include: ‘Satisfaction’ Abounds with Gift Cards 80% of consumers say they feel satisfied when they receive a gift card, so satisfied that consumers aren’t waiting long to spend them. 71% of consumers say that it takes one or two purchases to redeem the full value of a gift card; and 68% say it takes less than three months to fully redeem. Physical Cards Still Reign, But Digital is Growing While digital gift card spending is on the ri...