Skip to main content

What is “corpus”? And why is everyone in AI suddenly talking about it? Here’s what you need to know.!

 


Writer Michael Grothaus

Bill Gates, Reddits CEO, and other tech leaders are increasingly talking about their "corpus." Now is the time to learn what that means.

Thanks to ChatGPT and similar platforms, the rise of artificial intelligence has been one of the most headline-grabbing subjects of 2023. Not a day goes by without a new article coming out about some way AI tech spells either doom or salvation for the creative fields, your job, or humanity.

And if you’ve been reading these articles, you might have noticed one particular word being thrown around by tech executives recently: “corpus.” Reddit’s CEO has mentioned it; so has Wikipedia’s founder Jimmy Wales; and so has Microsoft founder Bill Gates.

Here’s what it means, and why it’s critical to understanding how artificial intelligence platforms like ChatGPT and Midjourney operate.

WHAT IS AN AI CORPUS?

Those who studied Latin in school will immediately know that corpus means “body.” (The modern word for a dead body—“corpse”—is derived from corpus.) Others might recognize the word corpus because of its use in a legal mechanism still in place today: habeas corpus. This phrase literally means “you should have the body” and it ensures that anyone arrested has the right to appear before a judge (thus, the judge “has the body” of the person arrested) to determine if that arrest is lawful.

But when used in the artificial intelligence realm, the term “corpus” doesn’t refer to a physical body at all. Instead, it refers to the metaphorical “body,” or collection, of data that was used to train the AI. This corpus is the material the AI reviews to become intelligent in whatever it was designed for.

Every AI’s corpus will be different, because it is humans who decide what kind of data, they want to train an AI on. And the corpus the humans decide to train the AI on will depend on what they want the AI to be proficient in.

TYPES OF CORPORA

There is no limit to the types of corpora (the plural of corpus) that can exist. What makes up an AI’s corpus simply depends upon what the human creator of the AI intends for it to do.

Take Midjourney, for example. Midjourney is a popular generative art platform for creating images with AI. Since Midjouney lets a user create images using nothing but text prompts, its AI needed to be trained on both a series of images and associated text descriptions. For example, in order for Midjounry to generate an image of a waterfall, its corpus must have included images of waterfalls and the accompanying text that labeled a wall of falling water as a “waterfall”.

Then there are AI platforms such as ChatGPT, a type of AI known as a large language model, or LLM. Robust LLM’s have the ability to have conversational text-based chats with a person—provided their corpus is large and rich enough. And depending on what its corpus contains, an LLM can also answer complex questions or even generate original creative works, like short stories or the code to create a space shooter game. Its abilities simply depend on the data contained in the corpus that was used to train the AI.

In ChatGPT’s case, I wanted to know what made up its corpus, so I just asked it. “[The ChatGPT corpus] consists of a wide range of text from the internet, including websites, books, articles, and other publicly available sources,” it replied. Not content with the rather vague answer, I asked ChatGPT to elaborate on the types of data in its corpus. This time ChatGPT was more detailed:

• Websites: Text from websites across different domains and topics.

• Books: Text from a wide range of books covering various genres and subjects.

• Articles: Text from news articles, magazine features, and blog posts.

• Research Papers: Text from scientific papers and publications.

• Conversational Data: Text from dialogues, conversations, and interactions.

• Social Media: Text from platforms like Twitter, Reddit, and online forums.

• Wikipedia: Text from Wikipedia articles spanning numerous topics.

ChatGPT

Notice one big omission from ChatGPT’s corpus: images. That’s because ChaptGPT is a text-based AI generator. It can’t generate images because its corpus never contained any to train on.

The data funneled into Midjourney and ChatGPT are just two examples of what can make up a corpus. But a corpus can be made of any kind of data. For example, if you wanted to make an AI that could create music, you would simply include audio songs in its corpus. Or if you wanted an AI that could write a novel in the sparse style of Hemingway, you would use a corpus containing only Hemingway’s written works.

THE LEGALITY OF CORPORA

If you don’t have a corpus to feed an AI, the AI cannot learn. And the larger your corpus is, the more proficient, or intelligent, the AI can become. But the actual data that makes up an AI’s corpus opens up a whole new can of worms when it comes to copyright and intellectual property law.

ADVERTISEMENT

Have the owners of AI that trained on a corpus of copyrighted material violated the law? For example, if I create an AI that can generate Banksy-like artwork, and I trained the AI on a corpus of Banksy’s works, have I violated Banksy’s copyrights or intellectual property? My AI doesn’t reproduce his artwork, just his style, so is it still a violation of copyright or intellectual property? Or, say I create an AI with a corpus containing Rihanna’s songs. The AI can then generate completely new, original songs, but with Rihanna’s voice, or something close to it. Is that legal?

Universal Music Group already answered with a hard “no” after AI-generated songs by Drake and The Weekend made the rounds on streaming services earlier this year. But creators who use AI tools might say otherwise. Ultimately, whether it’s in regard to AI-generated audio, visual, or text-based media, it’s a question that is likely to tie up courts around the world for years to come as generative AI programs like ChatGPT and Midjourney become more commonplace.

At the same time, governments are already planning legislation that would place regulations on generative AI models. The European Union, for example, is proposing a law that would require the owner of an AI to divulge whether the AI’s corpus contained copyrighted material. That transparency would make it easier for copyright holders to identify which corpora their work has been used in—and thus seek compensation.

In the United States, the Congressional Research Service recently advised Congress that it may wish to “adopt a wait-and-see approach” before updating copyright legislation, suggesting that it monitor how the courts react in the years ahead to AI-generated copyright cases.

AI CORPORA AS A REVENUE STREAM

Of course, some content creators will choose to embrace the revenue-generating opportunities that AI stands to offer—those that have large enough bodies of work, anyway. Let’s say a living painter did want to make some extra cash. She could simply package her collection of works in a corpus and sell access to it to generative AI companies. Authors could sell a corpus of their novels; magazine publishers could sell a corpus of their back-issues; and singers could sell a corpus of their vocals—or demand a part of the cut earned by any AI-generated work their corpus fueled, as Grimes has already proposed.

Heck, if Elon Musk wanted a new revenue stream for his flailing Twitter, he might consider packaging all the tweets on the platform into a corpus to sell to AI startups. Meta’s Facebook would also find a new revenue stream in this (provided Twitter and Meta can claim ownership of users’ posts, that is). Indeed, Reddit’s corpus of users’ posts has been used to help train ChatGPT, and in a recent interview with The New York Times, Reddit CEO Steve Huffman said he knew the value of that corpus. “The Reddit corpus of data is really valuable. But we don’t need to give all of that value to some of the largest companies in the world for free.”

In this sense, as more companies expand into the AI space, robust, pre-packaged corpora may become as important in the tech world as pick axes were to the miners of the gold rush, and a whole new cottage industry of corpora sellers may appear.

If that’s the case, in the months and years ahead, “corpus” is set to become a regular part of the vernacular when we talk about, and debate, AI.

About the author

Michael Grothaus is a novelist and author. His new novel, the speculative fiction 'BEAUTIFUL SHINING PEOPLE', is out now More


Comments

Popular posts from this blog

TruStage To Launch TSDA, Bringing Stablecoin Infrastructure To Community FIs

MADISON, Wis.— TruStage Tuesday today announced the planned launch of TruStage Stablecoin (TSDA), a fully reserved U.S. dollar stablecoin. At its core, TSDA is designed to broaden access to digital payment infrastructure for community-based financial institutions, TruStage explained. “A trusted partner of credit unions for more than 90 years, TruStage currently works with more than 93% of 4,300+ credit unions nationwide, which collectively hold more than $2 trillion in assets. TruStage Stablecoin will be among the very first stablecoins specific to community based financial institutions and is supported by decades of industry relationships, financial strength, and operational excellence,” TruStage said. “In my career working with credit unions, I’ve never witnessed the level of engagement surrounding any technology advancement similar to what I’m seeing with stablecoin solutions right now,” said Brian Kaas, president and managing director of TruStage Ventures, the venture capital arm o...

Sunday Reading - Where Beatniks Come From

  Where Beatniks Come From       An introduction to the Beat Generation The Beat Generation   was an American literary movement that rose to prominence in the 1950s. A loosely affiliated collection of poets, novelists, playwrights, publishers, and other artists reacted to what they considered an anti-intellectual and homogeneous social order following World War II.   The writing of the Beat Generation used experimental forms, surreal imagery, and vernacular language, and emphasized the importance of " spontaneous prose " to mimic the improvisation of jazz. Although the Beats praised canonical poets like William Blake, Arthur Rimbaud, and Walt Whitman, much of their work sought to rebel against literary tradition.   The Beats' radical politics and nonconformity influenced several subsequent countercultural ...

As Expected, Fed Opts Not to Raise Rates--But Says It May in Future

WASHINGTON–As expected, the Federal Reserve has adjourned its meeting here without raising rates, but it also indicated it could again do so in the future. The decision means rates remain at a two-decade high. The adjournment without action marks the second consecutive meetings at which the Fed has not raised rates, it the longest period without an increase since it began to lift rates from near 0% in March 2022. In announcing it would maintain the Fed Funds rate at a range of 5.25% to 5.50%, the Fed said in a statement that recent indicators suggest economic activity expanded at a strong pace in the third quarter, job gains have moderated since earlier in the year but remain strong, and the unemployment rate has remained low. Inflation remains elevated. ...

James Hunter, Executive Director of Credit Union Development for New Orleans Firemen’s CU, knows too well how expensive it is to be poor.

  NEW ORLEANS FIREMEN’S FCU 􀀁 METAIRIE, L   A passion for empowerment James Hunter knows too well how expensive it is to be poor. It’s what he sees every day as mortgage director and executive director of credit union development for $182 million asset New Orleans Firemen’s Federal Credit Union, Metairie, La., and executive director of The Faith Fund, a nonprofit partnership that seeks to provide a financial hand-up to the undeserved. It’s what inspires him to come to work every day and drives his passion of empowering people and setting them on the path to financial security. “Too many people are too far away from the starting line,” Hunter says. “Payday loans are a big business in Louisiana. Exorbitant fees and interest from payday loans drain more than a quarter of a billion dollars a year. Baton Rouge supports one of the top three pay-day loan markets in the U.S.” The Faith Fund was formed to counteract that. It’s a unique cooperative relationship between like-minded busi...

LA County firefighters help each other cope with toughest part of the job

This is an excellent program, and no matter what size your department is, you should be prepared. Scott Ross  talks over issues with Firefighter Richard Conejo who was recently affected by the death of a fellow firefighter . They meet under the auspices of the LA County Fire Department's Peer Support Program. **** Read More ; LA County <b>firefighters</b> help each other cope with toughest part of the job :

CU Board Modernization Act Passes House

Backed by NAFCU and CUNA, the legislation would reduce the number of times CU boards must meet each year. By Michael Ogden | September 30, 2022 at 01:00 PM U.S. Capitol building, Washington, D.C. (Source: Shutterstock) The House of Representatives passed the Credit Union Board Modernization Act on Thursday, the fate of which goes to the Senate, where a similar version was introduced in May. The bill would alter the Federal Credit Union Act’s requirement that federally charted credit unions meet 12 times each year and reduce that number to a minimum of six times each year. For months, CUNA and NAFCU officials have backed the bill , along with representatives from the California and Ohio Credit Union Leagues. “This bill would provide a needed update to credit union board meeting requirements, freeing up time and resources that can be dedicated to meeting members’ needs,” CUNA President/CEO Jim Nussle said. “We thank Reps. Var...

If these cuts in salaries catch on, is your credit union ready?

NEW ORLEANS — The first New Orleans firefighters were furloughed on Sunday under a plan requiring six unpaid days off by the end of the year to help stem a precipitous decline in city sales tax revenue during the coronavirus pandemic. The city’s furlough requires almost all 4,700 employees to take the six unpaid days, including police, firefighters and other safety workers, reducing their salaries by about 10% and saving the city $6 million. New Orleans' firefighters' union says the city's furloughs have had an impact on service. The city has required nearly all of its public employees to take at least six unpaid days off before the end of the year in order to offset COVID-19-related budget issues. ...

Syracuse Fire Department Credit Union

 Congrats, Tonia, on your promotion! ================================================= Remember, you're not alone with  NCOFCU.org Join/Upgrade Check out some of NCOFCU's additional features: First Responder Credit Union Academy Financial Literacy Podcasts YouTube Mini's Blog Job Board

The NCUA just published its stablecoin playbook: Here’s what credit unions need to know

The National Credit Union Administration (NCUA) has begun answering a key question for credit unions since the GENIUS Act became law last July: What is the stablecoin licensing process? On February 11, 2026, the NCUA published a  22-page proposed rule , "Investments in and Licensing of Permitted Payment Stablecoins Issuers," in the Federal Register. This document outlines the framework for credit union participation under the new Act. The NCUA has a deadline of July 18, 2026, to finalize this rule. Here’s what credit unions need to know now. Quick background: The GENIUS Act and the NCUA’s role The GENIUS Act designated the NCUA as a primary federal regulator of stablecoin, alongside the FDIC, the OCC, and the Federal Reserve. Credit unions can't issue stablecoins directly; they must operate through subsidiaries, typically CUSOs, that apply for and obtain an NCUA-issued Permitted Payment Stablecoin Issuer (PPSI) license. The newly proposed rule covers the application and l...