Top AI dataset pulls data from BitcoinTalk, Steemit, and U.S. SEC

exchange68September 11, 2024

0 15 2 minutes read

Colossal Clean Crawled Corpus (C4), an AI dataset used by major tech companies, contains data from various crypto-related websites.

C4 dataset draws from crypto sites

The Washington Post and the Allen Institute for AI recently analyzed the C4 dataset, ranking websites by the number of “tokens” or text snippets taken from each source.

The U.S. Securities and Exchange Commission — which in part contains content on cryptocurrency regulation — was among the dataset’s largest sources. Its website (sec.gov) ranked at #39 and accounted for 36 million, or 0.02%, of C4’s tokens.

Bitcointalk.org, a blockchain discussion board created by Satoshi Nakamoto, ranked at #780. It accounted for 6.1 million, or 0.004%, of C4’s tokens.

Cryptocurrency news and aggregation sites such as Cointelegraph and Coinmarketcap.com were also represented. Eight such sites collectively accounted for at least 0.008% of C4’s tokens, though other sites likely increase the true total.

Websites related to specific cryptocurrencies and exchanges were also represented in the dataset but accounted for a negligible amount of tokens.

Two crypto-adjacent sites also ranked highly. IPFS (ipfs.io) ranked at #16 while Steemit (steemit.com) ranked at #594. The first site is a distributed network from the blockchain firm Protocol Labs, while the second makes direct use of blockchain. However, these sites do not necessarily contain content related to cryptocurrency.

Mainstream sites topped the list

The C4 dataset is used in AI language models from major tech companies including Google’s T5 and Facebook’s LLaMA, according to the Washington Post.

Though the above sites are among C4’s most significant crypto-related websites, they are outranked by mainstream websites and news sources, which often cover cryptocurrency topics and are likely the primary source for all crypto-related data.

C4 has also been criticized for containing hate speech and pirated data. Though the dataset’s name suggests that it has been “cleaned,” its assemblers only used a list of 400 words to censor specific content, meaning that controversial content remains intact.

The presence of crypto sites, as well as the presence of controversial data, could affect the level of bias seen in content produced by AI chatbots.

Posted In: AI

Author

Mike Dalton

Journalist at CryptoSlate

Before transitioning to crypto writing in 2018, Mike studied library and information sciences. Currently, he resides on Canada’s West Coast.

Disclaimer: Our writers’ opinions are solely their own and do not reflect the opinion of CryptoSlate. None of the information you read on CryptoSlate should be taken as investment advice, nor does CryptoSlate endorse any project that may be mentioned or linked to in this article. Buying and trading cryptocurrencies should be considered a high-risk activity. Please do your own due diligence before taking any action related to content within this article. Finally, CryptoSlate takes no responsibility should you lose money trading cryptocurrencies.