Top AI dataset pulls data from BitcoinTalk, Steemit, and U.S. SEC
Colossal Clean Crawled Corpus (C4), an AI dataset used by major tech companies, contains data from various crypto-related websites.
C4 dataset draws from crypto sites
The Washington Post and the Allen Institute for AI recently analyzed the C4 dataset, ranking websites by the number of “tokens” or text snippets taken from each source.
The U.S. Securities and Exchange Commission — which in part contains content on cryptocurrency regulation — was among the dataset’s largest sources. Its website (sec.gov) ranked at #39 and accounted for 36 million, or 0.02%, of C4’s tokens.
Bitcointalk.org, a blockchain discussion board created by Satoshi Nakamoto, ranked at #780. It accounted for 6.1 million, or 0.004%, of C4’s tokens.
Cryptocurrency news and aggregation sites such as Cointelegraph and Coinmarketcap.com were also represented. Eight such sites collectively accounted for at least 0.008% of C4’s tokens, though other sites likely increase the true total.
Websites related to specific cryptocurrencies and exchanges were also represented in the dataset but accounted for a negligible amount of tokens.
Two crypto-adjacent sites also ranked highly. IPFS (ipfs.io) ranked at #16 while Steemit (steemit.com) ranked at #594. The first site is a distributed network from the blockchain firm Protocol Labs, while the second makes direct use of blockchain. However, these sites do not necessarily contain content related to cryptocurrency.
Mainstream sites topped the list
The C4 dataset is used in AI language models from major tech companies including Google’s T5 and Facebook’s LLaMA, according to the Washington Post.
Though the above sites are among C4’s most significant crypto-related websites, they are outranked by mainstream websites and news sources, which often cover cryptocurrency topics and are likely the primary source for all crypto-related data.
C4 has also been criticized for containing hate speech and pirated data. Though the dataset’s name suggests that it has been “cleaned,” its assemblers only used a list of 400 words to censor specific content, meaning that controversial content remains intact.
The presence of crypto sites, as well as the presence of controversial data, could affect the level of bias seen in content produced by AI chatbots.