BeInCrypto was included in a dataset to train and improve artificial intelligence (AI) tools such as ChatGPT, according to a recent analysis.
BeInCrypto has been included in a huge dataset for training AI called C4. The Washington Post and the Allen Institute for AI recently studied Google’s C4 dataset to determine what sites were feeding into AI tools.
Many large language models have used C4 (which stands for Colossal Clean Crawled Corpus) as an instructional tool. However, Open AI‘s ChatGPT does not make use of this dataset.
Helping AI Replicate Human Speech
Large language models like C4, and that employed by ChatGPT, “scrape” the internet for content to include in their model. The vastness of the dataset allows AI to mimic human speech.
The Washington Post sorted C4’s websites using data from the web analytics company, Similarweb. Then, they ranked the top 10 million websites by the number of “tokens” they contributed.
Tokens refer to short chunks of text utilized to make sense of unstructured data, usually consisting of a word or a phrase.
The three largest contributors to the dataset were patents.google.com, wikipedia.org, and scribd.com, a subscription-based digital library. And news organizations dominated the top ranks, with the Guardian, New York Times, Forbes, LA Times, and Huffington Post crowding the top 10.
Data for C4 was First Scraped in 2019
Other websites to feature heavily include Instructables, an online platform for sharing DIY instructions and how-tos. And the researchers also found at least 27 other sites identified by the U.S. government as markets for piracy and counterfeits.
C4 began life as a single scrape by the non-profit CommonCrawl in 2019. They told the Washington Post that it does not try to avoid licensed or copyrighted material. However, it does try to prioritize high-quality and trustworthy websites where data is free to use and analyze.
As AI technology continues to threaten various industries, scraping content for large language models has become increasingly controversial, particularly in sectors most at risk from AI.
AI training companies do not compensate content creators for the use of their work. Moreover, artists have recently hit AI image tools Midjourney and Stable Diffusion with a copyright lawsuit. And the suit claims generative AI art tools violate the copyright law by scraping artists’ work without their consent.
Disclaimer
In adherence to the Trust Project guidelines, BeInCrypto is committed to unbiased, transparent reporting. This news article aims to provide accurate, timely information. However, readers are advised to verify facts independently and consult with a professional before making any decisions based on this content. Please note that our Terms and Conditions, Privacy Policy, and Disclaimers have been updated.