Share

BeInCrypto Among Websites That Helped AI Like ChatGPT Elevate Intellectual Appeal

Prefer us on Google

Written by

Josh Adams

Edited by

Ali Martinez

20 April 2023, 18:31 UTC

•

Updated 20 April 2023, 21:51 UTC

BeInCrypto was included in the C4 dataset for training artificial intelligence (AI).
Language models and those used by ChatGPT "scrape" the internet to mimic human syntax.
CommonCrawl includes trustworthy websites and non-licensed and copyrighted materials.

#AI News

BeInCrypto was included in a dataset to train and improve artificial intelligence (AI) tools such as ChatGPT, according to a recent analysis.

BeInCrypto has been included in a huge dataset for training AI called C4. The Washington Post and the Allen Institute for AI recently studied Google’s C4 dataset to determine what sites were feeding into AI tools.

Many large language models have used C4 (which stands for Colossal Clean Crawled Corpus) as an instructional tool. However, Open AI‘s ChatGPT does not make use of this dataset.

Sponsored

Helping AI Replicate Human Speech

Large language models like C4, and that employed by ChatGPT, “scrape” the internet for content to include in their model. The vastness of the dataset allows AI to mimic human speech.

The Washington Post sorted C4’s websites using data from the web analytics company, Similarweb. Then, they ranked the top 10 million websites by the number of “tokens” they contributed.

Tokens refer to short chunks of text utilized to make sense of unstructured data, usually consisting of a word or a phrase.

BeInCrypto Contributed to AI Artificial Intelligence ChatGPT — AI Categorized Websites. Source: The Washington Post

The three largest contributors to the dataset were patents.google.com, wikipedia.org, and scribd.com, a subscription-based digital library. And news organizations dominated the top ranks, with the Guardian, New York Times, Forbes, LA Times, and Huffington Post crowding the top 10.

Data for C4 was First Scraped in 2019

Other websites to feature heavily include Instructables, an online platform for sharing DIY instructions and how-tos. And the researchers also found at least 27 other sites identified by the U.S. government as markets for piracy and counterfeits.

C4 began life as a single scrape by the non-profit CommonCrawl in 2019. They told the Washington Post that it does not try to avoid licensed or copyrighted material. However, it does try to prioritize high-quality and trustworthy websites where data is free to use and analyze.

As AI technology continues to threaten various industries, scraping content for large language models has become increasingly controversial, particularly in sectors most at risk from AI.

AI training companies do not compensate content creators for the use of their work. Moreover, artists have recently hit AI image tools Midjourney and Stable Diffusion with a copyright lawsuit. And the suit claims generative AI art tools violate the copyright law by scraping artists’ work without their consent.

To read the latest cryptocurrency market analysis from BeInCrypto, click here .

Disclaimer

BeInCrypto is committed to unbiased, transparent reporting. This news article aims to provide accurate, timely information. However, readers are advised to verify facts independently and consult with a professional before making any decisions based on this content. Please note that our Terms and Conditions, Privacy Policy, and Disclaimers have been updated.

Sponsored