See More

BeInCrypto Among Websites That Helped AI Like ChatGPT Elevate Intellectual Appeal

2 mins
Updated by Ali M.
Join our Trading Community on Telegram

In Brief

  • BeInCrypto was included in the C4 dataset for training artificial intelligence (AI).
  • Language models and those used by ChatGPT "scrape" the internet to mimic human syntax.
  • CommonCrawl includes trustworthy websites and non-licensed and copyrighted materials.
  • promo

BeInCrypto was included in a dataset to train and improve artificial intelligence (AI) tools such as ChatGPT, according to a recent analysis.

BeInCrypto has been included in a huge dataset for training AI called C4. The Washington Post and the Allen Institute for AI recently studied Google’s C4 dataset to determine what sites were feeding into AI tools.

Many large language models have used C4 (which stands for Colossal Clean Crawled Corpus) as an instructional tool. However, Open AI‘s ChatGPT does not make use of this dataset.

Helping AI Replicate Human Speech

Large language models like C4, and that employed by ChatGPT, “scrape” the internet for content to include in their model. The vastness of the dataset allows AI to mimic human speech.

The Washington Post sorted C4’s websites using data from the web analytics company, Similarweb. Then, they ranked the top 10 million websites by the number of “tokens” they contributed.

Tokens refer to short chunks of text utilized to make sense of unstructured data, usually consisting of a word or a phrase.

BeInCrypto Contributed to AI Artificial Intelligence ChatGPT
AI Categorized Websites. Source: The Washington Post

The three largest contributors to the dataset were patents.google.com, wikipedia.org, and scribd.com, a subscription-based digital library. And news organizations dominated the top ranks, with the Guardian, New York Times, Forbes, LA Times, and Huffington Post crowding the top 10.

Data for C4 was First Scraped in 2019

Other websites to feature heavily include Instructables, an online platform for sharing DIY instructions and how-tos. And the researchers also found at least 27 other sites identified by the U.S. government as markets for piracy and counterfeits.

C4 began life as a single scrape by the non-profit CommonCrawl in 2019. They told the Washington Post that it does not try to avoid licensed or copyrighted material. However, it does try to prioritize high-quality and trustworthy websites where data is free to use and analyze. 

As AI technology continues to threaten various industries, scraping content for large language models has become increasingly controversial, particularly in sectors most at risk from AI.

AI training companies do not compensate content creators for the use of their work. Moreover, artists have recently hit AI image tools Midjourney and Stable Diffusion with a copyright lawsuit. And the suit claims generative AI art tools violate the copyright law by scraping artists’ work without their consent.

Top crypto projects in the US | April 2024

Trusted

Disclaimer

In adherence to the Trust Project guidelines, BeInCrypto is committed to unbiased, transparent reporting. This news article aims to provide accurate, timely information. However, readers are advised to verify facts independently and consult with a professional before making any decisions based on this content. Please note that our Terms and ConditionsPrivacy Policy, and Disclaimers have been updated.

Frame-2298.png
Josh Adams
Josh is a reporter at BeInCrypto. He first worked as a journalist over a decade ago, initially covering music before moving into politics and current affairs. Josh first owned Bitcoin in 2014 and has followed the space ever since. He is particularly interested in Web3 adoption, policy and regulation, CBDCs, privacy, and the future of the metaverse.
READ FULL BIO
Sponsored
Sponsored