Skip to main content

Common Crawl Criticized for 'Quietly Funneling Paywalled Articles to AI Developers'

1 month 1 week ago
For more than a decade, the nonprofit Common Crawl "has been scraping billions of webpages to build a massive archive of the internet," notes the Atlantic, making it freely available for research. "In recent years, however, this archive has been put to a controversial purpose: AI companies including OpenAI, Google, Anthropic, Nvidia, Meta, and Amazon have used it to train large language models. "In the process, my reporting has found, Common Crawl has opened a back door for AI companies to train their models with paywalled articles from major news websites. And the foundation appears to be lying to publishers about this — as well as masking the actual contents of its archives..." Common Crawl's website states that it scrapes the internet for "freely available content" without "going behind any 'paywalls.'" Yet the organization has taken articles from major news websites that people normally have to pay for — allowing AI companies to train their LLMs on high-quality journalism for free. Meanwhile, Common Crawl's executive director, Rich Skrenta, has publicly made the case that AI models should be able to access anything on the internet. "The robots are people too," he told me, and should therefore be allowed to "read the books" for free. Multiple news publishers have requested that Common Crawl remove their articles to prevent exactly this use. Common Crawl says it complies with these requests. But my research shows that it does not. I've discovered that pages downloaded by Common Crawl have appeared in the training data of thousands of AI models. As Stefan Baack, a researcher formerly at Mozilla, has written, "Generative AI in its current form would probably not be possible without Common Crawl." In 2020, OpenAI used Common Crawl's archives to train GPT-3. OpenAI claimed that the program could generate "news articles which human evaluators have difficulty distinguishing from articles written by humans," and in 2022, an iteration on that model, GPT-3.5, became the basis for ChatGPT, kicking off the ongoing generative-AI boom. Many different AI companies are now using publishers' articles to train models that summarize and paraphrase the news, and are deploying those models in ways that steal readers from writers and publishers. Common Crawl maintains that it is doing nothing wrong. I spoke with Skrenta twice while reporting this story. During the second conversation, I asked him about the foundation archiving news articles even after publishers have asked it to stop. Skrenta told me that these publishers are making a mistake by excluding themselves from "Search 2.0" — referring to the generative-AI products now widely being used to find information online — and said that, anyway, it is the publishers that made their work available in the first place. "You shouldn't have put your content on the internet if you didn't want it to be on the internet," he said. Common Crawl doesn't log in to the websites it scrapes, but its scraper is immune to some of the paywall mechanisms used by news publishers. For example, on many news websites, you can briefly see the full text of any article before your web browser executes the paywall code that checks whether you're a subscriber and hides the content if you're not. Common Crawl's scraper never executes that code, so it gets the full articles. Thus, by my estimate, the foundation's archives contain millions of articles from news organizations around the world, including The Economist, the Los Angeles Times, The Wall Street Journal, The New York Times, The New Yorker, Harper's, and The Atlantic.... A search for nytimes.com in any crawl from 2013 through 2022 shows a "no captures" result, when in fact there are articles from NYTimes.com in most of these crawls. "In the past year, Common Crawl's CCBot has become the scraper most widely blocked by the top 1,000 websites," the article points out...

Read more of this story at Slashdot.

EditorDavid

Scientists Edit Gene in 15 Patients That May Permanently Reduce High Cholesterol

1 month 1 week ago
A CRISPR-based drug given to study participants by infusion is raising hopes for a much easier way to lower cholesterol, reports CNN: With a snip of a gene, doctors may one day permanently lower dangerously high cholesterol, possibly removing the need for medication, according to a new pilot study published Saturday in the New England Journal of Medicine. The study was extremely small — only 15 patients with severe disease — and was meant to test the safety of a new medication delivered by CRISPR-Cas9, a biological sort of scissor which cuts a targeted gene to modify or turn it on or off. Preliminary results, however, showed nearly a 50% reduction in low-density lipoprotein, or LDL, the "bad" cholesterol which plays a major role in heart disease — the No.1 killer of adults in the United States and worldwide. The study, which will be presented Saturday at the American Heart Association Scientific Sessions in New Orleans, also found an average 55% reduction in triglycerides, a different type of fat in the blood that is also linked to an increased risk of cardiovascular disease. "We hope this is a permanent solution, where younger people with severe disease can undergo a 'one and done' gene therapy and have reduced LDL and triglycerides for the rest of their lives," said senior study author Dr. Steven Nissen, chief academic officer of the Sydell and Arnold Miller Family Heart, Vascular & Thoracic Institute at Cleveland Clinic in Ohio.... Today, cardiologists want people with existing heart disease or those born with a predisposition for hard-to-control cholesterol to lower their LDL well below 100, which is the average in the US, said Dr. Pradeep Natarajan, director of preventive cardiology at Massachusetts General Hospital and associate professor of medicine at Harvard Medical School in Boston... People with a nonfunctioning ANGPTL3 gene — which Natarajan says applies to about 1 in 250 people in the US — have lifelong levels of low LDL cholesterol and triglycerides without any apparent negative consequences. They also have exceedingly low or no risk for cardiovascular disease. "It's a naturally occurring mutation that's protective against cardiovascular disease," said Nissen, who holds the Lewis and Patricia Dickey Chair in Cardiovascular Medicine at Cleveland Clinic. "And now that CRISPR is here, we have the ability to change other people's genes so they too can have this protection." "Phase 2 clinical trials will begin soon, quickly followed by Phase 3 trials, which are designed to show the effect of the drug on a larger population, Nissen said." And CNN quotes Nissen as saying "We hope to do all this by the end of next year. We're moving very fast because this is a huge unmet medical need — millions of people have these disorders and many of them are not on treatment or have stopped treatment for whatever reason."

Read more of this story at Slashdot.

EditorDavid

Bank of America Faces Lawsuit Over Alleged Unpaid Time for Windows Bootup, Logins, and Security Token Requests

1 month 1 week ago
A former Business Analyst reportedly filed a class action lawsuit claiming that for years, hundreds of remote employees at Bank of America first had to boot up complex computer systems before their paid work began, reports Human Resources Director magazine: Tava Martin, who worked both remotely and at the company's Jacksonville facility, says the financial institution required her and fellow hourly workers to log into multiple security systems, download spreadsheets, and connect to virtual private networks — all before the clock started ticking on their workday. The process wasn't quick. According to the filing in the United States District Court for the Western District of North Carolina, employees needed 15 to 30 minutes each morning just to get their systems running. When technical problems occurred, it took even longer... Workers turned on their computers, waited for Windows to load, grabbed their cell phones to request a security token for the company's VPN, waited for that token to arrive, logged into the network, opened required web applications with separate passwords, and downloaded the Excel files they needed for the day. Only then could they start taking calls from business customers about regulatory reporting requirements... The unpaid work didn't stop at startup. During unpaid lunch breaks, many systems would automatically disconnect or otherwise lose connection, forcing employees to repeat portions of the login process — approximately three to five minutes of uncompensated time on most days, sometimes longer when a complete reboot was required. After shifts ended, workers had to log out of all programs and shut down their computers securely, adding another two to three minutes. Thanks to Slashdot reader Joe_Dragon for sharing the article.

Read more of this story at Slashdot.

EditorDavid