Skip to main content

Reddit Wants 'Deeper Integration' with Google in Exchange for Licensed AI Training Data

1 month ago
Reddit's content became AI training data last year when Google signed a $60 million-per-year licensing agreement. But now Reddit is "in early talks" about a new deal seeking "deeper integration with Google's AI products," reports Bloomberg (citing executives familiar with the discussions). And Reddit also wants "a deal structure that could allow for dynamic pricing, where the social platform can be paid more" — with both Google and OpenAI — to "adequately reflect how valuable their data has been to these platforms..." Such licensing agreements are becoming more common as AI companies seek legal ways to train their models. OpenAI has also struck a series of partnership agreements with major media publishers such as Axel Springer SE, Time and Conde Nast to use their content in ChatGPT... Reddit remains among the most cited sources across AI platforms, according to analytics company Profound AI. However, Reddit executives have noticed that traffic coming from Google has limited value, as users seeking answers to a specific question often don't convert into becoming active Redditors, the people said. Now, Reddit is engaging with product teams at Google in hopes of finding ways to send more of its users deeper into its ecosystem of community forums, according to the executives. In return, Reddit is looking for ways to provide more high-quality data to its AI partners. Discussions between Reddit and Google have been productive, the people said. "We're midflight in our data licensing deals and still learning, but what we have seen is that Reddit data is highly cited and valued," Reddit Chief Operating Officer Jen Wong said on July 31 during a call with investors. "We'll continue to evaluate as we go."

Read more of this story at Slashdot.

EditorDavid

Bored developers accidentally turned their watercooler into a bootleg brewery

1 month ago
Revenge on managers who slow things down is a drink best served with floating fungus

Who, Me?  The world of work can sometimes drive IT pros to drink, leaving them more likely to make the sort of mistakes that The Register celebrates each week in Who, Me? It’s our reader-contributed column in which you share stories of making a mess at work, and cleaning up afterwards to the best of your ability.…

Simon Sharwood

CodeSOD: Identify a Nap

1 month ago

Guy picked up a bug ticket. There was a Hiesenbug; sometimes, saving a new entry in the application resulted in a duplicate primary key error, which should never happen.

The error was in the message-bus implementation someone else at the company had inner-platformed together, and it didn't take long to understand why it failed.

/** * This generator is used to generate message ids. * This implementation merely returns the current timestamp as long. * * We are, thus, limited to insert 1000 new messages per second. * That throughput seems reasonable in regard with the overall * processing of a ticket. * * Might have to re-consider that if needed. * */ public class IdGenerator implements IdentifierGenerator { long previousId; @Override public synchronized Long generate (SessionImplementor session, Object parent) throws HibernateException { long newId = new Date().getTime(); if (newId == previousId) { try { Thread.sleep(1); } catch (InterruptedException ignore) {} newId = new Date().getTime(); } return newId; } }

This generates IDs based off of the current timestamp. If too many requests come in and we start seeing repeating IDs, we sleep for a second and then try again.

This… this is just an autoincrementing counter with extra steps. Which most, but I suppose not all databases supply natively. It does save you the trouble of storing the current counter value outside of a running program, I guess, but at the cost of having your application take a break when it's under heavier than average load.

One thing you might note is absent here: generate doesn't update previousId. Which does, at least, mean we won't ever sleep for a second. But it also means we're not doing anything to avoid collisions here. But that, as it turns out, isn't really that much of a problem. Why?

Because this application doesn't just run on a single server. It's distributed across a handful of nodes, both for load balancing and resiliency. Which means even if the code properly updated previousId, this still wouldn't prevent collisions across multiple nodes, unless they suddenly start syncing previousId amongst each other.

I guess the fix might be to combine a timestamp with something unique to each machine, like… I don't know… hmmm… maybe the MAC address on one of their network interfaces? Oh! Or maybe you could use a sufficiently large random number, like really large. 128-bits or something. Or, if you're getting really fancy, combine the timestamp with some randomness. I dunno, something like that really sounds like it could get you to some kind of universally unique value.

Then again, since the throughput is well under 1,000 messages per second, you could probably also just let your database handle it, and maybe not generate the IDs in code.

.comment { border: none; } [Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.
Remy Porter