alt.hn

4/24/2025 at 7:13:43 PM

Discord Indexes Trillions of Messages

https://discord.com/blog/how-discord-indexes-trillions-of-messages

by todsacerdoti

4/25/2025 at 1:02:48 PM

Does anyone know a low memory disk based search solution for home usage? By low memory i mean in the range of 256MB maybe 1GB. I am almost convinced of mellisearch but still looking. Elasticsearch would have been perfect but the memory usage is too high for the server i have.

Or should i go with a custom solution? FYI the search is highly dynamic ranging from full books, medical records, chat messages to movie/shopping/recipe catalogue.

by cosmosgenius

4/25/2025 at 8:09:37 AM

Whoha, that is definitely expensive to run.

by rkwasny

4/24/2025 at 9:08:31 PM

Indexes trillions of messages and refuses to delete any of them and comply with GDPR. Deleting your account DOES NOT delete your messages, they remain on the platform for an indefinite period of time. The only way to get them to delete your messages is to go through this process (https://discordomicon.github.io/removal/overview/) and you're still likely to go through several support tickets before they comply with your request.

by coldblues

4/24/2025 at 7:29:50 PM

LLM training input treasure? Signal-to-noise ratio is going to be lowish though.

by lysace

4/24/2025 at 8:49:09 PM

I think discord's data has insane value because it has real time reasoning steps much more than any big social media. Obviously you might want to filter out low information messages like "Hello" but model based filtering for signal is already solved more or less.

Discord has lot of very technical channels like the one which solved BB(5) after decades of research.

by YetAnotherNick

4/24/2025 at 9:05:27 PM

Agreed. discord has a unique structure with threaded conversations, context carryover, and back and forth reasoning that you don't get from places like twitter or even reddit. it's especially useful for training models on collaborative problem solving or exploratory dialogue. filtering is a challenge but definitely solvable with current tools.

by leo-notte

4/24/2025 at 9:56:38 PM

[flagged]

by throwaway42167

4/24/2025 at 8:16:24 PM

If you want to train a groomer then Discord messages are the way to do it.

by DrillShopper

4/24/2025 at 8:48:52 PM

If we're acting like Redditors: "you can just block Minecraft Discords, and remove most of the grooming"

by lithos

4/24/2025 at 9:47:45 PM

Tldr: they used a single massive elasticsearch cluster. Now, they are using multiple elasticsearch cluster, running in k8s. And redis has been replaced by pubsub (GCP).

Since sounds like expensives design but ok, I guess

by JackSlateur