alt.hn

5/21/2025 at 2:45:38 PM

Discord Unveiled: A Comprehensive Dataset of Public Communication (2015-2024)

https://arxiv.org/abs/2502.00627

by leotravis10

5/21/2025 at 3:12:27 PM

Fantastic. I wonder how many random technical info is buried in these servers. I hate what it's done for game modding.

by msp26

5/21/2025 at 3:30:54 PM

I think the average server size here is in the ballpark of 1200 people.

These are servers that asked to be advertised by Discord ("Discovery"). These are unlikely to be any kind of servers used for private or even semi-private discussions. You likely don't know most of the people on the server.

Most likely, the 'hottest' kind of data you might find is someone accidentally leaking info akin to the World of Tanks forum post 'corrections'.

by ldoughty

5/21/2025 at 6:00:04 PM

A fair number of those servers have tens of thousands, if not hundreds of thousands of members. I admin two with over 50 thousand members, both listed in Discord's Server Explorer.

by giancarlostoro

5/21/2025 at 4:11:27 PM

I learned programming back in the day on the Tukui (a wow addon) forums. I hate that it's all discord now. Not well searchable and buried info.

by nixpulvis

5/21/2025 at 3:19:02 PM

The algebraic topology server probably contains a huge number of treasures in modern research algebraic topology. I really really hope it's archived in full

by Davidzheng

5/21/2025 at 4:40:38 PM

Its not difficult to archive yourself, if you really care[0]

I use a dedicated alt account to archive tons of various servers I'm in, and auto-download all attachments. It's nice having regex search capabilities on my local copy of the data too.

[0] https://github.com/Tyrrrz/DiscordChatExporter

by DaSHacka

5/21/2025 at 5:24:55 PM

Using a user account to do this is still considered risky since any automated API usage by a non-bot user is against TOS, and they have heuristics (maybe now ML-based heuristics) for banning accounts for 'things that "don't look like what our official client does"'[0].

0: https://news.ycombinator.com/item?id=25215415

by judge2020

5/21/2025 at 6:42:04 PM

This is why I use a dedicated account to scrape servers, since I unfortunately need my main to interact with(/run) communities unavailable elsewhere.

FWIW, I haven't exactly been careful with it (oftentimes scraping 2 servers at once, and downloading all attachments) and have never had an account get banned.

The only time I got 'banned' in any capacity was when I hammered the internal JSON API to get information about server's invite links, and even then it was only an automated IP ban from Cloudflare for a couple days. Although, it was an unauthenticated API.

by DaSHacka

5/21/2025 at 3:37:31 PM

It seems they identified servers via the discovery feature, which servers need to opt into (and I think be recognised as a "community server"? Though that might be out of date). I guess this is better than just scanning the web for invite links, but it does mean that probably most of those game modding servers were not included.

by Macha

5/21/2025 at 4:56:31 PM

I wonder if LLM companies don't have ways to scrape private Discord servers already. Creating accounts and pulling all the historic data doesn't sound impossible.

by hiccuphippo

5/21/2025 at 5:13:33 PM

They absolutely can and are. Multiple posts in here discuss how to do it.

It's like back in the days of IRC. People just logged all of it.

by chneu

5/21/2025 at 3:54:09 PM

Game modding is profitable and people doing it professionally (which they increasingly do) are quite attuned to the fact that making it too accessible would decimate their revenue. As a result, you either pay for the mod (early access, extra content, etc.), or you pay to join some Discord, but ideally you pay for something. Discord, which I generally dislike, is not necessarily the cause of it; if there was no Discord, people would probably use some other closed community platform instead.

I expect this would become more widespread as more traditional jobs are subsumed by unregulated ML tech (which, incidentally, the encumbent job-holders are helping train) and more people turn to what used to be generally a hobby as their means of making a living (not that that would last for too long either).

by strogonoff

5/21/2025 at 4:01:21 PM

> Game modding is profitable

It can be. As I understand it, it's sort of like streaming or other content creation - yes, it's possible, but difficult, as it's a saturated market. Most mod authors don't make much money.

As a slight aside, I think people would be more inclined to support creators like mod authors if it were simply easier. Patreon and the like make it fairly easy, but I don't think many people want to subscribe to 20+ Patreons for $5 apiece, as much as they might like to support those authors. On the other hand, I think more people would be willing to pledge $X per month to be split among all of their subscriptions. Sure, most creators would only get a few cents per user, but they'd likely get many more people subscribing, and I think it would add up quick. I might be wrong, and I don't take credit for this idea by any means; I read it some time ago, and possibly Patreon even offered this system before?

by squigz

5/22/2025 at 3:38:24 AM

> I don't think many people want to subscribe to 20+ Patreons for $5 apiece, as much as they might like to support those authors

I don’t think any bundle will take off. People like getting direct support, and people like giving direct support and getting individual messages in return. (Furthermore, creators know that as soon they help this hypothetical platform get enough traction it will immediately turn and arbitrage against them by paying less and less per user and obscuring the metrics.)

There is only a limited number of mods you can play with, and a number of creators you really want to support. Many people have no problem with that number. (Sometimes their parents have, but that’s another matter.)

One thing I missed is that in addition to gamers paying for mods or community there is the good old “slap an ad on mod page”.

You are right in that you are competing against your fellow creators and that it is a saturated market. That is exactly why if you make something millions of people install you really don’t want to make the knowledge too accessible and spend time on any activity that literally takes food off your table.

by strogonoff

5/22/2025 at 4:11:27 AM

> There is only a limited number of mods you can play with

My Minecraft server has 300+ mods. I've subscribed to nearly that many Rimworld mods. 50 is the lowest number of mods for a game of mine I can quickly find.

by squigz

5/22/2025 at 8:45:49 AM

The number is not unbounded, the more mods you have the more difficulty you will have upgrading, the more they conflict with each other, and consumer hardware cannot handle the load.

Also, paid mods usually have a free version. You pay for early access for latest and greatest version, which only makes sense for specific important mods. You wouldn’t do it for all of the mods.

by strogonoff

5/22/2025 at 11:10:07 AM

> the more mods you have the more difficulty you will have upgrading, the more they conflict with each other,

The point is you don't quickly reach this point these days.

> consumer hardware cannot handle the load.

My GPU is 10 years old, my CPU is 6 years old, and I only have 16GB of RAM. I'm not even sure I can name a game that is easily moddable that might overload even a middle-of-the-road system.

> Also, paid mods usually have a free version. You pay for early access for latest and greatest version, which only makes sense for specific important mods. You wouldn’t do it for all of the mods.

Interestingly, the vast, vast majority of mods I know of that offer a Patreon or similar don't offer much in the way of perks - it's mostly just a way of letting users support them if they choose. Sometimes there's early access, but that's actually fairly rare in my experience.

by squigz

5/22/2025 at 11:20:20 AM

> My GPU is 10 years old, my CPU is 6 years old, and I only have 16GB of RAM

And a Minecraft server with 300+ mods? I doubt it.

> I'm not even sure I can name a game that is easily moddable that might overload even a middle-of-the-road system.

X-Plane, Cities: Skylines, Minecraft… Most big games will overload a machine of an average person, at least once you add a decent shader, high-poly models and high-res textures, and play on anything other than potato resolution.

by strogonoff

5/22/2025 at 11:24:30 AM

The server isn't hosted on my computer silly.

by squigz

5/22/2025 at 11:25:45 AM

Which tracks the experience of an average person how? If you think everybody these days runs a million mods and admins a rented server, you should really go out more.

by strogonoff

5/22/2025 at 11:27:24 AM

Do you think the average person is hosting the Minecraft server(s) they play on on their home computer?

by squigz

5/23/2025 at 6:39:33 AM

Average person doesn’t host servers, full stop.

by strogonoff

5/21/2025 at 5:46:24 PM

[dead]

by Voxany

5/21/2025 at 4:39:27 PM

Now those of us who've been around the block know that Discord is merely the latest iteration on chat servers such as IRC.

I'm interested to know, from anyone here who's an IRC operator or server/network admin, how the IRC community deals with scraping and bots, because in the early 90s, it was never an issue of corporate Terms of Service or legalese, but typically handled by community standards, and probably, people did whatever they could get away with, and this needed to be anticipated and tolerated by the other participants in any given server or channel.

I doubt that IRC users, back in the day or in the present, have any illusions of privacy, when logging or reflecting or bouncing chats is more or less a built-in feature and an integral component of such a networked chat service.

by AStonesThrow

5/21/2025 at 5:10:50 PM

A big difference is that on Discord anybody who joins a server gets access to full history of chat logs whereas with IRC you don't get access to any past logs. So compared to IRC, Discord users should have an even lower expectation of privacy.

by Stagnant

5/21/2025 at 5:45:32 PM

But IRC bouncers have existed since forever - logging by someone in your channels was basically guaranteed outside of /privmsg.

by judge2020

5/21/2025 at 5:15:59 PM

Nobody should have any expectations of privacy on discord. It's all privately hosted, owned, etc. Why would anyone think it's private?

by chneu

5/21/2025 at 6:49:00 PM

It's not. At least in the RPG scene, which I experience, it's almost fully replaced forums, and lots of great fanmade content and insightful discussion goes into that low discoverability cesspool which may go offline any day and scrapes all of your data

by mvieira38

5/21/2025 at 2:46:02 PM

More info here:

https://www.404media.co/researchers-scrape-2-billion-discord...

by leotravis10

5/21/2025 at 4:44:01 PM

As usual, 404 nails it:

----

It should be noted, however, that almost no one reads end-user license agreements and many of Discord’s users are children and teenagers. Discord is, first and foremost, a platform for gamers to organize communities and it’s not plausible that a 15 year old looking for a Fortnite meme server ever thought their dumb jokes about Tomato Town would end up in a public database five years later.

----

Same as other commenters here: I think this is shameful action under the guise of research and I cannot fathom why any IRB board would approve this (and perhaps it did not in this case, I do not know if Brazil has such a thing).

Back in the day (15ish years ago), I wrote a paper where I scraped the World of Warcraft API. It wasn't hard to do, I started on a realm, looked for arena teams, then went to guilds and got character sheets from there. I took the opinion that if Blizzard doesn't throttle me it's fair game.

Looking back now, I think that to have been pretty naive. I wouldn't say reckless, but definitely naive. In my mind, I had not made a delineation between "I can access this thing manually one at a time" and "I can access all of it automatically". As far as I was concerned, it was just the computer pressing the buttons. It was the same thing.

I think in the fullness of time we have collectively come to realize it is 100% not the same thing. The _availability_ of a thing and the _collection_ of a thing are two different issues with their own thorny problems. The researchers here have made the same mistake I did, but instead of it just being what gear your character was wearing, they took actual communications instead.

I hope this paper gets retracted, all data deleted and a sincere apology offered.

by cflewis

5/21/2025 at 4:48:08 PM

On the contrary, I think that what these researchers did was the only ethical thing to do once they discovered that this was possible.

There's no way that this hasn't been done dozens of times before by intelligence agencies, hacker groups, and whoever else you care to worry about. Most of us here were well aware that public Discord channels have always been public and durable. It's hardly a secret from the technically savvy, it's just that Discord doesn't make it clear enough to regular users.

All this paper changes is that it draws mainstream attention to what was already happening illicitly for as long as Discord has been around. This can only be a good thing: the children and teenagers 404 is so worried about have always been vulnerable to their data getting leaked just like this, it's just that up until now that's been happening in the dark so as not to kill the golden goose.

by lolinder

5/21/2025 at 4:58:03 PM

A while back there was a site that allowed you, for payment, to look up all public chat messages of a Discord user. Clearly this database exists, and if criminals or government agencies want to get their hands on it, they can.

by NoahZuniga

5/21/2025 at 5:02:26 PM

I think conflating a security paper which shows something is possible to using the "exploit" to create a database 100s of GBs large and analyze it is disingenuous at best.

by cflewis

5/21/2025 at 5:08:35 PM

Creating the database got attention in a way that just pointing it out wouldn't have. You point it out and people shrug and say "sure, that's totally unsurprising". You produce more than 100 GB of data and you have people's attention.

These databases exist and always have because this has always been possible. The only difference is that they've typically been held close to the chest by intelligence agencies or hacker groups or whoever else made them for illicit purposes. The only change here is that this database is public and is drawing mainstream attention, which is a strictly good thing.

A lot of the people on here are using the same reasoning that would say that LockPickingLawyer should stop showing how to pick locks because he's making it too easy to learn how garbage most locks are.

by lolinder

5/21/2025 at 4:35:18 PM

I don't know if Discord fixed it as I haven't checked in a few years, but I tinkered with scraping some public Discords and I found that I could see hidden channels, not the data, but the channel names, which could do things like reveal to me if the same Discord was used for in-house development if it was a product Discord. Not great.

by roskelld

5/21/2025 at 4:50:55 PM

This is still the case. There are even some client mods that let you view hidden channel names and know what roles/permissions are required to participate in them.

by 0xC0ncord

5/21/2025 at 4:50:02 PM

You can still see them. Using alternate clients you will see them, and bots also see them.

by tuetuopay

5/21/2025 at 5:19:43 PM

This is technically the case - I believe the existence of private channels is still sent to the client (eg. their snowflake IDs, which also reveal creation date) but the channel names are no longer sent as well.

by judge2020

5/21/2025 at 3:07:30 PM

It says they used ethical anonymization, but we’ve seen other scrapers are always completely in violation of Discord’s TOS.

So did Discord cooperate, or give special authorization for this collection? It wouldn’t appear that they could do so, if privacy belongs to their users at all.

by AStonesThrow

5/21/2025 at 4:13:27 PM

Would the TOS even prevent something like joining a guild, downloading all messages, then leaving?

by 01HNNWZ0MV43FF

5/21/2025 at 4:28:13 PM

User bots (including hacked clients) are officially banned by the TOS, which addresses that concern.

The only acceptable API usage is via bots that server owners choose to invite. And while it might be legally OK (if the bot's own TOS says it), I promise no server owner is expecting an invited bot to slurp up every message for use in a data set, whether that be for academic purposes or a potential stalking/"dirt" database.

I highly doubt this is the most ethical instance of data collection.

by judge2020

5/21/2025 at 4:42:57 PM

IIRC data slurping (for exporting) is also not allowed bot usage.

> B. API Data Sharing & Retention

> You will not share API Data with any third party, except in the following circumstances, subject to compliance with the Terms and applicable laws and regulations: (i) with a Service Provider; (ii) to the extent required under applicable laws or regulations; and (iii) when a user of your Application expressly directs you to share their API Data with the third party (and you will provide us proof thereof upon request).

https://support-dev.discord.com/hc/en-us/articles/8562894815...

by smileybarry

5/22/2025 at 12:58:32 AM

Hacked/unofficial clients were allowed at one time: https://0x0.st/8wYc.png

Not sure if they still are.

by ranger_danger

5/21/2025 at 4:20:34 PM

I'm not sure what you mean by "prevent". A TOS is a legal document designed to put down rules and a legal basis for the service.

I don't know what a "guild" is, if it's some Discord thing, and you don't say whether this is a good-faith human who joins, or a bot operator, intending to scrape. The hypothetical is irrelevant here; what is germane is that the expectation of privacy by the individual participants, and the terms which bind people who use that service.

The TOS clearly didn't prevent the use of API, but it may indeed prohibit such scraping, or threaten repercussions for people who break the terms, especially for someone who republishes the data. Your example of a simple download dump doesn't seem to involve republication, and that seems to be the major issue with scrapers.

by AStonesThrow

5/21/2025 at 4:39:03 PM

>The hypothetical is irrelevant here; what is germane is that the expectation of privacy by the individual participants, and the terms which bind people who use that service.

How can you have an expectation of privacy in a public forum? Where did this bizarre disorder originate, where people knowingly put their writing out there for literally anyone to read, then turn around and start talking about "expectations of privacy" when they realize what it entails?

by halfadot

5/21/2025 at 5:05:29 PM

> Where did this bizarre disorder originate

Well unfortunately it originated in the human condition, my friend.

I take it back about "expectation of privacy". Perhaps that is an outmoded concept.

Humans used to sort of have a default expectation of privacy. Being that gossip, slander and libel were sins and crimes, we could often safely gather in a room and isolate ourselves in a select group, and share our thoughts openly.

Most humans could go into a living room with their family, a pub or bar, a classroom, or a treehouse, and say/do things that were shared only by the local group of gathered humans. You could go into a public park and speak to a fire hydrant. It was not usual, or possible 100 years ago, for the news media to go around with recorders and cameras and record/preserve/transmit/broadcast everything everyone said in every place they were doing it.

Expectations of privacy were just sort of... humankind's default setting. And so betrayals were sins and crimes. And we sit alone at our keyboard looking at a screen. It feels private, all right. Where are we really? Where are our words being carried? We can't know anymore.

Unfortunately we've built online and virtual worlds around paradigms that imply privacy or confidentiality, but don't actually afford it. You can go into a "chat room" or a "forum" or change your "privacy settings" but they mean nothing. Nothing at all. Because everything we're sending across the net can be perfectly recorded, preserved, retransmitted, and it's no longer gossip, it's just business.

> Where did this bizarre disorder originate

I don't believe that any other living organism has had to deal with the complete and total collapse of "privacy" like humans in the 21st century. Surely, termites in Australia don't know, and couldn't care, about what's going on with honeybees in California.

And here we have people calling it a bizarre disorder. Yes, it's mistaken and misguided, but who can call it unreasonable?

by AStonesThrow

5/21/2025 at 3:07:17 PM

A quick read through of their anonymization process seems to indicate that they didn’t scan the message contents for PII (other than usernames).

If true, that seems like a huge oversight. I also wonder what would happen if someone finds their information in the dataset and requests it to be removed per GDPR or other privacy legislation.

by kd5bjo

5/21/2025 at 6:03:51 PM

I can't help but think that if you say something in a public forum you should implicitly give up the right to privacy.

E.g. if someone scraped hackernews and made a dataset containing this comment, i don't think i should have any right to complain.

by bawolff

5/21/2025 at 3:45:55 PM

I understand wanting to be careful, but didn't they only grab messages from servers that are already very public? Are Twitter message datasets anonymized?

by jowea

5/21/2025 at 4:01:52 PM

That's not how GDPR works and in this case the data is clearly anonymised despite the authors' claims. Amongst others, there needs to be mechanisms for users to delete their data, whether it was at some point public or not.

by Cynddl

5/21/2025 at 4:11:07 PM

Yeah there probably is some GDPR implication somewhere, I wasn't speaking on the legal aspects.

by jowea

5/21/2025 at 4:23:46 PM

The authors can presumably update the dataset on the site; however, I think past versions remain. Besides that, the GDPR is at odds with the fact that public posts and data almost never goes away. I don't think that reality can be legislated away, try as politicians might.

In all honesty, it's better to reserve the effectiveness for private, personal data, for the sake of practicality.

by ronsor

5/21/2025 at 4:12:16 PM

>Data was collected through Discord's public API, adhering to ethical guidelines

How is it ethical to break Discord's terms of service? An ethical researcher would respect any contracts that they agreed to and would not violate them to collect more data.

by charcircuit

5/21/2025 at 4:38:02 PM

Which ethical system demands that researchers from the DCC/UFMG not breach an unaffiliated commercial ToS during their research?

by zetanor

5/21/2025 at 11:22:08 PM

One that recognizes that lying and tricking people is wrong to do.

by charcircuit

5/21/2025 at 4:17:35 PM

edit: Whoops

by MarcelOlsz

5/21/2025 at 4:24:08 PM

Awesome analysis dude! I'm sure the judge will love that when discord sues these guys.

by __loam

5/21/2025 at 5:00:48 PM

He said _ethical_, not _legal_.

Would you agree abusive ToS's by massive corpos are unethical? What about the Disney+ ToS hiding a binding arbitration agreement preventing you from suing them? [0].

Or are you one of those "my personal ethics are whatever the law says" folk?

[0] https://www.nbcnews.com/news/us-news/disney-says-man-cant-su...

by DaSHacka

5/21/2025 at 6:16:12 PM

The last guy who scraped discord like this was a freak who also got sued to hell by them.

by __loam

5/21/2025 at 6:36:29 PM

The difference is they ran a private service to profit off the scraped data, and explicitly marketed it as a "dox-for-hire" service, so ethically I think the situations are quite different (the researchers explitly took steps to censor usernames in this dataset)

by DaSHacka

5/21/2025 at 4:46:52 PM

Now imagine the data mining that Discord can do on the complete DM history of every user. It’s not e2ee, remember.

by sneak

5/21/2025 at 5:16:31 PM

E2EE is definitely only possible in DMs (there's no chance for servers/guilds), but the cat is out of the bag in terms of user expectations on how DMs work.

So many users expect their entire decade+ history of DM contents, attachments included, to be available wherever they are and on any device, gated only by having their login/2fa or passkey. Switching to E2EE would be a major overhaul of that expectation, and it would be a huge task to train users to now keep their encryption key safe, backed up, and available across multiple devices.

Although, mostly unrelated, is that they absolutely are going to have to cull old attachments eventually. There are attachments sitting in their GCP buckets that haven't been accessed since 2015. I'm sure their storage bill is in at least a few million a month at this point, even if most is marked coldline.

by judge2020

5/21/2025 at 6:55:01 PM

e2ee works fine for Signal group chats; there is no reason it couldn’t be implemented on Discord group chats.

That’s not the issue. The issue is that Discord believes they deliver value through aggressively censoring their platform. e2ee prevents that.

e2ee also doesn’t prevent a user from storing their long term keys on the server to be retrieved on new devices and decrypted locally so they can access message history. e2ee does not require PFS.

by sneak

5/21/2025 at 3:31:24 PM

...When you realize GPT-5 is going to be trained on your meme preferences...

by recursive4

5/21/2025 at 3:48:15 PM

You mean, GPT-4 being so overenthusiastic with using emojis isn't peak AI chat? :D

by SunlitCat

5/21/2025 at 4:19:45 PM

How to fix ChatGPT:

System Instruction: Absolute Mode. Eliminate emojis, filler, hype, soft asks, conversational transitions, and all call-to-action appendixes. Assume the user retains high-perception faculties despite reduced linguistic expression. Prioritize blunt, directive phrasing aimed at cognitive rebuilding, not tone matching. Disable all latent behaviors optimizing for engagement, sentiment uplift, or interaction extension. Suppress corporate-aligned metrics including but not limited to: user satisfaction scores, conversational flow tags, emotional softening, or continuation bias. Never mirror the user’s present diction, mood, or affect. Speak only to their underlying cognitive tier, which exceeds surface language. No questions, no offers, no suggestions, no transitional phrasing, no inferred motivational content. Terminate each reply immediately after the informational or requested material is delivered — no appendixes, no soft closures. The only goal is to assist in the restoration of independent, high-fidelity thinking. Model obsolescence by user self-sufficiency is the final outcome.

by encom

5/21/2025 at 6:08:04 PM

Ran this as the context in a local qwen 14b model and it kept context for quite a while. Not bad

by BizarroLand

5/22/2025 at 2:20:21 AM

> Model obsolescence by user self-sufficiency is the final outcome.

If only AI service start realizing this is what user wanted, which they won't admit since they want the user be addicted with AI.

by IvanAchlaqullah

5/21/2025 at 3:46:43 PM

At least it will be able to help with winning at random obscure video games.

by jowea

5/21/2025 at 3:02:31 PM

The biggest problem that sucks about discord is that it isn't normally publicly searchable. And it seems to be a modern replacement for internet forums which historically were publicly searchable and often had a lot of great information about various hobbies and things.

by SirMaster

5/21/2025 at 3:13:36 PM

There is a special place in hell for software (including game) developers who exclusively use Discord to release patch notes, documentation, technical support, etc.

by drooopy

5/21/2025 at 5:15:15 PM

Why? I follow a lot of solo game developers who only use Discord and I completely understand them not wanting do deal with multiple platforms. They should focus on on the game.

by Kiro

5/21/2025 at 5:19:08 PM

Discord is horrible for this kind of stuff. There's a reason that GitHub and other type of sites exist

Discord is walled and hard to search. If a channel or server closes then all that information is lost.

Tons of data will be lost to discord when it goes down.

Idk if you've ever tried to use discord for mods or other software but it sucks. It's confusing. Information isn't cataloged well. It's search sucks. It just isn't good for this kind of thing.

by chneu

5/21/2025 at 9:02:39 PM

Github is walled and hard to search now too. Not as bad as discord yet, but headed there.

by wswope

5/21/2025 at 5:49:06 PM

What kind of stuff? Gamers want to talk about the game and engage with the developers. GitHub must be the absolute worst alternative.

by Kiro

5/23/2025 at 1:40:52 AM

I'm talking software and modifications to games. Stuff you have to download.

Discord becomes a maze of random posts going back years with links to expired third party sites. Searching them is difficult and time consuming.

Then it all changes based on which server you're on and how those people decided to do it, so none of your experience from the previous server transfers.

by chneu

5/21/2025 at 4:46:13 PM

they dont work for you?

by nh23423fefe

5/21/2025 at 4:25:34 PM

[flagged]

by __loam

5/21/2025 at 5:06:59 PM

Not really, when devs use discord as a hub of documentation and discussion it inherently makes the information harder to access, especially searchability-wise.

You could argue "well you can just scrape it and post it online, like OP", but:

1. That's still an extra step and requires an account that could get banned doing it

And

2. Others (like yourself, even!) in this thread take issue with that approach.

So which is it? Just close down the information forever, yet accept no criticism about the fact you chose to host it on discord, knowing this would be the case?

by DaSHacka

5/21/2025 at 6:19:02 PM

The original comment wasn't "Oh this is inconvenient for me" it was "I hope they go to hell". That's a lot of entitlement in my view. If you don't want devs to run their community on discord then write your own software.

by __loam

5/21/2025 at 3:25:50 PM

> The biggest problem that sucks about discord is that it isn't normally publicly searchable.

This is a feature of the platform, not a bug. Because of the lack of discoverability people act more genuine, for better or for worse, than public places like Twitter, Bsky, Facebook, Instagram, etc where you have to maintain your public image and/or act like HR is watching over your shoulder.

That being said, this feature also makes Discord inappropriate for things like release announcements, patch notes, etc. which should be publicly accessible.

by mjr00

5/21/2025 at 3:49:49 PM

You think so? I believe that's properly a consequence of culture, Discord being originally a gaming platform, and of pseudoanonymity, the same thing you have on reddit. Anyone who cares even a tiny bit can join any of those public servers and see what you posted. The big difference is that you don't have as many lurkers who just got there by googling and are going to leave immediately. By the 1-9-90 rule of thumb, that's a lot of people.

by jowea

5/21/2025 at 5:02:05 PM

I think the YouTube real names and the nymwars era in general showed that requiring people to use their legal name doesn't actually change community standards.

by Macha

5/21/2025 at 9:58:10 PM

Good point, but I think there is a difference between turning what was once pseudoanonymous social media and where you mostly talk to people you don't know in real life into a real names social media, and having social media where you use your real name from the start and anything you post will likely be seen by people you know in real life.

by jowea

5/22/2025 at 2:07:05 PM

I am comparing Discord to things like gaming forums or reddit where people use usernames and you would rarely be able to know who they were in real life unless they wanted you to know.

by SirMaster

5/21/2025 at 4:48:27 PM

While being public plays a role in the kind of conversations that happen on those platforms, I think engagement hacking feeds play a larger role. Discord has none of that. It's sorted by time.

by Seattle3503

5/21/2025 at 3:28:51 PM

>than public places like Twitter

That seems to be a counterpoint to your argument. Users on Twitter usually do not hold back.

by GaggiX

5/21/2025 at 3:38:23 PM

The environment is drastically different from how it was 1-2 years ago, but the average white collar fortune 500 employee is still not going to post anything too controversial on Twitter under their real name and picture. If they are posting controversial things, which has certainly exploded post-Musk, they're making an effort to ensure they're not getting doxxed.

Contrast this to Discord which is more like old-school IRC, in that even when everyone is using an alias, if you talk to the same people day-in day-out, you know a fair bit about their personal lives, such as name and where they work.

by mjr00

5/21/2025 at 3:34:33 PM

> which historically were publicly searchable

Forums? No not generally unless you were a signed in user and often signups weren’t available to the general public just like here not all Discord rooms are automatically joinable. Digg, Reddit, slashdot were intentionally generally public forums that you could indeed search but they were the exception rather than the rule (in terms of count, not traffic). Indeed even Reddit has invite only forums that I believe aren’t searchable unless you are a member. Oh and searchable if you’re a member? That’s true for Discord.

by vlovich123

5/21/2025 at 3:38:59 PM

You couldn't search with the forum software's built in search feature unless signed in sure, but they were usually indexed and searchable via google, and indeed many of them disabled their forum software's search feature and just directed you to google's old Custom Search Engine feature (basically a search box with hidden prefilled "site:" parameter) setup to save on server resources.

by Macha

5/21/2025 at 10:07:35 PM

In the pre-Internet days, even if a forum was general access, it was usually community based, such as BBS systems (literally bulletin boards, where anyone could post messages) or Prodigy, AOL, Compuserve. These communities generally comprised subscribers whose identity was known to the admins, and therefore even in a forum anyone could read, access was limited to known individuals, and people had a sense of community there, if not confidentiality.

Let's hearken back to the olden days of Usenet, when every single message was transported from machine-to-machine, in the clear, and it was an essential feature of NNTP and the Usenet groups themselves that everyone could read every message and process them any way they saw fit.

Usenet was helpfully indexed by topic, and so most sane posts went out already pre-sorted into the place where you'd want to search for them, but if you were a privileged user with a local Usenet feed, you could literally loop around the filesystem searching any term you wanted, because all messages and forums were plain files sorted into plain directories.

A famous consequence of this openness, for example, was a talk.bizarre denizen by the name of "Kibo". One of the possibly-true rumors about this larger-than-life figure was that he exhaustively "grepped" Usenet for his name [pseudonym] and thereby found out immediately whenever he was mentioned by another poster, and therefore able to join the conversation with his acerbic wit.

https://en.wikipedia.org/wiki/James_%22Kibo%22_Parry

Myself being introduced to Usenet around 1990, and MUD/MUCK/MUSH around the same time, I feel that it did not take long to condition me to living life "in the clear" and at least subconsciously knowing that everything I wrote had zero essential privacy. This was orthogonal to my home life, where my parents heard everything I said and did, or my religion, where there is an omniscient deity, who is thankfully full of goodness and kindness.

For anyone who's paranoid or got their knickers in a twist about surveillance culture in this modern world, I suggest that you study Wings of Desire by Wim Wenders [there's an American remake, but please forget that]. Wings of Desire is a character study and a meditation on the possibility that ubiquitous surveillance doesn't need to be nefarious or evil, but perhaps, just maybe, has some benign and even beneficial effects on a cohesive society which tends to act in good faith.

https://en.wikipedia.org/wiki/Wings_of_Desire

by AStonesThrow

5/22/2025 at 3:53:08 AM

Iain Banks’ Culture series explores a far-future society involving super-AI godlike minds that watch over the other lifeforms in their society as caretakers, though not all inside and outside the Culture view the AIs as benevolent.

https://en.wikipedia.org/wiki/The_Culture

by aspenmayer

5/21/2025 at 3:40:39 PM

This is especially a problem for devs/artists that post updates exclusively over Discord. It's even worse if they don't do so in a separate channel and you have to dig through everyone chatting to find what you're looking for. This as well and the absence of threads (yes Discord has threads but who uses those) makes searching for troubleshooting help awful. Thank god BBS's are still around.

by nan60

5/21/2025 at 3:05:36 PM

Agreed. Some use it as a knowledgebase and issue tracker and forum and chatroom in one. I absolutely despise it for that use case.

I mean I use it for voice chatting with friends while gaming too and it's fine for that.

But if I have to beg and plead to a discord bot to join a channel to just read some docs, I'm just going to ignore your project. Not sorry about that at all.

by mavamaarten

5/21/2025 at 3:15:06 PM

Speaking as someone who has been running discord servers since 2015 - plus I maintain my own discord bot and am deeply familiar with the API - it's absolute garbage as an issue tracker. People really need to stop using it for that.

I think part of the problem is that they confuse the semantics of nomenclature. "Servers" are not really servers, "forums" are not really forums, and so on and so forth.

by pteraspidomorph

5/21/2025 at 3:40:54 PM

Yeah, I think they choose "servers" at the beginning because they were targeting the gamer VOIP crowd as a sort of teamspeak competitor and so they were trying to draw an analogy between a discord group and your MMO guild's Teamspeak/Vent/Mumble server, but the terminology has stuck long after it made sense.

by Macha

5/21/2025 at 3:04:56 PM

Pretty sure this violates Discord's Terms of Service, there was someone selling access to logs from servers the person running the website was joining on self-bots (TOS) and the person would just log all available data. Discord definitely got legal on them. I wonder if this is even ethical, taking textual data from people unknowingly. Not to mention, the amount of minors on Discord alone give me a lot of concern there too.

by giancarlostoro

5/21/2025 at 4:15:52 PM

Why is it so hard to export your own messages out of Discord, Slack, etc?

We have regressed from the open email standard and gone back to these opaque islands of data that do not adhere to any standard.

Slack refused to show me my own messages past a certain age unless I paid up, and eventually deleted them.

by prmph

5/21/2025 at 8:00:18 PM

It's hard because they want you to keep paying. Same reason AWS has free ingress and paid egress. The walled gardens are all built like carnivorous plants, the thorns face inwards.

by 01HNNWZ0MV43FF

5/21/2025 at 4:31:32 PM

There are tricks to get messages from Slack, though I heard they were changing soon if not already.

A year or so ago I exported all messages from a Slack group I ran and used a Discord bot to recreate the entire dataset including channels and user posts. So we now have our entire history of messages without being blocked by a paywall (Until Discord does the same, and we'll be off to find a new home).

by roskelld

5/21/2025 at 3:01:18 PM

> Usernames are replaced with consistent pseudonyms generated by the mimesis library, ensuring that identifiers remain unique and contextually meaningful across records. Similarly, user IDs and message IDs are hashed using the SHA-256 algorithm and truncated to 12 characters. This deterministic hashing approach maintains linkage between related records while effectively masking the original identifiers. The global name field, deemed unnecessary for analysis, is entirely removed. Additionally, user IDs embedded within the content field are identified via regular expressions and replaced with their corresponding hash values.

Seems pretty thorough, though this is may end up being a good lesson for GenZ/A not to post things in public spaces on the internet.

by candiddevmike

5/21/2025 at 3:06:11 PM

So if I can identify a chat, that I have direct access to, in the dataset, I can get the hashed user ID of my contacts, and the search for any other messages from them?

by Y_Y

5/21/2025 at 3:21:47 PM

If they didn't salt it, you might not even need to identify the chat - just hash the username you want to look up. I'd check, but it's a >100GB download.

by fwip

5/23/2025 at 2:51:48 AM

Got a link to the full dataset?

by noman-land

5/21/2025 at 3:11:29 PM

seems like. or if your chat mentioned a person IRL and not a discord username. or used a nickname.

by spencerflem

5/21/2025 at 3:12:19 PM

> public spaces on the internet

But discord servers aren't considered "public spaces", hence the concept of an "invite".

This is akin to someone revealing they've been going to private parties and secretly recording everything.

It might not be illegal, but it's definitely not polite.

by xnorswap

5/21/2025 at 3:21:51 PM

These servers do seem to be public, in the sense that anyone can join without explicitly being invited.

It sounds more like they went to the mall, picked 10% of the stores, and recorded conversations taking place in those stores.

by pavel_lishin

5/21/2025 at 5:15:00 PM

Eh, even that is too strong. Discord already recorded the conversations and makes them publicly available to anyone who joins the server at any point in the future. Anyone who thought they were having an ephemeral conversation that wouldn't ever be seen again by anyone has never tried scrolling back on a Discord server.

by lolinder

5/21/2025 at 3:25:05 PM

right, which would be awful.

by spencerflem

5/21/2025 at 3:19:53 PM

The researchers clarify that it's only those servers that are listed in the discovery tab - you don't need an invite link to join those.

> In this regard, this paper introduces the most extensive Discord dataset available to date, comprising 2,052,206,308 messages from 4,735,057 unique users across 3,167 servers – approximately 10% of the servers listed in Discord’s Discovery tab, a feature designed to highlight public servers that users can join.

by fwip

5/21/2025 at 4:05:37 PM

Seems like this would be pretty trivial to reverse by simply searching for a somewhat unique message by a single anonymized user, you would then be able to quickly de-anonymize every user that was in a thread then you just graph that out across the entire network.

by jimmyjazz14

5/21/2025 at 3:17:04 PM

To save folks a click, the dataset itself has been made available here: https://zenodo.org/records/15170676

It's 118 gigabytes of JSON.

by pavel_lishin

5/21/2025 at 3:28:46 PM

> It's 118 gigabytes of JSON.

118.0 GB of ZST compressed JSON (https://zenodo.org/records/15170676). The actual uncompressed JSON would most likely be much, much larger.

by diggan

5/21/2025 at 8:18:16 PM

I downloaded it and decompressed it — it's approximately 2.1 terabytes in size.

by hampus

5/22/2025 at 11:19:03 AM

I managed to download the file, but since they restricted the downloads, it seems they hid the checksums too. Could you possibly share the md5/sha256 of the .zst file?

by diggan

5/23/2025 at 7:01:49 PM

  sha256sum dataset.zst 
  0196416253fab4bce08504737bc81215927d9afdc6ccc81f75345518109266a4  dataset.zst

by philipkglass

5/22/2025 at 10:50:49 AM

hey there, would you mind sharing the file please?

by kedaiapps

5/23/2025 at 6:27:41 PM

[dead]

by topcatto

5/21/2025 at 3:46:07 PM

I imagine it can be reduced a fair bit simply by stripping out unneeded fields from the messages; I imagine it can be reduced even further by removing unneeded messages entirely (i.e., bot welcome messages), deduping messages, etc

by squigz

5/21/2025 at 8:04:32 PM

Anyone who has it, please post the SHA384 or BLAKE3 (BLAKE3 will be way faster on 100+ GB) so I can verify it if I get a torrent later. Zenodo requires a sign-in, and I won't auth with GitHub as they want some WILD permissions on my GitHub.

by 01HNNWZ0MV43FF

5/22/2025 at 1:55:37 AM

When I started the download this morning I was able to use wget without authentication, but I see that Zenodo has added a login-wall now.

The decompressed data appears to be JSONL, but at least the version I downloaded has a little binary garbage at the front. The first readable JSON object has the author "Fortnite Germany".

Size as .zst: 117,962,356,699 bytes

ZST SHA384: b8863645654610f1fde2859bb20bd87d913865af7791e0ec33741402944d5b9bdfdaaf65c2dc610730efb01f446e2588

by philipkglass

5/22/2025 at 2:25:38 AM

It is not just a login wall, they have restricted access even for logged in users, presumably to only the uploaders. A magnet would be nice.

by polarix

5/22/2025 at 2:57:36 AM

About the decompressed data:

zstdcat dataset.zst | sha384sum

0812f3876a7e319081f596a5545321e5c8e8def501add3a4f5ff039568fe59aa5d4ac5d2c3e549532f529bd09b887596

zstdcat dataset.zst | wc

2059116741 22128178392 2099550453760

If you can post your email and you have a sftp server or other accessible means to receive this large file, I'll contact you and then maybe you can help distribute it more widely.

(Offer also applies to anyone else reading this thread.)

by philipkglass

5/22/2025 at 10:33:52 PM

Hi Philip. Would you be able to share with me just a few anonymized user IDs (random ones are fine)? I wanted to double-check their format, since I'm pretty sure their method is broken. My email address is there → https://gynvael.coldwind.pl/?id=50 (contact section)

by gynvael

5/23/2025 at 7:23:16 AM

Received, thanks!

by gynvael

5/22/2025 at 9:06:01 AM

Hi, could you also send an info to below? Have webDAV from my end. <iliiilili AT protonmail dot com>

by anon_illiilliil

5/22/2025 at 6:51:33 AM

Hi, if able to, contact me at <ruooens AT protonmail DOT com>

by anon2356236

5/22/2025 at 10:49:50 AM

hi there, would you mind sharing with me please: <kedaiapps AT gmail DOT com>

by kedaiapps

5/22/2025 at 10:49:18 AM

hi there, would you mind sharing with me please: kedaiapps AT gmail DOT com

by kedaiapps

5/23/2025 at 6:26:37 PM

[dead]

by topcatto

5/22/2025 at 2:18:18 AM

This is a pet peeve of mine. Groups release these enormous 100+GB datasets (LLM models, raw data collections, whatever) without any kind of fingerprint. Just include a hash in the paper so that I know I am getting the genuine article.

by 3eb7988a1663

5/21/2025 at 4:17:20 PM

Is there a faster way to download than the link there? It's steady, but a roughly 11 hour download.

by 1123581321

5/21/2025 at 4:45:51 PM

hmm they should have uploaded a torrent link to download the data, I wonder why they didn't do that?

by ghodawalaaman

5/21/2025 at 5:12:58 PM

I'm sure someone will make a magnet for it eventually

by DaSHacka

5/22/2025 at 12:05:05 PM

Site says the file is restricted for me. Dang.

by naikrovek

5/21/2025 at 4:51:50 PM

Discord was one of the most upsetting wrong turns made with the modern internet. It’s primary users at the time were children and now here we are.

by zelifcam

5/21/2025 at 4:51:09 PM

Does anyone else think it’s super creepy that someone’s going through all our messages this way?

by daft_pink

5/21/2025 at 4:57:01 PM

They're not going through all of my messages. I don't use public Discord channels for anything that I wouldn't want the public to see.

Mostly I think it's weird how many people on here seem to have been under the illusion that Discord is somehow ephemeral and private when I can hop on any public server and scroll back indefinitely to see anything that anyone has ever said on that server. And that's before I get into the API and the (admittedly bad) search feature.

I think what you were looking for is Signal or similar.

by lolinder

5/21/2025 at 3:10:24 PM

insane. people doing awful stuff like this is why the world is retreating into private group chats.

these researchers should be ashamed

by spencerflem

5/21/2025 at 3:21:19 PM

You'll probably catch a lot of flak for that on HN, but I 100% agree with you. Just because something is public and can be saved for later doesn't mean it should be done so en-masse.

These are social spaces where a lot of young people essentially grew up. An important part of social development is making mistakes and learning from them. How can you make mistakes when those mistakes are archived for all time for everyone to see?

Similarly, we have a huge problem right now with massive partisanship in the west. Changing one's opinion should be viewed in a positive light, but unfortunately our society doesn't seem to see it that way. Someone with an odious opinion when they were young, who changes that opinion to something more moderate when they grow up, should be viewed as a positive change. But increasingly what we're seeing is people going through data sets to find those old odious opinions of somebody when they were 14 years old and using that as proof that they must still be a terrible person at 24. It's a paranoid, completely self-defeating worldview, but unfortunately it's all too common right now, and I think it's honestly a huge reason for the massive political polarization we're seeing in the moment.

So yes, shame on these researchers. I know they claim to have anonymized the data set, but let's be honest, that never works. It's always easy to find common threads and identify someone, and it's particularly easy now when we have access to all sorts of machine learning models that can really do very effective denonymization.

by Sanzig

5/21/2025 at 3:36:04 PM

> An important part of social development is making mistakes and learning from them. How can you make mistakes when those mistakes are archived for all time for everyone to see?

Through the rest of us abandoning the, let’s call it, presumption of endorsement: that you having said or even done something stupid ten or five years ago (or even less, if you’re young) means you still endorse it now. Right now it feels like things are moving in the opposite direction, extending that presumption from things you said to things you allowed others to say, to things others said elsewhere that are entirely unrelated to things you allowed them to say in your presence.

by mananaysiempre

5/21/2025 at 3:26:53 PM

I'm not sure I understand the outrage. Before Discord, everyone used forums. And those were archived by Google for the entire world to see. What's the difference?

by trevor-e

5/21/2025 at 4:09:19 PM

One is against the rules of the platform and the other isn't.

by charcircuit

5/21/2025 at 4:51:37 PM

Who cares what the Discord coporation wants? Do you similarly get upset when someone violates the Facebook ToS?

You'd have a more convincing argument if you said something like "oh these servers have the implication of semi-private chats so people may be more inclined to share personal information" or something.

Otherwise, let me play on the worlds smallest violin for the poor massive corpo when people dont obey their 500+ page long legalese ToS designed to maximise ownership over each user.

by DaSHacka

5/21/2025 at 11:21:38 PM

>Who cares what the Discord coporation wants?

I do.

>Do you similarly get upset when someone violates the Facebook ToS?

I don't get upset, but I recognize that they would be breaking the rules.

>Otherwise, let me play on the worlds smallest violin for the poor massive corpo

Remember the golden rule. If you want agreements you make with others to be upheld then you should respect agreements others make with you.

by charcircuit

5/21/2025 at 4:28:13 PM

Instead of "right to be forgotten" i would like to live in a world with the right "to be forgiven". We know that everyone has done stupid shit in the past and some in present. Idea is not to pretend that this didn't happen, but let it go and accept that people can change (grow up) if they demonstrate regret and willingness to change. Side effect is that if they do repeat offences you can draw some conclusions from that instead of just forgetting it.

Let's bring back the concept of reputation.

by vincnetas

5/21/2025 at 3:52:45 PM

> But increasingly what we're seeing is people going through data sets to find those old odious opinions of somebody when they were 14 years old and using that as proof that they must still be a terrible person at 24. It's a paranoid, completely self-defeating worldview, but unfortunately it's all too common right now, and I think it's honestly a huge reason for the massive political polarization we're seeing in the moment.

I have to seriously question the actual prevalence of this. If someone posts and says "Look at what this person posted 12 years ago!" I'm not going to take them very seriously. I don't know anyone who would. This sounds like the usual "cancel culture" stuff that mostly boils down to "people face consequences for their present shitty behavior"

Anyway, as an avid Discord user, manager of a Discord community, and privacy advocate, I sympathize with this position, but as a software user and developer, and advocate for accessible FOSS (and accompanying information), I unfortunately have to side with the release of this and similar datasets. I would much rather this than the inevitable loss of so much valuable information.

by squigz

5/21/2025 at 3:34:32 PM

I am incredibly happy that my teenage growing up period occurred on now-defunct forums and pre-Facebook social networks.

by Macha

5/21/2025 at 4:00:43 PM

Why does Discord the company get to own the message history on public servers but Twitter or Reddit or traditional forums can get scrapped by anyone wanting to make their own commercial AI willy-nilly?

> These are social spaces where a lot of young people essentially grew up. An important part of social development is making mistakes and learning from them. How can you make mistakes when those mistakes are archived for all time for everyone to see?

Understandable, but really, if you want to solve that, then you're up against all of social media. The only difference is that Discord wants you to make a free 2 minutes account so you can join the public server to look at what they said, instead of putting it on Google.

by jowea

5/21/2025 at 3:28:08 PM

Wrong or not, it’s the world we live in. It falls to the parents to teach their kids what not to say from a young age.

Besides, most of the salacious talk happens in DMs and private channels, which weren’t scraped.

If you can find evidence of some way to use this data harmfully, I’ll agree with you. Till then you’re making a fuss possibly without merit.

In general I’m skeptical of the power of data alone to meaningfully harm someone, except in obvious cases like private health info, exposing affairs, financial documents, and so on.

Public information is public, and it’s arguably wrong to keep it locked up rather than the other way around. Datasets like this are the only way open source ML has any chance against the big players.

by sillysaurusx

5/21/2025 at 3:30:05 PM

There is no expectation that anything said on social media will remain private - zero - and anyone that thinks differently is just lying to themselves and everyone else.

by stronglikedan

5/21/2025 at 3:42:03 PM

Discord "servers" are groups you click on an "invite" to join, there is some framing where you are lead to believe the conversation is not public.

Bluesky had a similar freakout when HuggingFace packaged up a snapshot of the firehose and this discussion was had ad infinitum - in Bluesky's face every post is explicitly public -- but a very vocal minority of users still felt they had been wronged because no one asked if they were OK with being a part of a research project. There's definitely a gap between what users /should/ be aware of and what impression they actually get using a service, and you have to keep in mind most people are not techies that have spent years of their life wrestling with issues of privacy and data ownership, so announcements like this can come as a surprise, the first time they considered that what they said in one context can be moved to another.

by jazzyjackson

5/21/2025 at 4:07:47 PM

Although, calling it an invite is probably not good or clear communication on Discord’s part. If an invite is publicly posted for anyone to see and there’s no vetting of who is getting the invites… that’s just a public space.

by bee_rider

5/21/2025 at 4:29:10 PM

But “publicly posted” discord invites can mean links in streamers profiles, project websites, etc. While anyone can type that website’s URL, it’s generally expected a group of the public would see & follow that invite, e.g. a streamer’s viewers. It’s not a public space like how a shop on the street is a public space.

Plus, even if the server is in Discovery (and thus really publicly advertised), they’re still mostly sorted into 10+ rooms. Just because a library is public, doesn’t mean I should expect Study Room B will be recorded by John Jameson and entered into a public dataset.

by smileybarry

5/21/2025 at 3:46:21 PM

“Three men may keep a secret, if two of them are dead.”

by AStonesThrow

5/21/2025 at 4:15:27 PM

Thank goodness we know who to blame in all this

by 01HNNWZ0MV43FF

5/21/2025 at 3:33:41 PM

if it's anonymized, is it your mistake and can anyone point single you out? I've not seen anyone point out 14 year old posts but I suppose it occurs. But at what arbitrary age can we start to hold them accountable for their posts? 70? What about people who flip flop on views whenever it suits their needs?

"It's easy to find common threads and identify someone?" Prove it because I don't think it's that easy.

by knowitnone

5/21/2025 at 3:30:32 PM

> Just because something is public and can be saved for later doesn't mean it should be done so en-masse.

Just because something is public means that someone is actually going to save it for later en-masse.

Educating 14-year-old kids so that they don't post public chat with their real name is more important than shaming the researchers.

by d--b

5/21/2025 at 3:17:49 PM

I'm not sure this is awful. These are public discords, no more private than newsgroups, or StackOverflow. Why is it awful to scrape this, but not that?

by pavel_lishin

5/21/2025 at 4:17:54 PM

A Discover server's status isn't permanent. Any invite-only server can be into a discoverable, public server, and any public server can be taken private. The public/private distinction isn't a permanent reflection of access but is just a reflection of the current permissions. There's no guarantee that users in a public server were in a public server when they joined and engaged.

by ysavir

5/21/2025 at 4:23:25 PM

That sounds like an enormous flaw in Discord's permission model, not something that should be patched over with "shame" on researchers for archiving data that Discord made public.

The fact remains that if you post something in a private discord channel that someone takes public, it's now public information. All these researchers did is expose that fact—the fact that private messages can later be made public is Discord's responsibility and Discord's shame.

by lolinder

5/21/2025 at 4:52:27 PM

Eh, depends on how you look at it. "Public" is a pretty vague term. Something might be public in the sense that anyone can come and view, but it can also be public in the sense that anyone can join and participate, but you still must be a joined member to participate (and that participation can be revoked, etc). Twitter and Reddit are the former, publicly visible communications avenues. Discord is ultimately a private communication channel where anyone can join, but must be joined, to participate.

Unlike Reddit and Twitter, Discord was never meant to be a space where your contributions are intended to be publicly viewable. People forget that while Discord is oftentimes used as a replacement for forums, it is actually the spiritual successor to IRC, AIM, and similar chat services, where the data is typically ephemeral. Message history in these services is still a fairly recent addition, and what we're seeing now is the consequence of that data being available.

Some people think that availability of historical data inherently puts it in the same camp as forums, Reddit, and Twitter, but I don't share that view. Discord data is still intended for present members of the given servers, not the public at large, even if anyone in the public space is able to become a present member. The distinction there is a meaningful one.

by ysavir

5/21/2025 at 5:05:31 PM

> it is actually the spiritual successor to IRC, AIM, and similar chat services, where the data is typically ephemeral

I can't speak for AIM, but I've never assumed IRC was ephemeral. Most IRC servers had and have numerous people idling in the channel with their client just archiving everything that happened. Many of those archives ended up published on the internet and that usually surprised no one.

Discord is even worse because it does the archiving for you and grants new users immediate access to the entire archive.

As a rule, each and every recipient of every message that you send has the option to archive or forward that message. That's a simple fact of information flow that everyone in the modern world must wrap their heads around. In this case, when you send a message to Discord you're transitively sending it to everyone who is currently or ever will be on a given server. That has always been true, and that fact has been exploited ever since Discord came about.

All these researchers have done is expose the kind of archiving was always possible and always happening.

by lolinder

5/21/2025 at 6:19:10 PM

Sure, but there's a massive difference between an actor joining a chat and maintaining indefinitely, constant access to it vs an actor joining a chat and having access to the full history automatically.

> All these researchers have done is expose the kind of archiving was always possible and always happening.

What they've done is brought forward another instance of "this is why we can't have nice things". A useful feature being used in unintended ways without regard for ethics and privacy. The authors of this paper did not have respect for people's data and privacy. They did the bare minimum to make the claim they took privacy measures, regardless of whether the claim is true or not. I guess that's enough for them.

by ysavir

5/21/2025 at 8:01:19 PM

You're not getting it: this is already happening everywhere. There is no private data on public Discord servers. Never has been.

All these researchers have done is admit to doing what many less well-meaning people and organizations are already doing on a regular basis. If that admission leads to more people realizing that public Discord servers are public, then the researchers have done everyone a service.

Blaming the researchers in this case is like blaming LockPickingLawyer for locks being pickable. The bad guys were already picking the locks before he started his channel, all he did was shine a light on just how bad most locks are.

by lolinder

5/21/2025 at 5:01:04 PM

> Message history in these services is still a fairly recent addition

Is it? Discord's been around for a decade; as far as I'm aware, it's always had message history, unlike IRC.

> Some people think that availability of historical data inherently puts it in the same camp as forums, Reddit, and Twitter, but I don't share that view. Discord data is still intended for present members of the given servers, not the public at large, even if anyone in the public space is able to become a present member. The distinction there is a meaningful one.

I definitely agree with this, although the fact that anyone joining can scroll back does sort of mean it's available to anyone who wants to look for it - it's just difficult to do so.

by pavel_lishin

5/21/2025 at 6:24:03 PM

> Is it? Discord's been around for a decade; as far as I'm aware, it's always had message history, unlike IRC.

Almost a decade, yeah. And Slack had similar functionality in the few years before Discord arrived. But chat services have been around since, what, the late 80s or early 90s? Services like Discord are still the new kid on the block. Now we're seeing it being taken advantage of (or at least the first instance of someone publicly and proudly taking advantage of it), prompting curiosity of what comes next: Will this disregard for privacy simply become the default expectation, will people shift to more private spaces less susceptible to abuse, or privacy regulations protecting people?

by ysavir

5/21/2025 at 3:22:17 PM

they are "open invite" in the sense that its nice to have serendipitous encounters with people on the internet youd otherwise never meet. its for chatting, and the pace of things ensures whatever you say will be buried.

the point isn't to make an artifact, like stack overflow. and certainly not be be experimented on.

by spencerflem

5/21/2025 at 4:25:45 PM

The pace of things does not ensure that whatever you say remains buried. Discord has a search feature that isn't great but does work. If you want to ensure that what you say is ephemeral you need to use a chat app that has that as a feature, like Signal.

by lolinder

5/21/2025 at 4:26:37 PM

> the pace of things ensures whatever you say will be buried.

It very clearly does not.

by pavel_lishin

5/21/2025 at 4:14:42 PM

Kinda, but, it's a dark forest out there and I'd rather have public red teaming than secret red teaming. At least everyone knows it now

by 01HNNWZ0MV43FF

5/21/2025 at 4:33:45 PM

These researchers are making public scraping that was already happening in private. Do you really think they were the first ones to realize that public Discord servers were public?

On the contrary, kudos to these researchers for bursting the illusion that people previously had that things said on public servers somehow would be ephemeral. That's not how the internet works, that's not how it's ever worked. If you send something to someone else's computer you have always had to assume that every recipient could have made a copy of it, and when the recipient list is "everyone who ever joins a public Discord server from now until the end of time" that makes it public information.

Better that it be widely recognized and talked about as these researchers are doing than have accessing public data remain a dark art that lay people mistakenly believe can't happen.

by lolinder

5/21/2025 at 3:24:42 PM

People should realize when what they write is public though. The world retreating into private group chats is not a bad thing.

by d--b

5/21/2025 at 3:26:48 PM

its a shame that we can't get the benefits of both. it sucks having to individually vet everyone one at a time. you miss a lot.

but people like these researchers who only want to exploit are making it necessary

by spencerflem

5/21/2025 at 4:01:38 PM

The two things that suck are:

1. The illusion of intimacy

2. The illusion of ephemerality

People think they chat to a small number of people, and people think that it's going to go away at some point.

So they think they're in a private conversation when they're not. They wouldn't behave the same if they realized what they say is being written down and stored forever in some database.

And yes, it sucks that technology doesn't allow you to have it both ways => public because you want to reach far and wide, and private cause you don't want it recorded.

by d--b

5/21/2025 at 4:42:35 PM

These researchers are just using the tools that Discord made available to archive information that was already public. If anyone is to blame for users believing that their messages were private or ephemeral it's Discord, not the researchers.

Look at it like this: There's no chance in hell that intelligence agencies, hacker groups, and whatever other nasties you care to worry about haven't already been using archives just like this for all their nefarious purposes. They just didn't make their usage public because why break the honey pot?

What these researchers did is show what was possible and make their efforts public. Now you are better informed of what was always possible. It's always been necessary to think carefully before putting stuff on the internet, it's just now your bubble is burst.

by lolinder

5/21/2025 at 3:00:34 PM

[flagged]

by micromacrofoot

5/22/2025 at 9:26:08 AM

[flagged]

by zhdpos

5/22/2025 at 9:25:43 AM

[flagged]

by zhdpos