Skip to content

Unabashedly publishing 2 billion Discord messages obtained through illicit scraping, carried out by ambitious researchers.

In a public Discord setting, your communications could potentially be utilized for scientific study.

Unchecked Release of 2 Billion Discord Conversations by Researchers Online
Unchecked Release of 2 Billion Discord Conversations by Researchers Online

Unabashedly publishing 2 billion Discord messages obtained through illicit scraping, carried out by ambitious researchers.

A recent revelation has sparked concerns over privacy and data compliance on the popular communication platform, Discord. According to reports, a dataset of over 2 billion messages was scraped by 404 Media and the Federal University of Minas Gerais in Brazil, using Discord's public API [1][2][3][4][5].

However, the question of compliance with Discord's Terms of Service remains unclear. Generally, Discord's policies prohibit unauthorized mass data collection and require that data use respect user privacy and platform rules. While scraping public messages via the official API might be technically allowed under certain conditions, the massive scale and publication of such a database could potentially breach Discord's guidelines on user data privacy and acceptable use.

Discord's Terms of Service explicitly state that scraping data is not allowed, a rule that has been in place since at least 2020 [6]. The spokesperson for Discord confirmed that scraping services without consent is a violation of their Terms of Service and Community Guidelines [7].

The researchers published the dataset with the aim of providing a sizable sample of human activity for other research purposes [8]. They anonymized the data by replacing usernames with pseudonyms, hashing and truncating identifiers, and removing potentially identifying features [9]. However, there is a possibility that details in the conversations could potentially identify users, especially when conversations are pieced together [10].

The initial investigation by Discord determined that the user accounts accessed Discord servers that were discoverable and widely accessible, and scraped data without permission [11]. The dataset was collected from Discord servers between 2015 and 2024, accounting for about 10% of the platform's open servers [12].

The researchers hope that the data will help explore the impact of digital platforms on political discourse, the propagation of misinformation, and the development of effective moderation and regulation strategies [13]. Potential applications of the data include discourse analysis, studying the relationship between social media and mental health, and training AI chatbots [14].

Discord is currently investigating the matter and will take appropriate enforcement actions [15]. The researchers' project may be in violation of Discord's rules as they scraped data without written consent [16]. This incident serves as a reminder to be cautious about what is shared on digital platforms, as it may be read or used in the future.

[1] [URL for the dataset publication] [2] [URL for the research paper] [3] [URL for the Discord API documentation] [4] [URL for Discord's Terms of Service] [5] [URL for Discord's Community Guidelines] [6] [URL for Discord's Terms of Service, section outlining the prohibition on scraping] [7] [URL for Discord's official statement on the matter] [8] [URL for the researchers' statement on the purpose of the dataset] [9] [URL for the researchers' description of the data anonymization process] [10] [URL for the researchers' acknowledgement of potential user identification] [11] [URL for Discord's initial investigation findings] [12] [URL for information on the timeframe of the data collection] [13] [URL for the researchers' goals for the data] [14] [URL for potential applications of the data] [15] [URL for Discord's statement on the ongoing investigation] [16] [URL for Discord's statement on the potential violation of their rules]

The enormous scraped dataset of Discord messages raises questions about data compliance and adherence to tech company's terms of service, as the use of technology for mass data collection can potentially breach privacy rules. Discord's Terms of Service, established as early as 2020, clearly forbid scraping data without consent, making the project by 404 Media and the Federal University of Minas Gerais a possible violation. The collections from Gizmodo and other tech publications only highlight the importance of understanding and respecting future technology regulations when dealing with cloud computing and data-and-cloud-computing-related activities.

Read also:

    Latest