Skip to content

dhfbk/MuLTa-Telegram

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 

Repository files navigation

MuLTa-Telegram

A Multilingual and Multi-Target Dataset for Hate Speech and Target Detection on Telegram

Description

MuLTa-Telegram is a curated dataset comprising about 4,000 manually annotated messages collected from public Telegram channels in Italian and Polish. It is designed to support research in hate speech detection, with a particular focus on:

  • The presence or absence of hate speech
  • Identification of specific target groups of hate
  • Identification of the target groups in each message’s content (mentioned group)

Paper

The dataset is described in the following paper:

MuLTa-Telegram: A Fine-Grained Italian and Polish Dataset for Hate Speech and Target Detection

Contents

The repository includes:

  • The full annotated dataset (JSON format)
  • Annotation guidelines

Dataset Summary

Language Total Messages Hate Speech Non-Hate
Italian 2,002 411 (20.5%) 1,591
Polish 1,934 257 (12.9%) 1,684

Target Categories

Annotations cover 9 primary identity groups, including:

  • Ethnicity/Origin (People of Color, Romani, Other)
  • LGBT+
  • Women
  • Religious groups (Jewish, Muslim, Christian)
  • People with disabilities
  • Other or No Target (as applicable)

Each message is also labeled for mentioned target group independently of whether it is hateful.

Methodology

  • Messages were retrieved using a snowball sampling approach applied to public Telegram channels known for high toxicity or disinformation.
  • Pre-selection of messages used a keyword-based filtering matrix covering all target categories.
  • Annotations were performed by expert native speakers following a structured guideline developed in collaboration with civil society organizations.
  • All data were anonymized in accordance with applicable privacy regulations.

Project Resources

Additional models, data, and documentation developed during the HATEDEMICS project are available in the HATEDEMICS GitHub repository.

Access and License

The dataset is publicly available under the Creative Commons Attribution Non-Commercial (CC BY NC).
You are free to share and adapt the material for research purposes, provided appropriate credit is given.

Acknowledgements

This work was supported by the European Union’s CERV fund under Grant Agreement No. 101143249 (HATEDEMICS) and the Horizon Europe research and innovation programme under Grant Agreement No. 101135437 (AI-CODE)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors