Skip to content

dhfbk/hatedemics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

HATEDEMICS Models and Data

This repository contains resources developed within the HATEDEMICS project, including models and annotated datasets for hate speech detection in English, Spanish, Polish, and Italian.

Contents

  • Final versions of hate speech detection models
  • Human-annotated datasets
  • LLM-annotated datasets
  • Documentation on labels, data format, and model usage

Languages

The repository currently includes resources for:

  • English
  • Spanish
  • Polish
  • Italian

Tasks

Hate Speech Detection

Models and annotations for identifying hate speech in online content.

Models

The released models were developed using MaChAmp. We provide final models for hate speech detection in English, Spanish, Polish, and Italian.

The released models are fine-tuned versions of existing pre-trained models. Each model was fine-tuned on the corresponding annotated data released in the data/ directory. The MaChAmp configuration files used for training are provided in the configs/ directory.

Released Models

Task Language Base model Training data
Hate speech Italian MilaNLProc/hate-ita Human-annotated Italian Telegram data
Hate speech Polish ptaszynski/bert-base-polish-cyberbullying Human-annotated Polish Telegram data
Hate speech English facebook/roberta-hate-speech-dynabench-r4-target LLM-annotated English Telegram data
Hate speech Spanish dccuchile/bert-base-spanish-wwm-uncased LLM-annotated Spanish Telegram data

Repository Organization

The repository is organized around the three main types of released resources: annotated data, fine-tuned models, and MaChAmp configuration files.

.
├── data/
│   ├── human_annotated/
│   └── llm_annotated/
├── models/
│   └── hate_speech/
├── configs/
│   ├── datasets/
│   └── parameters/
├── docs/
└── README.md

Related Resources

This repository is connected to the MuLTa-Telegram resource, a publicly available Italian and Polish Telegram dataset for hate speech and target detection.

The MuLTa-Telegram dataset is described in the following paper:

Acknowledgements

This work was supported by the European Union’s CERV fund under Grant Agreement No. 101143249 (HATEDEMICS)

About

Repository for the HATEDEMICS project, including multilingual models and annotated datasets for hate speech detection in English, Spanish, Polish, and Italian.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors