Skip to content

my1e2/Documenting_Hate

Repository files navigation

Documenting Hate — Hate Crime Relational Database

A normalized, 7-table MySQL relational database built to document, query, and analyze hate crime trends in the United States using ProPublica's Documenting Hate dataset.


Overview

This project was developed for INST327 (Section 0104) at the University of Maryland, College Park (August–December 2024). The database was designed to transform a raw, flat ProPublica dataset of hate crime news articles into a structured, queryable relational system — enabling journalists, researchers, advocates, and legislators to efficiently surface geographic hotspots, recurring crime types, high-frequency keywords, and publishing trends across organizations covering hate incidents.

The original ProPublica dataset documents hate incidents reported between February 13, 2017, and August 12, 2017. The team scoped the sample data to articles published between March 24–27, 2017, selecting records with complete attribute coverage across all seven entities. The result is a database containing 45 articles from 45 organizations across 35 U.S. locations, with 400 keywords and 42 article summaries — all normalized to the Third Normal Form (3NF) and optimized for JOIN-heavy analytical queries.


Motivation

Hate crimes remain an underreported and underanalyzed facet of American life. As members of minority communities personally affected by the climate of targeted violence, the team designed this database to shed light on these disparities — not as raw statistics, but as a structured, reusable tool. By documenting incidents with granular detail (who reported it, where it happened, what kind of crime it was), the database helps uncover systemic patterns that individual headlines cannot convey on their own.


Database Schema

The physical database consists of seven tables with no one-to-one relationships — all are either one-to-many or many-to-many — minimizing redundancy and maximizing flexibility for future inserts and updates.

Core Tables

article — Central table of the database. Stores all article-level metadata.

Column Type Constraints
Article_id INT PRIMARY KEY, AUTO_INCREMENT
Article_date DATETIME NOT NULL
Article_title VARCHAR(300) NOT NULL
Article_url_prefix VARCHAR(300)
Article_url_suffix VARCHAR(300)
Location_id INT NOT NULL, FK → location
Organization_id INT NOT NULL, FK → organization

The URL was deliberately split into prefix and suffix components to avoid redundancy across articles sharing the same domain origin (e.g., multiple articles from https://www.wsws.org).


location — Stores city/state combinations. Decoupled from article to enable location-level aggregation queries.

Column Type Constraints
Location_id INT PRIMARY KEY, NOT NULL
City VARCHAR(45) NOT NULL
State VARCHAR(45) NOT NULL

Sample locations span 35 cities including New York (NY), Washington (DC), Wichita (KS), Denver (CO), Seattle (WA), Baltimore (MD), and San Diego (CA).


organization — Stores the names of the 45 news outlets that published the articles.

Column Type Constraints
Organization_id INT PRIMARY KEY, NOT NULL
Organization_name VARCHAR(45) NOT NULL

Organizations include BuzzFeed News, The Washington Post, Baltimore Sun, Wichita Eagle, San Diego Gay & Lesbian News, India New England, KCUR, World Socialist Web Site, and many others — spanning national, regional, and community-specific outlets.


keyword — Stores individual keywords extracted from article content, supporting full-text pattern analysis.

Column Type Constraints
Keyword_id INT PRIMARY KEY
Keyword VARCHAR(75)

The keyword vocabulary spans 400 terms including crime types (stabbing, murder, attack, threats, vandalism), identity/group markers (jewish, islamic, sikh, black, asian), locations (manhattan, baltimore, colorado), and political/media figures (trump, spicer, blasio).


summary — Stores article summaries up to 8,000 characters. Kept as a separate entity to avoid duplication when multiple articles share the same summary (e.g., articles 42, 43, 44, and 45 all reference the same mosque vandalism summary).

Column Type Constraints
Summary_id INT PRIMARY KEY
Summary VARCHAR(8000)

Junction (Linking) Tables

article_keyword — Resolves the many-to-many relationship between articles and keywords. An article can have dozens of keywords; a keyword can appear across many articles.

Column Type Constraints
Article_id INT FK → article
Keyword_id INT FK → keyword

article_summary — Resolves the many-to-many relationship between articles and summaries, enabling summary reuse and reducing data redundancy.

Column Type Constraints
Article_id INT FK → article
Summary_id INT FK → summary

Entity-Relationship Diagram

The ERD was designed in MySQL Workbench and captures the following relationship structure:

  • articlelocation: many-to-one (many articles can share a location)
  • articleorganization: many-to-one (many articles can be published by one organization)
  • articlesummary: many-to-many via article_summary
  • articlekeyword: many-to-many via article_keyword

All IDs were set to NOT NULL and auto-incremented where applicable. The absence of one-to-one relationships across the schema reflects a design principle favoring reusable, independently scalable entities over tightly coupled attribute bundles.


SQL Views & Queries

Five analytical views were created to enable common query patterns without requiring end users to write complex SQL from scratch.


View 1 — high_frequency_keywords

Purpose: Identifies keywords appearing in more than two articles, ranked by frequency. Useful for spotting recurring crime types, location names, or targeted groups across the dataset.

CREATE VIEW High_Frequency_Keywords AS
    SELECT K.Keyword, COUNT(AK.Keyword_id) AS Keyword_Count
    FROM Keyword AS K
    JOIN Article_Keyword AS AK USING(Keyword_id)
    GROUP BY K.Keyword
    HAVING COUNT(AK.Keyword_id) > 2
    ORDER BY Keyword_COUNT DESC;

Techniques used: JOIN, GROUP BY, HAVING, ORDER BY


View 2 — city_article_count

Purpose: Counts how many articles are associated with each city-state combination, enabling geographic hotspot identification. New York, Washington DC, Wichita, and San Diego appear most frequently in the sample data.

CREATE VIEW city_article_count AS
SELECT
    l.City AS 'City Name',
    l.State AS 'State',
    COUNT(a.Location_id) AS 'Article Count'
FROM location l
JOIN article a ON l.Location_ID = a.Location_id
GROUP BY l.City, l.State
ORDER BY COUNT(a.Location_id) DESC;

Techniques used: JOIN, GROUP BY, ORDER BY, column aliasing


View 3 — organization_most_recent_article

Purpose: Lists each publishing organization alongside the date of its most recently published article, ordered from newest to oldest. Useful for identifying which outlets are most actively covering hate incidents.

CREATE VIEW organization_most_recent_article AS
SELECT
    o.Organization_name AS 'Organization',
    a.Article_date AS 'Article Date'
FROM documenting_hatedb.article a
JOIN documenting_hatedb.organization o USING(Organization_id)
ORDER BY Article_date DESC;

Techniques used: JOIN, ORDER BY


View 4 — article_crime_type

Purpose: Surfaces article titles and summaries filtered to specific violent crime keywords: stabbing, murder, attack, and threats. Enables rapid identification of the most severe incidents in the dataset.

CREATE VIEW article_crime_type AS
SELECT
   k.Keyword,
   a.Article_title,
   s.Summary
FROM Documenting_Hatedb.article a
JOIN Documenting_Hatedb.article_keyword USING(Article_id)
JOIN Documenting_Hatedb.keyword k USING(Keyword_id)
JOIN Documenting_Hatedb.article_summary USING(Article_id)
JOIN Documenting_Hatedb.summary s USING(Summary_id)
WHERE k.Keyword = 'stabbing' OR k.Keyword = 'murder'
   OR k.Keyword = 'attack' OR k.Keyword = 'threats'
ORDER BY k.Keyword;

Techniques used: Five-table JOIN chain, WHERE with multiple conditions, ORDER BY


View 5 — article_location_details

Purpose: Returns article IDs, titles, and their city/state for any location associated with more than one article — revealing geographic clusters where hate incidents were reported repeatedly within the data window.

CREATE VIEW article_location_details AS
SELECT a.Article_id, a.Article_title, l.City, l.State
FROM article a
JOIN location l ON a.Location_id = l.Location_id
WHERE l.Location_id IN (
    SELECT Location_id
    FROM article
    GROUP BY Location_id
    HAVING COUNT(Location_id) > 1
);

Techniques used: JOIN, correlated subquery with GROUP BY and HAVING


View Coverage Matrix

View Req A (x4) Req B (x3) Req C (x2) Req D (x1) Req E (x1)
high_frequency_keywords X X X
city_article_count X X
organization_most_recent_article X
article_crime_type X X X
article_location_details X X X X X

Normalization & Design Decisions

The database was normalized to Third Normal Form (3NF), eliminating transitive dependencies and ensuring every non-key attribute depends only on the primary key of its table.

Key design decisions include:

  • URL splitting: Article_url_prefix and Article_url_suffix were separated to eliminate redundancy across articles sharing the same domain.
  • Summary deduplication: Multiple articles referencing identical AP wire summaries (e.g., four articles covering the same Colorado mosque vandalism story all link to one Summary_id) are resolved cleanly through article_summary rather than duplicating full summary text.
  • Keyword table isolation: Keeping keywords in a dedicated table allows keyword-level analytics (frequency counts, trend analysis) without scanning full article text.
  • NOT NULL enforcement on IDs: All Location_id and Organization_id foreign keys in the article table are set to NOT NULL, ensuring only articles with complete provenance data are admitted into the database.
  • No one-to-one relationships: Every relationship in the schema is one-to-many or many-to-many, supporting data reuse, lower redundancy, and clean scalability as new articles are added.

The original proposal included attributes like zip codes, email addresses, phone numbers, and crime type enumerations — none of which existed in the source data. These were removed entirely in favor of a schema strictly grounded in available data, with the linking-table architecture providing equivalent analytical power.


ETL Process

Extract: Raw data sourced from ProPublica's Documenting Hate News Index. Records were filtered to those with complete attribute coverage (article date, title, organization, city, state, URL, keywords, summary).

Transform: Data was cleaned and reorganized from a flat tabular structure into seven normalized entities. Keywords were tokenized and enumerated. URLs were split at the domain boundary. Summaries were deduplicated and assigned shared IDs where applicable.

Load: Data was imported into MySQL using INSERT statements generated via forward engineering from the MySQL Workbench ERD. Reverse engineering was used to verify consistency between the logical ERD model and the physical database schema throughout development.


Ethical & Privacy Considerations

  • All source data is publicly available through ProPublica; no legal data privacy concerns apply to the database contents.
  • Some summaries contain names of individuals involved in hate incidents. These were retained because the crimes are matters of public record and victim families had, in many cases, consented to public disclosure through law enforcement reporting.
  • Keywords were selected to describe crime types and circumstances; terms directly encoding race or ethnicity were deliberately excluded to avoid introducing analytical bias into keyword frequency queries.
  • Data sources span news outlets, social media, public tips, and law enforcement reports — and include both urban and rural locations — to avoid overrepresentation of well-covered metro areas at the expense of underreported rural incidents.

Diversity, Equity & Inclusion

The database was built with explicit DEI intent. By drawing from a wide range of publishing organizations — community papers, LGBTQ+ outlets, South Asian diaspora media, and national wire services — the dataset captures hate incidents affecting groups that mainstream coverage routinely underrepresents. The architecture is designed to scale: as new articles are added from broader time windows and more sources, the views and query structure support increasingly comprehensive trend analysis without redesign.


Future Work

  • Expand the dataset to cover the full ProPublica date range (February–August 2017) and beyond, enabling longitudinal trend analysis
  • Integrate additional attributes such as incident type categorization or victim demographic data if and when they become available in updated source datasets
  • Build a front-end reporting interface allowing non-technical users (journalists, advocates, legislators) to query geographic hotspots and crime-type summaries without writing SQL
  • Connect to live or periodically updated data feeds to keep the database current as new hate incident reports are published

Tools & Technologies

  • Database: MySQL 9.0 (InnoDB engine, utf8mb3 character set)
  • Modeling: MySQL Workbench (ERD, forward and reverse engineering)
  • Query Language: SQL (views, subqueries, multi-table JOINs, GROUP BY / HAVING aggregations)
  • Data Source: ProPublica Documenting Hate News Index (2017)
  • Version Control: GitHub

Team

Myles Sartor, Tanushree Sharma, Ramya Nataraj, Ayana Khan, Gabrielle Waugh University of Maryland, College Park — INST327, Section 0104 (Fall 2024)


References

"Documenting Hate News Index." ProPublica, 2017. projects.propublica.org/hate-news-index/

German, Michael, and Emmanuel Mauleón. "Fighting Far-Right Violence and Hate Crimes." Brennan Center for Justice, July 1, 2019.

ProPublica. "Documenting Hate." ProPublica, September 8, 2016. projects.propublica.org/graphics/hatecrimes

About

Documenting Hate FInalized Deliverables

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors