Skip to content

Mejorarsim/Geo-Localization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Geo-Localization Analysis of Tweets

This project analyzes the spatial distribution of tweets in London using a geo-tagged dataset. It implements a newsworthiness scoring mechanism to filter and examine tweets for their relevance. The project includes data preprocessing, grid-based spatial analysis, and the application of a newsworthiness scoring model to geo-tagged tweets.

Table of Contents

Overview

This project:

  1. Organizes tweets into 1km x 1km grids in London.
  2. Develops a newsworthiness scoring method based on high-quality, low-quality, and background tweets.
  3. Analyzes geo-tagged data using the newsworthiness scores to assess the spatial distribution of newsworthy tweets.

Dataset

The dataset comprises geo-tagged tweets from London, organized into several JSON files. It includes separate datasets for background tweets and for high and low-quality tweets used for the newsworthiness model.

  • Geo-tagged tweets: geoLondonSep2022_*.json
  • Background tweets: bgQuality.json
  • High-quality tweets: highQuality.json
  • Low-quality tweets: lowQuality.json

Installation

To run this project, you need to install the required Python libraries:

pip install -r requirements.txt

requirements.txt:

nltk
geopandas
matplotlib
numpy
pandas
seaborn
shapely

Usage

  1. Navigate to the data directory containing the JSON files.

    %cd C:\Users\Simran\Desktop\neccchv\Simran\data\datajson
  2. Run the main script:

    python geo_localization_analysis.py

This script will:

  • Combine and preprocess tweet data.
  • Calculate tweet density in 1km x 1km grids.
  • Apply newsworthiness scoring to the tweets.
  • Visualize the results.

Methods

1. Grid-Based Spatial Analysis

  • Compute Haversine Distance: Calculate the distance between two geo-locations.
  • Grid Dimensions: Determine the number of rows and columns for the grid covering the London area.
  • Tweet Distribution: Count the number of tweets in each grid cell.

2. Newsworthiness Scoring

  • Data Preprocessing: Tokenize and remove stopwords from tweets.
  • Term Frequency Calculation: Compute term and document frequencies.
  • Likelihood Ratios: Calculate likelihood ratios for terms based on term frequencies in high-quality, low-quality, and background tweets.
  • Scoring Tweets: Assign newsworthiness scores to tweets based on term likelihood ratios.

3. Analysis and Visualization

  • Distribution Visualization: Create histograms and heatmaps to visualize tweet distribution and newsworthiness scores.
  • Statistical Analysis: Compute and visualize statistics of tweet distribution across grid cells.

Visualizations

  • Distribution of Tweets in Grid Cells: Tweets Distribution
  • Heatmap of Tweet Distribution: Heatmap
  • Newsworthiness Score Distribution: Newsworthiness Scores

Discussion

  • Spatial Distribution: The tweet density varies significantly across London, with certain areas having a higher concentration of tweets.
  • Newsworthiness: The newsworthiness score helps filter tweets, identifying those more relevant for analysis. The chosen threshold effectively separates high and low newsworthy tweets, with a reasonable balance between sensitivity and specificity.

Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository.
  2. Create a new branch (git checkout -b feature-branch).
  3. Commit your changes (git commit -am 'Add new feature').
  4. Push to the branch (git push origin feature-branch).
  5. Create a new Pull Request.

License

This project is licensed under my name - Simran Garg, GIT-https://github.com/Mejorarsim.

About

Geo-Localization Analysis of Tweets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published