This project analyzes the spatial distribution of tweets in London using a geo-tagged dataset. It implements a newsworthiness scoring mechanism to filter and examine tweets for their relevance. The project includes data preprocessing, grid-based spatial analysis, and the application of a newsworthiness scoring model to geo-tagged tweets.
This project:
- Organizes tweets into 1km x 1km grids in London.
- Develops a newsworthiness scoring method based on high-quality, low-quality, and background tweets.
- Analyzes geo-tagged data using the newsworthiness scores to assess the spatial distribution of newsworthy tweets.
The dataset comprises geo-tagged tweets from London, organized into several JSON files. It includes separate datasets for background tweets and for high and low-quality tweets used for the newsworthiness model.
- Geo-tagged tweets:
geoLondonSep2022_*.json - Background tweets:
bgQuality.json - High-quality tweets:
highQuality.json - Low-quality tweets:
lowQuality.json
To run this project, you need to install the required Python libraries:
pip install -r requirements.txtrequirements.txt:
nltk
geopandas
matplotlib
numpy
pandas
seaborn
shapely
-
Navigate to the data directory containing the JSON files.
%cd C:\Users\Simran\Desktop\neccchv\Simran\data\datajson
-
Run the main script:
python geo_localization_analysis.py
This script will:
- Combine and preprocess tweet data.
- Calculate tweet density in 1km x 1km grids.
- Apply newsworthiness scoring to the tweets.
- Visualize the results.
- Compute Haversine Distance: Calculate the distance between two geo-locations.
- Grid Dimensions: Determine the number of rows and columns for the grid covering the London area.
- Tweet Distribution: Count the number of tweets in each grid cell.
- Data Preprocessing: Tokenize and remove stopwords from tweets.
- Term Frequency Calculation: Compute term and document frequencies.
- Likelihood Ratios: Calculate likelihood ratios for terms based on term frequencies in high-quality, low-quality, and background tweets.
- Scoring Tweets: Assign newsworthiness scores to tweets based on term likelihood ratios.
- Distribution Visualization: Create histograms and heatmaps to visualize tweet distribution and newsworthiness scores.
- Statistical Analysis: Compute and visualize statistics of tweet distribution across grid cells.
- Distribution of Tweets in Grid Cells:

- Heatmap of Tweet Distribution:

- Newsworthiness Score Distribution:

- Spatial Distribution: The tweet density varies significantly across London, with certain areas having a higher concentration of tweets.
- Newsworthiness: The newsworthiness score helps filter tweets, identifying those more relevant for analysis. The chosen threshold effectively separates high and low newsworthy tweets, with a reasonable balance between sensitivity and specificity.
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature-branch). - Commit your changes (
git commit -am 'Add new feature'). - Push to the branch (
git push origin feature-branch). - Create a new Pull Request.
This project is licensed under my name - Simran Garg, GIT-https://github.com/Mejorarsim.