This project applies natural language processing (NLP) and network analysis techniques in R to analyze a corpus of documents from three distinct genres: Politics, Sports, and Reviews.
The workflow involves preprocessing raw text data, generating a document-term matrix, clustering documents, analyzing sentiment, and constructing multiple types of networks—document, token, and bipartite—to explore deeper relationships in the corpus.
- Total Documents: 24
- Genres: Politics (8), Sports (8), Reviews (8)
- Minimum Document Length: 200 words
- Format: Plain text files in the
corpusfolder - Sources:
- Politics: e.g., BBC News
- Sports: e.g., ESPN
- Reviews: e.g., IGN
- Naming Convention:
genre_id.txt(e.g.,politics_1.txt)
- Language: R
- IDE: RStudio
- Packages:
tm,SnowballC,proxy,SentimentAnalysis,igraph,RColorBrewer
Performed using tm, SnowballC, and regular expressions:
- Removed numbers, punctuation, quotes.
- Converted to lowercase.
- Removed common and stop words.
- Applied stemming.
- Created using
tm:- Converted the preprocessed corpus into a DTM.
- Removed sparse terms from the DTM (retaining terms in ≥ 30% of documents).
- Manually selected 30 informative tokens to remain in the matrix.
- Distance: Cosine
- Linkage: Ward.D
- Clustering Accuracy: 23/24 (≈96%)
- Documents grouped almost perfectly by genre, indicating strong separability with minimal overlap.
sports_5.txtwas misclassified with Reviews due to overlapping analytical language.
- Tool:
SentimentAnalysis - Dictionary: QDAP
- Metrics:
SentimentQDAP(net polarity)PositivityQDAP(positive word proportion)
- Findings (from descriptive statistics and hypothesis testing):
SentimentQDAP:- Sports had the most positive overall polarity.
- Politics showed the lowest and most variable overall polarity.
PositivityQDAP:- Reviews had the highest median positivity and the smallest range, indicating the most consistent use of positive words.
- Politics and Sports had similar median positivity to each other and showed greater variability than Reviews.
- Reviews were significantly more positive than Politics.
- No significant difference in positivity was found between Reviews and Sports, or Sports and Politics.
- Nodes = Documents
- Edges = Number of shared tokens between documents
- Important Documents:
politics_5.txt,reviews_3.txt,reviews_7.txt - Communities:
- Documents mainly grouped by genre.
- Exceptions like
reviews_1.txt, which grouped with Politics, reflected shared themes or vocabulary.
- Enhanced Network:
- Node color =
SentimentQDAP - Node size = Eigenvector centrality
- Edge width = Shared token count
- Node color =
- Nodes = Tokens
- Edges = Co-occurrence frequency across documents
- Important Tokens:
world,fight,stori,state - Communities:
- Tokens largely grouped by genre.
- Exceptions like
futurandkill, which grouped with Reviews instead of Politics, reflected overlap in usage across different genres.
- Enhanced Network:
- Node color = Closeness centrality
- Node size = Betweenness centrality
- Edge width = Co-occurrence frequency
- Documents linked to tokens they contain
- Nodes = Documents and tokens
- Edges = Token frequency in document
- Findings:
- Documents and tokens generally grouped by genre.
- Exceptions like
sports_5.txtandevent, which grouped with Reviews, reflected shared themes or vocabulary.
- Enhanced Network:
- Node color = Genre
- Node shape = Node type
- Token node size = Degree
- Edge width = Token frequency
Most documents and tokens were highly interconnected due to shared vocabulary. However, centrality analysis highlighted several important nodes:
- Documents:
politics_5.txt,reviews_3.txt, andreviews_7.txtconsistently showed high centrality, acting as bridges between genres. - Tokens:
world,fight,stori, andstatefrequently co-occurred and played key connective roles across the corpus.
Community detection and hierarchical clustering both grouped documents and tokens primarily by genre. Politics, Reviews, and Sports formed distinct clusters, with few overlaps like sports_5.txt, which occasionally grouped with Reviews due to thematic similarity.
- Clustering was highly accurate (≈96%) and effective for distinguishing major genre divisions.
- Network analysis offered deeper insight into overlapping language, node influence, and structural roles that clustering could not reveal.
- Used together, both methods provide a comprehensive understanding of both dominant groupings and nuanced relationships within the corpus.
To enhance accuracy and insight:
- Use lemmatization over stemming.
- Apply TF-IDF weighting.
- Include n-grams and named entities.
- Use contextual embeddings (e.g., BERT).
- Incorporate POS filtering, topic modeling, dimensionality reduction.
- Clone the repository or download the ZIP file from GitHub.
- Open the project folder in RStudio.
- Run the R script (
nlp_network_analysis.r) inside the RStudio environment.
Developed by Juan Nathan.