Glycoproteomic Data Analysis using R and Python

A comprehensive tutorial for bioinformaticians and proteomics researchers to learn glycoproteomic (proteomics with post-translational modifications) data analysis workflows using R and Python.

Target Audience

This tutorial is designed for:

Graduate students and researchers in proteomics/glycoproteomics
Bioinformaticians learning mass spectrometry data analysis
Scientists with basic R/Python knowledge wanting to analyze TMT-based quantitative proteomics data

Prerequisites:

Basic familiarity with R and/or Python
Understanding of proteomics concepts (proteins, peptides, mass spectrometry)
Familiarity with statistical concepts (p-values, fold changes, normalization)

Tutorial Structure

Chapter	Topic	Learning Outcomes
Chapter 1	R Basics	Install R/RStudio, understand tidyverse, perform basic statistical tests (K-S test, Wilcoxon), create publication-quality plots
Chapter 2	Data Normalization	Filter PSMs, aggregate to protein level, apply sample loading and TMM normalization, visualize with UMAP
Chapter 3	Differential Analysis	Set up limma design matrices, perform differential expression analysis, interpret results
Chapter 4	Enrichment Analysis	Run GSEA with GO/KEGG, analyze protein complexes (CORUM) and domains (Pfam)
Chapter 5	Structure Analysis	Analyze peptide physicochemical properties, integrate AlphaFold structural data

Installation & Setup

R Environment

Install R (version 4.3+): https://cran.r-project.org/
Install RStudio: https://posit.co/download/rstudio-desktop/
Install required packages:

# CRAN packages
install.packages(c(
  "tidyverse", "readxl", "writexl", "readr",
  "showtext", "rstatix", "ggpubr", "reticulate"
))

# Bioconductor packages
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c(
  "limma", "edgeR", "clusterProfiler",
  "org.Hs.eg.db", "AnnotationDbi", "ComplexHeatmap"
))

Python Environment (for Chapters 2 & 5)

# Create conda environment for UMAP (Chapter 2)
conda create -n UMAP_env python=3.9
conda activate UMAP_env
pip install umap-learn pandas numpy matplotlib seaborn

# Create conda environment for structure analysis (Chapter 5)
conda create -n structure_analysis python=3.9
conda activate structure_analysis
pip install pandas numpy matplotlib seaborn localcider structuremap

Platform-Specific Notes

Windows: Ensure conda is added to PATH during installation
macOS: If using Apple Silicon, some packages may require Rosetta 2
Linux: May require additional system libraries for some R packages

Quick Start

Clone this repository:

git clone https://github.com/lfu46/Glycoproteomic-Data-Analysis-using-R-and-Python.git
cd Glycoproteomic-Data-Analysis-using-R-and-Python

Open the .Rproj file in RStudio
Start with Chapter 1 to verify your R setup works:

rmarkdown::render("Chapter_1_R_Basics/R_Basics.Rmd")

Each chapter builds on previous ones, so work through them sequentially.

Data Access

See DATA_ACCESS.md for detailed information about:

Sample datasets included for quick testing
Where to download full datasets
Expected data formats and column descriptions

Key R Packages Used

Category	Packages
Data manipulation	tidyverse (dplyr, tidyr, purrr, stringr)
Data I/O	readxl, writexl, readr
Visualization	ggplot2, ggpubr, ComplexHeatmap
Statistics	rstatix, limma, edgeR
Bioinformatics	clusterProfiler, org.Hs.eg.db, AnnotationDbi
R-Python bridge	reticulate

Code Style

This tutorial follows tidyverse conventions:

Pipe operator (|>) for chaining operations
Tidy data principles (each variable a column, each observation a row)
2-space indentation
UTF-8 encoding

Typical Data Flow

Raw MS Results (CSV/TSV from search software)
    ↓
PSM Filtering (XCorr, PPM thresholds)
    ↓
Protein-level Aggregation
    ↓
Normalization (Sample Loading → TMM)
    ↓
Statistical Analysis (limma)
    ↓
Visualization & Enrichment Analysis

Citation

If you use this tutorial in your research, please cite:

Fu, L. (2025). Glycoproteomic Data Analysis using R and Python: A Practical Tutorial.
GitHub: https://github.com/lfu46/Glycoproteomic-Data-Analysis-using-R-and-Python

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

Issues: Report bugs or request features via GitHub Issues
Questions: For questions about the tutorial content, open a GitHub Discussion

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
Chapter_1_R_Basics		Chapter_1_R_Basics
Chapter_2_Data_Normalization		Chapter_2_Data_Normalization
Chapter_3_Differential_Analysis		Chapter_3_Differential_Analysis
Chapter_4_Enrichment_Analysis		Chapter_4_Enrichment_Analysis
Chapter_5_Structure_Related_Analysis		Chapter_5_Structure_Related_Analysis
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
DATA_ACCESS.md		DATA_ACCESS.md
Glycoproteomic_Data_Analysis_using_R_and_Python.Rproj		Glycoproteomic_Data_Analysis_using_R_and_Python.Rproj
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Glycoproteomic Data Analysis using R and Python

Target Audience

Tutorial Structure

Installation & Setup

R Environment

Python Environment (for Chapters 2 & 5)

Platform-Specific Notes

Quick Start

Data Access

Key R Packages Used

Code Style

Typical Data Flow

Citation

License

Support

Contributing

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

lfu46/Glycoproteomic-Data-Analysis-using-R-and-Python

Folders and files

Latest commit

History

Repository files navigation

Glycoproteomic Data Analysis using R and Python

Target Audience

Tutorial Structure

Installation & Setup

R Environment

Python Environment (for Chapters 2 & 5)

Platform-Specific Notes

Quick Start

Data Access

Key R Packages Used

Code Style

Typical Data Flow

Citation

License

Support

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages