Everything you need when starting at a new lab or office: Linux commands, Python & R bioinformatics essentials, SQL for databases, and full environment setup (Git, VS Code, Conda, SSH).
- Day-One Checklist
- Part 1 — Linux Survival Guide
- Part 2 — Python for Bioinformatics
- Part 3 — R for Bioinformatics
- Part 4 — SQL Essentials
- Part 5 — Environment Setup
- Part 6 — Quick Reference Cards
When you arrive at a new office/lab, go through this list:
□ Can I open a terminal?
□ Do I have SSH access to the server?
□ What OS and version is running?
□ Do I have sudo access?
□ Is Git installed? What version?
□ Is Conda/Miniconda installed? Where?
□ Is Python installed? What version?
□ Where is the shared data stored?
□ What job scheduler is used? (SLURM, SGE, PBS?)
□ Do I have a GitHub/GitLab account connected?
□ Is VS Code installed or can I install it?
□ Who do I contact for server issues?
# Who am I?
whoami
# Where am I?
pwd
# What machine am I on?
hostname
# What OS is running?
cat /etc/os-release
# What's the date and time?
date
# How long has the server been running?
uptime# List files in current directory
ls
# List with details (permissions, size, date)
ls -la
# List with human-readable sizes
ls -lh
# List sorted by modification date (newest first)
ls -lt
# Go to a directory
cd /data/my_project
# Go back to home directory
cd ~
# or just
cd
# Go up one level
cd ..
# Go back to previous directory
cd -
# Show directory structure (2 levels deep)
tree -L 2
# If tree is not installed
find . -maxdepth 2 -type d# How much disk space is left?
df -h
# How much space does this folder use?
du -sh /data/my_project/
# How much space does each subfolder use?
du -sh /data/my_project/*
# Sort folders by size (largest first)
du -sh /data/my_project/* | sort -rh
# Find the 10 largest files in a directory
find /data/my_project -type f -exec ls -lS {} + | head -10
# Check your disk quota (if applicable)
quota -s# Create an empty file
touch my_file.txt
# Create a directory
mkdir my_folder
# Create nested directories
mkdir -p project/data/raw
# Copy a file
cp source.txt destination.txt
# Copy a directory (recursive)
cp -r source_folder/ destination_folder/
# Move or rename a file
mv old_name.txt new_name.txt
# Move a file to another directory
mv my_file.txt /data/my_project/
# Delete a file
rm my_file.txt
# Delete a directory and everything inside (⚠️ BE CAREFUL)
rm -rf my_folder/
# Create a symbolic link (shortcut)
ln -s /data/shared/big_file.bam ~/my_project/big_file.bam
⚠️ DANGER:rm -rfis irreversible. Always double-check the path before running it. Never runrm -rf /orrm -rf *without being 100% sure of your current directory.
# Print entire file content
cat my_file.txt
# View first 10 lines
head my_file.txt
# View first 50 lines
head -n 50 my_file.txt
# View last 10 lines
tail my_file.txt
# View last 50 lines
tail -n 50 my_file.txt
# Scroll through a file (press q to quit)
less my_file.txt
# Count lines, words, characters
wc my_file.txt
# Count only lines
wc -l my_file.txt
# Count number of files in a directory
ls -1 | wc -l
# View a BED/TSV/CSV file nicely
column -t my_file.bed | head
# Check file type
file my_file.bam# Search for a word in a file
grep "CEBPB" my_file.txt
# Search case-insensitive
grep -i "cebpb" my_file.txt
# Search recursively in all files in a directory
grep -r "CEBPB" /data/my_project/
# Count how many times a pattern appears
grep -c "chr1" peaks.bed
# Find a file by name
find /data/ -name "*.bam"
# Find files modified in the last 7 days
find /data/ -mtime -7
# Find files larger than 1GB
find /data/ -size +1G
# Extract specific columns from a tabular file (e.g., columns 1,2,3)
cut -f1,2,3 my_file.bed
# Sort a file
sort my_file.txt
# Sort numerically by column 2
sort -k2,2n my_file.bed
# Remove duplicate lines
sort my_file.txt | uniq
# Count occurrences of each unique line
sort my_file.txt | uniq -c | sort -rn# Compress a file
gzip my_file.fastq
# Decompress
gunzip my_file.fastq.gz
# Read a gzipped file without decompressing
zcat my_file.fastq.gz | head
# Create a tar.gz archive
tar -czvf archive.tar.gz my_folder/
# Extract a tar.gz archive
tar -xzvf archive.tar.gz
# Extract a .zip file
unzip archive.zip# Check permissions of a file
ls -la my_file.txt
# Permission format: -rwxr-xr-x
# r=read, w=write, x=execute
# [owner][group][others]
# Make a script executable
chmod +x my_script.sh
# Give read/write to owner only
chmod 600 my_file.txt
# Give read/write/execute to owner, read to others
chmod 755 my_script.sh
# Change owner (needs sudo)
chown user:group my_file.txt# See running processes
top
# Better version (if installed)
htop
# See YOUR running processes
ps aux | grep $(whoami)
# Kill a process by PID
kill 12345
# Force kill
kill -9 12345
# Run a command in the background
my_script.sh &
# Run a command that survives logout
nohup my_script.sh &
# See background jobs
jobs
# Check SLURM queue (if using HPC)
squeue -u $(whoami)
# Submit a SLURM job
sbatch my_job.sh
# Cancel a SLURM job
scancel JOB_ID
# See past SLURM jobs
sacct --format=JobID,JobName,State,Elapsed,MaxRSS# Connect to a remote server
ssh username@server-address
# Connect with a specific port
ssh -p 2222 username@server-address
# Copy a file TO the server
scp my_file.txt username@server:/data/my_project/
# Copy a file FROM the server
scp username@server:/data/my_project/results.txt ./
# Copy a directory (recursive)
scp -r my_folder/ username@server:/data/my_project/
# Mount a remote folder locally (if sshfs installed)
sshfs username@server:/data/my_project/ ~/remote_mount/# Pipe: send output of one command to another
cat my_file.bed | grep "chr1" | wc -l
# Save output to a file (overwrite)
grep "chr1" peaks.bed > chr1_peaks.bed
# Append to a file
echo "new line" >> my_file.txt
# Redirect errors to a file
my_command 2> errors.log
# Redirect both output and errors
my_command > output.log 2>&1# Count reads in a FASTQ file
echo $(( $(wc -l < reads.fastq) / 4 ))
# Count reads in a gzipped FASTQ
echo $(( $(zcat reads.fastq.gz | wc -l) / 4 ))
# Count reads in a BAM file
samtools view -c my_file.bam
# Count mapped reads only
samtools view -c -F 4 my_file.bam
# Extract chromosome names from a BAM
samtools idxstats my_file.bam | cut -f1
# Count peaks per chromosome in a BED file
cut -f1 peaks.bed | sort | uniq -c | sort -rn
# Get unique gene names from column 4 of a BED
cut -f4 annotated.bed | sort -u
# Quick look at a BAM header
samtools view -H my_file.bam | head -20
# Check if a BAM is sorted
samtools view -H my_file.bam | grep "@HD"
# Convert BAM to BED
bedtools bamtobed -i my_file.bam > my_file.bed# ──────────────────────────────────
# Reading & writing text files
# ──────────────────────────────────
# Read entire file
with open("file.txt", "r") as f:
content = f.read()
# Read line by line
with open("file.txt", "r") as f:
for line in f:
line = line.strip() # remove newline
print(line)
# Write to a file
with open("output.txt", "w") as f:
f.write("Hello\n")
# Append to a file
with open("output.txt", "a") as f:
f.write("New line\n")
# Read a gzipped file
import gzip
with gzip.open("file.txt.gz", "rt") as f:
for line in f:
print(line.strip())import pandas as pd
# ──────────────────────────────────
# Loading data
# ──────────────────────────────────
df = pd.read_csv("data.csv")
df = pd.read_csv("data.tsv", sep="\t")
df = pd.read_csv("data.bed", sep="\t", header=None,
names=["chr", "start", "end", "name", "score"])
# ──────────────────────────────────
# First look at data
# ──────────────────────────────────
df.head() # first 5 rows
df.tail() # last 5 rows
df.shape # (rows, columns)
df.columns # column names
df.dtypes # data types
df.info() # summary
df.describe() # statistics
df.nunique() # unique values per column
df.isnull().sum() # missing values per column
# ──────────────────────────────────
# Filtering
# ──────────────────────────────────
chr1 = df[df["chr"] == "chr1"]
significant = df[df["pvalue"] < 0.05]
high_score = df[(df["score"] > 100) & (df["chr"] == "chr1")]
# ──────────────────────────────────
# Selecting columns
# ──────────────────────────────────
subset = df[["chr", "start", "end"]]
single_col = df["score"]
# ──────────────────────────────────
# Sorting
# ──────────────────────────────────
df_sorted = df.sort_values("score", ascending=False)
df_sorted = df.sort_values(["chr", "start"])
# ──────────────────────────────────
# Grouping & aggregation
# ──────────────────────────────────
counts = df.groupby("chr").size()
means = df.groupby("cancer_type")["score"].mean()
summary = df.groupby("cancer_type").agg({"score": ["mean", "std", "count"]})
# ──────────────────────────────────
# Adding / modifying columns
# ──────────────────────────────────
df["length"] = df["end"] - df["start"]
df["log_score"] = np.log10(df["score"] + 1)
# ──────────────────────────────────
# Saving
# ──────────────────────────────────
df.to_csv("output.csv", index=False)
df.to_csv("output.tsv", sep="\t", index=False)# ──────────────────────────────────
# Using Biopython
# ──────────────────────────────────
from Bio import SeqIO
# Read a FASTA file
for record in SeqIO.parse("sequences.fasta", "fasta"):
print(f"ID: {record.id}")
print(f"Length: {len(record.seq)}")
print(f"Sequence: {record.seq[:50]}...")
# Load all sequences into a dictionary
sequences = SeqIO.to_dict(SeqIO.parse("sequences.fasta", "fasta"))
my_seq = sequences["gene_name"].seq
# Write sequences to a FASTA file
records = []
for name, seq in my_sequences.items():
record = SeqIO.SeqRecord(Seq(seq), id=name, description="")
records.append(record)
SeqIO.write(records, "output.fasta", "fasta")
# Count sequences in a FASTA
count = sum(1 for _ in SeqIO.parse("sequences.fasta", "fasta"))
# Filter sequences by length
long_seqs = [r for r in SeqIO.parse("sequences.fasta", "fasta") if len(r.seq) > 500]import pandas as pd
# ──────────────────────────────────
# Load a BED file
# ──────────────────────────────────
bed = pd.read_csv("peaks.bed", sep="\t", header=None,
names=["chr", "start", "end", "name", "score", "strand"])
# ──────────────────────────────────
# Common operations
# ──────────────────────────────────
# Peak lengths
bed["length"] = bed["end"] - bed["start"]
# Filter by chromosome
chr1_peaks = bed[bed["chr"] == "chr1"]
# Filter by score
strong_peaks = bed[bed["score"] > 50]
# Count peaks per chromosome
bed["chr"].value_counts()
# Sort by genomic position
bed_sorted = bed.sort_values(["chr", "start"])
# Save as BED (no header, tab-separated)
bed_sorted.to_csv("sorted.bed", sep="\t", header=False, index=False)import pysam
# ──────────────────────────────────
# Read a BAM file
# ──────────────────────────────────
bam = pysam.AlignmentFile("sample.bam", "rb")
# Count total reads
total = bam.count()
# Count mapped reads
mapped = bam.count(read_callback=lambda r: not r.is_unmapped)
# Iterate over reads in a region
for read in bam.fetch("chr1", 1000, 2000):
print(read.query_name, read.mapping_quality)
# Get reads as a list
reads = list(bam.fetch("chr1", 1000, 2000))
bam.close()from Bio.Blast.Applications import NcbiblastpCommandline
from Bio.Blast import NCBIXML
# ──────────────────────────────────
# Run local BLASTp
# ──────────────────────────────────
blastp = NcbiblastpCommandline(
query="query.fasta",
db="my_database",
evalue=1e-5,
outfmt=5, # XML format
out="blast_results.xml"
)
stdout, stderr = blastp()
# ──────────────────────────────────
# Parse BLAST results
# ──────────────────────────────────
with open("blast_results.xml") as f:
records = NCBIXML.parse(f)
for record in records:
for alignment in record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 1e-10:
print(f"Hit: {alignment.title}")
print(f"E-value: {hsp.expect}")
print(f"Identity: {hsp.identities}/{hsp.align_length}")import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# ──────────────────────────────────
# Histogram (e.g., peak length distribution)
# ──────────────────────────────────
plt.figure(figsize=(8, 5))
plt.hist(bed["length"], bins=50, color="steelblue", edgecolor="black")
plt.xlabel("Peak Length (bp)")
plt.ylabel("Count")
plt.title("Peak Length Distribution")
plt.savefig("peak_lengths.png", dpi=300, bbox_inches="tight")
plt.show()
# ──────────────────────────────────
# Heatmap (e.g., gene presence/absence)
# ──────────────────────────────────
plt.figure(figsize=(12, 8))
sns.heatmap(matrix, cmap="YlOrRd", xticklabels=genes, yticklabels=species)
plt.title("Gene Presence/Absence")
plt.savefig("heatmap.png", dpi=300, bbox_inches="tight")
plt.show()
# ──────────────────────────────────
# Volcano plot
# ──────────────────────────────────
plt.figure(figsize=(8, 6))
plt.scatter(df["log2FC"], -np.log10(df["pvalue"]),
c=df["significant"].map({True: "red", False: "grey"}),
alpha=0.5, s=10)
plt.xlabel("log2 Fold Change")
plt.ylabel("-log10(p-value)")
plt.axhline(-np.log10(0.05), color="black", linestyle="--", linewidth=0.5)
plt.title("Volcano Plot")
plt.savefig("volcano.png", dpi=300, bbox_inches="tight")
plt.show()
# ──────────────────────────────────
# Save any figure
# ──────────────────────────────────
# Always use these options:
plt.savefig("figure.png", dpi=300, bbox_inches="tight")
# For publication: use PDF or SVG
plt.savefig("figure.pdf", bbox_inches="tight")# Reverse complement a DNA sequence
def reverse_complement(seq):
comp = str.maketrans("ATCG", "TAGC")
return seq.translate(comp)[::-1]
# GC content
def gc_content(seq):
gc = seq.count("G") + seq.count("C")
return gc / len(seq) * 100
# Read a two-column file into a dictionary
with open("mapping.txt") as f:
d = dict(line.strip().split("\t") for line in f)
# Flatten a list of lists
flat = [item for sublist in nested_list for item in sublist]
# Quick timer for a block of code
import time
start = time.time()
# ... your code ...
print(f"Elapsed: {time.time() - start:.2f}s")# ──────────────────────────────────
# Variables & Types
# ──────────────────────────────────
x <- 5 # assignment (use <- not =)
name <- "CEBPB"
is_significant <- TRUE
my_vector <- c(1, 2, 3, 4, 5) # vector (R's basic data structure)
# Check type
class(x)
typeof(x)
length(my_vector)
# ──────────────────────────────────
# Vectors (everything in R is a vector)
# ──────────────────────────────────
nums <- c(10, 20, 30, 40, 50)
genes <- c("TP53", "BRCA1", "EGFR", "MYC")
logicals <- c(TRUE, FALSE, TRUE)
# Access elements (R is 1-indexed!)
nums[1] # 10 (not 0-indexed like Python!)
nums[2:4] # 20, 30, 40
genes[c(1, 3)] # "TP53", "EGFR"
# Operations are vectorized
nums * 2 # c(20, 40, 60, 80, 100)
nums > 25 # c(FALSE, FALSE, TRUE, TRUE, TRUE)
sum(nums > 25) # 3
# ──────────────────────────────────
# Useful vector functions
# ──────────────────────────────────
length(nums)
sum(nums)
mean(nums)
median(nums)
sd(nums)
min(nums)
max(nums)
range(nums)
sort(nums, decreasing = TRUE)
unique(c(1, 1, 2, 3, 3)) # c(1, 2, 3)
table(c("A", "B", "A", "C")) # counts per value# ──────────────────────────────────
# Creating a data frame
# ──────────────────────────────────
df <- data.frame(
gene = c("TP53", "BRCA1", "EGFR"),
log2FC = c(2.5, -1.3, 4.1),
pvalue = c(0.001, 0.03, 0.0001)
)
# ──────────────────────────────────
# Reading files
# ──────────────────────────────────
df <- read.csv("data.csv")
df <- read.csv("data.tsv", sep = "\t")
df <- read.delim("data.bed", header = FALSE,
col.names = c("chr", "start", "end", "name", "score"))
# Tidyverse alternative (faster for large files)
library(readr)
df <- read_tsv("data.tsv")
df <- read_csv("data.csv")
# ──────────────────────────────────
# First look
# ──────────────────────────────────
head(df) # first 6 rows
tail(df) # last 6 rows
str(df) # structure (types, dimensions)
summary(df) # statistics
dim(df) # rows × columns
nrow(df) # number of rows
ncol(df) # number of columns
colnames(df) # column names
# ──────────────────────────────────
# Accessing data
# ──────────────────────────────────
df$gene # one column (as vector)
df[, 1] # first column
df[1, ] # first row
df[1:3, ] # rows 1 to 3
df[, c("gene", "pvalue")] # select columns by name
# ──────────────────────────────────
# Filtering
# ──────────────────────────────────
sig <- df[df$pvalue < 0.05, ]
up <- df[df$log2FC > 2 & df$pvalue < 0.05, ]
# ──────────────────────────────────
# Adding columns
# ──────────────────────────────────
df$neg_log_p <- -log10(df$pvalue)
df$significant <- df$pvalue < 0.05
# ──────────────────────────────────
# Sorting
# ──────────────────────────────────
df <- df[order(df$pvalue), ] # sort by pvalue (ascending)
df <- df[order(-df$log2FC), ] # sort by log2FC (descending)
# ──────────────────────────────────
# Saving
# ──────────────────────────────────
write.csv(df, "output.csv", row.names = FALSE)
write.table(df, "output.tsv", sep = "\t", row.names = FALSE, quote = FALSE)library(dplyr)
library(tidyr)
# ──────────────────────────────────
# The pipe operator: %>% (read as "then")
# ──────────────────────────────────
# Base R (nested, hard to read):
head(sort(df$pvalue))
# Tidyverse (piped, reads left-to-right):
df %>%
arrange(pvalue) %>%
head()
# ──────────────────────────────────
# dplyr verbs — the 6 essential functions
# ──────────────────────────────────
# filter() — select rows
df %>% filter(pvalue < 0.05)
df %>% filter(log2FC > 2, pvalue < 0.01)
df %>% filter(gene %in% c("TP53", "BRCA1"))
# select() — select columns
df %>% select(gene, pvalue)
df %>% select(-log2FC) # exclude a column
df %>% select(starts_with("log")) # by pattern
# mutate() — add/modify columns
df %>% mutate(
neg_log_p = -log10(pvalue),
significant = pvalue < 0.05,
length = end - start
)
# arrange() — sort rows
df %>% arrange(pvalue) # ascending
df %>% arrange(desc(log2FC)) # descending
# summarise() + group_by() — aggregate
df %>%
group_by(cancer_type) %>%
summarise(
n_genes = n(),
mean_FC = mean(log2FC),
median_p = median(pvalue)
)
# count() — quick frequency table
df %>% count(cancer_type, sort = TRUE)
# ──────────────────────────────────
# Joining data frames (like SQL joins)
# ──────────────────────────────────
merged <- left_join(df1, df2, by = "gene")
merged <- inner_join(df1, df2, by = c("chr", "start"))
# ──────────────────────────────────
# Reshaping
# ──────────────────────────────────
# Wide to long
long <- df %>% pivot_longer(cols = c(sample1, sample2, sample3),
names_to = "sample",
values_to = "expression")
# Long to wide
wide <- long %>% pivot_wider(names_from = sample,
values_from = expression)library(ggplot2)
# ──────────────────────────────────
# The ggplot2 formula:
# ggplot(data, aes(x, y)) + geom_*() + theme_*()
# ──────────────────────────────────
# Scatter plot
ggplot(df, aes(x = log2FC, y = -log10(pvalue))) +
geom_point(alpha = 0.5) +
theme_minimal() +
labs(title = "Volcano Plot", x = "log2 Fold Change", y = "-log10(p-value)")
# Volcano plot (colored by significance)
ggplot(df, aes(x = log2FC, y = -log10(pvalue), color = significant)) +
geom_point(alpha = 0.5, size = 1) +
scale_color_manual(values = c("grey", "red")) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
theme_minimal()
# Bar plot
ggplot(df, aes(x = cancer_type, fill = cancer_type)) +
geom_bar() +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Boxplot
ggplot(df, aes(x = cancer_type, y = expression, fill = cancer_type)) +
geom_boxplot() +
theme_minimal()
# Histogram
ggplot(df, aes(x = peak_length)) +
geom_histogram(bins = 50, fill = "steelblue", color = "black") +
theme_minimal()
# Heatmap
library(pheatmap)
pheatmap(matrix_data,
scale = "row",
clustering_distance_rows = "euclidean",
color = colorRampPalette(c("blue", "white", "red"))(100))
# ──────────────────────────────────
# Save a plot
# ──────────────────────────────────
ggsave("my_plot.png", width = 8, height = 6, dpi = 300)
ggsave("my_plot.pdf", width = 8, height = 6)# ──────────────────────────────────
# Install Bioconductor (do this ONCE)
# ──────────────────────────────────
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# Install packages
BiocManager::install("DESeq2")
BiocManager::install("edgeR")
BiocManager::install("GenomicRanges")
BiocManager::install("rtracklayer")
BiocManager::install("clusterProfiler")
# ──────────────────────────────────
# DESeq2 — Differential Expression Analysis
# ──────────────────────────────────
library(DESeq2)
# Create DESeq dataset from count matrix
dds <- DESeqDataSetFromMatrix(
countData = count_matrix, # genes × samples (integers)
colData = sample_info, # sample metadata
design = ~ condition # what to compare
)
# Run DESeq2
dds <- DESeq(dds)
results <- results(dds)
# Get significant genes
sig_genes <- results[which(results$padj < 0.05), ]
sig_genes <- sig_genes[order(sig_genes$padj), ]
# Volcano plot from DESeq2 results
plotMA(results)
# PCA plot
vsd <- vst(dds)
plotPCA(vsd, intgroup = "condition")
# ──────────────────────────────────
# GenomicRanges — Working with genomic intervals
# ──────────────────────────────────
library(GenomicRanges)
# Create genomic ranges
gr <- GRanges(
seqnames = c("chr1", "chr1", "chr2"),
ranges = IRanges(start = c(100, 200, 300), end = c(150, 250, 350)),
strand = c("+", "-", "+")
)
# Find overlaps between two sets of regions
overlaps <- findOverlaps(peaks_gr, promoters_gr)
# Reduce overlapping ranges to a single set
reduced <- reduce(gr)
# Get distance between ranges
dist <- distanceToNearest(peaks_gr, tss_gr)
# ──────────────────────────────────
# Read BED/BW/GFF files
# ──────────────────────────────────
library(rtracklayer)
bed <- import("peaks.bed")
bw <- import("signal.bw")
gff <- import("annotations.gff3")
# Export
export(gr, "output.bed", format = "BED")
# ──────────────────────────────────
# Gene Ontology / Pathway Analysis
# ──────────────────────────────────
library(clusterProfiler)
library(org.Hs.eg.db)
# GO enrichment
ego <- enrichGO(
gene = significant_genes,
OrgDb = org.Hs.eg.db,
keyType = "SYMBOL",
ont = "BP", # Biological Process
pAdjustMethod = "BH",
pvalueCutoff = 0.05
)
dotplot(ego, showCategory = 20)
# KEGG pathway enrichment
ekegg <- enrichKEGG(
gene = entrez_ids,
organism = "hsa",
pvalueCutoff = 0.05
)
dotplot(ekegg)# Check if a package is installed
"DESeq2" %in% rownames(installed.packages())
# Install a CRAN package
install.packages("ggplot2")
# Load multiple libraries at once
lapply(c("dplyr", "ggplot2", "readr"), library, character.only = TRUE)
# Apply a function to every column
sapply(df, function(x) sum(is.na(x))) # count NAs per column
# Quick summary of a BED file
bed <- read.delim("peaks.bed", header = FALSE)
cat("Peaks:", nrow(bed), "\n")
cat("Chromosomes:", length(unique(bed$V1)), "\n")
cat("Mean length:", mean(bed$V3 - bed$V2), "bp\n")
# Set working directory
setwd("/data/my_project")
getwd()
# List files matching a pattern
list.files(path = "results/", pattern = "*.csv")Many biological databases (Ensembl, UCSC Genome Browser, GEO, clinical databases) use SQL. You'll encounter it when:
- Querying genomic annotation databases
- Working with clinical/patient metadata
- Accessing TCGA or GEO programmatically
- Managing lab sample databases (LIMS)
- Pulling data from REDCap or similar platforms
-- ──────────────────────────────────
-- SELECT — get data from a table
-- ──────────────────────────────────
SELECT * FROM patients; -- all columns
SELECT patient_id, cancer_type FROM patients; -- specific columns
SELECT DISTINCT cancer_type FROM patients; -- unique values only
-- ──────────────────────────────────
-- WHERE — filter rows
-- ──────────────────────────────────
SELECT * FROM patients WHERE cancer_type = 'BRCA';
SELECT * FROM patients WHERE age > 50;
SELECT * FROM patients WHERE cancer_type = 'BRCA' AND stage = 'III';
SELECT * FROM patients WHERE cancer_type IN ('BRCA', 'LUAD', 'COAD');
SELECT * FROM genes WHERE gene_name LIKE 'CEBP%'; -- starts with CEBP
SELECT * FROM results WHERE pvalue IS NOT NULL;
-- ──────────────────────────────────
-- ORDER BY — sort results
-- ──────────────────────────────────
SELECT * FROM results ORDER BY pvalue ASC; -- ascending
SELECT * FROM results ORDER BY log2fc DESC; -- descending
SELECT * FROM results ORDER BY cancer_type, pvalue; -- multi-column
-- ──────────────────────────────────
-- LIMIT — get only N rows
-- ──────────────────────────────────
SELECT * FROM results ORDER BY pvalue LIMIT 10; -- top 10-- ──────────────────────────────────
-- COUNT, SUM, AVG, MIN, MAX
-- ──────────────────────────────────
SELECT COUNT(*) FROM patients; -- total rows
SELECT COUNT(DISTINCT cancer_type) FROM patients; -- unique cancer types
SELECT AVG(age) FROM patients; -- average age
SELECT MIN(pvalue), MAX(pvalue) FROM results;
-- ──────────────────────────────────
-- GROUP BY — aggregate per group
-- ──────────────────────────────────
SELECT cancer_type, COUNT(*) AS n_patients
FROM patients
GROUP BY cancer_type
ORDER BY n_patients DESC;
SELECT cancer_type,
COUNT(*) AS n_samples,
AVG(age) AS mean_age,
MIN(age) AS min_age,
MAX(age) AS max_age
FROM patients
GROUP BY cancer_type;
-- ──────────────────────────────────
-- HAVING — filter AFTER grouping
-- ──────────────────────────────────
SELECT cancer_type, COUNT(*) AS n
FROM patients
GROUP BY cancer_type
HAVING COUNT(*) >= 10; -- only cancer types with ≥10 samples-- ──────────────────────────────────
-- JOIN types visualized:
--
-- INNER JOIN: only matching rows from BOTH tables
-- LEFT JOIN: all rows from LEFT + matching from RIGHT
-- RIGHT JOIN: all rows from RIGHT + matching from LEFT
-- FULL JOIN: all rows from BOTH tables
-- ──────────────────────────────────
-- INNER JOIN — patients with results
SELECT p.patient_id, p.cancer_type, r.gene, r.pvalue
FROM patients p
INNER JOIN results r ON p.patient_id = r.patient_id;
-- LEFT JOIN — all patients, even without results
SELECT p.patient_id, p.cancer_type, r.gene, r.pvalue
FROM patients p
LEFT JOIN results r ON p.patient_id = r.patient_id;
-- Join on multiple columns
SELECT *
FROM peaks p
JOIN annotations a ON p.chr = a.chr
AND p.start >= a.start
AND p.end <= a.end;-- ──────────────────────────────────
-- CREATE TABLE
-- ──────────────────────────────────
CREATE TABLE patients (
patient_id VARCHAR(50) PRIMARY KEY,
cancer_type VARCHAR(20) NOT NULL,
age INTEGER,
sex VARCHAR(1),
stage VARCHAR(10)
);
CREATE TABLE peaks (
peak_id INTEGER PRIMARY KEY AUTOINCREMENT,
chr VARCHAR(10),
start_pos INTEGER,
end_pos INTEGER,
score FLOAT,
sample_id VARCHAR(50),
FOREIGN KEY (sample_id) REFERENCES patients(patient_id)
);
-- ──────────────────────────────────
-- INSERT data
-- ──────────────────────────────────
INSERT INTO patients (patient_id, cancer_type, age, sex, stage)
VALUES ('TCGA-A1-A0SB', 'BRCA', 55, 'F', 'IIA');
-- Insert multiple rows
INSERT INTO patients VALUES
('TCGA-A1-A0SB', 'BRCA', 55, 'F', 'IIA'),
('TCGA-BC-A3KF', 'LIHC', 62, 'M', 'III'),
('TCGA-OR-A5J2', 'ACC', 48, 'F', 'II');
-- ──────────────────────────────────
-- UPDATE data
-- ──────────────────────────────────
UPDATE patients SET stage = 'IIB' WHERE patient_id = 'TCGA-A1-A0SB';
-- ──────────────────────────────────
-- DELETE data (⚠️ careful!)
-- ──────────────────────────────────
DELETE FROM patients WHERE patient_id = 'TCGA-A1-A0SB';
-- ──────────────────────────────────
-- ALTER TABLE
-- ──────────────────────────────────
ALTER TABLE patients ADD COLUMN treatment VARCHAR(50);
ALTER TABLE patients DROP COLUMN treatment;-- ──────────────────────────────────
-- Subquery — query inside a query
-- ──────────────────────────────────
-- Find patients with above-average age
SELECT * FROM patients
WHERE age > (SELECT AVG(age) FROM patients);
-- Find genes significant in BRCA
SELECT gene FROM results
WHERE patient_id IN (
SELECT patient_id FROM patients WHERE cancer_type = 'BRCA'
)
AND pvalue < 0.05;
-- ──────────────────────────────────
-- CASE — conditional logic (like if/else)
-- ──────────────────────────────────
SELECT gene, pvalue,
CASE
WHEN pvalue < 0.001 THEN 'highly_significant'
WHEN pvalue < 0.05 THEN 'significant'
ELSE 'not_significant'
END AS significance
FROM results;
-- ──────────────────────────────────
-- Window functions — rank, running totals
-- ──────────────────────────────────
-- Rank genes by pvalue within each cancer type
SELECT cancer_type, gene, pvalue,
RANK() OVER (PARTITION BY cancer_type ORDER BY pvalue) AS rank
FROM results;
-- ──────────────────────────────────
-- Common Table Expressions (CTE) — readable subqueries
-- ──────────────────────────────────
WITH significant_genes AS (
SELECT * FROM results WHERE pvalue < 0.05
)
SELECT cancer_type, COUNT(*) AS n_sig_genes
FROM significant_genes
GROUP BY cancer_type
ORDER BY n_sig_genes DESC;import sqlite3
import pandas as pd
# ──────────────────────────────────
# SQLite — lightweight, no server needed
# ──────────────────────────────────
# Connect (creates file if it doesn't exist)
conn = sqlite3.connect("my_database.db")
cursor = conn.cursor()
# Create a table
cursor.execute("""
CREATE TABLE IF NOT EXISTS peaks (
chr TEXT,
start INTEGER,
end INTEGER,
score REAL,
sample TEXT
)
""")
# Insert data from a pandas DataFrame
df.to_sql("peaks", conn, if_exists="replace", index=False)
# Query into a DataFrame (most common pattern!)
results = pd.read_sql_query("SELECT * FROM peaks WHERE score > 50", conn)
# Run any SQL
cursor.execute("SELECT chr, COUNT(*) FROM peaks GROUP BY chr")
rows = cursor.fetchall()
# Close connection
conn.close()# ──────────────────────────────────
# MySQL / PostgreSQL (for larger databases)
# ──────────────────────────────────
# pip install sqlalchemy pymysql
from sqlalchemy import create_engine
# Connect to MySQL
engine = create_engine("mysql+pymysql://user:password@host/database")
# Query directly into pandas
df = pd.read_sql("SELECT * FROM patients WHERE cancer_type = 'BRCA'", engine)# ──────────────────────────────────
# Query UCSC Genome Browser MySQL (public, no password)
# ──────────────────────────────────
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine("mysql+pymysql://genome@genome-mysql.soe.ucsc.edu/hg38")
# Get all RefSeq genes on chr1
genes = pd.read_sql("""
SELECT name2 AS gene, chrom, txStart, txEnd, strand
FROM refGene
WHERE chrom = 'chr1'
ORDER BY txStart
LIMIT 100
""", engine)
# Get CpG islands
cpg = pd.read_sql("""
SELECT chrom, chromStart, chromEnd, length, cpgNum, gcNum
FROM cpgIslandExt
WHERE chrom = 'chr1'
""", engine)# ──────────────────────────────────
# Query Ensembl BioMart from R
# ──────────────────────────────────
library(biomaRt)
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
# Get gene coordinates
genes <- getBM(
attributes = c("hgnc_symbol", "chromosome_name",
"start_position", "end_position", "strand"),
filters = "hgnc_symbol",
values = c("TP53", "BRCA1", "EGFR", "CEBPB"),
mart = ensembl
)| Command | What It Does |
|---|---|
SELECT |
Choose which columns to return |
FROM |
Which table to query |
WHERE |
Filter rows (before grouping) |
AND / OR |
Combine conditions |
IN (...) |
Match any value in a list |
LIKE '%pattern%' |
Pattern matching |
IS NULL / IS NOT NULL |
Check for missing values |
ORDER BY ... ASC/DESC |
Sort results |
LIMIT N |
Return only N rows |
GROUP BY |
Aggregate per group |
HAVING |
Filter after grouping |
COUNT() / SUM() / AVG() |
Aggregate functions |
INNER JOIN ... ON |
Combine tables (matching only) |
LEFT JOIN ... ON |
Combine tables (keep all from left) |
INSERT INTO |
Add rows |
UPDATE ... SET |
Modify rows |
DELETE FROM |
Remove rows |
CREATE TABLE |
Make a new table |
ALTER TABLE |
Modify table structure |
Run these commands first to understand your environment:
# ──────────────────────────────────
# System
# ──────────────────────────────────
cat /etc/os-release # OS version
uname -a # Kernel info
nproc # Number of CPU cores
free -h # Available RAM
df -h # Disk space
# ──────────────────────────────────
# Tools
# ──────────────────────────────────
git --version # Git
python3 --version # Python
pip --version # pip
conda --version # Conda
R --version # R
java -version # Java
# ──────────────────────────────────
# Bioinformatics tools
# ──────────────────────────────────
samtools --version # SAMtools
bedtools --version # BEDTools
blastn -version # BLAST
mafft --version # MAFFT
module avail # Available modules (HPC)# Install (if not present)
sudo apt install git # Ubuntu/Debian
# or
conda install git # via Conda
# Configure (do this ONCE)
git config --global user.name "YourName"
git config --global user.email "your.email@example.com"
git config --global core.editor "code --wait" # VS Code as editor
git config --global init.defaultBranch main
# Verify
git config --list# Generate key
ssh-keygen -t ed25519 -C "your.email@example.com"
# Press Enter for all prompts (default location, no passphrase)
# Copy the public key
cat ~/.ssh/id_ed25519.pub
# → Go to GitHub → Settings → SSH Keys → New SSH Key → Paste
# Test connection
ssh -T git@github.com
# Expected: "Hi YourName! You've successfully authenticated"# Download Miniconda (lightweight)
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Install
bash Miniconda3-latest-Linux-x86_64.sh
# Follow prompts, say "yes" to initialize
# Restart terminal, then verify
conda --version
# ──────────────────────────────────
# Create a project environment
# ──────────────────────────────────
conda create -n bioinfo python=3.10
conda activate bioinfo
# Install common bioinformatics packages
conda install -c bioconda samtools bedtools blast mafft
conda install -c conda-forge pandas numpy matplotlib seaborn biopython pysam
# ──────────────────────────────────
# Managing environments
# ──────────────────────────────────
conda env list # List all environments
conda activate bioinfo # Activate
conda deactivate # Deactivate
conda env export > environment.yml # Export (for reproducibility)
conda env create -f environment.yml # Recreate from file
conda remove -n bioinfo --all # Delete an environment- Download from code.visualstudio.com
- Install and open
- Install VS Code locally
- Install the Remote - SSH extension
- Press
Ctrl+Shift+P→ type "Remote-SSH: Connect to Host" - Enter:
username@server-address - VS Code opens a window connected to the server
# Must-have extensions (install from Extensions panel):
- Python # Python support
- Remote - SSH # Work on remote servers
- GitLens # Enhanced Git integration
- Jupyter # Run notebooks in VS Code
- Bracket Pair Colorizer # Colored brackets
- indent-rainbow # Colored indentation
Press Ctrl+Shift+P → "Preferences: Open Settings (JSON)" and add:
{
"editor.fontSize": 14,
"editor.wordWrap": "on",
"files.autoSave": "afterDelay",
"python.defaultInterpreterPath": "~/miniconda3/envs/bioinfo/bin/python",
"terminal.integrated.defaultProfile.linux": "bash"
}# 1. SSH into server
ssh username@server-address
# 2. Check what's available
git --version && python3 --version && conda --version
# 3. Configure Git (if first time)
git config --global user.name "YourName"
git config --global user.email "your.email@example.com"
# 4. Clone your project
git clone git@github.com:YourName/my-project.git
cd my-project
# 5. Create/activate conda environment
conda create -n my_project python=3.10
conda activate my_project
pip install -r requirements.txt
# 6. Open in VS Code (from local machine)
# VS Code → Remote-SSH → Connect → open the project folder
# 7. Start working!
git checkout -b feature/my-task
# ... code ...
git add .
git commit -m "Add new analysis"
git push origin feature/my-taskWhen starting a new bioinformatics project, create this structure:
mkdir -p my_project/{data,scripts,results,figures,logs,docs}
touch my_project/README.md
touch my_project/requirements.txt
touch my_project/.gitignore
cd my_project && git initmy_project/
│
├── data/ # Input data (add to .gitignore if large)
├── scripts/ # Analysis scripts
├── results/ # Output files
├── figures/ # Plots and visualizations
├── logs/ # Job logs, error logs
├── docs/ # Notes, reports
│
├── .gitignore # Files to ignore
├── requirements.txt # Python dependencies
└── README.md # Project description
# Data files (too large for Git)
*.bam
*.bam.bai
*.fastq
*.fastq.gz
*.bw
*.bigwig
*.bedgraph
*.h5
*.hdf5
# Python
__pycache__/
*.pyc
.ipynb_checkpoints/
# Environment
.env
*.egg-info/
# OS
.DS_Store
Thumbs.db
# Logs
*.log
logs/
# Results (regenerable)
results/
figures/| Command | What It Does |
|---|---|
pwd |
Print current directory |
ls -la |
List all files with details |
cd |
Change directory |
mkdir -p |
Create nested directories |
cp -r |
Copy files/directories |
mv |
Move or rename |
rm -rf |
Delete ( |
ln -s |
Create symbolic link |
cat |
Print file content |
head -n 20 |
First 20 lines |
tail -n 20 |
Last 20 lines |
less |
Scroll through a file |
wc -l |
Count lines |
grep |
Search for a pattern |
find |
Find files |
cut -f1,2 |
Extract columns |
sort |
Sort lines |
uniq -c |
Count unique lines |
gzip / gunzip |
Compress / decompress |
tar -xzvf |
Extract archive |
chmod +x |
Make executable |
ssh |
Remote connection |
scp |
Copy files to/from server |
top / htop |
Monitor processes |
kill |
Stop a process |
nohup ... & |
Run in background |
df -h |
Disk space |
du -sh |
Folder size |
pipe | |
Chain commands |
> / >> |
Redirect output |
| Package | Purpose |
|---|---|
pandas |
Tabular data manipulation |
numpy |
Numerical computing |
matplotlib |
Plotting |
seaborn |
Statistical visualization |
biopython |
Sequence analysis, BLAST, FASTA |
pysam |
BAM/SAM file handling |
pybedtools |
BED file operations |
scikit-learn |
Machine learning |
scipy |
Statistics, distance metrics |
gzip |
Read compressed files |
# Python + bioinformatics tools
conda create -n bioinfo python=3.10
conda activate bioinfo
conda install -c bioconda -c conda-forge \
pandas numpy matplotlib seaborn biopython pysam \
samtools bedtools blast mafft
pip install pybedtools scikit-learn| Package | Purpose |
|---|---|
dplyr |
Data manipulation (filter, mutate, summarise) |
ggplot2 |
Plotting |
readr |
Fast file reading (CSV, TSV) |
tidyr |
Reshaping data (pivot) |
DESeq2 |
Differential expression analysis |
edgeR |
Differential expression (alternative) |
GenomicRanges |
Genomic interval operations |
rtracklayer |
Read/write BED, BW, GFF files |
clusterProfiler |
GO / KEGG enrichment |
pheatmap |
Heatmaps |
biomaRt |
Query Ensembl databases |
Biostrings |
Sequence manipulation |
# CRAN packages
install.packages(c("dplyr", "ggplot2", "readr", "tidyr", "pheatmap"))
# Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("DESeq2", "edgeR", "GenomicRanges",
"rtracklayer", "clusterProfiler", "biomaRt"))# SQLite (lightweight, no server)
sudo apt install sqlite3
# From Python (no install needed)
import sqlite3 # built into Python
# MySQL client
sudo apt install mysql-client
# or
conda install -c conda-forge mysqlclient
# PostgreSQL client
sudo apt install postgresql-client