Skip to content

FatineHic/First

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

🚀 First — Your Day-One Setup & Survival Guide

Everything you need when starting at a new lab or office: Linux commands, Python & R bioinformatics essentials, SQL for databases, and full environment setup (Git, VS Code, Conda, SSH).

Linux Python R SQL Bioinformatics VS Code Git


📌 Table of Contents


✅ Day-One Checklist

When you arrive at a new office/lab, go through this list:

□  Can I open a terminal?
□  Do I have SSH access to the server?
□  What OS and version is running?
□  Do I have sudo access?
□  Is Git installed? What version?
□  Is Conda/Miniconda installed? Where?
□  Is Python installed? What version?
□  Where is the shared data stored?
□  What job scheduler is used? (SLURM, SGE, PBS?)
□  Do I have a GitHub/GitLab account connected?
□  Is VS Code installed or can I install it?
□  Who do I contact for server issues?

🐧 Part 1 — Linux Survival Guide

1.1 — First Things First: Where Am I?

# Who am I?
whoami

# Where am I?
pwd

# What machine am I on?
hostname

# What OS is running?
cat /etc/os-release

# What's the date and time?
date

# How long has the server been running?
uptime

1.2 — Navigating the File System

# List files in current directory
ls

# List with details (permissions, size, date)
ls -la

# List with human-readable sizes
ls -lh

# List sorted by modification date (newest first)
ls -lt

# Go to a directory
cd /data/my_project

# Go back to home directory
cd ~
# or just
cd

# Go up one level
cd ..

# Go back to previous directory
cd -

# Show directory structure (2 levels deep)
tree -L 2

# If tree is not installed
find . -maxdepth 2 -type d

1.3 — Checking Disk Space & File Sizes

# How much disk space is left?
df -h

# How much space does this folder use?
du -sh /data/my_project/

# How much space does each subfolder use?
du -sh /data/my_project/*

# Sort folders by size (largest first)
du -sh /data/my_project/* | sort -rh

# Find the 10 largest files in a directory
find /data/my_project -type f -exec ls -lS {} + | head -10

# Check your disk quota (if applicable)
quota -s

1.4 — Files: Create, Copy, Move, Delete

# Create an empty file
touch my_file.txt

# Create a directory
mkdir my_folder

# Create nested directories
mkdir -p project/data/raw

# Copy a file
cp source.txt destination.txt

# Copy a directory (recursive)
cp -r source_folder/ destination_folder/

# Move or rename a file
mv old_name.txt new_name.txt

# Move a file to another directory
mv my_file.txt /data/my_project/

# Delete a file
rm my_file.txt

# Delete a directory and everything inside (⚠️ BE CAREFUL)
rm -rf my_folder/

# Create a symbolic link (shortcut)
ln -s /data/shared/big_file.bam ~/my_project/big_file.bam

⚠️ DANGER: rm -rf is irreversible. Always double-check the path before running it. Never run rm -rf / or rm -rf * without being 100% sure of your current directory.


1.5 — Reading & Inspecting Files

# Print entire file content
cat my_file.txt

# View first 10 lines
head my_file.txt

# View first 50 lines
head -n 50 my_file.txt

# View last 10 lines
tail my_file.txt

# View last 50 lines
tail -n 50 my_file.txt

# Scroll through a file (press q to quit)
less my_file.txt

# Count lines, words, characters
wc my_file.txt

# Count only lines
wc -l my_file.txt

# Count number of files in a directory
ls -1 | wc -l

# View a BED/TSV/CSV file nicely
column -t my_file.bed | head

# Check file type
file my_file.bam

1.6 — Searching & Filtering

# Search for a word in a file
grep "CEBPB" my_file.txt

# Search case-insensitive
grep -i "cebpb" my_file.txt

# Search recursively in all files in a directory
grep -r "CEBPB" /data/my_project/

# Count how many times a pattern appears
grep -c "chr1" peaks.bed

# Find a file by name
find /data/ -name "*.bam"

# Find files modified in the last 7 days
find /data/ -mtime -7

# Find files larger than 1GB
find /data/ -size +1G

# Extract specific columns from a tabular file (e.g., columns 1,2,3)
cut -f1,2,3 my_file.bed

# Sort a file
sort my_file.txt

# Sort numerically by column 2
sort -k2,2n my_file.bed

# Remove duplicate lines
sort my_file.txt | uniq

# Count occurrences of each unique line
sort my_file.txt | uniq -c | sort -rn

1.7 — Compression & Archives

# Compress a file
gzip my_file.fastq

# Decompress
gunzip my_file.fastq.gz

# Read a gzipped file without decompressing
zcat my_file.fastq.gz | head

# Create a tar.gz archive
tar -czvf archive.tar.gz my_folder/

# Extract a tar.gz archive
tar -xzvf archive.tar.gz

# Extract a .zip file
unzip archive.zip

1.8 — Permissions & Ownership

# Check permissions of a file
ls -la my_file.txt

# Permission format: -rwxr-xr-x
# r=read, w=write, x=execute
# [owner][group][others]

# Make a script executable
chmod +x my_script.sh

# Give read/write to owner only
chmod 600 my_file.txt

# Give read/write/execute to owner, read to others
chmod 755 my_script.sh

# Change owner (needs sudo)
chown user:group my_file.txt

1.9 — Processes & Jobs

# See running processes
top

# Better version (if installed)
htop

# See YOUR running processes
ps aux | grep $(whoami)

# Kill a process by PID
kill 12345

# Force kill
kill -9 12345

# Run a command in the background
my_script.sh &

# Run a command that survives logout
nohup my_script.sh &

# See background jobs
jobs

# Check SLURM queue (if using HPC)
squeue -u $(whoami)

# Submit a SLURM job
sbatch my_job.sh

# Cancel a SLURM job
scancel JOB_ID

# See past SLURM jobs
sacct --format=JobID,JobName,State,Elapsed,MaxRSS

1.10 — SSH & Remote Servers

# Connect to a remote server
ssh username@server-address

# Connect with a specific port
ssh -p 2222 username@server-address

# Copy a file TO the server
scp my_file.txt username@server:/data/my_project/

# Copy a file FROM the server
scp username@server:/data/my_project/results.txt ./

# Copy a directory (recursive)
scp -r my_folder/ username@server:/data/my_project/

# Mount a remote folder locally (if sshfs installed)
sshfs username@server:/data/my_project/ ~/remote_mount/

1.11 — Pipes & Redirection

# Pipe: send output of one command to another
cat my_file.bed | grep "chr1" | wc -l

# Save output to a file (overwrite)
grep "chr1" peaks.bed > chr1_peaks.bed

# Append to a file
echo "new line" >> my_file.txt

# Redirect errors to a file
my_command 2> errors.log

# Redirect both output and errors
my_command > output.log 2>&1

1.12 — Useful One-Liners for Bioinformatics

# Count reads in a FASTQ file
echo $(( $(wc -l < reads.fastq) / 4 ))

# Count reads in a gzipped FASTQ
echo $(( $(zcat reads.fastq.gz | wc -l) / 4 ))

# Count reads in a BAM file
samtools view -c my_file.bam

# Count mapped reads only
samtools view -c -F 4 my_file.bam

# Extract chromosome names from a BAM
samtools idxstats my_file.bam | cut -f1

# Count peaks per chromosome in a BED file
cut -f1 peaks.bed | sort | uniq -c | sort -rn

# Get unique gene names from column 4 of a BED
cut -f4 annotated.bed | sort -u

# Quick look at a BAM header
samtools view -H my_file.bam | head -20

# Check if a BAM is sorted
samtools view -H my_file.bam | grep "@HD"

# Convert BAM to BED
bedtools bamtobed -i my_file.bam > my_file.bed

🐍 Part 2 — Python for Bioinformatics

2.1 — File I/O (The Basics)

# ──────────────────────────────────
# Reading & writing text files
# ──────────────────────────────────

# Read entire file
with open("file.txt", "r") as f:
    content = f.read()

# Read line by line
with open("file.txt", "r") as f:
    for line in f:
        line = line.strip()  # remove newline
        print(line)

# Write to a file
with open("output.txt", "w") as f:
    f.write("Hello\n")

# Append to a file
with open("output.txt", "a") as f:
    f.write("New line\n")

# Read a gzipped file
import gzip
with gzip.open("file.txt.gz", "rt") as f:
    for line in f:
        print(line.strip())

2.2 — Pandas Essentials

import pandas as pd

# ──────────────────────────────────
# Loading data
# ──────────────────────────────────
df = pd.read_csv("data.csv")
df = pd.read_csv("data.tsv", sep="\t")
df = pd.read_csv("data.bed", sep="\t", header=None,
                  names=["chr", "start", "end", "name", "score"])

# ──────────────────────────────────
# First look at data
# ──────────────────────────────────
df.head()              # first 5 rows
df.tail()              # last 5 rows
df.shape               # (rows, columns)
df.columns             # column names
df.dtypes              # data types
df.info()              # summary
df.describe()          # statistics
df.nunique()           # unique values per column
df.isnull().sum()      # missing values per column

# ──────────────────────────────────
# Filtering
# ──────────────────────────────────
chr1 = df[df["chr"] == "chr1"]
significant = df[df["pvalue"] < 0.05]
high_score = df[(df["score"] > 100) & (df["chr"] == "chr1")]

# ──────────────────────────────────
# Selecting columns
# ──────────────────────────────────
subset = df[["chr", "start", "end"]]
single_col = df["score"]

# ──────────────────────────────────
# Sorting
# ──────────────────────────────────
df_sorted = df.sort_values("score", ascending=False)
df_sorted = df.sort_values(["chr", "start"])

# ──────────────────────────────────
# Grouping & aggregation
# ──────────────────────────────────
counts = df.groupby("chr").size()
means = df.groupby("cancer_type")["score"].mean()
summary = df.groupby("cancer_type").agg({"score": ["mean", "std", "count"]})

# ──────────────────────────────────
# Adding / modifying columns
# ──────────────────────────────────
df["length"] = df["end"] - df["start"]
df["log_score"] = np.log10(df["score"] + 1)

# ──────────────────────────────────
# Saving
# ──────────────────────────────────
df.to_csv("output.csv", index=False)
df.to_csv("output.tsv", sep="\t", index=False)

2.3 — Working with FASTA Files

# ──────────────────────────────────
# Using Biopython
# ──────────────────────────────────
from Bio import SeqIO

# Read a FASTA file
for record in SeqIO.parse("sequences.fasta", "fasta"):
    print(f"ID: {record.id}")
    print(f"Length: {len(record.seq)}")
    print(f"Sequence: {record.seq[:50]}...")

# Load all sequences into a dictionary
sequences = SeqIO.to_dict(SeqIO.parse("sequences.fasta", "fasta"))
my_seq = sequences["gene_name"].seq

# Write sequences to a FASTA file
records = []
for name, seq in my_sequences.items():
    record = SeqIO.SeqRecord(Seq(seq), id=name, description="")
    records.append(record)
SeqIO.write(records, "output.fasta", "fasta")

# Count sequences in a FASTA
count = sum(1 for _ in SeqIO.parse("sequences.fasta", "fasta"))

# Filter sequences by length
long_seqs = [r for r in SeqIO.parse("sequences.fasta", "fasta") if len(r.seq) > 500]

2.4 — Working with BED Files

import pandas as pd

# ──────────────────────────────────
# Load a BED file
# ──────────────────────────────────
bed = pd.read_csv("peaks.bed", sep="\t", header=None,
                   names=["chr", "start", "end", "name", "score", "strand"])

# ──────────────────────────────────
# Common operations
# ──────────────────────────────────

# Peak lengths
bed["length"] = bed["end"] - bed["start"]

# Filter by chromosome
chr1_peaks = bed[bed["chr"] == "chr1"]

# Filter by score
strong_peaks = bed[bed["score"] > 50]

# Count peaks per chromosome
bed["chr"].value_counts()

# Sort by genomic position
bed_sorted = bed.sort_values(["chr", "start"])

# Save as BED (no header, tab-separated)
bed_sorted.to_csv("sorted.bed", sep="\t", header=False, index=False)

2.5 — Working with BAM/SAM Files

import pysam

# ──────────────────────────────────
# Read a BAM file
# ──────────────────────────────────
bam = pysam.AlignmentFile("sample.bam", "rb")

# Count total reads
total = bam.count()

# Count mapped reads
mapped = bam.count(read_callback=lambda r: not r.is_unmapped)

# Iterate over reads in a region
for read in bam.fetch("chr1", 1000, 2000):
    print(read.query_name, read.mapping_quality)

# Get reads as a list
reads = list(bam.fetch("chr1", 1000, 2000))

bam.close()

2.6 — Running BLAST from Python

from Bio.Blast.Applications import NcbiblastpCommandline
from Bio.Blast import NCBIXML

# ──────────────────────────────────
# Run local BLASTp
# ──────────────────────────────────
blastp = NcbiblastpCommandline(
    query="query.fasta",
    db="my_database",
    evalue=1e-5,
    outfmt=5,            # XML format
    out="blast_results.xml"
)
stdout, stderr = blastp()

# ──────────────────────────────────
# Parse BLAST results
# ──────────────────────────────────
with open("blast_results.xml") as f:
    records = NCBIXML.parse(f)
    for record in records:
        for alignment in record.alignments:
            for hsp in alignment.hsps:
                if hsp.expect < 1e-10:
                    print(f"Hit: {alignment.title}")
                    print(f"E-value: {hsp.expect}")
                    print(f"Identity: {hsp.identities}/{hsp.align_length}")

2.7 — Plotting for Bioinformatics

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# ──────────────────────────────────
# Histogram (e.g., peak length distribution)
# ──────────────────────────────────
plt.figure(figsize=(8, 5))
plt.hist(bed["length"], bins=50, color="steelblue", edgecolor="black")
plt.xlabel("Peak Length (bp)")
plt.ylabel("Count")
plt.title("Peak Length Distribution")
plt.savefig("peak_lengths.png", dpi=300, bbox_inches="tight")
plt.show()

# ──────────────────────────────────
# Heatmap (e.g., gene presence/absence)
# ──────────────────────────────────
plt.figure(figsize=(12, 8))
sns.heatmap(matrix, cmap="YlOrRd", xticklabels=genes, yticklabels=species)
plt.title("Gene Presence/Absence")
plt.savefig("heatmap.png", dpi=300, bbox_inches="tight")
plt.show()

# ──────────────────────────────────
# Volcano plot
# ──────────────────────────────────
plt.figure(figsize=(8, 6))
plt.scatter(df["log2FC"], -np.log10(df["pvalue"]),
            c=df["significant"].map({True: "red", False: "grey"}),
            alpha=0.5, s=10)
plt.xlabel("log2 Fold Change")
plt.ylabel("-log10(p-value)")
plt.axhline(-np.log10(0.05), color="black", linestyle="--", linewidth=0.5)
plt.title("Volcano Plot")
plt.savefig("volcano.png", dpi=300, bbox_inches="tight")
plt.show()

# ──────────────────────────────────
# Save any figure
# ──────────────────────────────────
# Always use these options:
plt.savefig("figure.png", dpi=300, bbox_inches="tight")
# For publication: use PDF or SVG
plt.savefig("figure.pdf", bbox_inches="tight")

2.8 — Useful Python One-Liners

# Reverse complement a DNA sequence
def reverse_complement(seq):
    comp = str.maketrans("ATCG", "TAGC")
    return seq.translate(comp)[::-1]

# GC content
def gc_content(seq):
    gc = seq.count("G") + seq.count("C")
    return gc / len(seq) * 100

# Read a two-column file into a dictionary
with open("mapping.txt") as f:
    d = dict(line.strip().split("\t") for line in f)

# Flatten a list of lists
flat = [item for sublist in nested_list for item in sublist]

# Quick timer for a block of code
import time
start = time.time()
# ... your code ...
print(f"Elapsed: {time.time() - start:.2f}s")

📊 Part 3 — R for Bioinformatics

3.1 — R Basics

# ──────────────────────────────────
# Variables & Types
# ──────────────────────────────────
x <- 5                          # assignment (use <- not =)
name <- "CEBPB"
is_significant <- TRUE
my_vector <- c(1, 2, 3, 4, 5)  # vector (R's basic data structure)

# Check type
class(x)
typeof(x)
length(my_vector)

# ──────────────────────────────────
# Vectors (everything in R is a vector)
# ──────────────────────────────────
nums <- c(10, 20, 30, 40, 50)
genes <- c("TP53", "BRCA1", "EGFR", "MYC")
logicals <- c(TRUE, FALSE, TRUE)

# Access elements (R is 1-indexed!)
nums[1]           # 10 (not 0-indexed like Python!)
nums[2:4]         # 20, 30, 40
genes[c(1, 3)]    # "TP53", "EGFR"

# Operations are vectorized
nums * 2          # c(20, 40, 60, 80, 100)
nums > 25         # c(FALSE, FALSE, TRUE, TRUE, TRUE)
sum(nums > 25)    # 3

# ──────────────────────────────────
# Useful vector functions
# ──────────────────────────────────
length(nums)
sum(nums)
mean(nums)
median(nums)
sd(nums)
min(nums)
max(nums)
range(nums)
sort(nums, decreasing = TRUE)
unique(c(1, 1, 2, 3, 3))    # c(1, 2, 3)
table(c("A", "B", "A", "C")) # counts per value

3.2 — Data Frames (R's Equivalent of Pandas)

# ──────────────────────────────────
# Creating a data frame
# ──────────────────────────────────
df <- data.frame(
  gene = c("TP53", "BRCA1", "EGFR"),
  log2FC = c(2.5, -1.3, 4.1),
  pvalue = c(0.001, 0.03, 0.0001)
)

# ──────────────────────────────────
# Reading files
# ──────────────────────────────────
df <- read.csv("data.csv")
df <- read.csv("data.tsv", sep = "\t")
df <- read.delim("data.bed", header = FALSE,
                  col.names = c("chr", "start", "end", "name", "score"))

# Tidyverse alternative (faster for large files)
library(readr)
df <- read_tsv("data.tsv")
df <- read_csv("data.csv")

# ──────────────────────────────────
# First look
# ──────────────────────────────────
head(df)            # first 6 rows
tail(df)            # last 6 rows
str(df)             # structure (types, dimensions)
summary(df)         # statistics
dim(df)             # rows × columns
nrow(df)            # number of rows
ncol(df)            # number of columns
colnames(df)        # column names

# ──────────────────────────────────
# Accessing data
# ──────────────────────────────────
df$gene             # one column (as vector)
df[, 1]             # first column
df[1, ]             # first row
df[1:3, ]           # rows 1 to 3
df[, c("gene", "pvalue")]  # select columns by name

# ──────────────────────────────────
# Filtering
# ──────────────────────────────────
sig <- df[df$pvalue < 0.05, ]
up <- df[df$log2FC > 2 & df$pvalue < 0.05, ]

# ──────────────────────────────────
# Adding columns
# ──────────────────────────────────
df$neg_log_p <- -log10(df$pvalue)
df$significant <- df$pvalue < 0.05

# ──────────────────────────────────
# Sorting
# ──────────────────────────────────
df <- df[order(df$pvalue), ]              # sort by pvalue (ascending)
df <- df[order(-df$log2FC), ]             # sort by log2FC (descending)

# ──────────────────────────────────
# Saving
# ──────────────────────────────────
write.csv(df, "output.csv", row.names = FALSE)
write.table(df, "output.tsv", sep = "\t", row.names = FALSE, quote = FALSE)

3.3 — Tidyverse (Modern R)

library(dplyr)
library(tidyr)

# ──────────────────────────────────
# The pipe operator: %>% (read as "then")
# ──────────────────────────────────

# Base R (nested, hard to read):
head(sort(df$pvalue))

# Tidyverse (piped, reads left-to-right):
df %>%
  arrange(pvalue) %>%
  head()

# ──────────────────────────────────
# dplyr verbs — the 6 essential functions
# ──────────────────────────────────

# filter() — select rows
df %>% filter(pvalue < 0.05)
df %>% filter(log2FC > 2, pvalue < 0.01)
df %>% filter(gene %in% c("TP53", "BRCA1"))

# select() — select columns
df %>% select(gene, pvalue)
df %>% select(-log2FC)              # exclude a column
df %>% select(starts_with("log"))   # by pattern

# mutate() — add/modify columns
df %>% mutate(
  neg_log_p = -log10(pvalue),
  significant = pvalue < 0.05,
  length = end - start
)

# arrange() — sort rows
df %>% arrange(pvalue)              # ascending
df %>% arrange(desc(log2FC))        # descending

# summarise() + group_by() — aggregate
df %>%
  group_by(cancer_type) %>%
  summarise(
    n_genes = n(),
    mean_FC = mean(log2FC),
    median_p = median(pvalue)
  )

# count() — quick frequency table
df %>% count(cancer_type, sort = TRUE)

# ──────────────────────────────────
# Joining data frames (like SQL joins)
# ──────────────────────────────────
merged <- left_join(df1, df2, by = "gene")
merged <- inner_join(df1, df2, by = c("chr", "start"))

# ──────────────────────────────────
# Reshaping
# ──────────────────────────────────
# Wide to long
long <- df %>% pivot_longer(cols = c(sample1, sample2, sample3),
                             names_to = "sample",
                             values_to = "expression")

# Long to wide
wide <- long %>% pivot_wider(names_from = sample,
                              values_from = expression)

3.4 — Plotting with ggplot2

library(ggplot2)

# ──────────────────────────────────
# The ggplot2 formula:
# ggplot(data, aes(x, y)) + geom_*() + theme_*()
# ──────────────────────────────────

# Scatter plot
ggplot(df, aes(x = log2FC, y = -log10(pvalue))) +
  geom_point(alpha = 0.5) +
  theme_minimal() +
  labs(title = "Volcano Plot", x = "log2 Fold Change", y = "-log10(p-value)")

# Volcano plot (colored by significance)
ggplot(df, aes(x = log2FC, y = -log10(pvalue), color = significant)) +
  geom_point(alpha = 0.5, size = 1) +
  scale_color_manual(values = c("grey", "red")) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
  theme_minimal()

# Bar plot
ggplot(df, aes(x = cancer_type, fill = cancer_type)) +
  geom_bar() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Boxplot
ggplot(df, aes(x = cancer_type, y = expression, fill = cancer_type)) +
  geom_boxplot() +
  theme_minimal()

# Histogram
ggplot(df, aes(x = peak_length)) +
  geom_histogram(bins = 50, fill = "steelblue", color = "black") +
  theme_minimal()

# Heatmap
library(pheatmap)
pheatmap(matrix_data,
         scale = "row",
         clustering_distance_rows = "euclidean",
         color = colorRampPalette(c("blue", "white", "red"))(100))

# ──────────────────────────────────
# Save a plot
# ──────────────────────────────────
ggsave("my_plot.png", width = 8, height = 6, dpi = 300)
ggsave("my_plot.pdf", width = 8, height = 6)

3.5 — Bioconductor Essentials

# ──────────────────────────────────
# Install Bioconductor (do this ONCE)
# ──────────────────────────────────
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Install packages
BiocManager::install("DESeq2")
BiocManager::install("edgeR")
BiocManager::install("GenomicRanges")
BiocManager::install("rtracklayer")
BiocManager::install("clusterProfiler")

# ──────────────────────────────────
# DESeq2 — Differential Expression Analysis
# ──────────────────────────────────
library(DESeq2)

# Create DESeq dataset from count matrix
dds <- DESeqDataSetFromMatrix(
  countData = count_matrix,     # genes × samples (integers)
  colData = sample_info,        # sample metadata
  design = ~ condition          # what to compare
)

# Run DESeq2
dds <- DESeq(dds)
results <- results(dds)

# Get significant genes
sig_genes <- results[which(results$padj < 0.05), ]
sig_genes <- sig_genes[order(sig_genes$padj), ]

# Volcano plot from DESeq2 results
plotMA(results)

# PCA plot
vsd <- vst(dds)
plotPCA(vsd, intgroup = "condition")

# ──────────────────────────────────
# GenomicRanges — Working with genomic intervals
# ──────────────────────────────────
library(GenomicRanges)

# Create genomic ranges
gr <- GRanges(
  seqnames = c("chr1", "chr1", "chr2"),
  ranges = IRanges(start = c(100, 200, 300), end = c(150, 250, 350)),
  strand = c("+", "-", "+")
)

# Find overlaps between two sets of regions
overlaps <- findOverlaps(peaks_gr, promoters_gr)

# Reduce overlapping ranges to a single set
reduced <- reduce(gr)

# Get distance between ranges
dist <- distanceToNearest(peaks_gr, tss_gr)

# ──────────────────────────────────
# Read BED/BW/GFF files
# ──────────────────────────────────
library(rtracklayer)

bed <- import("peaks.bed")
bw <- import("signal.bw")
gff <- import("annotations.gff3")

# Export
export(gr, "output.bed", format = "BED")

# ──────────────────────────────────
# Gene Ontology / Pathway Analysis
# ──────────────────────────────────
library(clusterProfiler)
library(org.Hs.eg.db)

# GO enrichment
ego <- enrichGO(
  gene = significant_genes,
  OrgDb = org.Hs.eg.db,
  keyType = "SYMBOL",
  ont = "BP",               # Biological Process
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05
)
dotplot(ego, showCategory = 20)

# KEGG pathway enrichment
ekegg <- enrichKEGG(
  gene = entrez_ids,
  organism = "hsa",
  pvalueCutoff = 0.05
)
dotplot(ekegg)

3.6 — R Useful One-Liners

# Check if a package is installed
"DESeq2" %in% rownames(installed.packages())

# Install a CRAN package
install.packages("ggplot2")

# Load multiple libraries at once
lapply(c("dplyr", "ggplot2", "readr"), library, character.only = TRUE)

# Apply a function to every column
sapply(df, function(x) sum(is.na(x)))    # count NAs per column

# Quick summary of a BED file
bed <- read.delim("peaks.bed", header = FALSE)
cat("Peaks:", nrow(bed), "\n")
cat("Chromosomes:", length(unique(bed$V1)), "\n")
cat("Mean length:", mean(bed$V3 - bed$V2), "bp\n")

# Set working directory
setwd("/data/my_project")
getwd()

# List files matching a pattern
list.files(path = "results/", pattern = "*.csv")

🗄️ Part 4 — SQL Essentials

4.1 — Why SQL in Bioinformatics?

Many biological databases (Ensembl, UCSC Genome Browser, GEO, clinical databases) use SQL. You'll encounter it when:

  • Querying genomic annotation databases
  • Working with clinical/patient metadata
  • Accessing TCGA or GEO programmatically
  • Managing lab sample databases (LIMS)
  • Pulling data from REDCap or similar platforms

4.2 — SQL Basics

-- ──────────────────────────────────
-- SELECT — get data from a table
-- ──────────────────────────────────
SELECT * FROM patients;                          -- all columns
SELECT patient_id, cancer_type FROM patients;    -- specific columns
SELECT DISTINCT cancer_type FROM patients;       -- unique values only

-- ──────────────────────────────────
-- WHERE — filter rows
-- ──────────────────────────────────
SELECT * FROM patients WHERE cancer_type = 'BRCA';
SELECT * FROM patients WHERE age > 50;
SELECT * FROM patients WHERE cancer_type = 'BRCA' AND stage = 'III';
SELECT * FROM patients WHERE cancer_type IN ('BRCA', 'LUAD', 'COAD');
SELECT * FROM genes WHERE gene_name LIKE 'CEBP%';     -- starts with CEBP
SELECT * FROM results WHERE pvalue IS NOT NULL;

-- ──────────────────────────────────
-- ORDER BY — sort results
-- ──────────────────────────────────
SELECT * FROM results ORDER BY pvalue ASC;              -- ascending
SELECT * FROM results ORDER BY log2fc DESC;             -- descending
SELECT * FROM results ORDER BY cancer_type, pvalue;     -- multi-column

-- ──────────────────────────────────
-- LIMIT — get only N rows
-- ──────────────────────────────────
SELECT * FROM results ORDER BY pvalue LIMIT 10;         -- top 10

4.3 — Aggregation & Grouping

-- ──────────────────────────────────
-- COUNT, SUM, AVG, MIN, MAX
-- ──────────────────────────────────
SELECT COUNT(*) FROM patients;                           -- total rows
SELECT COUNT(DISTINCT cancer_type) FROM patients;        -- unique cancer types
SELECT AVG(age) FROM patients;                           -- average age
SELECT MIN(pvalue), MAX(pvalue) FROM results;

-- ──────────────────────────────────
-- GROUP BY — aggregate per group
-- ──────────────────────────────────
SELECT cancer_type, COUNT(*) AS n_patients
FROM patients
GROUP BY cancer_type
ORDER BY n_patients DESC;

SELECT cancer_type,
       COUNT(*) AS n_samples,
       AVG(age) AS mean_age,
       MIN(age) AS min_age,
       MAX(age) AS max_age
FROM patients
GROUP BY cancer_type;

-- ──────────────────────────────────
-- HAVING — filter AFTER grouping
-- ──────────────────────────────────
SELECT cancer_type, COUNT(*) AS n
FROM patients
GROUP BY cancer_type
HAVING COUNT(*) >= 10;            -- only cancer types with ≥10 samples

4.4 — Joins (Combining Tables)

-- ──────────────────────────────────
-- JOIN types visualized:
--
-- INNER JOIN:  only matching rows from BOTH tables
-- LEFT JOIN:   all rows from LEFT + matching from RIGHT
-- RIGHT JOIN:  all rows from RIGHT + matching from LEFT
-- FULL JOIN:   all rows from BOTH tables
-- ──────────────────────────────────

-- INNER JOIN — patients with results
SELECT p.patient_id, p.cancer_type, r.gene, r.pvalue
FROM patients p
INNER JOIN results r ON p.patient_id = r.patient_id;

-- LEFT JOIN — all patients, even without results
SELECT p.patient_id, p.cancer_type, r.gene, r.pvalue
FROM patients p
LEFT JOIN results r ON p.patient_id = r.patient_id;

-- Join on multiple columns
SELECT *
FROM peaks p
JOIN annotations a ON p.chr = a.chr
                   AND p.start >= a.start
                   AND p.end <= a.end;

4.5 — Creating & Modifying Tables

-- ──────────────────────────────────
-- CREATE TABLE
-- ──────────────────────────────────
CREATE TABLE patients (
    patient_id   VARCHAR(50) PRIMARY KEY,
    cancer_type  VARCHAR(20) NOT NULL,
    age          INTEGER,
    sex          VARCHAR(1),
    stage        VARCHAR(10)
);

CREATE TABLE peaks (
    peak_id      INTEGER PRIMARY KEY AUTOINCREMENT,
    chr          VARCHAR(10),
    start_pos    INTEGER,
    end_pos      INTEGER,
    score        FLOAT,
    sample_id    VARCHAR(50),
    FOREIGN KEY (sample_id) REFERENCES patients(patient_id)
);

-- ──────────────────────────────────
-- INSERT data
-- ──────────────────────────────────
INSERT INTO patients (patient_id, cancer_type, age, sex, stage)
VALUES ('TCGA-A1-A0SB', 'BRCA', 55, 'F', 'IIA');

-- Insert multiple rows
INSERT INTO patients VALUES
    ('TCGA-A1-A0SB', 'BRCA', 55, 'F', 'IIA'),
    ('TCGA-BC-A3KF', 'LIHC', 62, 'M', 'III'),
    ('TCGA-OR-A5J2', 'ACC', 48, 'F', 'II');

-- ──────────────────────────────────
-- UPDATE data
-- ──────────────────────────────────
UPDATE patients SET stage = 'IIB' WHERE patient_id = 'TCGA-A1-A0SB';

-- ──────────────────────────────────
-- DELETE data (⚠️ careful!)
-- ──────────────────────────────────
DELETE FROM patients WHERE patient_id = 'TCGA-A1-A0SB';

-- ──────────────────────────────────
-- ALTER TABLE
-- ──────────────────────────────────
ALTER TABLE patients ADD COLUMN treatment VARCHAR(50);
ALTER TABLE patients DROP COLUMN treatment;

4.6 — Subqueries & Advanced Patterns

-- ──────────────────────────────────
-- Subquery — query inside a query
-- ──────────────────────────────────

-- Find patients with above-average age
SELECT * FROM patients
WHERE age > (SELECT AVG(age) FROM patients);

-- Find genes significant in BRCA
SELECT gene FROM results
WHERE patient_id IN (
    SELECT patient_id FROM patients WHERE cancer_type = 'BRCA'
)
AND pvalue < 0.05;

-- ──────────────────────────────────
-- CASE — conditional logic (like if/else)
-- ──────────────────────────────────
SELECT gene, pvalue,
    CASE
        WHEN pvalue < 0.001 THEN 'highly_significant'
        WHEN pvalue < 0.05  THEN 'significant'
        ELSE 'not_significant'
    END AS significance
FROM results;

-- ──────────────────────────────────
-- Window functions — rank, running totals
-- ──────────────────────────────────

-- Rank genes by pvalue within each cancer type
SELECT cancer_type, gene, pvalue,
    RANK() OVER (PARTITION BY cancer_type ORDER BY pvalue) AS rank
FROM results;

-- ──────────────────────────────────
-- Common Table Expressions (CTE) — readable subqueries
-- ──────────────────────────────────
WITH significant_genes AS (
    SELECT * FROM results WHERE pvalue < 0.05
)
SELECT cancer_type, COUNT(*) AS n_sig_genes
FROM significant_genes
GROUP BY cancer_type
ORDER BY n_sig_genes DESC;

4.7 — SQL from Python

import sqlite3
import pandas as pd

# ──────────────────────────────────
# SQLite — lightweight, no server needed
# ──────────────────────────────────

# Connect (creates file if it doesn't exist)
conn = sqlite3.connect("my_database.db")
cursor = conn.cursor()

# Create a table
cursor.execute("""
    CREATE TABLE IF NOT EXISTS peaks (
        chr TEXT,
        start INTEGER,
        end INTEGER,
        score REAL,
        sample TEXT
    )
""")

# Insert data from a pandas DataFrame
df.to_sql("peaks", conn, if_exists="replace", index=False)

# Query into a DataFrame (most common pattern!)
results = pd.read_sql_query("SELECT * FROM peaks WHERE score > 50", conn)

# Run any SQL
cursor.execute("SELECT chr, COUNT(*) FROM peaks GROUP BY chr")
rows = cursor.fetchall()

# Close connection
conn.close()
# ──────────────────────────────────
# MySQL / PostgreSQL (for larger databases)
# ──────────────────────────────────
# pip install sqlalchemy pymysql

from sqlalchemy import create_engine

# Connect to MySQL
engine = create_engine("mysql+pymysql://user:password@host/database")

# Query directly into pandas
df = pd.read_sql("SELECT * FROM patients WHERE cancer_type = 'BRCA'", engine)

4.8 — SQL for Genomic Databases

# ──────────────────────────────────
# Query UCSC Genome Browser MySQL (public, no password)
# ──────────────────────────────────
from sqlalchemy import create_engine
import pandas as pd

engine = create_engine("mysql+pymysql://genome@genome-mysql.soe.ucsc.edu/hg38")

# Get all RefSeq genes on chr1
genes = pd.read_sql("""
    SELECT name2 AS gene, chrom, txStart, txEnd, strand
    FROM refGene
    WHERE chrom = 'chr1'
    ORDER BY txStart
    LIMIT 100
""", engine)

# Get CpG islands
cpg = pd.read_sql("""
    SELECT chrom, chromStart, chromEnd, length, cpgNum, gcNum
    FROM cpgIslandExt
    WHERE chrom = 'chr1'
""", engine)
# ──────────────────────────────────
# Query Ensembl BioMart from R
# ──────────────────────────────────
library(biomaRt)

ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")

# Get gene coordinates
genes <- getBM(
  attributes = c("hgnc_symbol", "chromosome_name",
                 "start_position", "end_position", "strand"),
  filters = "hgnc_symbol",
  values = c("TP53", "BRCA1", "EGFR", "CEBPB"),
  mart = ensembl
)

4.9 — SQL Cheatsheet

Command What It Does
SELECT Choose which columns to return
FROM Which table to query
WHERE Filter rows (before grouping)
AND / OR Combine conditions
IN (...) Match any value in a list
LIKE '%pattern%' Pattern matching
IS NULL / IS NOT NULL Check for missing values
ORDER BY ... ASC/DESC Sort results
LIMIT N Return only N rows
GROUP BY Aggregate per group
HAVING Filter after grouping
COUNT() / SUM() / AVG() Aggregate functions
INNER JOIN ... ON Combine tables (matching only)
LEFT JOIN ... ON Combine tables (keep all from left)
INSERT INTO Add rows
UPDATE ... SET Modify rows
DELETE FROM Remove rows
CREATE TABLE Make a new table
ALTER TABLE Modify table structure

⚙️ Part 5 — Environment Setup

5.1 — Check What's Already Installed

Run these commands first to understand your environment:

# ──────────────────────────────────
# System
# ──────────────────────────────────
cat /etc/os-release          # OS version
uname -a                     # Kernel info
nproc                        # Number of CPU cores
free -h                      # Available RAM
df -h                        # Disk space

# ──────────────────────────────────
# Tools
# ──────────────────────────────────
git --version                # Git
python3 --version            # Python
pip --version                # pip
conda --version              # Conda
R --version                  # R
java -version                # Java

# ──────────────────────────────────
# Bioinformatics tools
# ──────────────────────────────────
samtools --version           # SAMtools
bedtools --version           # BEDTools
blastn -version              # BLAST
mafft --version              # MAFFT
module avail                 # Available modules (HPC)

5.2 — Install & Configure Git

# Install (if not present)
sudo apt install git         # Ubuntu/Debian
# or
conda install git            # via Conda

# Configure (do this ONCE)
git config --global user.name "YourName"
git config --global user.email "your.email@example.com"
git config --global core.editor "code --wait"    # VS Code as editor
git config --global init.defaultBranch main

# Verify
git config --list

Generate SSH Key (for GitHub without password)

# Generate key
ssh-keygen -t ed25519 -C "your.email@example.com"
# Press Enter for all prompts (default location, no passphrase)

# Copy the public key
cat ~/.ssh/id_ed25519.pub

# → Go to GitHub → Settings → SSH Keys → New SSH Key → Paste

# Test connection
ssh -T git@github.com
# Expected: "Hi YourName! You've successfully authenticated"

5.3 — Install & Configure Conda

# Download Miniconda (lightweight)
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Install
bash Miniconda3-latest-Linux-x86_64.sh
# Follow prompts, say "yes" to initialize

# Restart terminal, then verify
conda --version

# ──────────────────────────────────
# Create a project environment
# ──────────────────────────────────
conda create -n bioinfo python=3.10
conda activate bioinfo

# Install common bioinformatics packages
conda install -c bioconda samtools bedtools blast mafft
conda install -c conda-forge pandas numpy matplotlib seaborn biopython pysam

# ──────────────────────────────────
# Managing environments
# ──────────────────────────────────
conda env list                          # List all environments
conda activate bioinfo                  # Activate
conda deactivate                        # Deactivate
conda env export > environment.yml      # Export (for reproducibility)
conda env create -f environment.yml     # Recreate from file
conda remove -n bioinfo --all           # Delete an environment

5.4 — Install & Configure VS Code

Option A — Local Machine

  1. Download from code.visualstudio.com
  2. Install and open

Option B — Remote Server (SSH)

  1. Install VS Code locally
  2. Install the Remote - SSH extension
  3. Press Ctrl+Shift+P → type "Remote-SSH: Connect to Host"
  4. Enter: username@server-address
  5. VS Code opens a window connected to the server

Essential Extensions

# Must-have extensions (install from Extensions panel):
- Python                    # Python support
- Remote - SSH              # Work on remote servers
- GitLens                   # Enhanced Git integration
- Jupyter                   # Run notebooks in VS Code
- Bracket Pair Colorizer    # Colored brackets
- indent-rainbow            # Colored indentation

VS Code Settings for Bioinformatics

Press Ctrl+Shift+P → "Preferences: Open Settings (JSON)" and add:

{
    "editor.fontSize": 14,
    "editor.wordWrap": "on",
    "files.autoSave": "afterDelay",
    "python.defaultInterpreterPath": "~/miniconda3/envs/bioinfo/bin/python",
    "terminal.integrated.defaultProfile.linux": "bash"
}

5.5 — Connect Everything Together

Full Setup Workflow

# 1. SSH into server
ssh username@server-address

# 2. Check what's available
git --version && python3 --version && conda --version

# 3. Configure Git (if first time)
git config --global user.name "YourName"
git config --global user.email "your.email@example.com"

# 4. Clone your project
git clone git@github.com:YourName/my-project.git
cd my-project

# 5. Create/activate conda environment
conda create -n my_project python=3.10
conda activate my_project
pip install -r requirements.txt

# 6. Open in VS Code (from local machine)
# VS Code → Remote-SSH → Connect → open the project folder

# 7. Start working!
git checkout -b feature/my-task
# ... code ...
git add .
git commit -m "Add new analysis"
git push origin feature/my-task

5.6 — Project Folder Template

When starting a new bioinformatics project, create this structure:

mkdir -p my_project/{data,scripts,results,figures,logs,docs}
touch my_project/README.md
touch my_project/requirements.txt
touch my_project/.gitignore
cd my_project && git init
my_project/
│
├── data/               # Input data (add to .gitignore if large)
├── scripts/            # Analysis scripts
├── results/            # Output files
├── figures/            # Plots and visualizations
├── logs/               # Job logs, error logs
├── docs/               # Notes, reports
│
├── .gitignore          # Files to ignore
├── requirements.txt    # Python dependencies
└── README.md           # Project description

Recommended .gitignore

# Data files (too large for Git)
*.bam
*.bam.bai
*.fastq
*.fastq.gz
*.bw
*.bigwig
*.bedgraph
*.h5
*.hdf5

# Python
__pycache__/
*.pyc
.ipynb_checkpoints/

# Environment
.env
*.egg-info/

# OS
.DS_Store
Thumbs.db

# Logs
*.log
logs/

# Results (regenerable)
results/
figures/

📋 Part 6 — Quick Reference Cards

Linux Commands — Top 30

Command What It Does
pwd Print current directory
ls -la List all files with details
cd Change directory
mkdir -p Create nested directories
cp -r Copy files/directories
mv Move or rename
rm -rf Delete (⚠️ careful!)
ln -s Create symbolic link
cat Print file content
head -n 20 First 20 lines
tail -n 20 Last 20 lines
less Scroll through a file
wc -l Count lines
grep Search for a pattern
find Find files
cut -f1,2 Extract columns
sort Sort lines
uniq -c Count unique lines
gzip / gunzip Compress / decompress
tar -xzvf Extract archive
chmod +x Make executable
ssh Remote connection
scp Copy files to/from server
top / htop Monitor processes
kill Stop a process
nohup ... & Run in background
df -h Disk space
du -sh Folder size
pipe | Chain commands
> / >> Redirect output

Python Packages — Bioinformatics Essentials

Package Purpose
pandas Tabular data manipulation
numpy Numerical computing
matplotlib Plotting
seaborn Statistical visualization
biopython Sequence analysis, BLAST, FASTA
pysam BAM/SAM file handling
pybedtools BED file operations
scikit-learn Machine learning
scipy Statistics, distance metrics
gzip Read compressed files

Install All at Once

# Python + bioinformatics tools
conda create -n bioinfo python=3.10
conda activate bioinfo
conda install -c bioconda -c conda-forge \
    pandas numpy matplotlib seaborn biopython pysam \
    samtools bedtools blast mafft
pip install pybedtools scikit-learn

R Packages — Bioinformatics Essentials

Package Purpose
dplyr Data manipulation (filter, mutate, summarise)
ggplot2 Plotting
readr Fast file reading (CSV, TSV)
tidyr Reshaping data (pivot)
DESeq2 Differential expression analysis
edgeR Differential expression (alternative)
GenomicRanges Genomic interval operations
rtracklayer Read/write BED, BW, GFF files
clusterProfiler GO / KEGG enrichment
pheatmap Heatmaps
biomaRt Query Ensembl databases
Biostrings Sequence manipulation

Install R Essentials

# CRAN packages
install.packages(c("dplyr", "ggplot2", "readr", "tidyr", "pheatmap"))

# Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("DESeq2", "edgeR", "GenomicRanges",
                        "rtracklayer", "clusterProfiler", "biomaRt"))

SQL Quick Install

# SQLite (lightweight, no server)
sudo apt install sqlite3

# From Python (no install needed)
import sqlite3    # built into Python

# MySQL client
sudo apt install mysql-client
# or
conda install -c conda-forge mysqlclient

# PostgreSQL client
sudo apt install postgresql-client

About

Day-one guide for bioinformatics: essential Linux commands, R, SQL, Python functions for genomics (FASTA, BED, BAM, BLAST), and full environment setup (Git, SSH, Conda, VS Code). Everything you need when starting at a new lab.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors