🚀 First — Your Day-One Setup & Survival Guide

Everything you need when starting at a new lab or office: Linux commands, Python & R bioinformatics essentials, SQL for databases, and full environment setup (Git, VS Code, Conda, SSH).

📌 Table of Contents

Day-One Checklist
Part 1 — Linux Survival Guide
Part 2 — Python for Bioinformatics
Part 3 — R for Bioinformatics
Part 4 — SQL Essentials
Part 5 — Environment Setup
Part 6 — Quick Reference Cards

✅ Day-One Checklist

When you arrive at a new office/lab, go through this list:

□  Can I open a terminal?
□  Do I have SSH access to the server?
□  What OS and version is running?
□  Do I have sudo access?
□  Is Git installed? What version?
□  Is Conda/Miniconda installed? Where?
□  Is Python installed? What version?
□  Where is the shared data stored?
□  What job scheduler is used? (SLURM, SGE, PBS?)
□  Do I have a GitHub/GitLab account connected?
□  Is VS Code installed or can I install it?
□  Who do I contact for server issues?

🐧 Part 1 — Linux Survival Guide

1.1 — First Things First: Where Am I?

# Who am I?
whoami

# Where am I?
pwd

# What machine am I on?
hostname

# What OS is running?
cat /etc/os-release

# What's the date and time?
date

# How long has the server been running?
uptime

1.2 — Navigating the File System

# List files in current directory
ls

# List with details (permissions, size, date)
ls -la

# List with human-readable sizes
ls -lh

# List sorted by modification date (newest first)
ls -lt

# Go to a directory
cd /data/my_project

# Go back to home directory
cd ~
# or just
cd

# Go up one level
cd ..

# Go back to previous directory
cd -

# Show directory structure (2 levels deep)
tree -L 2

# If tree is not installed
find . -maxdepth 2 -type d

1.3 — Checking Disk Space & File Sizes

# How much disk space is left?
df -h

# How much space does this folder use?
du -sh /data/my_project/

# How much space does each subfolder use?
du -sh /data/my_project/*

# Sort folders by size (largest first)
du -sh /data/my_project/* | sort -rh

# Find the 10 largest files in a directory
find /data/my_project -type f -exec ls -lS {} + | head -10

# Check your disk quota (if applicable)
quota -s

1.4 — Files: Create, Copy, Move, Delete

# Create an empty file
touch my_file.txt

# Create a directory
mkdir my_folder

# Create nested directories
mkdir -p project/data/raw

# Copy a file
cp source.txt destination.txt

# Copy a directory (recursive)
cp -r source_folder/ destination_folder/

# Move or rename a file
mv old_name.txt new_name.txt

# Move a file to another directory
mv my_file.txt /data/my_project/

# Delete a file
rm my_file.txt

# Delete a directory and everything inside (⚠️ BE CAREFUL)
rm -rf my_folder/

# Create a symbolic link (shortcut)
ln -s /data/shared/big_file.bam ~/my_project/big_file.bam

⚠️ DANGER: rm -rf is irreversible. Always double-check the path before running it. Never run rm -rf / or rm -rf * without being 100% sure of your current directory.

1.5 — Reading & Inspecting Files

# Print entire file content
cat my_file.txt

# View first 10 lines
head my_file.txt

# View first 50 lines
head -n 50 my_file.txt

# View last 10 lines
tail my_file.txt

# View last 50 lines
tail -n 50 my_file.txt

# Scroll through a file (press q to quit)
less my_file.txt

# Count lines, words, characters
wc my_file.txt

# Count only lines
wc -l my_file.txt

# Count number of files in a directory
ls -1 | wc -l

# View a BED/TSV/CSV file nicely
column -t my_file.bed | head

# Check file type
file my_file.bam

1.6 — Searching & Filtering

# Search for a word in a file
grep "CEBPB" my_file.txt

# Search case-insensitive
grep -i "cebpb" my_file.txt

# Search recursively in all files in a directory
grep -r "CEBPB" /data/my_project/

# Count how many times a pattern appears
grep -c "chr1" peaks.bed

# Find a file by name
find /data/ -name "*.bam"

# Find files modified in the last 7 days
find /data/ -mtime -7

# Find files larger than 1GB
find /data/ -size +1G

# Extract specific columns from a tabular file (e.g., columns 1,2,3)
cut -f1,2,3 my_file.bed

# Sort a file
sort my_file.txt

# Sort numerically by column 2
sort -k2,2n my_file.bed

# Remove duplicate lines
sort my_file.txt | uniq

# Count occurrences of each unique line
sort my_file.txt | uniq -c | sort -rn

1.7 — Compression & Archives

# Compress a file
gzip my_file.fastq

# Decompress
gunzip my_file.fastq.gz

# Read a gzipped file without decompressing
zcat my_file.fastq.gz | head

# Create a tar.gz archive
tar -czvf archive.tar.gz my_folder/

# Extract a tar.gz archive
tar -xzvf archive.tar.gz

# Extract a .zip file
unzip archive.zip

1.8 — Permissions & Ownership

# Check permissions of a file
ls -la my_file.txt

# Permission format: -rwxr-xr-x
# r=read, w=write, x=execute
# [owner][group][others]

# Make a script executable
chmod +x my_script.sh

# Give read/write to owner only
chmod 600 my_file.txt

# Give read/write/execute to owner, read to others
chmod 755 my_script.sh

# Change owner (needs sudo)
chown user:group my_file.txt

1.9 — Processes & Jobs

# See running processes
top

# Better version (if installed)
htop

# See YOUR running processes
ps aux | grep $(whoami)

# Kill a process by PID
kill 12345

# Force kill
kill -9 12345

# Run a command in the background
my_script.sh &

# Run a command that survives logout
nohup my_script.sh &

# See background jobs
jobs

# Check SLURM queue (if using HPC)
squeue -u $(whoami)

# Submit a SLURM job
sbatch my_job.sh

# Cancel a SLURM job
scancel JOB_ID

# See past SLURM jobs
sacct --format=JobID,JobName,State,Elapsed,MaxRSS

1.10 — SSH & Remote Servers

# Connect to a remote server
ssh username@server-address

# Connect with a specific port
ssh -p 2222 username@server-address

# Copy a file TO the server
scp my_file.txt username@server:/data/my_project/

# Copy a file FROM the server
scp username@server:/data/my_project/results.txt ./

# Copy a directory (recursive)
scp -r my_folder/ username@server:/data/my_project/

# Mount a remote folder locally (if sshfs installed)
sshfs username@server:/data/my_project/ ~/remote_mount/

1.11 — Pipes & Redirection

# Pipe: send output of one command to another
cat my_file.bed | grep "chr1" | wc -l

# Save output to a file (overwrite)
grep "chr1" peaks.bed > chr1_peaks.bed

# Append to a file
echo "new line" >> my_file.txt

# Redirect errors to a file
my_command 2> errors.log

# Redirect both output and errors
my_command > output.log 2>&1

1.12 — Useful One-Liners for Bioinformatics

# Count reads in a FASTQ file
echo $(( $(wc -l < reads.fastq) / 4 ))

# Count reads in a gzipped FASTQ
echo $(( $(zcat reads.fastq.gz | wc -l) / 4 ))

# Count reads in a BAM file
samtools view -c my_file.bam

# Count mapped reads only
samtools view -c -F 4 my_file.bam

# Extract chromosome names from a BAM
samtools idxstats my_file.bam | cut -f1

# Count peaks per chromosome in a BED file
cut -f1 peaks.bed | sort | uniq -c | sort -rn

# Get unique gene names from column 4 of a BED
cut -f4 annotated.bed | sort -u

# Quick look at a BAM header
samtools view -H my_file.bam | head -20

# Check if a BAM is sorted
samtools view -H my_file.bam | grep "@HD"

# Convert BAM to BED
bedtools bamtobed -i my_file.bam > my_file.bed

🐍 Part 2 — Python for Bioinformatics

2.1 — File I/O (The Basics)

# ──────────────────────────────────
# Reading & writing text files
# ──────────────────────────────────

# Read entire file
with open("file.txt", "r") as f:
    content = f.read()

# Read line by line
with open("file.txt", "r") as f:
    for line in f:
        line = line.strip()  # remove newline
        print(line)

# Write to a file
with open("output.txt", "w") as f:
    f.write("Hello\n")

# Append to a file
with open("output.txt", "a") as f:
    f.write("New line\n")

# Read a gzipped file
import gzip
with gzip.open("file.txt.gz", "rt") as f:
    for line in f:
        print(line.strip())

2.2 — Pandas Essentials

import pandas as pd

# ──────────────────────────────────
# Loading data
# ──────────────────────────────────
df = pd.read_csv("data.csv")
df = pd.read_csv("data.tsv", sep="\t")
df = pd.read_csv("data.bed", sep="\t", header=None,
                  names=["chr", "start", "end", "name", "score"])

# ──────────────────────────────────
# First look at data
# ──────────────────────────────────
df.head()              # first 5 rows
df.tail()              # last 5 rows
df.shape               # (rows, columns)
df.columns             # column names
df.dtypes              # data types
df.info()              # summary
df.describe()          # statistics
df.nunique()           # unique values per column
df.isnull().sum()      # missing values per column

# ──────────────────────────────────
# Filtering
# ──────────────────────────────────
chr1 = df[df["chr"] == "chr1"]
significant = df[df["pvalue"] < 0.05]
high_score = df[(df["score"] > 100) & (df["chr"] == "chr1")]

# ──────────────────────────────────
# Selecting columns
# ──────────────────────────────────
subset = df[["chr", "start", "end"]]
single_col = df["score"]

# ──────────────────────────────────
# Sorting
# ──────────────────────────────────
df_sorted = df.sort_values("score", ascending=False)
df_sorted = df.sort_values(["chr", "start"])

# ──────────────────────────────────
# Grouping & aggregation
# ──────────────────────────────────
counts = df.groupby("chr").size()
means = df.groupby("cancer_type")["score"].mean()
summary = df.groupby("cancer_type").agg({"score": ["mean", "std", "count"]})

# ──────────────────────────────────
# Adding / modifying columns
# ──────────────────────────────────
df["length"] = df["end"] - df["start"]
df["log_score"] = np.log10(df["score"] + 1)

# ──────────────────────────────────
# Saving
# ──────────────────────────────────
df.to_csv("output.csv", index=False)
df.to_csv("output.tsv", sep="\t", index=False)

2.3 — Working with FASTA Files

# ──────────────────────────────────
# Using Biopython
# ──────────────────────────────────
from Bio import SeqIO

# Read a FASTA file
for record in SeqIO.parse("sequences.fasta", "fasta"):
    print(f"ID: {record.id}")
    print(f"Length: {len(record.seq)}")
    print(f"Sequence: {record.seq[:50]}...")

# Load all sequences into a dictionary
sequences = SeqIO.to_dict(SeqIO.parse("sequences.fasta", "fasta"))
my_seq = sequences["gene_name"].seq

# Write sequences to a FASTA file
records = []
for name, seq in my_sequences.items():
    record = SeqIO.SeqRecord(Seq(seq), id=name, description="")
    records.append(record)
SeqIO.write(records, "output.fasta", "fasta")

# Count sequences in a FASTA
count = sum(1 for _ in SeqIO.parse("sequences.fasta", "fasta"))

# Filter sequences by length
long_seqs = [r for r in SeqIO.parse("sequences.fasta", "fasta") if len(r.seq) > 500]

2.4 — Working with BED Files

import pandas as pd

# ──────────────────────────────────
# Load a BED file
# ──────────────────────────────────
bed = pd.read_csv("peaks.bed", sep="\t", header=None,
                   names=["chr", "start", "end", "name", "score", "strand"])

# ──────────────────────────────────
# Common operations
# ──────────────────────────────────

# Peak lengths
bed["length"] = bed["end"] - bed["start"]

# Filter by chromosome
chr1_peaks = bed[bed["chr"] == "chr1"]

# Filter by score
strong_peaks = bed[bed["score"] > 50]

# Count peaks per chromosome
bed["chr"].value_counts()

# Sort by genomic position
bed_sorted = bed.sort_values(["chr", "start"])

# Save as BED (no header, tab-separated)
bed_sorted.to_csv("sorted.bed", sep="\t", header=False, index=False)

2.5 — Working with BAM/SAM Files

import pysam

# ──────────────────────────────────
# Read a BAM file
# ──────────────────────────────────
bam = pysam.AlignmentFile("sample.bam", "rb")

# Count total reads
total = bam.count()

# Count mapped reads
mapped = bam.count(read_callback=lambda r: not r.is_unmapped)

# Iterate over reads in a region
for read in bam.fetch("chr1", 1000, 2000):
    print(read.query_name, read.mapping_quality)

# Get reads as a list
reads = list(bam.fetch("chr1", 1000, 2000))

bam.close()

2.6 — Running BLAST from Python

from Bio.Blast.Applications import NcbiblastpCommandline
from Bio.Blast import NCBIXML

# ──────────────────────────────────
# Run local BLASTp
# ──────────────────────────────────
blastp = NcbiblastpCommandline(
    query="query.fasta",
    db="my_database",
    evalue=1e-5,
    outfmt=5,            # XML format
    out="blast_results.xml"
)
stdout, stderr = blastp()

# ──────────────────────────────────
# Parse BLAST results
# ──────────────────────────────────
with open("blast_results.xml") as f:
    records = NCBIXML.parse(f)
    for record in records:
        for alignment in record.alignments:
            for hsp in alignment.hsps:
                if hsp.expect < 1e-10:
                    print(f"Hit: {alignment.title}")
                    print(f"E-value: {hsp.expect}")
                    print(f"Identity: {hsp.identities}/{hsp.align_length}")

2.7 — Plotting for Bioinformatics

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# ──────────────────────────────────
# Histogram (e.g., peak length distribution)
# ──────────────────────────────────
plt.figure(figsize=(8, 5))
plt.hist(bed["length"], bins=50, color="steelblue", edgecolor="black")
plt.xlabel("Peak Length (bp)")
plt.ylabel("Count")
plt.title("Peak Length Distribution")
plt.savefig("peak_lengths.png", dpi=300, bbox_inches="tight")
plt.show()

# ──────────────────────────────────
# Heatmap (e.g., gene presence/absence)
# ──────────────────────────────────
plt.figure(figsize=(12, 8))
sns.heatmap(matrix, cmap="YlOrRd", xticklabels=genes, yticklabels=species)
plt.title("Gene Presence/Absence")
plt.savefig("heatmap.png", dpi=300, bbox_inches="tight")
plt.show()

# ──────────────────────────────────
# Volcano plot
# ──────────────────────────────────
plt.figure(figsize=(8, 6))
plt.scatter(df["log2FC"], -np.log10(df["pvalue"]),
            c=df["significant"].map({True: "red", False: "grey"}),
            alpha=0.5, s=10)
plt.xlabel("log2 Fold Change")
plt.ylabel("-log10(p-value)")
plt.axhline(-np.log10(0.05), color="black", linestyle="--", linewidth=0.5)
plt.title("Volcano Plot")
plt.savefig("volcano.png", dpi=300, bbox_inches="tight")
plt.show()

# ──────────────────────────────────
# Save any figure
# ──────────────────────────────────
# Always use these options:
plt.savefig("figure.png", dpi=300, bbox_inches="tight")
# For publication: use PDF or SVG
plt.savefig("figure.pdf", bbox_inches="tight")

2.8 — Useful Python One-Liners

# Reverse complement a DNA sequence
def reverse_complement(seq):
    comp = str.maketrans("ATCG", "TAGC")
    return seq.translate(comp)[::-1]

# GC content
def gc_content(seq):
    gc = seq.count("G") + seq.count("C")
    return gc / len(seq) * 100

# Read a two-column file into a dictionary
with open("mapping.txt") as f:
    d = dict(line.strip().split("\t") for line in f)

# Flatten a list of lists
flat = [item for sublist in nested_list for item in sublist]

# Quick timer for a block of code
import time
start = time.time()
# ... your code ...
print(f"Elapsed: {time.time() - start:.2f}s")

📊 Part 3 — R for Bioinformatics

3.1 — R Basics

# ──────────────────────────────────
# Variables & Types
# ──────────────────────────────────
x <- 5                          # assignment (use <- not =)
name <- "CEBPB"
is_significant <- TRUE
my_vector <- c(1, 2, 3, 4, 5)  # vector (R's basic data structure)

# Check type
class(x)
typeof(x)
length(my_vector)

# ──────────────────────────────────
# Vectors (everything in R is a vector)
# ──────────────────────────────────
nums <- c(10, 20, 30, 40, 50)
genes <- c("TP53", "BRCA1", "EGFR", "MYC")
logicals <- c(TRUE, FALSE, TRUE)

# Access elements (R is 1-indexed!)
nums[1]           # 10 (not 0-indexed like Python!)
nums[2:4]         # 20, 30, 40
genes[c(1, 3)]    # "TP53", "EGFR"

# Operations are vectorized
nums * 2          # c(20, 40, 60, 80, 100)
nums > 25         # c(FALSE, FALSE, TRUE, TRUE, TRUE)
sum(nums > 25)    # 3

# ──────────────────────────────────
# Useful vector functions
# ──────────────────────────────────
length(nums)
sum(nums)
mean(nums)
median(nums)
sd(nums)
min(nums)
max(nums)
range(nums)
sort(nums, decreasing = TRUE)
unique(c(1, 1, 2, 3, 3))    # c(1, 2, 3)
table(c("A", "B", "A", "C")) # counts per value

3.2 — Data Frames (R's Equivalent of Pandas)

# ──────────────────────────────────
# Creating a data frame
# ──────────────────────────────────
df <- data.frame(
  gene = c("TP53", "BRCA1", "EGFR"),
  log2FC = c(2.5, -1.3, 4.1),
  pvalue = c(0.001, 0.03, 0.0001)
)

# ──────────────────────────────────
# Reading files
# ──────────────────────────────────
df <- read.csv("data.csv")
df <- read.csv("data.tsv", sep = "\t")
df <- read.delim("data.bed", header = FALSE,
                  col.names = c("chr", "start", "end", "name", "score"))

# Tidyverse alternative (faster for large files)
library(readr)
df <- read_tsv("data.tsv")
df <- read_csv("data.csv")

# ──────────────────────────────────
# First look
# ──────────────────────────────────
head(df)            # first 6 rows
tail(df)            # last 6 rows
str(df)             # structure (types, dimensions)
summary(df)         # statistics
dim(df)             # rows × columns
nrow(df)            # number of rows
ncol(df)            # number of columns
colnames(df)        # column names

# ──────────────────────────────────
# Accessing data
# ──────────────────────────────────
df$gene             # one column (as vector)
df[, 1]             # first column
df[1, ]             # first row
df[1:3, ]           # rows 1 to 3
df[, c("gene", "pvalue")]  # select columns by name

# ──────────────────────────────────
# Filtering
# ──────────────────────────────────
sig <- df[df$pvalue < 0.05, ]
up <- df[df$log2FC > 2 & df$pvalue < 0.05, ]

# ──────────────────────────────────
# Adding columns
# ──────────────────────────────────
df$neg_log_p <- -log10(df$pvalue)
df$significant <- df$pvalue < 0.05

# ──────────────────────────────────
# Sorting
# ──────────────────────────────────
df <- df[order(df$pvalue), ]              # sort by pvalue (ascending)
df <- df[order(-df$log2FC), ]             # sort by log2FC (descending)

# ──────────────────────────────────
# Saving
# ──────────────────────────────────
write.csv(df, "output.csv", row.names = FALSE)
write.table(df, "output.tsv", sep = "\t", row.names = FALSE, quote = FALSE)

3.3 — Tidyverse (Modern R)

library(dplyr)
library(tidyr)

# ──────────────────────────────────
# The pipe operator: %>% (read as "then")
# ──────────────────────────────────

# Base R (nested, hard to read):
head(sort(df$pvalue))

# Tidyverse (piped, reads left-to-right):
df %>%
  arrange(pvalue) %>%
  head()

# ──────────────────────────────────
# dplyr verbs — the 6 essential functions
# ──────────────────────────────────

# filter() — select rows
df %>% filter(pvalue < 0.05)
df %>% filter(log2FC > 2, pvalue < 0.01)
df %>% filter(gene %in% c("TP53", "BRCA1"))

# select() — select columns
df %>% select(gene, pvalue)
df %>% select(-log2FC)              # exclude a column
df %>% select(starts_with("log"))   # by pattern

# mutate() — add/modify columns
df %>% mutate(
  neg_log_p = -log10(pvalue),
  significant = pvalue < 0.05,
  length = end - start
)

# arrange() — sort rows
df %>% arrange(pvalue)              # ascending
df %>% arrange(desc(log2FC))        # descending

# summarise() + group_by() — aggregate
df %>%
  group_by(cancer_type) %>%
  summarise(
    n_genes = n(),
    mean_FC = mean(log2FC),
    median_p = median(pvalue)
  )

# count() — quick frequency table
df %>% count(cancer_type, sort = TRUE)

# ──────────────────────────────────
# Joining data frames (like SQL joins)
# ──────────────────────────────────
merged <- left_join(df1, df2, by = "gene")
merged <- inner_join(df1, df2, by = c("chr", "start"))

# ──────────────────────────────────
# Reshaping
# ──────────────────────────────────
# Wide to long
long <- df %>% pivot_longer(cols = c(sample1, sample2, sample3),
                             names_to = "sample",
                             values_to = "expression")

# Long to wide
wide <- long %>% pivot_wider(names_from = sample,
                              values_from = expression)

3.4 — Plotting with ggplot2

library(ggplot2)

# ──────────────────────────────────
# The ggplot2 formula:
# ggplot(data, aes(x, y)) + geom_*() + theme_*()
# ──────────────────────────────────

# Scatter plot
ggplot(df, aes(x = log2FC, y = -log10(pvalue))) +
  geom_point(alpha = 0.5) +
  theme_minimal() +
  labs(title = "Volcano Plot", x = "log2 Fold Change", y = "-log10(p-value)")

# Volcano plot (colored by significance)
ggplot(df, aes(x = log2FC, y = -log10(pvalue), color = significant)) +
  geom_point(alpha = 0.5, size = 1) +
  scale_color_manual(values = c("grey", "red")) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
  theme_minimal()

# Bar plot
ggplot(df, aes(x = cancer_type, fill = cancer_type)) +
  geom_bar() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Boxplot
ggplot(df, aes(x = cancer_type, y = expression, fill = cancer_type)) +
  geom_boxplot() +
  theme_minimal()

# Histogram
ggplot(df, aes(x = peak_length)) +
  geom_histogram(bins = 50, fill = "steelblue", color = "black") +
  theme_minimal()

# Heatmap
library(pheatmap)
pheatmap(matrix_data,
         scale = "row",
         clustering_distance_rows = "euclidean",
         color = colorRampPalette(c("blue", "white", "red"))(100))

# ──────────────────────────────────
# Save a plot
# ──────────────────────────────────
ggsave("my_plot.png", width = 8, height = 6, dpi = 300)
ggsave("my_plot.pdf", width = 8, height = 6)

3.5 — Bioconductor Essentials

# ──────────────────────────────────
# Install Bioconductor (do this ONCE)
# ──────────────────────────────────
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Install packages
BiocManager::install("DESeq2")
BiocManager::install("edgeR")
BiocManager::install("GenomicRanges")
BiocManager::install("rtracklayer")
BiocManager::install("clusterProfiler")

# ──────────────────────────────────
# DESeq2 — Differential Expression Analysis
# ──────────────────────────────────
library(DESeq2)

# Create DESeq dataset from count matrix
dds <- DESeqDataSetFromMatrix(
  countData = count_matrix,     # genes × samples (integers)
  colData = sample_info,        # sample metadata
  design = ~ condition          # what to compare
)

# Run DESeq2
dds <- DESeq(dds)
results <- results(dds)

# Get significant genes
sig_genes <- results[which(results$padj < 0.05), ]
sig_genes <- sig_genes[order(sig_genes$padj), ]

# Volcano plot from DESeq2 results
plotMA(results)

# PCA plot
vsd <- vst(dds)
plotPCA(vsd, intgroup = "condition")

# ──────────────────────────────────
# GenomicRanges — Working with genomic intervals
# ──────────────────────────────────
library(GenomicRanges)

# Create genomic ranges
gr <- GRanges(
  seqnames = c("chr1", "chr1", "chr2"),
  ranges = IRanges(start = c(100, 200, 300), end = c(150, 250, 350)),
  strand = c("+", "-", "+")
)

# Find overlaps between two sets of regions
overlaps <- findOverlaps(peaks_gr, promoters_gr)

# Reduce overlapping ranges to a single set
reduced <- reduce(gr)

# Get distance between ranges
dist <- distanceToNearest(peaks_gr, tss_gr)

# ──────────────────────────────────
# Read BED/BW/GFF files
# ──────────────────────────────────
library(rtracklayer)

bed <- import("peaks.bed")
bw <- import("signal.bw")
gff <- import("annotations.gff3")

# Export
export(gr, "output.bed", format = "BED")

# ──────────────────────────────────
# Gene Ontology / Pathway Analysis
# ──────────────────────────────────
library(clusterProfiler)
library(org.Hs.eg.db)

# GO enrichment
ego <- enrichGO(
  gene = significant_genes,
  OrgDb = org.Hs.eg.db,
  keyType = "SYMBOL",
  ont = "BP",               # Biological Process
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05
)
dotplot(ego, showCategory = 20)

# KEGG pathway enrichment
ekegg <- enrichKEGG(
  gene = entrez_ids,
  organism = "hsa",
  pvalueCutoff = 0.05
)
dotplot(ekegg)

3.6 — R Useful One-Liners

# Check if a package is installed
"DESeq2" %in% rownames(installed.packages())

# Install a CRAN package
install.packages("ggplot2")

# Load multiple libraries at once
lapply(c("dplyr", "ggplot2", "readr"), library, character.only = TRUE)

# Apply a function to every column
sapply(df, function(x) sum(is.na(x)))    # count NAs per column

# Quick summary of a BED file
bed <- read.delim("peaks.bed", header = FALSE)
cat("Peaks:", nrow(bed), "\n")
cat("Chromosomes:", length(unique(bed$V1)), "\n")
cat("Mean length:", mean(bed$V3 - bed$V2), "bp\n")

# Set working directory
setwd("/data/my_project")
getwd()

# List files matching a pattern
list.files(path = "results/", pattern = "*.csv")

🗄️ Part 4 — SQL Essentials

4.1 — Why SQL in Bioinformatics?

Many biological databases (Ensembl, UCSC Genome Browser, GEO, clinical databases) use SQL. You'll encounter it when:

Querying genomic annotation databases
Working with clinical/patient metadata
Accessing TCGA or GEO programmatically
Managing lab sample databases (LIMS)
Pulling data from REDCap or similar platforms

4.2 — SQL Basics

-- ──────────────────────────────────
-- SELECT — get data from a table
-- ──────────────────────────────────
SELECT * FROM patients;                          -- all columns
SELECT patient_id, cancer_type FROM patients;    -- specific columns
SELECT DISTINCT cancer_type FROM patients;       -- unique values only

-- ──────────────────────────────────
-- WHERE — filter rows
-- ──────────────────────────────────
SELECT * FROM patients WHERE cancer_type = 'BRCA';
SELECT * FROM patients WHERE age > 50;
SELECT * FROM patients WHERE cancer_type = 'BRCA' AND stage = 'III';
SELECT * FROM patients WHERE cancer_type IN ('BRCA', 'LUAD', 'COAD');
SELECT * FROM genes WHERE gene_name LIKE 'CEBP%';     -- starts with CEBP
SELECT * FROM results WHERE pvalue IS NOT NULL;

-- ──────────────────────────────────
-- ORDER BY — sort results
-- ──────────────────────────────────
SELECT * FROM results ORDER BY pvalue ASC;              -- ascending
SELECT * FROM results ORDER BY log2fc DESC;             -- descending
SELECT * FROM results ORDER BY cancer_type, pvalue;     -- multi-column

-- ──────────────────────────────────
-- LIMIT — get only N rows
-- ──────────────────────────────────
SELECT * FROM results ORDER BY pvalue LIMIT 10;         -- top 10

4.3 — Aggregation & Grouping

-- ──────────────────────────────────
-- COUNT, SUM, AVG, MIN, MAX
-- ──────────────────────────────────
SELECT COUNT(*) FROM patients;                           -- total rows
SELECT COUNT(DISTINCT cancer_type) FROM patients;        -- unique cancer types
SELECT AVG(age) FROM patients;                           -- average age
SELECT MIN(pvalue), MAX(pvalue) FROM results;

-- ──────────────────────────────────
-- GROUP BY — aggregate per group
-- ──────────────────────────────────
SELECT cancer_type, COUNT(*) AS n_patients
FROM patients
GROUP BY cancer_type
ORDER BY n_patients DESC;

SELECT cancer_type,
       COUNT(*) AS n_samples,
       AVG(age) AS mean_age,
       MIN(age) AS min_age,
       MAX(age) AS max_age
FROM patients
GROUP BY cancer_type;

-- ──────────────────────────────────
-- HAVING — filter AFTER grouping
-- ──────────────────────────────────
SELECT cancer_type, COUNT(*) AS n
FROM patients
GROUP BY cancer_type
HAVING COUNT(*) >= 10;            -- only cancer types with ≥10 samples

4.4 — Joins (Combining Tables)

-- ──────────────────────────────────
-- JOIN types visualized:
--
-- INNER JOIN:  only matching rows from BOTH tables
-- LEFT JOIN:   all rows from LEFT + matching from RIGHT
-- RIGHT JOIN:  all rows from RIGHT + matching from LEFT
-- FULL JOIN:   all rows from BOTH tables
-- ──────────────────────────────────

-- INNER JOIN — patients with results
SELECT p.patient_id, p.cancer_type, r.gene, r.pvalue
FROM patients p
INNER JOIN results r ON p.patient_id = r.patient_id;

-- LEFT JOIN — all patients, even without results
SELECT p.patient_id, p.cancer_type, r.gene, r.pvalue
FROM patients p
LEFT JOIN results r ON p.patient_id = r.patient_id;

-- Join on multiple columns
SELECT *
FROM peaks p
JOIN annotations a ON p.chr = a.chr
                   AND p.start >= a.start
                   AND p.end <= a.end;

4.5 — Creating & Modifying Tables

-- ──────────────────────────────────
-- CREATE TABLE
-- ──────────────────────────────────
CREATE TABLE patients (
    patient_id   VARCHAR(50) PRIMARY KEY,
    cancer_type  VARCHAR(20) NOT NULL,
    age          INTEGER,
    sex          VARCHAR(1),
    stage        VARCHAR(10)
);

CREATE TABLE peaks (
    peak_id      INTEGER PRIMARY KEY AUTOINCREMENT,
    chr          VARCHAR(10),
    start_pos    INTEGER,
    end_pos      INTEGER,
    score        FLOAT,
    sample_id    VARCHAR(50),
    FOREIGN KEY (sample_id) REFERENCES patients(patient_id)
);

-- ──────────────────────────────────
-- INSERT data
-- ──────────────────────────────────
INSERT INTO patients (patient_id, cancer_type, age, sex, stage)
VALUES ('TCGA-A1-A0SB', 'BRCA', 55, 'F', 'IIA');

-- Insert multiple rows
INSERT INTO patients VALUES
    ('TCGA-A1-A0SB', 'BRCA', 55, 'F', 'IIA'),
    ('TCGA-BC-A3KF', 'LIHC', 62, 'M', 'III'),
    ('TCGA-OR-A5J2', 'ACC', 48, 'F', 'II');

-- ──────────────────────────────────
-- UPDATE data
-- ──────────────────────────────────
UPDATE patients SET stage = 'IIB' WHERE patient_id = 'TCGA-A1-A0SB';

-- ──────────────────────────────────
-- DELETE data (⚠️ careful!)
-- ──────────────────────────────────
DELETE FROM patients WHERE patient_id = 'TCGA-A1-A0SB';

-- ──────────────────────────────────
-- ALTER TABLE
-- ──────────────────────────────────
ALTER TABLE patients ADD COLUMN treatment VARCHAR(50);
ALTER TABLE patients DROP COLUMN treatment;

4.6 — Subqueries & Advanced Patterns

-- ──────────────────────────────────
-- Subquery — query inside a query
-- ──────────────────────────────────

-- Find patients with above-average age
SELECT * FROM patients
WHERE age > (SELECT AVG(age) FROM patients);

-- Find genes significant in BRCA
SELECT gene FROM results
WHERE patient_id IN (
    SELECT patient_id FROM patients WHERE cancer_type = 'BRCA'
)
AND pvalue < 0.05;

-- ──────────────────────────────────
-- CASE — conditional logic (like if/else)
-- ──────────────────────────────────
SELECT gene, pvalue,
    CASE
        WHEN pvalue < 0.001 THEN 'highly_significant'
        WHEN pvalue < 0.05  THEN 'significant'
        ELSE 'not_significant'
    END AS significance
FROM results;

-- ──────────────────────────────────
-- Window functions — rank, running totals
-- ──────────────────────────────────

-- Rank genes by pvalue within each cancer type
SELECT cancer_type, gene, pvalue,
    RANK() OVER (PARTITION BY cancer_type ORDER BY pvalue) AS rank
FROM results;

-- ──────────────────────────────────
-- Common Table Expressions (CTE) — readable subqueries
-- ──────────────────────────────────
WITH significant_genes AS (
    SELECT * FROM results WHERE pvalue < 0.05
)
SELECT cancer_type, COUNT(*) AS n_sig_genes
FROM significant_genes
GROUP BY cancer_type
ORDER BY n_sig_genes DESC;

4.7 — SQL from Python

import sqlite3
import pandas as pd

# ──────────────────────────────────
# SQLite — lightweight, no server needed
# ──────────────────────────────────

# Connect (creates file if it doesn't exist)
conn = sqlite3.connect("my_database.db")
cursor = conn.cursor()

# Create a table
cursor.execute("""
    CREATE TABLE IF NOT EXISTS peaks (
        chr TEXT,
        start INTEGER,
        end INTEGER,
        score REAL,
        sample TEXT
    )
""")

# Insert data from a pandas DataFrame
df.to_sql("peaks", conn, if_exists="replace", index=False)

# Query into a DataFrame (most common pattern!)
results = pd.read_sql_query("SELECT * FROM peaks WHERE score > 50", conn)

# Run any SQL
cursor.execute("SELECT chr, COUNT(*) FROM peaks GROUP BY chr")
rows = cursor.fetchall()

# Close connection
conn.close()

# ──────────────────────────────────
# MySQL / PostgreSQL (for larger databases)
# ──────────────────────────────────
# pip install sqlalchemy pymysql

from sqlalchemy import create_engine

# Connect to MySQL
engine = create_engine("mysql+pymysql://user:password@host/database")

# Query directly into pandas
df = pd.read_sql("SELECT * FROM patients WHERE cancer_type = 'BRCA'", engine)

4.8 — SQL for Genomic Databases

# ──────────────────────────────────
# Query UCSC Genome Browser MySQL (public, no password)
# ──────────────────────────────────
from sqlalchemy import create_engine
import pandas as pd

engine = create_engine("mysql+pymysql://genome@genome-mysql.soe.ucsc.edu/hg38")

# Get all RefSeq genes on chr1
genes = pd.read_sql("""
    SELECT name2 AS gene, chrom, txStart, txEnd, strand
    FROM refGene
    WHERE chrom = 'chr1'
    ORDER BY txStart
    LIMIT 100
""", engine)

# Get CpG islands
cpg = pd.read_sql("""
    SELECT chrom, chromStart, chromEnd, length, cpgNum, gcNum
    FROM cpgIslandExt
    WHERE chrom = 'chr1'
""", engine)

# ──────────────────────────────────
# Query Ensembl BioMart from R
# ──────────────────────────────────
library(biomaRt)

ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")

# Get gene coordinates
genes <- getBM(
  attributes = c("hgnc_symbol", "chromosome_name",
                 "start_position", "end_position", "strand"),
  filters = "hgnc_symbol",
  values = c("TP53", "BRCA1", "EGFR", "CEBPB"),
  mart = ensembl
)

4.9 — SQL Cheatsheet

Command	What It Does
`SELECT`	Choose which columns to return
`FROM`	Which table to query
`WHERE`	Filter rows (before grouping)
`AND` / `OR`	Combine conditions
`IN (...)`	Match any value in a list
`LIKE '%pattern%'`	Pattern matching
`IS NULL` / `IS NOT NULL`	Check for missing values
`ORDER BY ... ASC/DESC`	Sort results
`LIMIT N`	Return only N rows
`GROUP BY`	Aggregate per group
`HAVING`	Filter after grouping
`COUNT()` / `SUM()` / `AVG()`	Aggregate functions
`INNER JOIN ... ON`	Combine tables (matching only)
`LEFT JOIN ... ON`	Combine tables (keep all from left)
`INSERT INTO`	Add rows
`UPDATE ... SET`	Modify rows
`DELETE FROM`	Remove rows
`CREATE TABLE`	Make a new table
`ALTER TABLE`	Modify table structure

⚙️ Part 5 — Environment Setup

5.1 — Check What's Already Installed

Run these commands first to understand your environment:

# ──────────────────────────────────
# System
# ──────────────────────────────────
cat /etc/os-release          # OS version
uname -a                     # Kernel info
nproc                        # Number of CPU cores
free -h                      # Available RAM
df -h                        # Disk space

# ──────────────────────────────────
# Tools
# ──────────────────────────────────
git --version                # Git
python3 --version            # Python
pip --version                # pip
conda --version              # Conda
R --version                  # R
java -version                # Java

# ──────────────────────────────────
# Bioinformatics tools
# ──────────────────────────────────
samtools --version           # SAMtools
bedtools --version           # BEDTools
blastn -version              # BLAST
mafft --version              # MAFFT
module avail                 # Available modules (HPC)

5.2 — Install & Configure Git

# Install (if not present)
sudo apt install git         # Ubuntu/Debian
# or
conda install git            # via Conda

# Configure (do this ONCE)
git config --global user.name "YourName"
git config --global user.email "your.email@example.com"
git config --global core.editor "code --wait"    # VS Code as editor
git config --global init.defaultBranch main

# Verify
git config --list

Generate SSH Key (for GitHub without password)

# Generate key
ssh-keygen -t ed25519 -C "your.email@example.com"
# Press Enter for all prompts (default location, no passphrase)

# Copy the public key
cat ~/.ssh/id_ed25519.pub

# → Go to GitHub → Settings → SSH Keys → New SSH Key → Paste

# Test connection
ssh -T git@github.com
# Expected: "Hi YourName! You've successfully authenticated"

5.3 — Install & Configure Conda

# Download Miniconda (lightweight)
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Install
bash Miniconda3-latest-Linux-x86_64.sh
# Follow prompts, say "yes" to initialize

# Restart terminal, then verify
conda --version

# ──────────────────────────────────
# Create a project environment
# ──────────────────────────────────
conda create -n bioinfo python=3.10
conda activate bioinfo

# Install common bioinformatics packages
conda install -c bioconda samtools bedtools blast mafft
conda install -c conda-forge pandas numpy matplotlib seaborn biopython pysam

# ──────────────────────────────────
# Managing environments
# ──────────────────────────────────
conda env list                          # List all environments
conda activate bioinfo                  # Activate
conda deactivate                        # Deactivate
conda env export > environment.yml      # Export (for reproducibility)
conda env create -f environment.yml     # Recreate from file
conda remove -n bioinfo --all           # Delete an environment

5.4 — Install & Configure VS Code

Option A — Local Machine

Download from code.visualstudio.com
Install and open

Option B — Remote Server (SSH)

Install VS Code locally
Install the Remote - SSH extension
Press Ctrl+Shift+P → type "Remote-SSH: Connect to Host"
Enter: username@server-address
VS Code opens a window connected to the server

Essential Extensions

# Must-have extensions (install from Extensions panel):
- Python                    # Python support
- Remote - SSH              # Work on remote servers
- GitLens                   # Enhanced Git integration
- Jupyter                   # Run notebooks in VS Code
- Bracket Pair Colorizer    # Colored brackets
- indent-rainbow            # Colored indentation

VS Code Settings for Bioinformatics

Press Ctrl+Shift+P → "Preferences: Open Settings (JSON)" and add:

{
    "editor.fontSize": 14,
    "editor.wordWrap": "on",
    "files.autoSave": "afterDelay",
    "python.defaultInterpreterPath": "~/miniconda3/envs/bioinfo/bin/python",
    "terminal.integrated.defaultProfile.linux": "bash"
}

5.5 — Connect Everything Together

Full Setup Workflow

# 1. SSH into server
ssh username@server-address

# 2. Check what's available
git --version && python3 --version && conda --version

# 3. Configure Git (if first time)
git config --global user.name "YourName"
git config --global user.email "your.email@example.com"

# 4. Clone your project
git clone git@github.com:YourName/my-project.git
cd my-project

# 5. Create/activate conda environment
conda create -n my_project python=3.10
conda activate my_project
pip install -r requirements.txt

# 6. Open in VS Code (from local machine)
# VS Code → Remote-SSH → Connect → open the project folder

# 7. Start working!
git checkout -b feature/my-task
# ... code ...
git add .
git commit -m "Add new analysis"
git push origin feature/my-task

5.6 — Project Folder Template

When starting a new bioinformatics project, create this structure:

mkdir -p my_project/{data,scripts,results,figures,logs,docs}
touch my_project/README.md
touch my_project/requirements.txt
touch my_project/.gitignore
cd my_project && git init

my_project/
│
├── data/               # Input data (add to .gitignore if large)
├── scripts/            # Analysis scripts
├── results/            # Output files
├── figures/            # Plots and visualizations
├── logs/               # Job logs, error logs
├── docs/               # Notes, reports
│
├── .gitignore          # Files to ignore
├── requirements.txt    # Python dependencies
└── README.md           # Project description

Recommended `.gitignore`

# Data files (too large for Git)
*.bam
*.bam.bai
*.fastq
*.fastq.gz
*.bw
*.bigwig
*.bedgraph
*.h5
*.hdf5

# Python
__pycache__/
*.pyc
.ipynb_checkpoints/

# Environment
.env
*.egg-info/

# OS
.DS_Store
Thumbs.db

# Logs
*.log
logs/

# Results (regenerable)
results/
figures/

📋 Part 6 — Quick Reference Cards

Linux Commands — Top 30

Command	What It Does
`pwd`	Print current directory
`ls -la`	List all files with details
`cd`	Change directory
`mkdir -p`	Create nested directories
`cp -r`	Copy files/directories
`mv`	Move or rename
`rm -rf`	Delete (⚠️ careful!)
`ln -s`	Create symbolic link
`cat`	Print file content
`head -n 20`	First 20 lines
`tail -n 20`	Last 20 lines
`less`	Scroll through a file
`wc -l`	Count lines
`grep`	Search for a pattern
`find`	Find files
`cut -f1,2`	Extract columns
`sort`	Sort lines
`uniq -c`	Count unique lines
`gzip` / `gunzip`	Compress / decompress
`tar -xzvf`	Extract archive
`chmod +x`	Make executable
`ssh`	Remote connection
`scp`	Copy files to/from server
`top` / `htop`	Monitor processes
`kill`	Stop a process
`nohup ... &`	Run in background
`df -h`	Disk space
`du -sh`	Folder size
`pipe \|`	Chain commands
`> / >>`	Redirect output

Python Packages — Bioinformatics Essentials

Package	Purpose
`pandas`	Tabular data manipulation
`numpy`	Numerical computing
`matplotlib`	Plotting
`seaborn`	Statistical visualization
`biopython`	Sequence analysis, BLAST, FASTA
`pysam`	BAM/SAM file handling
`pybedtools`	BED file operations
`scikit-learn`	Machine learning
`scipy`	Statistics, distance metrics
`gzip`	Read compressed files

Install All at Once

# Python + bioinformatics tools
conda create -n bioinfo python=3.10
conda activate bioinfo
conda install -c bioconda -c conda-forge \
    pandas numpy matplotlib seaborn biopython pysam \
    samtools bedtools blast mafft
pip install pybedtools scikit-learn

R Packages — Bioinformatics Essentials

Package	Purpose
`dplyr`	Data manipulation (filter, mutate, summarise)
`ggplot2`	Plotting
`readr`	Fast file reading (CSV, TSV)
`tidyr`	Reshaping data (pivot)
`DESeq2`	Differential expression analysis
`edgeR`	Differential expression (alternative)
`GenomicRanges`	Genomic interval operations
`rtracklayer`	Read/write BED, BW, GFF files
`clusterProfiler`	GO / KEGG enrichment
`pheatmap`	Heatmaps
`biomaRt`	Query Ensembl databases
`Biostrings`	Sequence manipulation

Install R Essentials

# CRAN packages
install.packages(c("dplyr", "ggplot2", "readr", "tidyr", "pheatmap"))

# Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("DESeq2", "edgeR", "GenomicRanges",
                        "rtracklayer", "clusterProfiler", "biomaRt"))

SQL Quick Install

# SQLite (lightweight, no server)
sudo apt install sqlite3

# From Python (no install needed)
import sqlite3    # built into Python

# MySQL client
sudo apt install mysql-client
# or
conda install -c conda-forge mysqlclient

# PostgreSQL client
sudo apt install postgresql-client

Folders and files

Latest commit

History

Repository files navigation

🚀 First — Your Day-One Setup & Survival Guide

📌 Table of Contents

✅ Day-One Checklist

🐧 Part 1 — Linux Survival Guide

1.1 — First Things First: Where Am I?

1.2 — Navigating the File System

1.3 — Checking Disk Space & File Sizes

1.4 — Files: Create, Copy, Move, Delete

1.5 — Reading & Inspecting Files

1.6 — Searching & Filtering

1.7 — Compression & Archives

1.8 — Permissions & Ownership

1.9 — Processes & Jobs

1.10 — SSH & Remote Servers

1.11 — Pipes & Redirection

1.12 — Useful One-Liners for Bioinformatics

🐍 Part 2 — Python for Bioinformatics

2.1 — File I/O (The Basics)

2.2 — Pandas Essentials

2.3 — Working with FASTA Files

2.4 — Working with BED Files

2.5 — Working with BAM/SAM Files

2.6 — Running BLAST from Python

2.7 — Plotting for Bioinformatics

2.8 — Useful Python One-Liners

📊 Part 3 — R for Bioinformatics

3.1 — R Basics

3.2 — Data Frames (R's Equivalent of Pandas)

3.3 — Tidyverse (Modern R)

3.4 — Plotting with ggplot2

3.5 — Bioconductor Essentials

3.6 — R Useful One-Liners

🗄️ Part 4 — SQL Essentials

4.1 — Why SQL in Bioinformatics?

4.2 — SQL Basics

4.3 — Aggregation & Grouping

4.4 — Joins (Combining Tables)

4.5 — Creating & Modifying Tables

4.6 — Subqueries & Advanced Patterns

4.7 — SQL from Python

4.8 — SQL for Genomic Databases

4.9 — SQL Cheatsheet

⚙️ Part 5 — Environment Setup

5.1 — Check What's Already Installed

5.2 — Install & Configure Git

Generate SSH Key (for GitHub without password)

5.3 — Install & Configure Conda

5.4 — Install & Configure VS Code

Option A — Local Machine

Option B — Remote Server (SSH)

Essential Extensions

VS Code Settings for Bioinformatics

5.5 — Connect Everything Together

Full Setup Workflow

5.6 — Project Folder Template

Recommended .gitignore

📋 Part 6 — Quick Reference Cards

Linux Commands — Top 30

Python Packages — Bioinformatics Essentials

Install All at Once

R Packages — Bioinformatics Essentials

Install R Essentials

SQL Quick Install

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Recommended `.gitignore`

Packages