Introduction to Nextflow: Building Reproducible Computational Pipelines

Welcome to the Nextflow workshop! In this hands-on tutorial, you'll learn how to build scalable, reproducible computational pipelines using Nextflow.

What is Nextflow?

Nextflow is a workflow management system for building computational pipelines that are:

Portable: Run on laptops, HPC clusters, or cloud
Reproducible: Track every step with consistent results
Scalable: Automatic parallelization across resources
Flexible: Deploy anywhere (local, SLURM, AWS, etc.)
Tool-agnostic: Integrate any command-line program (Python, R, bash scripts, compiled binaries)

Why Use Nextflow?

Traditional computational workflows require:

Complex bash scripts with nested loops
Manual file tracking
Custom parallelization logic
Difficult result reproduction
Messy integration of diverse tools

Nextflow handles:

Automatic task parallelization
Data flow management
Failure recovery and checkpointing
Clear maintainable code
Seamless tool integration

Key Concepts

Before we start, here are the core concepts you'll learn:

Processes: Individual computational tasks (running any command-line tool or script)
Channels: Data streams that connect processes together
Workflows: The orchestration of processes and data flow
Operators: Methods to transform and manipulate channels (.map(), .collect(), etc.)

About This Workshop

Example Domain: We use bioinformatics examples in this workshop, but the concepts apply to any computational domain:

Image processing and computer vision
Climate modeling and simulations
Text analysis and natural language processing
Machine learning pipelines
Statistical analysis workflows
Any field requiring batch processing of data

Tools Used: Throughout this workshop, you'll see how Nextflow integrates:

Published tools: FastQC and Fastp (widely-used bioinformatics programs)
Custom scripts: Python scripts written specifically for this analysis
Your own tools: The same principles apply to any command-line program you use

The key insight: Nextflow doesn't care what your tools do - it just manages how data flows between them!

What You'll Build Today

In this workshop, you'll progress through increasingly complex pipelines:

Scripts 01-05 (This tutorial):

Script 01: Hello World - Your first Nextflow process
Script 02: Working with files - Reading and writing data
Script 03: Using parameters - Making pipelines configurable
Script 04: Multiple inputs - Processing several files
Script 05: Channels - Understanding data flow

Scripts 06-10 (Implementation tutorial):

Script 06: Quality control with FastQC (published tool)
Script 07: Data trimming with Fastp (published tool)
Script 08: Parallel workflows (running multiple tools simultaneously)
Script 09: Collecting results with ReadLenDist (custom Python script)
Script 10: Complete multi-step pipeline (integrating everything)

Learning Approach

This tutorial is designed to be:

Hands-on: You'll run every script and see the results
Progressive: Each script builds on concepts from the previous one
Practical: Real tools and data from actual research workflows
Interactive: Experiment, break things, and learn!
Transferable: Apply these skills to your own research domain

Let's get started!

Getting a compute node on Nova using OnDemand

Use the link: Nova OnDemand

Please login using your ISU credentials.

We will be using the VS Code Server on ISU HPC cluster for this tutorial.

Setting up the environment

Let us clone the tutorial repo

cd /work/short_term/<ISU_NetID>
git clone https://github.com/ISUgenomics/isu-nextflow-workshop.git

Copying data required for the tutorial

cp -a /work/short_term/workshop2_bash/01_data .

Script 01: Hello World - Your First Nextflow Process

Learning Goals:

Understand the basic structure of a Nextflow script
Learn what a process is and how it works
See how the workflow block orchestrates processes
Use the .view() operator to display output

Before Nextflow: Manual Command

First, let's see what we're automating. Run this command directly in your terminal:

echo 'Welcome to the world of Nextflow!'

Output:

Welcome to the world of Nextflow!

This prints a greeting directly in the shell. Simple, right? But what if you need to:

Run this on 100 different inputs?
Track when it was run and with what parameters?
Resume if it fails?
Run it on different compute systems?

That's where Nextflow comes in!

Step 1: Load Nextflow

module load nextflow
module list

This loads the Nextflow module on the HPC cluster.

Step 2: Examine the Script

Let's look at the Nextflow version:

cat pipelines/01_hello_screen.nf

Click to see the complete script

File: pipelines/01_hello_screen.nf

#!/usr/bin/env nextflow

/*
 * Use echo to print a message to the screen
 */

process hello {

    output:
    stdout

    script:
    """
    echo "Welcome to the world of Nextflow!"
    """
}

workflow {
    // Run the hello process
    hello().view()
}

Understanding the Components

The Shebang Line

#!/usr/bin/env nextflow

Declares this file as a Nextflow script
Allows the script to be executed directly (like a bash script)
Not strictly required, but good practice

The Process Block

process hello {
    output:
    stdout

    script:
    """
    echo "Welcome to the world of Nextflow!"
    """
}

What is a Process?

A process is a basic computing unit in Nextflow. Think of it as a wrapper around any command-line tool or script.

Components:

process hello: Names the process "hello"
output: stdout: Captures standard output (what the command prints)
- This creates a channel containing the output
- Channels are how data flows between processes
script: """ ... """: Contains the actual command(s) to run
- Triple quotes allow multi-line commands
- Can contain any bash/shell commands
- Variables can be interpolated with ${variable}

Key Insight: The process doesn't run immediately - it's just a definition. The workflow block decides when to run it.

The Workflow Block

workflow {
    // Run the hello process
    hello().view()
}

What is a Workflow?

The workflow block is where you orchestrate your processes - deciding which to run, in what order, and with what data.

Components:

hello(): Executes the hello process
- Returns a channel containing the process output
- In this case, the channel contains "Welcome to the world of Nextflow!"
.view(): A channel operator that prints channel contents to the terminal
- Useful for debugging and seeing what's in a channel
- Without .view(), the output would be captured but not displayed

Data Flow:

hello process → stdout channel → .view() → terminal

Step 3: Run the Pipeline

nextflow run pipelines/01_hello_screen.nf

Expected Output:

N E X T F L O W  ~  version 24.04.4

Launching `pipelines/01_hello_screen.nf` [maniac_albattani] DSL2 - revision: 91b2c7c409

executor >  local (1)
[9c/4c931d] process > hello [100%] 1 of 1 ✔
Welcome to the world of Nextflow!

Understanding the Output

What does each line mean?

N E X T F L O W  ~  version 24.04.4

Nextflow version being used

Launching `pipelines/01_hello_screen.nf` [maniac_albattani] DSL2 - revision: 91b2c7c409

Script being run
Random name assigned to this run ("maniac_albattani")
DSL2: Nextflow's Domain Specific Language version 2
Git revision (if in a git repository)

executor >  local (1)

Executor: where the process runs (local machine, SLURM, AWS, etc.)
(1): One process executed

[9c/4c931d] process > hello [100%] 1 of 1 ✔

[9c/4c931d]: Unique hash for this process execution
- Used to find the work directory: work/9c/4c931d.../
process > hello: Name of the process
[100%] 1 of 1 ✔: Progress indicator - 1 task completed successfully

Welcome to the world of Nextflow!

The actual output from our process (displayed by .view())

Key Takeaways

You've learned:

A Nextflow script has processes (what to do) and a workflow (when to do it)
Processes capture output into channels
The .view() operator displays channel contents
Nextflow tracks every execution with a unique hash
Each process runs in its own work directory

Script 02: Writing to Files - Understanding File Outputs

Learning Goals:

Learn how to create file outputs instead of stdout
Understand the Nextflow work directory structure
See where process outputs are stored
Learn to inspect process execution artifacts

Building on Script 01:

In Script 01, we used output: stdout to print to the terminal. But real pipelines need to save results to files. Script 02 shows you how!

Step 1: Examine the Script

cat pipelines/02_hello_redirect.nf

Click to see the complete script

File: pipelines/02_hello_redirect.nf

#!/usr/bin/env nextflow

/*
 * Hello redirect: write greeting to file
 */

process hello {
    output:
    path 'result.txt'

    script:
        """
        echo "Welcome to the world of Nextflow!" > result.txt
        """
}

workflow {
    hello()
}

What Changed from Script 01?

Comparing the two scripts

Script 01:

process hello {
    output:
    stdout              // Captures what's printed to terminal

    script:
    """
    echo "Welcome to the world of Nextflow!"
    """
}

workflow {
    hello().view()      // .view() displays the stdout
}

Script 02:

process hello {
    output:
    path 'result.txt'   // Captures a file

    script:
    """
    echo "Welcome to the world of Nextflow!" > result.txt
    """
}

workflow {
    hello()             // No .view() needed - output is a file
}

Key Differences:

Output type: stdout → path 'result.txt'
Script command: Direct echo → Redirect to file with >
Workflow: No .view() needed (file is automatically saved)

Understanding File Outputs

The path output qualifier

output:
path 'result.txt'

What does path mean?

path tells Nextflow: "This process creates a file"
'result.txt' is the filename to capture
Nextflow will look for this file after the process completes
The file is automatically added to a channel (for downstream processes)

Important: The filename in output: must match the filename created in script:

output:
path 'result.txt'        // Nextflow expects this file

script:
"""
echo "..." > result.txt  // Script must create this file
"""

Step 2: Run the Pipeline

nextflow run pipelines/02_hello_redirect.nf

Expected Output:

N E X T F L O W  ~  version 24.04.4

Launching `pipelines/02_hello_redirect.nf` [determined_hopper] DSL2 - revision: 5ff76c72d0

executor >  local (1)
[1a/e8c8c0] process > hello [100%] 1 of 1 ✔

Notice: No output is printed! That's because the result is in a file, not stdout.

Step 3: Find Your Output

Where did the file go?

The Work Directory:

Nextflow stores all process outputs in the work/ directory. Let's explore it:

tree -a work

Output:

work
└── 1a
    └── e8c8c0a1b2c3d4e5f6g7h8i9j0k1l2m3
        ├── .command.begin
        ├── .command.err
        ├── .command.log
        ├── .command.out
        ├── .command.run
        ├── .command.sh
        ├── .exitcode
        └── result.txt        ← Your output file!

2 directories, 8 files

Structure:

work/: Main directory for all process executions
1a/: First two characters of the process hash
e8c8c0.../: Full unique hash for this specific execution
result.txt: Your output file
.command.*: Nextflow's internal files

Understanding the Work Directory

What are all these .command files?

Each process execution creates several files:

Your Files:

result.txt: The output file your process created

Nextflow's Files:

.command.sh: The actual shell script Nextflow generated and ran
- Look inside to see exactly what was executed!
.command.run: Wrapper script that sets up the environment
.command.out: Standard output (stdout) from the process
.command.err: Standard error (stderr) from the process
.command.log: Combined log of stdout and stderr
.exitcode: Exit status (0 = success, non-zero = error)
.command.begin: Timestamp when process started

Why is this useful?

Debugging: Check .command.err if something fails
Reproducibility: .command.sh shows exactly what ran
Verification: .exitcode confirms success/failure

Step 4: View the Output

Let's read the file using the hash from the Nextflow output:

# Use the hash from your output (e.g., [1a/e8c8c0])
cat work/1a/e8c8c0*/result.txt

Output:

Welcome to the world of Nextflow!

Pro tip: You can also use:

find work -name result.txt -exec cat {} \;

Key Takeaways

You've learned:

Use output: path 'filename' to create file outputs
The filename in output: must match what the script creates
All process outputs go to the work/ directory
Each process gets a unique subdirectory with a hash
The work directory contains debugging artifacts (.command.* files)
You can inspect these files to understand what happened

Script 03: Publishing Outputs - Making Results Accessible

Learning Goals:

Learn to use publishDir to copy outputs to accessible locations
Understand the difference between work directory and published directory
Learn about different publishDir modes
Make your results easy to find and share

Building on Script 02:

In Script 02, outputs went to the work/ directory with cryptic hashes. That's great for Nextflow's internal management, but terrible for humans! Script 03 shows you how to publish outputs to user-friendly locations.

The Problem with Work Directories

Script 02 output location:

work/1a/e8c8c0a1b2c3d4e5f6g7h8i9j0k1l2m3/result.txt

Problems:

Hard to find (need to look up the hash)
Changes every run (new hash each time)
Gets deleted if you clean the work directory
Not suitable for sharing results

Solution: Use publishDir to copy/link outputs to a permanent, accessible location!

Step 1: Examine the Script

cat pipelines/03_hello_publishdir.nf

Click to see the complete script

File: pipelines/03_hello_publishdir.nf

#!/usr/bin/env nextflow

/*
* Create an output directory where the output will be saved
*/

process hello {
  publishDir 'output', mode: 'copy'

  output:
  path 'result.txt'

  script:
  """
  echo "Hello Nextflow World!" > result.txt
  """
}

workflow {
  // Run the hello process
  hello()
}

What Changed from Script 02?

The publishDir directive

Script 02:

process hello {
    output:
    path 'result.txt'
    
    script:
    """
    echo "Welcome to the world of Nextflow!" > result.txt
    """
}

Script 03:

process hello {
  publishDir 'output', mode: 'copy'    // NEW!

  output:
  path 'result.txt'

  script:
  """
  echo "Hello Nextflow World!" > result.txt
  """
}

The only change: Added publishDir 'output', mode: 'copy'

Understanding publishDir

How publishDir works

publishDir 'output', mode: 'copy'

What it does:

Takes files from the work directory
Copies (or links) them to the specified directory
Happens after the process completes successfully

Components:

'output': Target directory name
- Can be any path: 'results', 'my_outputs', '/absolute/path'
- Created automatically if it doesn't exist
mode: 'copy': How to publish the file

Data Flow:

Process runs → Creates result.txt in work/ → publishDir copies to output/

publishDir modes

Common modes:

mode: 'copy' (most common)
- Creates a copy of the file
- Original stays in work directory
- Safe: deleting published file doesn't affect work directory
mode: 'symlink'
- Creates a symbolic link
- Saves disk space (no duplicate)
- Faster than copy
- Warning: If you clean work directory, link breaks!
mode: 'move'
- Moves the file (removes from work directory)
- Saves disk space
- Warning: Can't resume if you delete the published file!

Recommendation: Use 'copy' unless disk space is critical.

Step 2: Run the Pipeline

nextflow run pipelines/03_hello_publishdir.nf

Expected Output:

N E X T F L O W  ~  version 24.04.4

Launching `pipelines/03_hello_publishdir.nf` [determined_hopper] DSL2 - revision: 5ff76c72d0

executor >  local (1)
[2b/f9d1e2] process > hello [100%] 1 of 1 ✔

Step 3: Find Your Published Output

Now the output is in an easy-to-find location!

ls -la output/

Output:

total 8
drwxr-xr-x  3 user  group   96 Oct 19 01:30 .
drwxr-xr-x  8 user  group  256 Oct 19 01:30 ..
-rw-r--r--  1 user  group   23 Oct 19 01:30 result.txt

View the file:

cat output/result.txt

Output:

Hello Nextflow World!

Much easier than:

cat work/2b/f9d1e2a3b4c5d6e7f8g9h0i1j2k3l4m5/result.txt

Both Locations Exist!

Work directory vs. Published directory

After running Script 03, you have TWO copies:

Work directory (Nextflow's internal copy):
```
work/2b/f9d1e2.../result.txt
```
- Used for pipeline management
- Used for -resume functionality
- Can be cleaned up later
Published directory (Your accessible copy):
```
output/result.txt
```
- Easy to find and share
- Permanent location
- Safe to use in downstream analysis

Best Practice: Keep work directory for development, clean it periodically. Keep published outputs permanently.

Key Takeaways

You've learned:

publishDir copies outputs to accessible locations
Use mode: 'copy' for safety (most common)
Published files are separate from work directory
The work directory is still created (for resume functionality)
Published outputs are easy to find, share, and use

Script 04: Process Inputs - Passing Data to Processes

Learning Goals:

Learn how to define process inputs
Understand the val input qualifier
See how to pass data from workflow to process
Use variables in process scripts

Building on Script 03:

Scripts 01-03 had hardcoded messages. But real pipelines need to process different data! Script 04 introduces process inputs - the foundation for making processes reusable.

The Problem with Hardcoded Values

Script 03:

process hello {
  script:
  """
  echo "Hello Nextflow World!" > result.txt  // Always the same!
  """
}

Problems:

Can't change the message without editing the script
Can't process different inputs
Not reusable

Solution: Add an input block to accept data!

Step 1: Examine the Script

cat pipelines/04_hello_input.nf

Click to see the complete script

File: pipelines/04_hello_input.nf

#!/usr/bin/env nextflow

process hello {
    publishDir 'output', mode: 'copy'

    input:
    val welcome

    output:
    path 'result.txt'

    script:
    """
    echo "$welcome" > result.txt
    """
}

workflow {
    hello("Hello, welcome to the world of Nextflow!")
}

What Changed from Script 03?

Adding the input block

Script 03:

process hello {
  publishDir 'output', mode: 'copy'

  output:
  path 'result.txt'

  script:
  """
  echo "Hello Nextflow World!" > result.txt
  """
}

workflow {
  hello()  // No arguments
}

Script 04:

process hello {
    publishDir 'output', mode: 'copy'

    input:                    // NEW!
    val welcome              // NEW!

    output:
    path 'result.txt'

    script:
    """
    echo "$welcome" > result.txt    // Uses variable!
    """
}

workflow {
    hello("Hello, welcome to the world of Nextflow!")  // Pass data
}

Key Changes:

Added input: block with val welcome
Changed script to use $welcome variable
Workflow now passes a string to hello()

Understanding Process Inputs

The input block

input:
val welcome

What does this mean?

input:: Declares what data the process needs
val: Input qualifier meaning "value" (string, number, etc.)
welcome: Variable name to use in the script

Input qualifiers:

val: Simple values (strings, numbers, booleans)
path: Files or directories (we'll use this in Scripts 06-10)
tuple: Multiple values grouped together

Using inputs in scripts

script:
"""
echo "$welcome" > result.txt
"""

Variable interpolation:

$welcome: Bash-style variable reference
Nextflow replaces $welcome with the actual value before running
Works like bash variables in the script block

Example: If welcome = "Hello!", the script becomes:

echo "Hello!" > result.txt

Passing data in the workflow

workflow {
    hello("Hello, welcome to the world of Nextflow!")
}

How it works:

Workflow calls hello() with a string argument
The string is passed to the process's input block
The value is assigned to the welcome variable
The process script uses $welcome

Data Flow:

Workflow: "Hello, welcome..." → Process input: welcome → Script: $welcome

Step 2: Run the Pipeline

nextflow run pipelines/04_hello_input.nf

Expected Output:

N E X T F L O W  ~  version 24.04.4

Launching `pipelines/04_hello_input.nf` [determined_hopper] DSL2 - revision: 5ff76c72d0

executor >  local (1)
[3c/a1b2c3] process > hello [100%] 1 of 1 ✔

Step 3: Check the Output

cat output/result.txt

Output:

Hello, welcome to the world of Nextflow!

The message came from the workflow, not hardcoded in the process!

Key Takeaways

You've learned:

Processes can accept inputs using the input: block
Use val for simple values (strings, numbers)
Variables are accessed with $variable_name in scripts
The workflow passes data to processes as arguments
This makes processes reusable with different data

Script 05: Parameters - Making Pipelines Configurable

Learning Goals:

Learn to use params for configurable pipelines
Understand default parameter values
Learn to override parameters from the command line
Make pipelines flexible and reusable

Building on Script 04:

Script 04 required editing the workflow to change the message. But what if you want users to customize the pipeline without editing code? Script 05 introduces parameters - the standard way to make Nextflow pipelines configurable!

The Problem with Hardcoded Workflow Values

Script 04:

workflow {
    hello("Hello, welcome to the world of Nextflow!")  // Hardcoded!
}

To change the message, you must:

Open the script file
Edit the string
Save the file

Not user-friendly!

Solution: Use params to allow command-line configuration!

Step 1: Examine the Script

cat pipelines/05_hello_default.nf

Click to see the complete script

File: pipelines/05_hello_default.nf

#!/usr/bin/env nextflow

process hello {
    publishDir 'output', mode: 'copy'

    input:
    val welcome

    output:
    path 'result.txt'

    script:
    """
    echo "$welcome" > result.txt
    """
}

params.welcome = "Hello, welcome to the world of Nextflow!"

workflow {
    hello(params.welcome)
}

What Changed from Script 04?

Adding parameters

Script 04:

workflow {
    hello("Hello, welcome to the world of Nextflow!")
}

Script 05:

params.welcome = "Hello, welcome to the world of Nextflow!"  // NEW!

workflow {
    hello(params.welcome)  // Uses parameter
}

Key Changes:

Added params.welcome with a default value
Workflow uses params.welcome instead of hardcoded string

Understanding Parameters

What are params?

Parameters are Nextflow's way of making pipelines configurable.

params.welcome = "Hello, welcome to the world of Nextflow!"

Key Concepts:

params: Special Nextflow object for parameters
.welcome: Parameter name (you choose this)
= "...": Default value (used if not overridden)

Naming convention: Use lowercase with underscores

Good: params.input_file, params.output_dir, params.quality_threshold
Avoid: params.InputFile, params.OUTDIR

How parameters work

Three ways to set parameters:

Default value in script (lowest priority):
```
params.welcome = "Default message"
```

Command-line argument (highest priority):

nextflow run script.nf --welcome "Custom message"

Config file (medium priority - covered in advanced topics):
```
params {
    welcome = "Config message"
}
```

Priority: Command-line > Config file > Script default

Step 2: Run with Default Parameters

nextflow run pipelines/05_hello_default.nf

Expected Output:

N E X T F L O W  ~  version 24.04.4

Launching `pipelines/05_hello_default.nf` [determined_hopper] DSL2 - revision: 5ff76c72d0

executor >  local (1)
[4d/b2c3d4] process > hello [100%] 1 of 1 ✔

Check the output:

cat output/result.txt

Output:

Hello, welcome to the world of Nextflow!

The default value was used!

Step 3: Override Parameters

Now the magic - change the message without editing the script:

nextflow run pipelines/05_hello_default.nf --welcome "Greetings from the command line!"

Check the output:

cat output/result.txt

Output:

Greetings from the command line!

It worked! The command-line value overrode the default.

Real-World Example

How this applies to real pipelines

Typical bioinformatics pipeline parameters:

// Input/Output
params.reads = "data/*_{R1,R2}.fastq.gz"
params.output_dir = "results"

// Quality control
params.min_quality = 20
params.min_length = 50

// Analysis
params.genome = "/path/to/reference.fa"
params.threads = 4

Users can customize without editing:

nextflow run pipeline.nf \\
  --reads "my_data/*.fastq.gz" \\
  --output_dir "my_results" \\
  --min_quality 30 \\
  --threads 8

This is how Scripts 06-10 work!

Common Parameter Patterns

Best practices for parameters

1. Provide sensible defaults:

params.threads = 4          // Good default
params.output_dir = "results"  // Reasonable

2. Document your parameters:

// Input files (glob pattern)
params.reads = "data/*.fastq.gz"

// Quality threshold (Phred score)
params.min_quality = 20

3. Group related parameters:

// Input/Output
params.input_dir = "data"
params.output_dir = "results"

// Quality Control
params.min_quality = 20
params.min_length = 50

4. Use descriptive names:

Good: params.quality_threshold, params.input_fastq
Bad: params.qt, params.in

Key Takeaways

You've learned:

Use params.name = value to define parameters
Parameters provide default values
Override with --param_name value on command line
Command-line values take priority over defaults
Parameters make pipelines user-friendly and reusable
This is the foundation for configurable bioinformatics pipelines

Congratulations! You've completed the foundational scripts (01-05). You now understand:

Processes and workflows
File outputs and publishDir
Process inputs
Parameters

You're ready for the implementation tutorial (Scripts 06-10) with real bioinformatics tools!

Workshop Checkpoint: What You've Learned

Core Concepts Mastered

Script 01 - Hello World:

Basic Nextflow script structure
Process definitions and workflow orchestration
Using .view() for debugging

Script 02 - File Outputs:

Creating file outputs with path
Understanding the work directory
Inspecting .command.* files for debugging

Script 03 - Publishing:

Using publishDir to make outputs accessible
Different publishing modes (copy, symlink, move)
Separating work directory from final outputs

Script 04 - Process Inputs:

Defining process inputs with val
Variable interpolation in scripts
Passing data from workflow to process

Script 05 - Parameters:

Making pipelines configurable with params
Setting default values and overriding from command line
Best practices for parameter naming

Next Steps: Implementation Tutorial

What's Coming in Scripts 06-10

Now you'll apply these concepts to build a real bioinformatics pipeline:

Script 06 - FastQC Quality Control:

Run a published tool (FastQC) on multiple files in parallel

Script 07 - Paired-End Read Trimming:

Handle paired-end sequencing data with Fastp

Script 08 - Parallel Workflows:

Run multiple processes simultaneously

Script 09 - Collecting Results:

Use .collect() to aggregate files
Run custom Python scripts

Script 10 - Complete Pipeline:

Chain processes together with channel transformations

Ready to Continue?

Open the implementation tutorial:

cat implementation.md

Or in your editor:

code implementation.md

Understanding the Work Directory

What is the work directory for?

Throughout Scripts 01-05, Nextflow created a work/ directory with cryptic subdirectories. Let's understand why this exists and when to clean it up.

The work directory serves two critical purposes:

Pipeline execution workspace
- Each process runs in its own isolated subdirectory
- Contains all inputs, outputs, and execution logs
- Enables debugging (you can inspect exactly what happened)
Resume functionality
- Nextflow tracks which tasks completed successfully
- Allows skipping already-completed work
- Essential for long-running pipelines

The -resume Feature

How resume works

Imagine you run a pipeline with 100 samples, and it fails on sample 95. Without -resume, you'd have to rerun all 100 samples!

With -resume, Nextflow is smart:

# First run (fails at sample 95)
nextflow run script.nf

# Fix the issue, then resume
nextflow run script.nf -resume

What happens:

Nextflow checks the work directory
Finds that samples 1-94 completed successfully
Skips those tasks (uses cached results)
Only runs sample 95 onwards

How it knows:

Each task has a unique hash based on:
- Process script
- Input files
- Parameters
If hash matches and task succeeded, reuse the result!

When to use -resume

Use -resume when:

Your pipeline failed and you fixed the issue
You stopped the pipeline and want to continue
You added more samples and want to process only the new ones
You're iterating during development

Example scenario:

# Run pipeline on 10 samples
nextflow run script.nf --input "data/*.fastq"
# (Completes successfully)

# Add 5 more samples to data/
# Run again with -resume
nextflow run script.nf --input "data/*.fastq" -resume
# Only processes the 5 new samples!

Should You Clean the Work Directory?

During development (now): Keep it!

You'll experiment with scripts
You'll use -resume frequently
Disk space is minimal (Scripts 01-05 are tiny)

After pipeline completion: Optional cleanup

Published outputs are in output/ (safe to keep)
Work directory can be deleted to save space
Only matters when working with large files

Cleaning Up (Optional)

If you want to clean up before moving to the implementation tutorial:

# Remove work directory (safe - outputs are published)
rm -rf work/

# Remove output directory (if you want a fresh start)
rm -rf output/

Or keep everything:

# Do nothing! The work directory from Scripts 01-05 is tiny.
# You'll learn more about cleanup strategies in the implementation
# tutorial when working with real sequencing data (GB of files).

Advanced: Nextflow Clean Command

Using nextflow clean (optional)

Nextflow provides a clean command for more selective cleanup:

# See what would be deleted (dry run)
nextflow clean -n

# Delete all work files
nextflow clean -f

# Keep only successful runs, delete failed ones
nextflow clean -f -k

# Delete work files older than a certain run
nextflow clean -f -before <run_name>

For now, simple rm -rf work/ is fine. You'll learn more about nextflow clean in the implementation tutorial.

Quick Reference

Essential Commands

# Run a pipeline
nextflow run script.nf

# Override parameters
nextflow run script.nf --param_name value

# Resume from checkpoint
nextflow run script.nf -resume

Essential Syntax

// Parameters
params.input = "data/*.fastq"

// Process
process MyProcess {
    publishDir 'results', mode: 'copy'
    input:
    path input_file
    output:
    path 'output.txt'
    script:
    """
    my_tool $input_file > output.txt
    """
}

// Workflow
workflow {
    Channel.fromPath(params.input) | MyProcess
}

Continue to Implementation Tutorial

You've built a solid foundation. Now let's apply it to real-world analysis!

Next: Implementation Tutorial (Scripts 06-10)

In the implementation tutorial, you'll process real sequencing data and build a complete analysis pipeline.

FilesExpand file tree

hello.md

Latest commit

History

hello.md

File metadata and controls

Introduction to Nextflow: Building Reproducible Computational Pipelines

What is Nextflow?

Why Use Nextflow?

Key Concepts

About This Workshop

What You'll Build Today

Learning Approach

Getting a compute node on Nova using OnDemand

Setting up the environment

Let us clone the tutorial repo

Copying data required for the tutorial

Script 01: Hello World - Your First Nextflow Process

Before Nextflow: Manual Command

Step 1: Load Nextflow

Step 2: Examine the Script

Understanding the Components

Step 3: Run the Pipeline

Understanding the Output

Key Takeaways

Script 02: Writing to Files - Understanding File Outputs

Step 1: Examine the Script

What Changed from Script 01?

Understanding File Outputs

Step 2: Run the Pipeline

Step 3: Find Your Output

Understanding the Work Directory

Step 4: View the Output

Key Takeaways

Script 03: Publishing Outputs - Making Results Accessible

The Problem with Work Directories

Step 1: Examine the Script

What Changed from Script 02?

Understanding publishDir

Step 2: Run the Pipeline

Step 3: Find Your Published Output

Both Locations Exist!

Key Takeaways

Script 04: Process Inputs - Passing Data to Processes

The Problem with Hardcoded Values

Step 1: Examine the Script

What Changed from Script 03?

Understanding Process Inputs

Step 2: Run the Pipeline

Step 3: Check the Output

Key Takeaways

Script 05: Parameters - Making Pipelines Configurable

The Problem with Hardcoded Workflow Values

Step 1: Examine the Script

What Changed from Script 04?

Understanding Parameters

Step 2: Run with Default Parameters

Step 3: Override Parameters

Real-World Example

Common Parameter Patterns

Key Takeaways

Workshop Checkpoint: What You've Learned

Core Concepts Mastered

Next Steps: Implementation Tutorial

What's Coming in Scripts 06-10

Ready to Continue?

Understanding the Work Directory

What is the work directory for?

The -resume Feature

Should You Clean the Work Directory?

Cleaning Up (Optional)

Advanced: Nextflow Clean Command

Quick Reference

Essential Commands

Essential Syntax

Continue to Implementation Tutorial