This tutorial explains how to run a Nextflow
pipeline on Batch. Specifically, this tutorial runs the
sample rnaseq-nf
life sciences pipeline from Nextflow,
which quantifies genomic features from short read data using
RNA-Seq.
This tutorial is intended for Batch users who want to use Nextflow with Batch.
Nextflow is open-source software for orchestrating bioinformatics workflows.
Create a Cloud Storage bucket
To create a Cloud Storage bucket to store temporary work and output files from the Nextflow pipeline, use the Google Cloud console or the command-line.
Console
To create a Cloud Storage bucket using the Google Cloud console, follow these steps:
In the Google Cloud console, go to the Buckets page.
Click
Create.On the Create a bucket page, enter a globally unique name for your bucket.
Click Create.
In the Public access will be prevented window, click Confirm.
gcloud
To create a Cloud Storage bucket using the Google Cloud CLI,
use the
gcloud storage buckets create
command.
gcloud storage buckets create gs://BUCKET_NAME
Replace BUCKET_NAME
with a
globally unique name for your bucket.
If the request is successful, the output should be similar to the following:
Creating gs://BUCKET_NAME/...
```
Configure Nextflow
To configure the Nextflow pipeline to run on Batch, follow these steps in the command-line:
Clone the sample pipeline repository:
git clone https://github.com/nextflow-io/rnaseq-nf.git
Go to the
rnaseq-nf
folder:cd rnaseq-nf
Open the
nextflow.config
file:nano nextflow.config
The file should contain the following
gcb
section:gcb { params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa' params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq' params.multiqc = 'gs://rnaseq-nf/multiqc' process.executor = 'google-batch' process.container = 'quay.io/nextflow/rnaseq-nf:v1.1' workDir = 'gs://BUCKET_NAME/WORK_DIRECTORY' google.region = 'REGION' }
In the
gcb
section, do the following:Replace
BUCKET_NAME
with the name of the Cloud Storage bucket you created in the previous steps.Replace
WORK_DIRECTORY
with the name for a new folder that the pipeline can use to store logs and outputs.For example, enter
workDir
.Replace
REGION
with the region to use.For example, enter
us-central1
.After the
google.region
field, add the following fields:Add the
google.project
field:google.project = 'PROJECT_ID'
Replace
PROJECT_ID
with the project ID of the current Google Cloud project.If you aren't using the Compute Engine default service account as the job's service account, add the
google.batch.serviceAccountEmail
field:google.batch.serviceAccountEmail = 'SERVICE_ACCOUNT_EMAIL'
Replace
SERVICE_ACCOUNT_EMAIL
with the email address of the job's service account that you prepared for this tutorial.
To save your edits, do the following:
Press
Control+S
.Enter
Y
.Press
Enter
.
Run the pipeline
Run the sample Nextflow pipeline using the command-line:
../nextflow run nextflow-io/rnaseq-nf -profile gcb
The pipeline runs a small dataset using the settings you provided in the previous steps. This operation might take up to 10 minutes to complete.
After the pipeline finishes running, the output should be similar to the following:
N E X T F L O W ~ version 23.04.1
Launching `https://github.com/nextflow-io/rnaseq-nf` [crazy_curry] DSL2 - revision: 88b8ef803a [master]
R N A S E Q - N F P I P E L I N E
===================================
transcriptome: gs://rnaseq-nf/data/ggal/transcript.fa
reads : gs://rnaseq-nf/data/ggal/gut_{1,2}.fq
outdir : results
Uploading local `bin` scripts folder to gs://example-bucket/workdir/tmp/53/2847f2b832456a88a8e4cd44eec00a/bin
executor > google-batch (4)
[67/71b856] process > RNASEQ:INDEX (transcript) [100%] 1 of 1 ✔
[0c/2c79c6] process > RNASEQ:FASTQC (FASTQC on gut) [100%] 1 of 1 ✔
[a9/571723] process > RNASEQ:QUANT (gut) [100%] 1 of 1 ✔
[9a/1f0dd4] process > MULTIQC [100%] 1 of 1 ✔
Done! Open the following report in your browser --> results/multiqc_report.html
Completed at: 20-Apr-2023 15:44:55
Duration : 10m 13s
CPU hours : (a few seconds)
Succeeded : 4
View outputs of the pipeline
After the pipeline finishes running, it stores output files, logs, errors, or
temporary files in the results/qc_report.html
file within the
WORK_DIRECTORY
folder of your Cloud Storage
bucket.
To check the pipeline's output files in the
WORK_DIRECTORY
folder of your Cloud Storage
bucket, you can use the Google Cloud console or the command-line.
Console
To check the pipeline's output files using the Google Cloud console, follow these steps:
In the Google Cloud console, go to the Buckets page.
In the Name column, click the name of the bucket you created in the previous steps.
On the Bucket details page, open the
WORK_DIRECTORY
folder.
There is a folder for each separate tasks that the workflow run. Each folder contains the commands that were run, output files, and temporary files created by the pipeline.
gcloud
To check the pipeline's output files using the gcloud CLI, use
the gcloud storage ls
command.
gcloud storage ls gs://BUCKET_NAME/WORK_DIRECTORY
Replace the following:
BUCKET_NAME
: the name of the bucket you created in the previous steps.WORK_DIRECTORY
: the directory you specified in thenextflow.config
file.
The output lists a folder for each separate tasks that the pipeline run. Each folder contains the commands that were run, output files, and temporary files created by the pipeline.