Orchestrate jobs by running Nextflow pipelines on Batch

This tutorial explains how to run a Nextflow pipeline on Batch. Specifically, this tutorial runs the sample rnaseq-nf life sciences pipeline from Nextflow, which quantifies genomic features from short read data using RNA-Seq.

This tutorial is intended for Batch users who want to use Nextflow with Batch.

Nextflow is open-source software for orchestrating bioinformatics workflows.

Create a Cloud Storage bucket

To create a Cloud Storage bucket to store temporary work and output files from the Nextflow pipeline, use the Google Cloud console or the command-line.

Console

To create a Cloud Storage bucket using the Google Cloud console, follow these steps:

  1. In the Google Cloud console, go to the Buckets page.

    Go to Buckets

  2. Click Create.

  3. On the Create a bucket page, enter a globally unique name for your bucket.

  4. Click Create.

  5. In the Public access will be prevented window, click Confirm.

gcloud

To create a Cloud Storage bucket using the Google Cloud CLI, use the gcloud storage buckets create command.

gcloud storage buckets create gs://BUCKET_NAME

Replace BUCKET_NAME with a globally unique name for your bucket.

If the request is successful, the output should be similar to the following:

Creating gs://BUCKET_NAME/...
   ```

Configure Nextflow

To configure the Nextflow pipeline to run on Batch, follow these steps in the command-line:

  1. Clone the sample pipeline repository:

    git clone https://github.com/nextflow-io/rnaseq-nf.git
    
  2. Go to the rnaseq-nf folder:

    cd rnaseq-nf
    
  3. Open the nextflow.config file:

    nano nextflow.config
    

    The file should contain the following gcb section:

    gcb {
      params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa'
      params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq'
      params.multiqc = 'gs://rnaseq-nf/multiqc'
      process.executor = 'google-batch'
      process.container = 'quay.io/nextflow/rnaseq-nf:v1.1'
      workDir = 'gs://BUCKET_NAME/WORK_DIRECTORY'
      google.region  = 'REGION'
    }
    
  4. In the gcb section, do the following:

    1. Replace BUCKET_NAME with the name of the Cloud Storage bucket you created in the previous steps.

    2. Replace WORK_DIRECTORY with the name for a new folder that the pipeline can use to store logs and outputs.

      For example, enter workDir.

    3. Replace REGION with the region to use.

      For example, enter us-central1.

    4. After the google.region field, add the following fields:

      1. Add the google.project field:

        google.project = 'PROJECT_ID'
        

        Replace PROJECT_ID with the project ID of the current Google Cloud project.

      2. If you aren't using the Compute Engine default service account as the job's service account, add the google.batch.serviceAccountEmail field:

        google.batch.serviceAccountEmail = 'SERVICE_ACCOUNT_EMAIL'
        

        Replace SERVICE_ACCOUNT_EMAIL with the email address of the job's service account that you prepared for this tutorial.

  5. To save your edits, do the following:

    1. Press Control+S.

    2. Enter Y.

    3. Press Enter.

Run the pipeline

Run the sample Nextflow pipeline using the command-line:

../nextflow run nextflow-io/rnaseq-nf -profile gcb

The pipeline runs a small dataset using the settings you provided in the previous steps. This operation might take up to 10 minutes to complete.

After the pipeline finishes running, the output should be similar to the following:

N E X T F L O W  ~  version 23.04.1
Launching `https://github.com/nextflow-io/rnaseq-nf` [crazy_curry] DSL2 - revision: 88b8ef803a [master]
 R N A S E Q - N F   P I P E L I N E
 ===================================
 transcriptome: gs://rnaseq-nf/data/ggal/transcript.fa
 reads        : gs://rnaseq-nf/data/ggal/gut_{1,2}.fq
 outdir       : results

Uploading local `bin` scripts folder to gs://example-bucket/workdir/tmp/53/2847f2b832456a88a8e4cd44eec00a/bin
executor >  google-batch (4)
[67/71b856] process > RNASEQ:INDEX (transcript)     [100%] 1 of 1 ✔
[0c/2c79c6] process > RNASEQ:FASTQC (FASTQC on gut) [100%] 1 of 1 ✔
[a9/571723] process > RNASEQ:QUANT (gut)            [100%] 1 of 1 ✔
[9a/1f0dd4] process > MULTIQC                       [100%] 1 of 1 ✔

Done! Open the following report in your browser --> results/multiqc_report.html

Completed at: 20-Apr-2023 15:44:55
Duration    : 10m 13s
CPU hours   : (a few seconds)
Succeeded   : 4

View outputs of the pipeline

After the pipeline finishes running, it stores output files, logs, errors, or temporary files in the results/qc_report.html file within the WORK_DIRECTORY folder of your Cloud Storage bucket.

To check the pipeline's output files in the WORK_DIRECTORY folder of your Cloud Storage bucket, you can use the Google Cloud console or the command-line.

Console

To check the pipeline's output files using the Google Cloud console, follow these steps:

  1. In the Google Cloud console, go to the Buckets page.

    Go to Buckets

  2. In the Name column, click the name of the bucket you created in the previous steps.

  3. On the Bucket details page, open the WORK_DIRECTORY folder.

There is a folder for each separate tasks that the workflow run. Each folder contains the commands that were run, output files, and temporary files created by the pipeline.

gcloud

To check the pipeline's output files using the gcloud CLI, use the gcloud storage ls command.

gcloud storage ls gs://BUCKET_NAME/WORK_DIRECTORY

Replace the following:

  • BUCKET_NAME: the name of the bucket you created in the previous steps.

  • WORK_DIRECTORY: the directory you specified in the nextflow.config file.

The output lists a folder for each separate tasks that the pipeline run. Each folder contains the commands that were run, output files, and temporary files created by the pipeline.