This tutorial shows you how to run a single-cell genomics analysis using Dask, NVIDIA RAPIDS, and GPUs, which you can configure on Dataproc. You can configure Dataproc to run Dask either with its standalone scheduler or with YARN for resource management.
This tutorial configures Dataproc with a hosted JupyterLab instance to run a notebook featuring a single-cell genomics analysis. Using a Jupyter Notebook on Dataproc lets you combine the interactive capabilities of Jupyter with the workload scaling that Dataproc enables. With Dataproc, you can scale out your workloads from one to many machines, which you can configure with as many GPUs as you need.
This tutorial is intended for data scientists and researchers. It assumes that you are experienced with Python and have basic knowledge of the following:
Prepare your environment
Select a location for your resources.
REGION=REGION
Create a Cloud Storage bucket.
gcloud storage buckets create gs://BUCKET --location=REGION
Copy the following initialization actions to your bucket.
SCRIPT_BUCKET=gs://goog-dataproc-initialization-actions-REGION gcloud storage cp ${SCRIPT_BUCKET}/gpu/install_gpu_driver.sh BUCKET/gpu/install_gpu_driver.sh gcloud storage cp ${SCRIPT_BUCKET}/dask/dask.sh BUCKET/dask/dask.sh gcloud storage cp ${SCRIPT_BUCKET}/rapids/rapids.sh BUCKET/rapids/rapids.sh gcloud storage cp ${SCRIPT_BUCKET}/python/pip-install.sh BUCKET/python/pip-install.sh
Create a Dataproc cluster with JupyterLab and open source components
- Create a Dataproc cluster.
gcloud dataproc clusters create CLUSTER_NAME \ --region REGION \ --image-version 2.0-ubuntu18 \ --master-machine-type n1-standard-32 \ --master-accelerator type=nvidia-tesla-t4,count=4 \ --initialization-actions BUCKET/gpu/install_gpu_driver.sh,BUCKET/dask/dask.sh,BUCKET/rapids/rapids.sh,BUCKET/python/pip-install.sh \ --initialization-action-timeout=60m \ --metadata gpu-driver-provider=NVIDIA,dask-runtime=yarn,rapids-runtime=DASK,rapids-version=21.06,PIP_PACKAGES="scanpy==1.8.1,wget" \ --optional-components JUPYTER \ --enable-component-gateway \ --single-node
The cluster has the following properties:
--region
: the region where your cluster is located.--image-version
:2.0-ubuntu18
, the cluster image version--master-machine-type
:n1-standard-32
, the main machine type.--master-accelerator
: the type and count of GPUs on the main node, fournvidia-tesla-t4
GPUs.--initialization-actions
: the Cloud Storage paths to the installation scripts that install GPU drivers, Dask, RAPIDS, and extra dependencies.--initialization-action-timeout
: the timeout for the initialization actions.--metadata
: passed to the initialization actions to configure the cluster with NVIDIA GPU drivers, the standalone scheduler for Dask, and RAPIDS version21.06
.--optional-components
: configures the cluster with the Jupyter optional component.--enable-component-gateway
: allows access to web UIs on the cluster.--single-node
: configures the cluster as a single node (no workers).
Access the Jupyter Notebook
- Open the Clusters page in the Dataproc Google Cloud console.
Open Clusters page - Click your cluster and click the Web Interfaces tab.
- Click JupyterLab.
- Open a new terminal in JupyterLab.
Clone the
clara-parabricks/rapids-single-cell-examples
repository and check out thedataproc/multi-gpu
branch.git clone https://github.com/clara-parabricks/rapids-single-cell-examples.git git checkout dataproc/multi-gpu
In JupyterLab, navigate to the rapids-single-cell-examples/notebooks repository and open the 1M_brain_gpu_analysis_uvm.ipynb Jupyter Notebook.
To clear all the outputs in the notebook, select Edit > Clear All Outputs
Read the instructions in the cells of the notebook. The notebook uses Dask and RAPIDS on Dataproc to guide you through a single-cell RNA-seq workflow on 1 million cells, including processing and visualizing the data. To learn more, see Accelerating Single Cell Genomic Analysis using RAPIDS.