在 Dataproc 上使用 JupyterLab 笔记本运行基因组分析

本教程介绍如何使用可在 Dataproc 上配置的 Dask、NVIDIA RAPIDS 和 GPU 运行单个单元基因组分析。您可以将 Dataproc 配置为使用其独立调度器或 YARN 运行 Dask 以进行资源管理。

本教程使用托管的 JupyterLab 实例配置 Dataproc，以运行包含单细胞基因组分析的笔记本。通过在 Dataproc 上使用 Jupyter 笔记本，您可以将 Jupyter 的交互式功能与 Dataproc 支持的工作负载扩缩功能结合使用。借助 Dataproc，您可以将工作负载从一台机器横向扩容到多台机器，并且可以为这些机器配置任意数量的 GPU。

本教程适用于数据科学家和研究人员。本教程假定您具有 Python 使用经验，并具备以下基础知识：

目标

创建使用 GPU、JupyterLab 和开源组件配置的 Dataproc 实例。
在 Dataproc 上运行笔记本。

费用

在本文档中，您将使用 Google Cloud的以下收费组件：

Dataproc

Cloud Storage

GPUs

如需根据您的预计使用量来估算费用，请使用价格计算器。

新 Google Cloud 用户可能有资格申请免费试用。

完成本文档中描述的任务后，您可以通过删除所创建的资源来避免继续计费。如需了解详情，请参阅清理。

准备工作

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

Go to project selector
Verify that billing is enabled for your Google Cloud project.
Enable the Dataproc API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.
Enable the API

准备环境

为您的资源选择位置。
```
REGION=REGION
```

创建 Cloud Storage 存储桶。

gcloud storage buckets create gs://BUCKET --location=REGION

将以下初始化操作复制到您的存储桶。

SCRIPT_BUCKET=gs://goog-dataproc-initialization-actions-REGION
gcloud storage cp ${SCRIPT_BUCKET}/gpu/install_gpu_driver.sh BUCKET/gpu/install_gpu_driver.sh
gcloud storage cp ${SCRIPT_BUCKET}/dask/dask.sh BUCKET/dask/dask.sh
gcloud storage cp ${SCRIPT_BUCKET}/rapids/rapids.sh BUCKET/rapids/rapids.sh
gcloud storage cp ${SCRIPT_BUCKET}/python/pip-install.sh BUCKET/python/pip-install.sh

使用 JupyterLab 和开源组件创建 Dataproc 集群

创建 Dataproc 集群。

gcloud dataproc clusters create CLUSTER_NAME \
    --region REGION \
    --image-version 2.0-ubuntu18 \
    --master-machine-type n1-standard-32 \
    --master-accelerator type=nvidia-tesla-t4,count=4 \
    --initialization-actions
BUCKET/gpu/install_gpu_driver.sh,BUCKET/dask/dask.sh,BUCKET/rapids/rapids.sh,BUCKET/python/pip-install.sh
\
    --initialization-action-timeout=60m \
    --metadata
gpu-driver-provider=NVIDIA,dask-runtime=yarn,rapids-runtime=DASK,rapids-version=21.06,PIP_PACKAGES="scanpy==1.8.1,wget" \
    --optional-components JUPYTER \
    --enable-component-gateway \
    --single-node

该集群具有以下属性：

--region：集群所在的区域。
--image-version：2.0-ubuntu18（集群映像版本）
--master-machine-type：n1-standard-32（主要机器类型）。
--master-accelerator：主节点上 GPU 的类型和数量（四个 nvidia-tesla-t4 GPU）。
--initialization-actions：用于安装 GPU 驱动程序、Dask、RAPIDS 和额外依赖项的安装脚本的 Cloud Storage 路径。
--initialization-action-timeout：初始化操作的超时时间。
--metadata：传递给初始化操作以使用 NVIDIA GPU 驱动程序、Dask 的独立调度器和 RAPIDS 版本 21.06 配置集群。
--optional-components：使用 Jupyter 可选组件配置集群。
--enable-component-gateway：允许访问集群上的网页界面。
--single-node：将集群配置为单个节点（无工作器）。

访问 Jupyter 笔记本

打开 Dataproc Google Cloud 控制台中的集群页面。
打开“集群”页面
点击您的集群，然后点击 Web 界面标签页。
点击 JupyterLab。
在 JupyterLab 中打开新终端。

克隆 clara-parabricks/rapids-single-cell-examples 代码库并签出 dataproc/multi-gpu 分支。

git clone https://github.com/clara-parabricks/rapids-single-cell-examples.git
git checkout dataproc/multi-gpu

在 JupyterLab 中，转到 rapids-single-cell-examples/notebooks 代码库，并打开 1M_brain_gpu_analysis_uvm.ipynb Jupyter 笔记本。
如需清除笔记本中的所有输出，请选择修改 > 清除所有输出项
阅读笔记本单元中的相关说明。该笔记本使用 Dataproc 上的 Dask 和 RAPIDS 引导您在 100 万个单元上完成单个单元 RNA-seq 工作流，包括处理和直观呈现数据。如需了解详情，请参阅使用 RAPIDS 加速单个单元基因组分析。

清理

为避免因本教程中使用的资源导致您的 Google Cloud 账号产生费用，请删除包含这些资源的项目，或者保留项目但删除各个资源。

删除项目

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

删除各个资源

删除 Dataproc 集群。

gcloud dataproc clusters delete cluster-name \
    --region=region

删除存储分区：
```
gcloud storage buckets delete BUCKET_NAME
```
重要提示：必须先清空存储桶，然后才能将其删除。

后续步骤

详细了解 Dataproc。
探索参考架构、图表、教程和最佳做法。查看我们的云架构中心。