Manage GPU container workloads

You can enable and manage graphics processing unit (GPU) resources on your containers. For example, you might prefer running artificial intelligence (AI) and machine learning (ML) notebooks in a GPU environment. To run GPU container workloads, you must have a Kubernetes cluster that supports GPU devices. GPU support is enabled by default for Kubernetes clusters that have GPU machines provisioned for them.

This document is for application developers within the application operator group who are responsible for creating application workloads for their organization. For more information, see Audiences for GDC air-gapped documentation.

Before you begin

To deploy GPUs to your containers, you must have the following:

  • A Kubernetes cluster with a GPU machine class. Check the supported GPU cards section for options on what you can configure for your cluster machines.

  • The User Cluster Admin role (user-cluster-admin) to check GPUs in a shared cluster, and the Namespace Admin role (namespace-admin) in your project namespace to deploy GPU workloads to a shared cluster.

  • The Standard Cluster Admin role (standard-cluster-admin) to check GPUs in a standard cluster, and the Cluster Developer role (cluster-developer) in your standard cluster to deploy GPU workloads to a standard cluster.

  • To run commands against a Kubernetes cluster, make sure you have the following resources:

    • Locate the Kubernetes cluster name, or ask a member of the platform administrator group what the cluster name is.

    • Sign in and generate the kubeconfig file for the Kubernetes cluster if you don't have one.

    • Use the kubeconfig path of the Kubernetes cluster to replace KUBERNETES_CLUSTER_KUBECONFIG in these instructions.

  • The kubeconfig path for the zonal management API server that hosts your Kubernetes cluster. Sign in and generate the kubeconfig file if you don't have one.

  • The kubeconfig path for the org infrastructure cluster in the zone intended to host your GPUs. Sign in and generate the kubeconfig file if you don't have one.

Configure a container to use GPU resources

To use these GPUs in a container, complete the following steps:

  1. Verify your Kubernetes cluster has node pools that support GPUs:

    kubectl describe clusters.cluster.gdc.goog/KUBERNETES_CLUSTER_NAME \
        -n KUBERNETES_CLUSTER_NAMESPACE \
        --kubeconfig MANAGEMENT_API_SERVER
    

    Replace the following:

    • KUBERNETES_CLUSTER_NAME: the name of the cluster.
    • KUBERNETES_CLUSTER_NAMESPACE: the namespace of the cluster. For shared clusters, use the platform namespace. For standard clusters, use the project namespace of the cluster.
    • MANAGEMENT_API_SERVER: the zonal API server's kubeconfig path where the Kubernetes cluster is hosted. If you have not yet generated a kubeconfig file for the API server in your targeted zone, see Sign in.

    The relevant output is similar to the following snippet:

    # Several lines of code are omitted here.
    spec:
      nodePools:
      - machineTypeName: a2-ultragpu-1g-gdc
        nodeCount: 2
    # Several lines of code are omitted here.
    

    For a full list of supported GPU machine types and Multi-Instance GPU (MIG) profiles, see Cluster node machine types.

  2. Add the .containers.resources.requests and .containers.resources.limits fields to your container spec. Each resource name is different depending on your machine class. Check your GPU resource allocation to find your GPU resource names.

    For example, the following container spec requests three partitions of a GPU from an a2-ultragpu-1g-gdc node:

     ...
     containers:
     - name: my-container
       image: "my-image"
       resources:
         requests:
           nvidia.com/mig-1g.10gb-NVIDIA_A100_80GB_PCIE: 3
         limits:
           nvidia.com/mig-1g.10gb-NVIDIA_A100_80GB_PCIE: 3
     ...
    
  3. Containers also require additional permissions to access GPUs. For each container that requests GPUs, add the following permissions to your container spec:

    ...
    securityContext:
     seLinuxOptions:
       type: unconfined_t
    ...
    
  4. Apply your container manifest file:

    kubectl apply -f CONTAINER_MANIFEST_FILE \
        -n KUBERNETES_CLUSTER_NAMESPACE \
        --kubeconfig KUBERNETES_CLUSTER_KUBECONFIG
    

    Replace the following:

    • CONTAINER_MANIFEST_FILE: the YAML manifest file for your container workload.
    • KUBERNETES_CLUSTER_NAMESPACE: the namespace of the cluster. For shared clusters, use the platform namespace. For standard clusters, use the project namespace of the cluster.
    • KUBERNETES_CLUSTER_KUBECONFIG: the kubeconfig path of the cluster.

Check GPU resource allocation

  • To check your GPU resource allocation, use the following command:

    kubectl describe nodes NODE_NAME --kubeconfig KUBERNETES_CLUSTER_KUBECONFIG
    

    Replace the following:

    • NODE_NAME: the node managing the GPUs you want to inspect.
    • KUBERNETES_CLUSTER_KUBECONFIG: the kubeconfig path of the cluster.

    The relevant output is similar to the following snippet:

    # Several lines of code are omitted here.
    Capacity:
      nvidia.com/mig-1g.10gb-NVIDIA_A100_80GB_PCIE: 7
    Allocatable:
      nvidia.com/mig-1g.10gb-NVIDIA_A100_80GB_PCIE: 7
    # Several lines of code are omitted here.
    

Note the resource names for your GPUs; you must specify them when configuring a container to use GPU resources.