This guide describes how to enable, disable, and manage the Collective Communication Analyzer (CoMMA) library. CoMMA collects NCCL telemetry for Google Cloud services. For more information about CoMMA, see Collective Communication Analyzer (CoMMA).
Enable CoMMA
CoMMA is pre-installed and enabled if you use images that contain the NCCL gIB plugin. For a list of these images, see Images that have CoMMA enabled.
Installation options
If you don't use any of these images and want to install CoMMA, use one of the following methods.
| Installation method | Supported machine types | 
|---|---|
| NCCL Google Infrastructure Bundle (gIB) image (Recommended for newer machine types) | A4X, A4 High, and A3 Ultra | 
| CoMMA installer image | A4X, A4 High, and A3 Ultra | 
| Build from source (Required for older machine types) | A3 Mega, A3 High, A3 Edge, A2 Ultra, A2 Standard, and N1 with attached GPUs | 
Install CoMMA
To install CoMMA, select one of the following options:
NCCL gIB image
To install CoMMA by using the NCCL gIB image, run the following command.
docker run --rm --name nccl-gib-installer --volume /usr/local/gib:/var/lib/gib \ us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib install \ --install-nccl
CoMMA installer image
You can get CoMMA binaries in a standalone Docker image. You can use the
CoMMA Docker image, us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/comma-installer,
as initContainers to install CoMMA binaries into your workload container.
The container stores the binaries in the /artifacts directory.
To use the CoMMA installer image, complete the following steps:
- Install CoMMA into your workload by adding the following snippet to your - initContainer:- - name: profiler-plugin-installer image: http://us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/comma-installer:latest imagePullPolicy: Always volumeMounts: - name: nccl-plugin-volume mountPath: /usr/local/nccl-plugin resources: requests: cpu: 150m command: - /bin/sh - -c - | set -ex rm -rf /usr/local/nccl-plugin/lib64/libnccl-profiler.so
The JSON configuration snippet specifies a container for installing a profiler
plugin. The snippet specifies the Docker image, its pull policy, and a volume
mount for the plugin. The container requires a small amount of CPU resources.
The command section is central to the configuration. The section runs a
shell script to remove any existing profiler library. It then copies the new
profiler library into the designated plugin directory. The script verifies that
the correct version of the profiler plugin is installed and ready for use.
Build from source
To build the CoMMA library from source, install the following software:
- Rust Programming Language, which the compiler and Cargo require.
- Libclang-dev, which- bindgenrequires.
- CMakeversion 3.10 or later
To build from source, complete the following steps:
- Clone the repository and its submodules. - git clone --recurse-submodules https://github.com/google/CoMMA 
- Compile the binaries by using Cargo. - cargo build --release - Cargo saves the binary in - target/release/libnccl_profiler.so.
- Enable NCCL to load the CoMMA libraries by using one of the following methods: - Copy the compiled libnccl_profiler.soto a directory in yourLD_LIBRARY_PATH. Rename it tolibnccl-profiler.so(use a hyphen instead of an underscore).
- Alternatively, set the NCCL_PROFILER_PLUGINenvironment variable to specify the path of the.sofile.
 
- Copy the compiled 
Verify installation or enablement
To verify that NCCL loads the CoMMA libraries, review the NCCL logs:
- Enable NCCL debug logging. Enable logging by setting the NCCL_DEBUG=INFOenvironment variable. You can also specify a more detailed debug level. For more debug options, see theNCCL_DEBUGsection in the NVIDIA documentation.
- Specify the INITsubsystem for debugging. SpecifyINITby setting theNCCL_DEBUG_SUBSYS=INITenvironment variable. You can also specify other subsystems. For more subsystem options, see theNCCL_DEBUG_SUBSYSsection.
- Find a line in the NCCL log that is similar to the following:
NCCL INFO PROFILER/Plugin: Plugin name set by env to PATH_TO_PROFILER_PLUGIN
Disable CoMMA
If CoMMA is already installed, prevent it from collecting
NCCL telemetry by setting the NCCL_TELEMETRY_MODE=0 CoMMA environment variable
before running your workloads. To set CoMMA environment
variables, see Set environment variables.
To re-enable CoMMA after disabling it, follow these steps:
- Set the - NCCL_TELEMETRY_MODEenvironment variable to a non-zero value; for example, to use the default mode, specify- NCCL_TELEMETRY_MODE=3.- To review the full list of options, see - NCCL_TELEMETRY_MODEin the Configuration options table.
Configure and view CoMMA NCCL telemetry
If CoMMA is enabled in your environment, you can configure the type of telemetry data that it collects by setting the level of data granularity. This section explains how to set data granularity and the available options.
You can also review the data that CoMMA collects to verify that it aligns with your organization's security policies or to analyze it with your own NCCL telemetry analysis tools. To do so, export the raw data to a local file.
Set data granularity
CoMMA collects NCCL telemetry at different granularity levels. Configure the granularity level by using environment variables. To set CoMMA environment variables, see Set environment variables.
- Default behavior: By default, CoMMA tracks NCCL operations,
including both collective and peer-to-peer, the metadata of those
operations, and completion
times. It uses the following environment variables:
- NCCL_PROFILER_TRACK_NCCLOP=true
- NCCL_PROFILER_AGGREGATE_STEPS=true
- NCCL_PROFILER_TRACK_INTERPROCESS_PROXYOP=true
 
- To enable more granular levels of data collection,
set the following environment variables:
- Track completion time for proxy operations by setting
NCCL_PROFILER_TRACK_PROXYOP=true.
- Track the time spent on each networking I/O operation by setting
NCCL_PROFILER_TRACK_STEPS=true. This setting provides the highest level of granularity.
 
- Track completion time for proxy operations by setting
To review the full list of environment variables, see Configuration options.
Export data to a local file
Export the raw data to a local file to view it. To export the data to a local file and view the output, follow the steps:
- Set the NCCL_TELEMETRY_MODEto either1or4. To learn about theNCCL_TELEMETRY_MODEenvironment variable, see Configuration options.
- Set one of the following export paths: - Set NCCL_PROFILER_LATENCY_FILE=PATHto export detailed event traces to a local file. ReplacePATHwith a path such as/tmp/latency-%p.txt.
- Set - NCCL_PROFILER_SUMMARY_FILE=PATHto export aggregated summary statistics. Replace- PATHwith a path such as- /tmp/summary-%p.txt.- The system replaces - %pwith the process ID.
 
- Set 
- Review the output. The raw output is a JSON file. 
Configuration options
The following sections summarize all the environment variables that you can configure for CoMMA. They also explain how to set any environment variable.
Set CoMMA environment variables
To set a CoMMA environment variable to a non-default value, set environment
variables. You can set environment variables on the command-line for the
instance or add them to a startup script. If you set the environment variables
at the command-line, the value only persists per session. To make the
environment variables permanent, place them into the ~/.bashrc file, ~/.profile,
or whichever startup file your operating system uses. For more information,
review your operating system's documentation.
You need to set CoMMA environment variables before your workload starts as the workload reads the variables during NCCL initialization. You can set environment variables as follows:
export ENVIRONMENT_VARIABLE=VALUE
Replace the following:
- ENVIRONMENT_VARIABLE: the environment variable you want to set; for example,- NCCL_TELEMETRY_MODE.
- VALUE: the value for the environment variable; for example,- 0.
CoMMA environment variables
This section lists the environment variables that you can set for CoMMA and their default values.
| Name | Description | Default | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NCCL_PROFILER_AGGREGATE_STEPS | Enables ( true) or disables (false)
      aggregating network chunk operations. | true | ||||||||||
| NCCL_PROFILER_GPUVIZ_LIB | Specifies the path to libGPUViz.so, a library that uploads
     NCCL telemetry to Google Cloud services. This library wraps the
     agent communication API.
     The agent communication API is the interface that agents, such as processes
     running within your guest operating system, use to initiate secure and
     reliable connections with Google Cloud services.If you use a NCCL gIB image as an installer or use any of the images that bundle the NCCL gIB plugin, you don't need to set this environment variable. | |||||||||||
| NCCL_PROFILER_LATENCY_FILE | Specifies the path template for the latency trace file. For example, /tmp/latency-%p.txt.
      The system replaces%pin the name with the process ID (pid).To disable file-based export, unset this variable. | |||||||||||
| NCCL_PROFILER_PLUGIN | Specifies the path to the profiler plugin binary. If you don't specify this setting, NCCL looks for libnccl-profiler.soin theLD_LIBRARY_PATH. | |||||||||||
| NCCL_PROFILER_SUMMARY_FILE | Specifies the path for the aggregated summary file. For example, /tmp/summary-%p.txt.
      The system replaces%pin the name with the process ID (pid).To disable file-based export, unset this variable. | |||||||||||
| NCCL_PROFILER_SUMMARY_INTERVAL | Specifies the interval for summary reporting. For example, 10s,1m.
      Supportsd,h,m,s,ms,us,ns | 1m | ||||||||||
| NCCL_PROFILER_TRACK_INTERPROCESS_PROXYOP | Enables ( true) or disables (false)
      monitoring inter-process NCCL proxy operations. | true | ||||||||||
| NCCL_PROFILER_TRACK_NCCLOP | Enables ( true) or disables (false)
      tracking and reporting for NCCL operations,
      including both collective and point-to-point communications. | true | ||||||||||
| NCCL_PROFILER_TRACK_PROXYOP | Enables ( true) or disables (false) proxy
      operation tracking and reporting. | false | ||||||||||
| NCCL_PROFILER_TRACK_STEPS | Enables ( true) or disables (false)
      tracking and reporting network chunk operations. | false | ||||||||||
| NCCL_TELEMETRY_MODE | Controls the export location of the NCCL telemetry data.
      The options include the following: 
 | 3 | 
What's next
- Learn how to troubleshoot issues with CoMMA.
- Learn how to detect and resolve stragglers.