This page shows you how to resolve common issues that you might encounter when using the Collective Communication Analyzer (CoMMA). CoMMA is a library that collects telemetry data for Google Cloud services. For more information, see Collective Communication Analyzer (CoMMA).
Troubleshoot CoMMA loading issues
CoMMA might not load correctly. To verify that the binaries load correctly, complete these steps:
- Enable NCCL debug logging. To enable logging, set the environment variable
NCCL_DEBUG=INFO. You might also use a more detailed debug level. For options, see theNCCL_DEBUGsection in the NVIDIA documentation. - Specify the
INITsubsystem for debugging. To specifyINIT, setNCCL_DEBUG_SUBSYS=INIT. You might also add other subsystems. For more subsystem options, see theNCCL_DEBUG_SUBSYSsection. Look for a line in the NCCL log that is similar to the following:
NCCL INFO PROFILER/Plugin: Plugin name set by env to PATH_TO_PROFILER_PLUGINIf the
NCCL_PROFILER_PLUGINenvironment variable is unset, NCCL might attempt to load thelibnccl-profiler.sobinary from the path specified in theLD_LIBRARY_PATHenvironment variable.
To resolve this issue, consider the following solutions:
Verify that the plugin shared library (
libnccl-profiler.so) is correctly named.Check that it is located in a directory specified in
LD_LIBRARY_PATHenvironment variable. Alternatively, check that theNCCL_PROFILER_PLUGINenvironment variable points directly to the location of thelibnccl-profiler.sobinary.Check that your NCCL version is
2.23or later, as the NCCL profiler API requires this version.
Troubleshoot missing output files
If you configured your environment to send data collected by CoMMA to a local file, but the output file is missing, check the NCCL logs or application logs for messages that are similar to the following:
Failed to open file Failed to log <telemetry type> to file
These errors indicate an underlying file system issue, such as a missing directory or insufficient free space. CoMMA ceases to export telemetry to files after these errors occur.
To resolve this issue, consider these solutions:
- Check that the
NCCL_PROFILER_LATENCY_FILEorNCCL_PROFILER_SUMMARY_FILEenvironment variables are set correctly. Provide a valid path and filename template, such as/tmp/latency-%p.txt. - Check that the process has write permissions to the specified output directory.
- If you modified the
NCCL_TELEMETRY_MODEenvironment variable, check that you set it to a value that enables local file output (for example,1or4).
Troubleshoot unexpected data or missing events
CoMMA might capture unexpected data or miss expected events.
To resolve this issue, check that the required level of granularity is set.