Troubleshoot the Collective Communication Analyzer (CoMMA)
Stay organized with collectionsSave and categorize content based on your preferences.
This page shows you how to resolve common issues that you might encounter when
using the Collective Communication Analyzer (CoMMA). CoMMA is a
library that collects telemetry data for Google Cloud services.
For more information, seeCollective Communication Analyzer (CoMMA).
Troubleshoot CoMMA loading issues
CoMMA might not load correctly.
To verify that the binaries load correctly, complete these steps:
Enable NCCL debug logging. To enable logging, set the environment variableNCCL_DEBUG=INFO. You might also use a more detailed debug level.
For options, see theNCCL_DEBUGsection in the NVIDIA documentation.
Specify theINITsubsystem for debugging. To specifyINIT, setNCCL_DEBUG_SUBSYS=INIT. You might also add other subsystems.
For more subsystem options, see theNCCL_DEBUG_SUBSYSsection.
Look for a line in the NCCL log that is similar to the following:NCCL INFO PROFILER/Plugin: Plugin name set by env toPATH_TO_PROFILER_PLUGIN
If theNCCL_PROFILER_PLUGINenvironment variable is unset, NCCL might
attempt to load thelibnccl-profiler.sobinary from the path specified in
theLD_LIBRARY_PATHenvironment variable.
To resolve this issue, consider the following solutions:
Verify that the plugin shared library (libnccl-profiler.so) is correctly
named.
Check that it is located in a directory specified inLD_LIBRARY_PATHenvironment variable. Alternatively, check that theNCCL_PROFILER_PLUGINenvironment variable points directly to the location of thelibnccl-profiler.sobinary.
Check that your NCCL version is2.23or later, as the NCCL profiler API
requires this version.
Troubleshoot missing output files
If you configured your environment to send data collected by CoMMA
to a local file, but the output file is missing, check the NCCL logs
or application logs for messages that are similar to the following:
Failed to open file
Failed to log <telemetry type> to file
These errors indicate an underlying file system issue, such as a missing directory or insufficient free space. CoMMA ceases to export telemetry to
files after these errors occur.
To resolve this issue, consider these solutions:
Check that theNCCL_PROFILER_LATENCY_FILEorNCCL_PROFILER_SUMMARY_FILEenvironment variables are set correctly. Provide a valid path and filename
template, such as/tmp/latency-%p.txt.
Check that the process has write permissions to the specified output
directory.
If you modified theNCCL_TELEMETRY_MODEenvironment variable, check that
you set it to a value that enables local file output (for example,1or4).
Troubleshoot unexpected data or missing events
CoMMA might capture unexpected
data or miss expected events.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-04 UTC."],[],[],null,["This page shows you how to resolve common issues that you might encounter when\nusing the Collective Communication Analyzer (CoMMA). CoMMA is a\nlibrary that collects telemetry data for Google Cloud services.\nFor more information, see [Collective Communication Analyzer (CoMMA)](/ai-hypercomputer/docs/nccl/comma).\n\n\nTroubleshoot CoMMA loading issues\n\nCoMMA might not load correctly.\nTo verify that the binaries load correctly, complete these steps:\n\n1. Enable NCCL debug logging. To enable logging, set the environment variable `NCCL_DEBUG=INFO`. You might also use a more detailed debug level. For options, see the [`NCCL_DEBUG`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug) section in the NVIDIA documentation.\n2. Specify the `INIT` subsystem for debugging. To specify `INIT`, set `NCCL_DEBUG_SUBSYS=INIT`. You might also add other subsystems. For more subsystem options, see the [`NCCL_DEBUG_SUBSYS`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug-subsys) section.\n3. Look for a line in the NCCL log that is similar to the following:\n `NCCL INFO PROFILER/Plugin: Plugin name set by env to `\u003cvar translate=\"no\"\u003ePATH_TO_PROFILER_PLUGIN\u003c/var\u003e\n\n If the `NCCL_PROFILER_PLUGIN` environment variable is unset, NCCL might\n attempt to load the `libnccl-profiler.so` binary from the path specified in\n the `LD_LIBRARY_PATH` environment variable.\n\nTo resolve this issue, consider the following solutions:\n\n- Verify that the plugin shared library (`libnccl-profiler.so`) is correctly\n named.\n\n Check that it is located in a directory specified in `LD_LIBRARY_PATH`\n environment variable. Alternatively, check that the `NCCL_PROFILER_PLUGIN`\n environment variable points directly to the location of the `libnccl-profiler.so`\n binary.\n- Check that your NCCL version is `2.23` or later, as the NCCL profiler API\n requires this version.\n\nTroubleshoot missing output files\n\nIf you configured your environment to send data collected by CoMMA\nto a local file, but the output file is missing, check the NCCL logs\nor application logs for messages that are similar to the following: \n\n```\nFailed to open file\nFailed to log \u003ctelemetry type\u003e to file\n```\n\nThese errors indicate an underlying file system issue, such as a missing directory or insufficient free space. CoMMA ceases to export telemetry to\nfiles after these errors occur.\n\nTo resolve this issue, consider these solutions:\n\n- Check that the `NCCL_PROFILER_LATENCY_FILE` or `NCCL_PROFILER_SUMMARY_FILE` environment variables are set correctly. Provide a valid path and filename template, such as `/tmp/latency-%p.txt`.\n- Check that the process has write permissions to the specified output directory.\n- If you modified the `NCCL_TELEMETRY_MODE` environment variable, check that you set it to a value that enables local file output (for example, `1` or `4`).\n\nTroubleshoot unexpected data or missing events\n\nCoMMA might capture unexpected\ndata or miss expected events.\n\nTo resolve this issue, check that the required\n[level of granularity is set](/ai-hypercomputer/docs/nccl/comma#set-data-granularity)."]]