NVIDIA NCP-AII - NVIDIA AI Infrastructure
If two ports must be connected, but one is SFP and one is QSFP, for example, to connect a 25 GbE HOST CHANNEL ADAPTER to a QSFP port capable of both 100 GbE and 25 GbE, which of the following solutions would best meet this requirement?
A team is validating a DGX BasePOD deployment. Using cmsh, they run a command to check GPU health across all nodes. What indicates that the system is ready for AI workloads?
An InfiniBand server stops working, and a system administrator runs the "ibstat" command that provides the following output:
CA 'mlx5_1'
CA type: MT4115
Number of ports: 2
Firmware version: 10.20.1010
Hardware version: 0
Node GUID: 0x0002c90300002f78
System image GUID: 0x0002c90300002f7b
Port 1:
State: Initializing
Physical state: Linkup
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x0251086a
Port GUID: 0x0002c90300002f79
Link layer: InfiniBand
What is the cause of the issue?
A system administrator is installing a GPU into a server and needs to avoid damaging the device. What item should be used?
Why is it important to provide a large and high-performance local cache (using SSDs configured as RAID-0) for deep learning workloads on DGX systems?
For a 48-hour NCCL burn-in test, which parameters ensure sustained fabric stress while detecting silent data corruption?
After NCCL burn-in reports "transport retry count exceeded," which corrective action addresses the underlying fabric issue?
A System Administrator needs to change the scheduling behavior of a single GPU to use a fixed share scheduler. What command achieves this?
To validate bisectional bandwidth across two racks in a Spectrum-X Ethernet fabric, which NCCL test configuration isolates East-West traffic?
After upgrading to HPL-AI 2.0 on a DGX A100 cluster, a 2x performance gain is observed. Which optimization is primarily responsible for this improvement?
