Spring Sale Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: xmas50

NVIDIA NCP-AII - NVIDIA AI Infrastructure

Page: 2 / 3
Total 71 questions

If two ports must be connected, but one is SFP and one is QSFP, for example, to connect a 25 GbE HOST CHANNEL ADAPTER to a QSFP port capable of both 100 GbE and 25 GbE, which of the following solutions would best meet this requirement?

A.

SFP Connectors

B.

SFP to 1G BASE-T (RJ45) adapter

C.

QSA Adapter

A team is validating a DGX BasePOD deployment. Using cmsh, they run a command to check GPU health across all nodes. What indicates that the system is ready for AI workloads?

A.

The command output is ignored if the system powers on without errors.

B.

At least half of the GPUs report Status_Health = OK.

C.

All GPUs report Status_Health = OK and Health = OK for each device.

D.

Only the head node's GPUs need to be healthy.

An InfiniBand server stops working, and a system administrator runs the "ibstat" command that provides the following output:

CA 'mlx5_1'

CA type: MT4115

Number of ports: 2

Firmware version: 10.20.1010

Hardware version: 0

Node GUID: 0x0002c90300002f78

System image GUID: 0x0002c90300002f7b

Port 1:

State: Initializing

Physical state: Linkup

Rate: 100

Base lid: 0

LMC: 0

SM lid: 0

Capability mask: 0x0251086a

Port GUID: 0x0002c90300002f79

Link layer: InfiniBand

What is the cause of the issue?

A.

The HCA port is faulty.

B.

There is no running SM in the fabric.

C.

The neighboring switch port is faulty.

D.

The cable is disconnected.

A system administrator is installing a GPU into a server and needs to avoid damaging the device. What item should be used?

A.

Anti-ESD strap

B.

Gloves

C.

Protective film

D.

Electric screwdriver

Why is it important to provide a large and high-performance local cache (using SSDs configured as RAID-0) for deep learning workloads on DGX systems?

A.

Local SSD cache allows users to increase the number of NFS threads on the server without impacting storage reliability.

B.

Using local SSD cache in RAID-0 enables direct GPU access to files without host CPU involvement, further boosting performance.

C.

Local SSD cache in RAID-0 is necessary to provide redundancy in case one of the drives fails during long training runs.

D.

A local SSD cache in RAID-0 ensures that most training data is read only once from the network, significantly reducing NFS traffic.

For a 48-hour NCCL burn-in test, which parameters ensure sustained fabric stress while detecting silent data corruption?

A.

broadcast_perf -b 4G -e 16G -w 160

B.

all_reduce_perf -b 8G -e 32G -c 1000 -z 1 -G 1000

C.

all_reduce_perf -b 8G -e 32G -z 1 -G 1000

D.

reduce_scatter_perf -f 2 -g 8

After NCCL burn-in reports "transport retry count exceeded," which corrective action addresses the underlying fabric issue?

A.

Switch from Ring to Tree algorithms via NCCL_ALGO=TREE

B.

Reduce message size to decrease network utilization

C.

Increase NCCL_IB_TIMEOUT to tolerate longer latencies

D.

Inspect InfiniBand link quality metrics (BER, symbol errors) and replace faulty cables

A System Administrator needs to change the scheduling behavior of a single GPU to use a fixed share scheduler. What command achieves this?

A.

esxcli system module parameters set -m nvidia -p

B.

esxcli -i 0 -mig 18

C.

nvidia-smi -i 0 -mig 1

D.

mlxconfig -d /dev/mst/mt4123_pciconf0 set LINK_TYPE_P1 =2

To validate bisectional bandwidth across two racks in a Spectrum-X Ethernet fabric, which NCCL test configuration isolates East-West traffic?

A.

NCCL_TESTS_SPLIT="OR 0x7" ./all_reduce_perf -g 8

B.

Run without splits and analyze per-rack averages.

C.

NCCL_TESTS_SPLIT="MOD 2" ./all_reduce_perf -g 8

D.

NCCL_TESTS_SPLIT="DIV 8" ./all_reduce_perf -g 1

After upgrading to HPL-AI 2.0 on a DGX A100 cluster, a 2x performance gain is observed. Which optimization is primarily responsible for this improvement?

A.

Reduction of problem size (N) to accelerate computation.

B.

MPI-aware GPU communication that reduces CPU bottlenecks and GPU idle time.

C.

Doubling of GPU clock speeds through firmware updates and relevant configuration.

D.

Automatic NVLink bandwidth doubling via driver updates.