Pre-Summer Sale Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: xmas50

NVIDIA NCP-AII - NVIDIA AI Infrastructure

Page: 1 / 4
Total 123 questions

An AI training cluster with NVIDIA GPUs experiences prolonged data loading times during checkpoint reloading, causing GPUs to idle frequently. CPU utilization during data transfers remains high. Which solution most effectively optimizes storage-to-GPU throughput while reducing CPU overhead?

A.

Increase batch sizes to reduce the frequency of storage access.

B.

Migrate datasets to SATA SSDs with RAID 0 for higher sequential read speeds.

C.

Add more GPUs to the cluster to parallelize data loading tasks.

D.

Implement GPUDirect Storage to enable direct data transfers.

What is the primary purpose of performing a NeMo burn-in on a new AI infrastructure?

A.

To benchmark production training speed and ensure all GPUs are running at identical clock speeds.

B.

To stress test the hardware and software stack with representative NeMo workloads, ensuring reliability.

C.

To tune NeMo model hyperparameters for maximum accuracy on user datasets during cluster deployment.

What command sequence is used to identify the exact name of the server that runs as the master SM in a multi-node fabric?

A.

sminfo, then smpquery ND

B.

ibstat, then sminfo

C.

ibnetdiscover, then ibsim

D.

sminfo, then smpquery NI

An enterprise IT team has completed the physical installation of an AI Factory with a Spectrum-X Ethernet network connected to all GPU servers. They now need to ensure the environment is ready for scalable AI workload deployment. What is the recommended sequence of validation steps?

A.

Set up Active Directory and LDAP, configure role-based access controls and security settings first, install users, and skip network or hardware performance validation.

B.

Perform application benchmarking first, use performance logs to identify bottlenecks, update switch and server firmware afterward, and then tune the network using performance tests.

C.

Validate the software stack, test link connectivity and port health, run network benchmarks, run OSPF, ensure neighbors are exchanging route information, then stage AI workload tests.

D.

Confirm switch and server firmware configuration, test link connectivity and port health, run network benchmarks, validate the software stack, then stage AI workload tests.

You are expanding a DGX-based deep learning cluster to train on large, high-resolution images that cannot fit into local cache. Multiple nodes will access this data concurrently and require high performance. Which storage and networking solution best meets these requirements?

A.

Increase the SSD RAID-0 local cache size in each node so it can absorb most training data, making network storage type and speed less important for performance.

B.

Implement a standard NFS server on a 10GbE network because the cluster can access the export and job performance will not be impacted.

C.

Deploy a high-performance parallel file system across InfiniBand or 40/100GbE, ensuring at least 3 GB/s per node and scalable aggregate bandwidth for all cluster workloads.

D.

Recommend general-purpose object storage for all training data because it is optimized for deep learning workloads and distributed data access at any scale.

To validate bisectional bandwidth across two racks in a Spectrum-X Ethernet fabric, which NCCL test configuration isolates East-West traffic?

A.

NCCL_TESTS_SPLIT= " OR 0x7 " ./all_reduce_perf -g 8

B.

Run without splits and analyze per-rack averages.

C.

NCCL_TESTS_SPLIT= " MOD 2 " ./all_reduce_perf -g 8

D.

NCCL_TESTS_SPLIT= " DIV 8 " ./all_reduce_perf -g 1

An engineer needs to validate NVLink Switch functionality on a DGX H100 system with 8 GPUs. Which NCCL command verifies intra-node NVLink bandwidth?

A.

broadcast_perf -b 8 -e 16G -f 2 -g 8 without split configuration

B.

all_reduce_perf -b 8 -e 16G -f 2 -g 4 with NCCL_TESTS_SPLIT= " MOD 2 "

C.

all_reduce_perf -b 8 -e 16G -f 2 -g 1 repeated 8 times

D.

all_reduce_perf -b 8 -e 16G -f 2 -g 8 with NCCL_TESTS_SPLIT= " OR 0x7 "

A system administrator receives an alert about a potential hardware fault on an NVIDIA DGX A100. The GPU performance seems degraded, and the system fans are operating loudly. What step should be recommended to identify and troubleshoot the hardware fault?

A.

Run a deep learning workload to stress test the GPUs and check whether the issue persists.

B.

Check the NVIDIA System Management Interface (nvidia-smi) for GPU status and temperatures.

C.

Power drain then restart the DGX and check if the performance degradation resolves.

D.

Increase the fan speed to maximum and check whether the performance improves.

During multi-node HPL burn-in, GPUs show uneven utilization. Which configuration ensures balanced workload distribution?

A.

Enable HPL_USE_NVSHMEM=1 for shared memory acceleration

B.

HPL_RUN_GEMM_TESTS to skip validation

C.

Set --gpu-affinity and --cpu-affinity to align GPU and NUMA nodes

D.

HPL_OOC_TILE_M to 8192 for larger blocks

A system administrator needs to configure a BlueField DPU and enable RShim on the baseboard management controller (BMC). Which command should be executed?

A.

ipmitool raw 0x32 0x6a 1

B.

systemctl restart rshim

C.

systemctl enable bmc-rshim.service

D.

scp < path_to_bfb > root@ < bmc_ip > :/dev/rshim0/boot