Pre-Summer Sale Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: xmas50

NVIDIA NCP-AII - NVIDIA AI Infrastructure

Page: 3 / 4
Total 123 questions

After configuring NGC CLI with ngc config set, a user receives ”Authentication failed” errors when pulling containers. What step was most likely omitted?

A.

Installing the CLI with apt-get instead of manual extraction.

B.

Entering the API key during ngc config set or storing it in ~/.ngc/config.

C.

Setting --format_type=json to enable API interactions.

D.

Running sudo systemctl restart docker after configuration.

A system administrator needs to validate a GPU-based server and ensure that no errors occur under load. What command should be used?

A.

nvsm dump health

B.

stress-test --usage

C.

nvsm show health

D.

nvsm stress-test

During a 72-hour HPL burn-in test on a DGX H100 cluster, one node shows a 15% performance drop after 48 hours. What are the two most likely causes and diagnostic steps?

Pick the 2 correct responses below.

A.

MPI configuration error; rerun with --cpu-affinity adjustments.

B.

Network packet loss; analyze ibdiagnet reports.

C.

Thermal throttling due to cooling issues; check nvidia-smi dmon.

D.

Memory corruption; reboot the node and reduce problem size N.

A cluster administrator is preparing to update the firmware on a DGX H100 system, including the GPU tray (baseboard). What is the correct sequence of steps to perform a safe and successful firmware upgrade?

A.

Update the BMC and skip the GPU tray and motherboard tray updates if the system appears healthy.

B.

Perform a cold reset, stop all GPU activity, update and reboot the BMC, update motherboard and tray components, and verify completion.

C.

Update the GPU tray first, then the motherboard tray, and reboot the BMC after all updates are complete.

D.

Stop all GPU activity, update and reboot the BMC, update motherboard and tray components, perform a cold reset, and verify completion.

During a 48-hour NeMo question-answering model burn-in test, GPU memory errors occur when processing large datasets. Which configuration strategy prevents Out-of-Memory (OOM) errors while maintaining processing efficiency?

A.

Set blocksize= " 1GB " for data loading and enable RMM asynchronous allocation.

B.

Switch from FP16 to FP32 precision for numerical stability.

C.

Disable add_filename for Parquet files to reduce metadata.

D.

Increase files_per_partition to 1000 for larger batch processing.

You are leading a project to enhance the energy efficiency of a data center that heavily relies on AI workloads. NVIDIA suggests moving beyond traditional metrics like Power Usage Effectiveness (PUE) to better capture the efficiency of modern data centers. Which strategy should you prioritize?

A.

Use Power Usage Effectiveness as the primary metric while supplementing it with additional measures of useful work done per unit of energy.

B.

Use watts used as the primary measure of efficiency, as it accurately reflects the power input at any given time.

C.

Develop benchmarks tailored to specific workloads, such as MLPerf for AI applications, to better understand energy use in real-world scenarios.

D.

Focus on integrating kilowatt-hours into existing metrics to better reflect the actual energy used for productive work.

You are following the official steps to install the NVIDIA Container Toolkit using a package manager on Ubuntu. After importing the NVIDIA package repository and GPG key, what is the next action?

A.

Reboot the host system to apply the repository changes and proceed.

B.

Install the nvidia-container-toolkit package using your package manager.

C.

Format the disk to clear any existing NVIDIA-related dependencies first.

D.

Download the CUDA toolkit installer from NVIDIA ' S official website.

After initial setup and health checks, the DGX H100 system administrator wants to verify that containers can access GPUs before running production workloads. Which method is recommended for this validation?

A.

sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 systemctl

B.

sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 ls -la

C.

sudo docker run --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi

D.

sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi

Your company is planning to expand its AI capabilities significantly over the next five years. To future-proof your storage infrastructure, you need a solution that can scale in both capacity and performance. Which of the following strategies best ensures that your storage infrastructure remains adaptable to future AI demands?

A.

Deploy an all-flash array and remove data tiering to reduce latency.

B.

Implement single-tier cloud storage solution to leverage cloud scalability.

C.

Use a hybrid cloud model combining scalable cloud resources with on-premises infrastructure.

D.

Implement on-premises block storage system with periodic hardware upgrades.

A customer is designing an AI Factory for enterprise-scale deployments and wants to ensure redundancy and load balancing for the management and storage networks. Which feature should be implemented on the Ethernet switches?

A.

Implement redundant switches with spanning tree protocol.

B.

MLAG for bonded interfaces across redundant switches.

C.

Use only one switch for all management and storage traffic.

D.

Disable VLANs and use unmanaged switches.