NVIDIA NCP-AII - NVIDIA Certified Professional AI Infrastructure Certification Exam
Question #1 (Topic: Demo Questions)
A systems administrator is preparing a new DGX server for deployment. What is the most secure approach to configuring the BMC port during initial setup?
Correct Answer: D
Explanation:
The Baseboard Management Controller (BMC) is a powerful tool that allows for total control over the DGX system, including the ability to flash firmware, cycle power, and access the serial console. Because of this, it is a high-value target for security threats. The " 100% verified " secure approach (Option D) involves two critical layers:
The Baseboard Management Controller (BMC) is a powerful tool that allows for total control over the DGX system, including the ability to flash firmware, cycle power, and access the serial console. Because of this, it is a high-value target for security threats. The " 100% verified " secure approach (Option D) involves two critical layers:
Network Isolation: The BMC port should never be exposed to the public internet (Option A) or even the general production network (Option B). It must reside on a dedicated Out-of-Band (OOB) network that is firewalled and accessible only to authorized administrators.
Credential Management: Standard NVIDIA factory defaults (like admin/admin) must be changed immediately upon first access. As part of the DGX first-boot wizard, the system prompts the administrator to create a strong, unique password for the primary user, which is then synchronized to the BMC.
Leaving the port disconnected (Option C) is unfeasible for modern data center operations, as the BMC is required for remote monitoring and " headless " deployment. Following the isolated/firewalled approach ensures the AI Factory remains resilient against both external attacks and internal lateral movement.
Question #2 (Topic: Demo Questions)
A team is installing the NVIDIA Run:ai control plane on a Kubernetes cluster. Which two (2) options are most critical to validate before proceeding? (Pick the 2 correct responses below)
Correct Answer: A, B
Explanation:
NVIDIA Run:ai is an advanced orchestration platform designed to optimize GPU resource allocation within Kubernetes environments. Because Run:ai is cloud-native, its control plane and worker agents are deployed as Kubernetes resources. Therefore, the absolute first prerequisite is a running Kubernetes cluster (Option B) to host the services. Secondly, Run:ai utilizes Helm, the package manager for Kubernetes, to manage its complex installation charts, deployments, and service configurations. Without Helm installed on the administrative machine (Option A), the installation scripts will fail to execute. While having GPUs (Option C) is the ultimate goal for the worker nodes, the control plane itself can be installed on a cluster before all GPU hardware is physically present. Disabling NTP (Option D) is never recommended; in fact, accurate time synchronization is vital for the TLS certificates and logging used by Run:ai and Kubernetes.
Question #3 (Topic: Demo Questions)
What is the primary purpose of performing a NeMo burn-in on a new AI infrastructure?
Correct Answer: B
Explanation:
The primary purpose of a NeMo burn-in is to stress test the hardware and software stack using representative NeMo workloads before releasing the AI infrastructure to production. NeMo workloads can exercise GPU compute, GPU memory, CUDA libraries, NCCL communication, storage access, checkpointing, container runtime, scheduler integration, and distributed training behavior. This makes NeMo burn-in more realistic than simply checking that GPUs are visible or that a small synthetic benchmark runs successfully. The goal is not to tune hyperparameters for model accuracy, because burn-in validates infrastructure reliability rather than model quality. It is also not mainly about ensuring all GPUs run at identical clock speeds; clock behavior can vary based on power, thermals, workload, and GPU boost behavior. What matters is that the workload runs reliably, without stalls, NCCL failures, GPU Xid errors, storage bottlenecks, memory faults, or unstable performance. In NVIDIA AI infrastructure validation, representative workload burn-in bridges the gap between low-level diagnostics and real production training, helping detect issues that synthetic tests alone may miss.
Question #4 (Topic: Demo Questions)
If two ports must be connected, but one is SFP and one is QSFP, for example, to connect a 25 GbE HOST CHANNEL ADAPTER to a QSFP port capable of both 100 GbE and 25 GbE, which of the following solutions would best meet this requirement?
Correct Answer: C
Explanation:
The QSA (QSFP to SFP Adapter) is a mechanical and electrical bridge that allows a single-lane SFP/SFP28 transceiver (typically 10G or 25G) to be plugged into a four-lane QSFP/QSFP28 switch port. In AI infrastructure, this is commonly used to connect low-speed management servers or legacy nodes to a high-speed backbone switch without wasting entire 100G/200G ports or requiring specialized breakout cables. The QSA adapter maps the single lane of the SFP module to the first lane of the QSFP port. This is a " pass-through " solution that maintains the signal integrity and latency characteristics of the link. It is the verified hardware solution for port-density mismatch in NVIDIA networking environments.
Question #5 (Topic: Demo Questions)
A system administrator needs to configure a BlueField DPU and enable RShim on the baseboard management controller (BMC). Which command should be executed?
Correct Answer: A
Explanation: