Forums.developer.nvidia.com

GPU diagnostics How to test a GPU

WEBMisterAnderson42 August 14, 2008, 2:47pm 3. I haven’t heard of any GPU diagnostic programs either. All I can suggest is to run a known stable CUDA app and …

Actived: 6 days ago

URL: https://forums.developer.nvidia.com/t/gpu-diagnostics-how-to-test-a-gpu/5004

Mlx5_core poll_health raise an error: device's health compromised

WEBAfter I create several VF on ConnectX5 Adapter port 0, I got the following system message: [ 1810.527156] mlx5_core 0000:51:00.3: poll_health:853:(pid 0): …

Category:  Health Go Health

Breaking Barriers in Healthcare with New Models for Generative AI …

WEBOriginally published at: https://developer.nvidia.com/blog/breaking-barriers-in-healthcare-with-new-models-for-generative-ai-and-cellular-imaging/ Driving the future

Category:  Health Go Health

Triton Inference Server's health status shows 'Connection peer …

WEBGET /v2/health/ready HTTP/1.1 Host: localhost:8000 User-Agent: curl/7.58.0 Accept: / Recv failure: Connection reset by peer; stopped the pause stream! Closing …

Category:  Health Go Health

Health ready check failed

WEBWhen running the command. bash riva_start.sh. I get this error: Health ready check failed. Have attached the detailed logs in the text file. riva_log.txt (146.7 KB) …

Category:  Health Go Health

Infiniband Error, device's health compromised

WEBHello, Yes - asserts should not occur. It means something unexpected occurred - either in the Firmware or the Hardware itself. Best Regards, Jonathan.

Category:  Health Go Health

Device's health compromised: firmware internal error

WEBHello Mansunc Thanks for clearing this out. Currently we do not have any impact on the network operation. Just wanted to understand that this is not an issue to …

Category:  Health Go Health

mlx5_pcie_event:Detected insufficient power on the PCIe slot (27W)

WEBAfter I restarted the OFED driver using the command (sudo /etc/init.d/openibd restart ), the kernel log displayed the following information: mlx5_pcie_event:301:(pid …

Category:  Health Go Health

EMMC health/remaining life reporting tool

WEBEMMC health/remaining life reporting tool. Autonomous Machines Jetson & Embedded Systems Jetson AGX Orin. cadelsey October 19, 2023, 7:14pm 1. I thought I …

Category:  Health Go Health

Mlx5_core 0000:41:00.0: poll_health:853:(pid 0): device's health

WEBThis topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Category:  Health Go Health

Testing Jetson Nano RTC Battery health

WEBHello! We are using a Jetson Nano on a carrier board with a RTC battery. In production, we have had 1-2 boards that, despite having the battery installed, don’t …

Category:  Health Go Health

Tx1-eMMC Health Status tool

WEBI am not sure what part was used on TX1 module. There is confusion part number mentioned in TX1 module schematic (Schematic reference number: P2180 …

Category:  Health Go Health

nVidia Healthmon Cluster Management Tools!

WEBI just saw a GTC2010 presentation on GPU Cluster management. There is a talk about a tool called “NVIDIA Healthmon” – looks like a close cousin of nvidia-smi Can …

Category:  Health Go Health

Triton server died before reaching ready state. Terminating Jarvis

WEBHi, I want to set up the Jarvis server with jarvis_init.sh, but is facing a problem of: Triton server died before reaching ready state. Terminating Jarvis startup. I …

Category:  Health Go Health

Triton server died before reaching ready state. Terminating Riva

WEBHi the edge device that I used is Jetson Nano. I know that Jetson is not a supported hardware for Riva, so I just use it to download docker images from NGC via …

Category:  Health Go Health

Firmware 24.30.1004

WEBHi, in my recent thread we determined that our cards(MBF2M332A-AEEO_Ax) are an older model. As I’m still trying to figure out the reason behind an …

Category:  Health Go Health

Command to check if GPU's are enabled on NVIDIA Jetson TX2

WEBHi Honey_Patouceul. head -1 /etc/nv_tegra_release. commands show. R28 (release), REVISION: 1.0, GCID: 9379712, BOARD: t186ref, EABI: aarch64, DATE: Thu …

Category:  Health Go Health

Error code -16 at startup on MCX516-GCAT once in a while.

WEBHello Lennart, Thank you for posting your inquiry on the NVIDIA Networking Community. Based on the information provided, we want to continue to debug this issue …

Category:  Health Go Health

[HPC-Benchmarks] Discrepancy between A100 PCIe and A100 SMX4

WEBHi, We are running hpc benchmark 21.4 on out GPU servers. There is a pronounced discrepancy in term of HPL-AI performance as follow: 4 x A100 PCIe: ~ 110 …

Category:  Health Go Health

How to add a correct battery for RTC on Jetson Nano

WEBAn optional back up battery can be attached to the PMIC_BBAT module input. It is used to maintain the RTC voltage when VDD_IN is not present. This pin is …

Category:  Health Go Health

Nvidia-smi failed to initialize NVML (driver/library version mismatch)

WEBI am running Ubuntu 20.04 and I am often facing this nvidia driver issue. I have the recommended proprietary nvidia driver installed. Reboot did not work. Logs are …

Category:  Health Go Health

GPU not getting detected on running lspci grep NVIDIA command

WEBnvidia-bug-report.log.gz (539.2 KB) generix April 18, 2022, 11:49am 4. To get a proper output in lspci, you just need to update the pci db: sudo update-pciids. …

Category:  Health Go Health