| GPU link speed [0000:0e:00.0][8GT/s]................................. Healthy | | GPU link speed [0000:0e:00.0][8GT/s]................................. Healthy |
| GPU link width [0000:0e:00.0][x16]................................... Healthy | | GPU link width [0000:0e:00.0][x16]................................... Healthy |
| GPU link speed [0000:07:00.0][8GT/s]................................. Healthy | | GPU link speed [0000:07:00.0][8GT/s]................................. Healthy |
| GPU link width [0000:07:00.0][x16]................................... Healthy | | GPU link width [0000:07:00.0][x16]................................... Healthy |
| Verify PCIe switches................................................. Healthy | | Verify PCIe switches................................................. Healthy |
n | [sanity] Page retirement support on GPU None......................... Unknown | n | |
| Could not find key "output.nvidia_smi_log.gpu.retired_pages.double_b | | |
| it_retirement.retired_count" | | |
| [sanity] Return value of 'nvidia-smi nvlink -s' command.............. Unhealthy | | |
| /usr/share/nvhealth/collect/nvidia-smi/nvidia-smi-nvlink-status.sh: | | |
| line 14: [: too many arguments | | |
| ERROR: This version of nvidia-smi does not query NVLink speed | | |
| ERROR: NVLink speed query requires NVIDIA driver version 384.98 or | | |
| later | | |
| | | |
| nvidia-smi-nvlink-status.sh returned '1' | | |
| Observed value "1" when "0" was expected | | |
| [sanity] Parsing 'nvidia-smi' topology matrx......................... Unhealthy | | |
| nvidia-smi-topo.py returned '1' | | |
| Observed value "1" when "0" was expected | | |
| NVIDIA Driver Version [None]......................................... | | NVIDIA Driver Version [384.145]...................................... |
| Inforom Storage Version [GPU None][None]............................. | | Inforom Storage Version [GPU 0][G500.0201.00.02]..................... |
| | | Retired pages pending [GPU 0][No].................................... Healthy |
| | | NVLink Topology...................................................... Healthy |
| | | Checking NVLink speed [GPU 0 Link 0][25.781 GB/s].................... Healthy |
| | | Checking NVLink speed [GPU 0 Link 1][25.781 GB/s].................... Healthy |
| | | Checking NVLink speed [GPU 0 Link 2][25.781 GB/s].................... Healthy |
| | | Checking NVLink speed [GPU 0 Link 3][25.781 GB/s].................... Healthy |
| | | Checking NVLink speed [GPU 1 Link 0][25.781 GB/s].................... Healthy |
| | | Checking NVLink speed [GPU 1 Link 1][25.781 GB/s].................... Healthy |
| | | Checking NVLink speed [GPU 1 Link 2][25.781 GB/s].................... Healthy |
| | | Checking NVLink speed [GPU 1 Link 3][25.781 GB/s].................... Healthy |
| | | Checking NVLink speed [GPU 2 Link 0][25.781 GB/s].................... Healthy |
| | | Checking NVLink speed [GPU 2 Link 1][25.781 GB/s].................... Healthy |
| | | Checking NVLink speed [GPU 2 Link 2][25.781 GB/s].................... Healthy |
| | | Checking NVLink speed [GPU 2 Link 3][25.781 GB/s].................... Healthy |
| | | Checking NVLink speed [GPU 3 Link 0][25.781 GB/s].................... Healthy |
| | | Checking NVLink speed [GPU 3 Link 1][25.781 GB/s].................... Healthy |
| | | Checking NVLink speed [GPU 3 Link 2][25.781 GB/s].................... Healthy |
| | | Checking NVLink speed [GPU 3 Link 3][25.781 GB/s].................... Healthy |
| | | Total retired page count [GPU 0][0 retired pages].................... Healthy |
| VBIOS Version [GPU None][None]....................................... | | VBIOS Version [GPU 0][88.00.24.00.01]................................ |
| Inforom Storage Version [GPU 1][G500.0201.00.02]..................... | | Inforom Storage Version [GPU 1][G500.0201.00.02]..................... |
n | Retired pages pending [GPU None][None]............................... Unknown | n | |
| Could not find key | | |
| "output.nvidia_smi_log.gpu.retired_pages.pending_retirement" | | |
| Suggested action: Reboot the system to blacklist GPU pages flagged | | |
| for retirement. See: http://docs.nvidia.com/deploy/dynamic-page- | | |
| retirement/index.html#blacklisting | | |
| Total retired page count [GPU None][None retired pages].............. Unknown | | |
| Could not find key | | |
| "output.nvidia_smi_log.gpu.retired_pages.total_retired_count" | | |
| Retired pages pending [GPU 1][No].................................... Healthy | | Retired pages pending [GPU 1][No].................................... Healthy |
| Total retired page count [GPU 1][0 retired pages].................... Healthy | | Total retired page count [GPU 1][0 retired pages].................... Healthy |
| VBIOS Version [GPU 1][88.00.24.00.01]................................ | | VBIOS Version [GPU 1][88.00.24.00.01]................................ |
| Inforom Storage Version [GPU 2][G500.0201.00.02]..................... | | Inforom Storage Version [GPU 2][G500.0201.00.02]..................... |
| Retired pages pending [GPU 2][No].................................... Healthy | | Retired pages pending [GPU 2][No].................................... Healthy |
| VBIOS Version [GPU 2][88.00.24.00.01]................................ | | VBIOS Version [GPU 2][88.00.24.00.01]................................ |
| Inforom Storage Version [GPU 3][G500.0201.00.02]..................... | | Inforom Storage Version [GPU 3][G500.0201.00.02]..................... |
| Retired pages pending [GPU 3][No].................................... Healthy | | Retired pages pending [GPU 3][No].................................... Healthy |
| Total retired page count [GPU 3][0 retired pages].................... Healthy | | Total retired page count [GPU 3][0 retired pages].................... Healthy |
| VBIOS Version [GPU 3][88.00.24.00.01]................................ | | VBIOS Version [GPU 3][88.00.24.00.01]................................ |
n | NVIDIA Driver Version [None]......................................... | n | |
| Inforom Storage Version [GPU None][None]............................. | | |
| [sanity] Page retirement support on GPU None......................... Unknown | | |
| Could not find key "output.nvidia_smi_log.gpu.retired_pages.double_b | | |
| it_retirement.retired_count" | | |
| Retired pages pending [GPU None][None]............................... Unknown | | |
| Could not find key | | |
| "output.nvidia_smi_log.gpu.retired_pages.pending_retirement" | | |
| Suggested action: Reboot the system to blacklist GPU pages flagged | | |
| for retirement. See: http://docs.nvidia.com/deploy/dynamic-page- | | |
| retirement/index.html#blacklisting | | |
| Total retired page count [GPU None][None retired pages].............. Unknown | | |
| Could not find key | | |
| "output.nvidia_smi_log.gpu.retired_pages.total_retired_count" | | |
| VBIOS Version [GPU None][None]....................................... | | |
| NVIDIA Driver Version [None]......................................... | | |
| Inforom Storage Version [GPU None][None]............................. | | |
| [sanity] Page retirement support on GPU None......................... Unknown | | |
| Could not find key "output.nvidia_smi_log.gpu.retired_pages.double_b | | |
| it_retirement.retired_count" | | |
| Retired pages pending [GPU None][None]............................... Unknown | | |
| Could not find key | | |
| "output.nvidia_smi_log.gpu.retired_pages.pending_retirement" | | |
| Suggested action: Reboot the system to blacklist GPU pages flagged | | |
| for retirement. See: http://docs.nvidia.com/deploy/dynamic-page- | | |
| retirement/index.html#blacklisting | | |
| Total retired page count [GPU None][None retired pages].............. Unknown | | |
| Could not find key | | |
| "output.nvidia_smi_log.gpu.retired_pages.total_retired_count" | | |
| VBIOS Version [GPU None][None]....................................... | | |
| Verify installed disks............................................... Healthy | | Verify installed disks............................................... Healthy |
| Linux kernel version [4.4.0-127-generic]............................. | | Linux kernel version [4.4.0-127-generic]............................. |
n | System Uptime [up 3 days, 20 hours, 20 minutes ]..................... | n | System Uptime [up 5 minutes ]........................................ |
| Verify DIMM vendors.................................................. Healthy | | Verify DIMM vendors.................................................. Healthy |
| | | |
| System Summary | | System Summary |
| -------------- | | -------------- |
| Product Name: DGX Station | | Product Name: DGX Station |
| Manufacturer: NVIDIA | | Manufacturer: NVIDIA |
| DGX Serial Number: 0154217000027 | | DGX Serial Number: 0154217000027 |
n | Uptime: up 3 days, 20 hours, 20 minutes | n | Uptime: up 5 minutes |
| Motherboard: | | Motherboard: |
| BIOS Version: 0406 | | BIOS Version: 0406 |
| Serial Number: 170295316900143 | | Serial Number: 170295316900143 |
| GPU: | | GPU: |
n | NVIDIA Driver Version: Unknown | n | NVIDIA Driver Version: 384.145 |
| Product Name(s): Unknown | | Product Name(s): Tesla V100-DGXS-16GB |
| VBIOS Version(s): Unknown | | VBIOS Version(s): 88.00.24.00.01 |
| Software: | | Software: |
| DGX BaseOS Version: 3.1.7 | | DGX BaseOS Version: 3.1.7 |
| Kernel Version: 4.4.0-127-generic | | Kernel Version: 4.4.0-127-generic |
| | | |
| Health Summary | | Health Summary |
| -------------- | | -------------- |
n | 25 out of 36 checks are Healthy | n | 44 out of 44 checks are Healthy |
| 2 out of 36 checks are Unhealthy | | |
| 9 out of 36 checks are Unknown | | |
| Overall system status is Unhealthy | | Overall system status is Healthy |
| | | |
t | Problem detected. | t | |
| Please visit the ESP portal: https://nvid.nvidia.com/enterpriselogin | | |
| And create a support ticket with the log file attached. | | |
| | | |
| Log file written to: /tmp/nvhealth-log.kE3XdxPBlo.jsonl | | Log file written to: /tmp/nvhealth-log.q4DASkL1aD.jsonl |