Diff Report


nvhealth-bugs.log
nvhealth-normal.log
f1Infof1Info
2----2----
n3Timestamp:      Tue Aug 14 14:15:43 2018 -0500n3Timestamp:      Tue Aug 14 14:28:15 2018 -0500
4Version:        18.04-44Version:        18.04-4
55
6Checks6Checks
7------7------
8DGX BaseOS Version [3.1.7]........................................... 8DGX BaseOS Version [3.1.7]........................................... 
23GPU link speed [0000:0e:00.0][8GT/s]................................. Healthy23GPU link speed [0000:0e:00.0][8GT/s]................................. Healthy
24GPU link width [0000:0e:00.0][x16]................................... Healthy24GPU link width [0000:0e:00.0][x16]................................... Healthy
25GPU link speed [0000:07:00.0][8GT/s]................................. Healthy25GPU link speed [0000:07:00.0][8GT/s]................................. Healthy
26GPU link width [0000:07:00.0][x16]................................... Healthy26GPU link width [0000:07:00.0][x16]................................... Healthy
27Verify PCIe switches................................................. Healthy27Verify PCIe switches................................................. Healthy
n28[sanity] Page retirement support on GPU None......................... Unknownn
29        Could not find key "output.nvidia_smi_log.gpu.retired_pages.double_b
30        it_retirement.retired_count"
31[sanity] Return value of 'nvidia-smi nvlink -s' command.............. Unhealthy
32        /usr/share/nvhealth/collect/nvidia-smi/nvidia-smi-nvlink-status.sh:
33        line 14: [: too many arguments
34        ERROR: This version of nvidia-smi does not query NVLink speed
35        ERROR: NVLink speed query requires NVIDIA driver version 384.98 or
36        later
37 
38        nvidia-smi-nvlink-status.sh returned '1'
39        Observed value "1" when "0" was expected
40[sanity] Parsing 'nvidia-smi' topology matrx......................... Unhealthy
41        nvidia-smi-topo.py returned '1'
42        Observed value "1" when "0" was expected
43NVIDIA Driver Version [None]......................................... 28NVIDIA Driver Version [384.145]...................................... 
44Inforom Storage Version [GPU None][None]............................. 29Inforom Storage Version [GPU 0][G500.0201.00.02]..................... 
30Retired pages pending [GPU 0][No].................................... Healthy
31NVLink Topology...................................................... Healthy
32Checking NVLink speed [GPU 0 Link 0][25.781 GB/s].................... Healthy
33Checking NVLink speed [GPU 0 Link 1][25.781 GB/s].................... Healthy
34Checking NVLink speed [GPU 0 Link 2][25.781 GB/s].................... Healthy
35Checking NVLink speed [GPU 0 Link 3][25.781 GB/s].................... Healthy
36Checking NVLink speed [GPU 1 Link 0][25.781 GB/s].................... Healthy
37Checking NVLink speed [GPU 1 Link 1][25.781 GB/s].................... Healthy
38Checking NVLink speed [GPU 1 Link 2][25.781 GB/s].................... Healthy
39Checking NVLink speed [GPU 1 Link 3][25.781 GB/s].................... Healthy
40Checking NVLink speed [GPU 2 Link 0][25.781 GB/s].................... Healthy
41Checking NVLink speed [GPU 2 Link 1][25.781 GB/s].................... Healthy
42Checking NVLink speed [GPU 2 Link 2][25.781 GB/s].................... Healthy
43Checking NVLink speed [GPU 2 Link 3][25.781 GB/s].................... Healthy
44Checking NVLink speed [GPU 3 Link 0][25.781 GB/s].................... Healthy
45Checking NVLink speed [GPU 3 Link 1][25.781 GB/s].................... Healthy
46Checking NVLink speed [GPU 3 Link 2][25.781 GB/s].................... Healthy
47Checking NVLink speed [GPU 3 Link 3][25.781 GB/s].................... Healthy
48Total retired page count [GPU 0][0 retired pages].................... Healthy
45VBIOS Version [GPU None][None]....................................... 49VBIOS Version [GPU 0][88.00.24.00.01]................................ 
46Inforom Storage Version [GPU 1][G500.0201.00.02]..................... 50Inforom Storage Version [GPU 1][G500.0201.00.02]..................... 
n47Retired pages pending [GPU None][None]............................... Unknownn
48        Could not find key
49        "output.nvidia_smi_log.gpu.retired_pages.pending_retirement"
50        Suggested action: Reboot the system to blacklist GPU pages flagged
51        for retirement. See: http://docs.nvidia.com/deploy/dynamic-page-
52        retirement/index.html#blacklisting
53Total retired page count [GPU None][None retired pages].............. Unknown
54        Could not find key
55        "output.nvidia_smi_log.gpu.retired_pages.total_retired_count"
56Retired pages pending [GPU 1][No].................................... Healthy51Retired pages pending [GPU 1][No].................................... Healthy
57Total retired page count [GPU 1][0 retired pages].................... Healthy52Total retired page count [GPU 1][0 retired pages].................... Healthy
58VBIOS Version [GPU 1][88.00.24.00.01]................................ 53VBIOS Version [GPU 1][88.00.24.00.01]................................ 
59Inforom Storage Version [GPU 2][G500.0201.00.02]..................... 54Inforom Storage Version [GPU 2][G500.0201.00.02]..................... 
60Retired pages pending [GPU 2][No].................................... Healthy55Retired pages pending [GPU 2][No].................................... Healthy
62VBIOS Version [GPU 2][88.00.24.00.01]................................ 57VBIOS Version [GPU 2][88.00.24.00.01]................................ 
63Inforom Storage Version [GPU 3][G500.0201.00.02]..................... 58Inforom Storage Version [GPU 3][G500.0201.00.02]..................... 
64Retired pages pending [GPU 3][No].................................... Healthy59Retired pages pending [GPU 3][No].................................... Healthy
65Total retired page count [GPU 3][0 retired pages].................... Healthy60Total retired page count [GPU 3][0 retired pages].................... Healthy
66VBIOS Version [GPU 3][88.00.24.00.01]................................ 61VBIOS Version [GPU 3][88.00.24.00.01]................................ 
n67NVIDIA Driver Version [None]......................................... n
68Inforom Storage Version [GPU None][None]............................. 
69[sanity] Page retirement support on GPU None......................... Unknown
70        Could not find key "output.nvidia_smi_log.gpu.retired_pages.double_b
71        it_retirement.retired_count"
72Retired pages pending [GPU None][None]............................... Unknown
73        Could not find key
74        "output.nvidia_smi_log.gpu.retired_pages.pending_retirement"
75        Suggested action: Reboot the system to blacklist GPU pages flagged
76        for retirement. See: http://docs.nvidia.com/deploy/dynamic-page-
77        retirement/index.html#blacklisting
78Total retired page count [GPU None][None retired pages].............. Unknown
79        Could not find key
80        "output.nvidia_smi_log.gpu.retired_pages.total_retired_count"
81VBIOS Version [GPU None][None]....................................... 
82NVIDIA Driver Version [None]......................................... 
83Inforom Storage Version [GPU None][None]............................. 
84[sanity] Page retirement support on GPU None......................... Unknown
85        Could not find key "output.nvidia_smi_log.gpu.retired_pages.double_b
86        it_retirement.retired_count"
87Retired pages pending [GPU None][None]............................... Unknown
88        Could not find key
89        "output.nvidia_smi_log.gpu.retired_pages.pending_retirement"
90        Suggested action: Reboot the system to blacklist GPU pages flagged
91        for retirement. See: http://docs.nvidia.com/deploy/dynamic-page-
92        retirement/index.html#blacklisting
93Total retired page count [GPU None][None retired pages].............. Unknown
94        Could not find key
95        "output.nvidia_smi_log.gpu.retired_pages.total_retired_count"
96VBIOS Version [GPU None][None]....................................... 
97Verify installed disks............................................... Healthy62Verify installed disks............................................... Healthy
98Linux kernel version [4.4.0-127-generic]............................. 63Linux kernel version [4.4.0-127-generic]............................. 
n99System Uptime [up 3 days, 20 hours, 20 minutes ]..................... n64System Uptime [up 5 minutes ]........................................ 
100Verify DIMM vendors.................................................. Healthy65Verify DIMM vendors.................................................. Healthy
10166
102System Summary67System Summary
103--------------68--------------
104    Product Name: DGX Station69    Product Name: DGX Station
105    Manufacturer: NVIDIA70    Manufacturer: NVIDIA
106    DGX Serial Number: 015421700002771    DGX Serial Number: 0154217000027
n107    Uptime: up 3 days, 20 hours, 20 minutesn72    Uptime: up 5 minutes
108Motherboard:73Motherboard:
109    BIOS Version: 040674    BIOS Version: 0406
110    Serial Number: 17029531690014375    Serial Number: 170295316900143
111GPU:76GPU:
n112    NVIDIA Driver Version: Unknownn77    NVIDIA Driver Version: 384.145
113    Product Name(s): Unknown78    Product Name(s): Tesla V100-DGXS-16GB
114    VBIOS Version(s): Unknown79    VBIOS Version(s): 88.00.24.00.01
115Software:80Software:
116    DGX BaseOS Version: 3.1.781    DGX BaseOS Version: 3.1.7
117    Kernel Version: 4.4.0-127-generic82    Kernel Version: 4.4.0-127-generic
11883
119Health Summary84Health Summary
120--------------85--------------
n12125 out of 36 checks are Healthyn8644 out of 44 checks are Healthy
1222 out of 36 checks are Unhealthy
1239 out of 36 checks are Unknown
124Overall system status is Unhealthy87Overall system status is Healthy
12588
t126Problem detected.t
127Please visit the ESP portal: https://nvid.nvidia.com/enterpriselogin
128And create a support ticket with the log file attached.
129 
130Log file written to: /tmp/nvhealth-log.kE3XdxPBlo.jsonl89Log file written to: /tmp/nvhealth-log.q4DASkL1aD.jsonl