Report for peroidically hibernation of Nvidia DGX 230 on Aug. 11, 2018

Date: Aug 11, 2018
Last Updated: Aug 11, 2018
Categories:
Playground Technique Bug
Tags:
linux docker compiler NVIDIA

Contents


Problem statement

  • What is it?: We assume that this problem is caused by the hibernation of the graphics cards.
  • How does it happen?: This problem occurs periodically. Every week it would occur for about once. It is not due to some special applications, because even when there is no burden on GPU, it still happens.
  • What does it cause?: Here are some phenomenons when the problem appears:
    • Cannot turn on the screen: The screen of DGX would be kept off. Whatever we do (like move the mouse or type something with the keyboard), the screen would not be turned on. However, we could still get access to the device by remote accessing (like using Putty).
    • Cannot use nvida apps: Like nvidia-smi, nvidia-docker run. It would return an error like GPU is lost. However, some commands are still accessible, like nvidia-docker images.
    • Cannot exit from the container: If we are running a container by docker when the problem occurs, we could still use the container normally (except some processes that need GPU). But if we type exit inside the container, the terminal would be freezed.
    • Process on GPU would be freezed: If we are running any process that needs the GPU when the problem occurs, that process would be freezed. We could not terminate it by terminate or kill.
    • A core would be fully occupied (not verified): It seems that when this problem occurs, one CPU core (among 40 logical cores) would be fully occupied. But we could not find which application is occupying it.
  • How to solve it?: Now every time it happens, we need to restart the machine and everything would be recoverd. But after about one week, it happens again.
  • Why we need to solve it?: Many of our programs need to be runned for a long time (several days), we could not take the risk of restarting our testing program due to the hibernation.

Problem records

Screenshots

This problem could be detected by running nvidia-smi. For example, if we are running watch nvidia-smi, we could find this problem if the returned message is changed like this:

nvidia-smi tells us that GPU is lost

Here we show a screenshot when the problem occurs. This screen shot is captured inside the container, i.e. a VNC remote desktop provided by the container. We are running a testing program that only uses CPU, so we could see that the CPU is occupied and that program keeps running normally. However we could not get access to the GPU now.

A whole screenshot when the problem happens

Nvidia bug reports

Run this command outside the docker (by remote accessing, we have mentioned that although we could not turn on the screen, we could still get access to the Bash by remote accessing), then we would get a full record of the hardware:

$ sudo nvidia-bug-report.sh

To help us detect what happens and what makes it different when the problem occurs, we run the command for twice. One is during the hibernation and the other is in the normal state (where the GPU is working correctly). The comparison of two reports is shown as follows:

The left side is the report of a normal state, the right one is the report when the problem happens. You could download the comparison of the reports by this link:

Comparison Report

Or you can use the following links to download the original txt reports directly:

Nvidia healthy report

Follow the instructions here:

Instructions

We use this command to collect the system status firstly:

$ sudo nvsysinfo 

This command would dump a file with a name like /tmp/nvsysinfo-timestamp.random-number.out. No more information is given.

Then, call this command,

$ sudo nvhealth &> nvhealth.log

This command would also dump a file with a name like /tmp/nvhealth-log.random-string.jsonl. And the printed information would be dumpled into nvhealth.log.

We have run these commands in both bug state and normal state respectively. The comparison of the nvhealth.log is shown as follows:

You can only download the comparison report here:

Comparison Report

We have collected all dumped files in a zipped file. The file name table is as follows:

Status System information Healthy report
with-bug nvsysinfo-201808141356.4n835u.out nvhealth-log.kE3XdxPBlo.jsonl
normal nvsysinfo-201808141425.Gn7yxn.out nvhealth-log.q4DASkL1aD.jsonl

You could download the zipped file here:

Zipped logs