Advanced Linux Skills for Using NVIDIA Docker: Questions and Answers

Date: Jun 14, 2018
Last Updated: Jul 13, 2018
Categories:
Playground Technique Command Q and A
Tags:
linux docker putty remote-access remote-GUI NVIDIA

Contents


Instruction

Here we collect some common errors caused by mistaking operations. Note that because of the limitation of the docker environment, some errors are unsolvable. When you meet such errors, it is better to delete the broken image and retrieve the backup version.

In the following parts, the questions would be categorized by different classes. Please use CTRL+F or search for your problem by checking these classes. If you meet a problem that is not discussed here, it is welcome for you to propose it in the discussion board at the bottom of this page. I would update this page once a new problem is detected.

Solvable problems

Docker

Q1.1: How to close a container that I forget to close it?

A1.1: This case appears when you run the container without --rm option or exit the docker due to connection failed. You need to follow these instructions.

  1. Type nvidia-docker ps -a to check whether the container is still running.
  2. Type nvidia-docker kill container_ID to remove the container from the background.

Q1.2: How to kill a container with a abnormal execution code?

A1.2: Some times nvidia-docker kill could not kill a container because it is exited due to a fatal error (with a non-zero execution code). In this case, you need to use such command to remove all exited containers:

$ nvidia-docker ps -a | grep Exit | cut -d ' ' -f 1 | xargs nvidia-docker rm

Q1.3: Why cannot I open GUI inside the docker?

A1.3: Maybe you forget to run xhost + before running your docker. And some options are also required. Please compare your command with this:

$ xhost +
$ nvidia-docker run --rm -it -e DISPLAY=xx.xx.xx.xx:0 -v /tmp/.X11-unix:/tmp/.X11-unix -v local_dir:container_dir image_name:tag

(When you run docker locally, use -e DISPLAY instead.)

Q1.4: Is it possible that two users disturb each other by running the same image?

A1.4: Not at all. The image is just like a storage of a VM. Although you run the same image, you are working with different containers, just like different VMs loading the same storage.

Q1.5: Could I remove an image which is depended on by other images?

A1.5: Yes, but you need to do that with the image name rather than image ID. For example

$ nvidia-docker rmi image_name:tag

If this image is depended on by others, rmi would not delete the data but just remove the tag so that this image would not be accessible in the future. But other images could run without influenced.

Q1.6: Why cannot I enter the container with an error named “GPU not found”?

A1.6: When this error happens, the remote server meets a fatal error. You need to wait until it get restarted. To check whether this error happens truly, use

$ nvidia-smi

If you could not see anything, it means the error really happens.

Q1.7: Why cannot I run docker without sudo?

A1.7: It may be caused by two reasons:

  1. You are not in the user group docker, please contact the manager to solve this problem.
  2. You have not deploy your docker config file. Check here to learn how to do that.

Q1.8: Should we install/uninstall apps outside docker?

A1.8: Actually it is better not. Because the machine is a fully customized one designed by NVIDIA, the version of the driver, the OS configuration and some other settings need to keep as it is. But we could still use

$ sudo apt-get update && sudo apt-get upgrade

to upgrade your packages as long as you have not changed the repositories of your software manager.

Network Service

Q2.1: Why cannot I open noVNC with an error about websocket?

A2.1: This error is caused by your mistaking operations before. Maybe you close noVNC via a wrong way. you need to use

$ ls -a

to see whether there are some temporary files about with a name like x.xx.x-shared.sock in your user folder ~. If so, delete them by rm.

Then you need to close all sessions and make a re-login. Then you may solve this problem.

Q2.2: How to release some occupied ports when using noVNC and meeting port is being used?

A2.2: Some unknown wrong mistaking operations may cause the noVNC services remained in the background although we do not use them. In this case, you need to check whether these remained processes exist:

$ ps aux | less | grep noVNC

And you would verify that these remained processes are occupying the ports. Remember their PID and use

$ kill -9 PID

to kill them.

Sometimes these processes are created by other users. You need to add sudo before the command to acquire the authority to kill them.

Q2.2: Why cannot I log in the server even though I could open noVNC page?

A2.2: This problem is caused by several reasons in different cases. We would list some common reasons:

  1. The host IP used in noVNC does not match your container’s IP.
  2. The input port used in noVNC does not match your container’s port, check New container_ID:port on the screen to know the right port.
  3. The password does not match your server, use vncpasswd inside the container to modify it.
  4. You have not closed your container securely before saving it in the last session. You need to follow the instructions to:

    1. Remove the X11 log files. They are usually in /tmp/.X11-unix.
    2. Kill all vncservers by vncserver -kill :port.
    3. Save the image again.
    4. Re-open in the container.

    After that, you need to follow the instructions about secure saving strictly.

  5. You may mount your local X11 temporary folder which causes your default port changed. Remove -v /tmp/.X11-unix:/tmp/.X11-unix option when running the container.

Q2.3: Why my remote desktop gets shut suddenly?

A2.3: You may close the main terminal while running the desktop. Exit the container without saving it and re-enter it.

Applications

Q3.1: Why cannot I open an application on the desktop?

A3.1: This problem is caused by several reasons in different cases some of which could not be solved. Here are the possible reasons:

  1. Some dependent libraries are missing. Use apt-get install -f to fix it.
  2. The application does not allow you to run it in root mode. Maybe you have to add some options to run it. For example, using chromium-browser --no-sandbox to run the browser.
  3. One of the dependent library of the application collapse. For example, Qt5Cursor could not be run in the VNC desktop, TexStudio may call it, which causes the application collapse. This kind of problem is unsolvable generally, we may list the known unsolvable application errors here.

Q3.2: Why cannot I open the chromium with an error about GPU?

A3.2: Note that you should not login while using chromium in the container. Because when you reload the image, the container ID would change, which makes chromium prevent you from getting access to your previous account. To solve this problem, you need to delete all your user configuration files of chromium by

$ rm -rf ~/.config/chromium

The same problem would occur when using google-chrome.

Q3.3: Which application is suitable for processing documents?

A3.3: We recommend you to install AcroRead, TeXLive, Kile and LibreOffice. To do that, follow these instructions:

  1. Install AcroRead:

    $ add-apt-repository "deb http://archive.canonical.com/ precise partner"
    $ apt-get update
    $ apt install adobereader-enu
    
  2. Install TeXLive:

    $ add-apt-repository ppa:jonathonf/texlive
    $ apt update && apt install texlive-full
    
  3. Install Kile:

    $ apt-get install kile
    
  4. Install LibreOffice:

    $ apt-get install libreoffice
    

Q3.4: Why cannot I use mex in matlab?

A3.4: Note that Matlab-R2018a requires you to use gcc-6.3.x. If you overwrite the default gcc (which is gcc-6.3.0) you may not be able to use mex. I recommend you to get back to the previous backup image. But maybe you could try to reinstall gcc-6.3.0 by referring these documents:

Q3.5: Is it necessary for us to save the image for every time?

A3.5: Not really. When you just work with data in the mount folder (If you use provided command, the mount folder is /homelocal), you could exit the container directly without saving it. You change upon the mount folder would be remained.

Q3.6: How to remove a repository that may be invalid?

A3.6: When you add an invalid repository in your apt, you may be not able to run apt-get update since you may meet with a 404 not found error. To remove this invalid repository, you need to use this command:

$ add-apt-repository ppa:repository-name -r

Q3.7: Why my pip tool could not be used?

Some times you may meet an error like this when you use pip

Traceback (most recent call last):
  File "/usr/local/bin/pip", line 7, in <module>
    from pip._internal import main
ImportError: No module named _internal

This problem is caused by the incompatibility between python and the newest pip. You need to use such commands to fix the pip for python2

$ cd ~ && curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
$ python2.7 get-pip.py --force-reinstall

And use these commands to fix the pip for python3

$ apt-get remove python3-pip
$ apt-get install python3-pip --reinstall

Q3.8: Why a GUI is displayed abnormally?

A3.8: Maybe this GUI requires 24bit depth mode. However, vnc4server use 16bits depth mode as default. To change this problem, you need to change your launch script and add the depth option to your vnc4server command like this:

$ vnc4server -depth 24 -geometry ...

Unsolvable problems

Note that these problems are unsolvable yet. If you do not know how to tackle them, it is better to retrieve your backup images. If you have ideas about these problems, please tell us by the discussion board. We appreciate for your help!

Here we collect some error reports that show the problems could not be solved when you use VNC server, because the VNC server disable RGB24 mode and some advanced graphic settings to accelerate the connection to the desktop, which makes some application collapsed.

Packages that could not be “made”

  1. OpenCV-3.4.1 with CMake: Although the making process is successful, most of the test could not be passed when you check it with make test -j32
  2. MatCaffe with CMake: You would not be able to use make to build caffe if you install opencv3 (because it could not be installed normally), and you could not use cmake to build matlab-caffe.
  3. FFmpeg: Could not be linked with the non-free library libaom. If we do not enable it, ffmpeg could be built successfully.

Applications

  1. TeXStudio: Returns an error code qt_xcb_createCursorXRender: query_pict_formats failed.
  2. ParaView: Returns an error code X Error: BadLength (poly request too large or internal Xlib length error) 16.
  3. VLC: Returns an error code Unsupported screen format: depth: 16, red_mask: 3f, blue_mask: f800.