Page tree

This page describes how a multi-node GPU cluster can be monitored on Gadi using gpustat. It is a GPU monitoring tool, capable of displaying a multitude of information from multiple GPU nodes on one page. The gpustat is an open-source project located in the following repository: We are running a modified version that is compiled specially for Gadi. The gpustat will work for both interactive and batch jobs.

To run the monitoring tool, one has to login into the Gadi, load the module, and then launch the tool from a login node. The actual monitoring tool will run from the first node of the GPU cluster. When the gpustat is called it prints a SSH command, which can be copied and used to directly connect from the local computer to the Gadi GPU cluster.  gpustat makes it easy to connect to multiple GPU nodes with just a single command. In Mac or Linux, the code can be run from a terminal. For Windows, one has to use an SSH client like Putty. 

You need to set up the password-less ssh login within Gadi nodes to run gpustat.

First get the task running

One has to submit a PBS script and have the GPU job running before launching gpustat as the script needs to know the GPU node addresses to establish a connection. The PBS job must add "gdata/dk92" into its storage request to use gpustat.


Next, we are going to explain how to run gpustat in a few easy steps. We have put all the files in a modular format for making it easier for Gadi users. There are four steps that need to be performed to run gpustat. The first three are performed on Gadi, while the last step is performed on your local machine.

On Gadi

  1. First, one needs to submit the PBS script and wait for GPUs to be allocated. gpustat requires CUDA drivers to operate, running gpustat from non-GPU will cause a run-time error. The following example shows the submission of a PBS job. In this case, 16 GPUs in 4 nodes have been allocated in the job. You can submit any other number of GPU requests. Wait for the job to start and note the job number.

    Get the Job Number
    qsub: waiting for job 49068096.gadi-pbs to start
    qsub: job 49068096.gadi-pbs ready
  2. Once the job has started, we can load the module and run gpustat in less than a minute. The following module is required for gpustat.

    Load module
    $  module use /g/data/dk92/apps/Modules/modulefiles
    $  module load gpustat/1.0
  3. After the module is loaded, run the gpustat-run command from the Gadi login node. It requires only the PBS job number as input and uses the following format:
    gpustat-run <job_no>.gadi-pbs

    The command above will start a web server in Gadi. The output will show the important information about the cluster, including all the GPU nodes that are allocated, and the server port number. An example output is shown below.

    The last line is the most important for us, it prints the direct code for port forwarding. Just copy this line which contains all the information required to start a port forwarding session from your local machine. 

On the local machine 

  1. Now, run the last line of code on your local computer terminal which will establish connection with the web server and collect information from all GPU nodes. There is no need to manually connect to all all GPU servers; thus, it makes GPU cluster monitoring easier. 

    Run on local terminal
    % ssh -N -L

New code

Each time you run the gpustat-run command, a new code will be generated. It will have new domain names and port numbers.  Copy the new code and run it from the local terminal to start a new session. Reuse of old code will not work. 

Display on local browser

Note the port number in the above command. Open a browser, use the port number together with localhost and the output will be displayed on the browser window. You can get the port number from execution of the gpustat-run command above. 
For example, if the port number is
16881, then use the following address in your local browser: localhost:16881. A screenshot of the output is shown below. One can see that 16 GPUs are grounded in four nodes. All information is updated at five seconds interval. The port number will change each time the command is run, and you can not use the old port number. 

Display disconnect

The server is running on the first node of the GPU cluster. When the allocated wall time for the task is over and GPUs are released back to the system, the web server will stop operating. 

Module file

All files are located in: /g/data/dk92/apps/gpustat.
The module file is located in:  /g/data/dk92/apps/Modules/modulefiles.

If the files are to be relocated, then make sure all the executable files are moved together and placed in the same folder. Furthermore, the working directory path in the module file must point to the new folder.

  • No labels