Pulling Job Statistics

This page will help researchers gather system usage information about their running and completed jobs.

Info on Running Jobs

seff/sacct cannot not pull accurate statistics on currently running jobs. To see the current CPU, Memory, and GPU usage, you will need to connect to the node during execution.

If you have a sbatch script running, you can use the myjobs command to find the node your job is running on. You can identify the target node to connect to under NODELIST.

rcsparky@login02:~]$ myjobs
   JobID ...  PARTITION/QOS   NAME   STATE    ... Node/Core/GPU  NODELIST(REASON)
11273558 ...  general/public  myjob  RUNNING  ... 1/1/NA         sc008

In the example above, the job is running on node c008. You can connect directly to that node with ssh.

tip

You can only ssh to nodes you have a job currently running on: when you ssh to a node.

[rcsparky@login01:~]$ ssh sc008
[rcsparky@sc008:~]$

The prompt changed from username@login01 to username@sc008, indicating a successful connection to the compute node sc008. Once on the node, we can see the realtime resource usage.

CPU Usage - htop

htop will show the CPU and Memory of the processes running in our job

$ htop
top - 15:12:17 up 62 days, 16:04,  6 users,  load average: 83.96, 83.95, 83.37
Tasks: 1490 total,   7 running, 1483 sleeping,   0 stopped,   0 zombie
%Cpu(s): 63.6 us,  0.8 sy,  0.0 ni, 35.2 id,  0.1 wa,  0.3 hi,  0.0 si,  0.0 st
KiB Mem : 52768358+total, 43774208+free, 44024912 used, 45916584 buff/cache
KiB Swap:  4194300 total,  1393524 free,  2800776 used. 47617328+avail Mem 

    PID    PPID USER      PR    VIRT    RES    SHR   SWAP    DATA   TIME S  %CPU  %MEM nMaj nMin COMMAND    P nDRT   CODE   UID   GID     TIME+ RUSER    WCHAN      nTH  NI Flags    TTY      OOMs ENVIRON  
 931372  928130 rcsparky  20 7162680   1.0g 341492      0 1145200   0:07 S  69.1   0.2  138 191k /packag+ 125    0    144 1724+ 9999+   0:07.10 rcsparky -           44   0 4...4... ?         668 SLURM_M+ 
 931369  929093 rcsparky  20   67228   7016   4456      0    3072   0:00 R   1.0   0.0    0  828 top -u + 101    0    108 1724+ 9999+   0:00.34 rcsparky -            1   0 ....4... pts/0     666 LS_COLO+ 
 928130  928126 rcsparky  20   12868   3100   2908    168     536   5:02 S   0.0   0.0  516 1417 /bin/ba+ 125    0   1056 1724+ 9999+   5:02.62 rcsparky -            1   0 ....41.. ?         666 LS_COLO+ 
 929092  929086 rcsparky  20  175560   5592   4208      0    1232   0:00 S   0.0   0.0    0  175 sshd: j+  99    0    824 1724+ 9999+   0:00.01 rcsparky -            1   0 ....414. ?         666 -        
 929093  929092 rcsparky  20   24656   4516   3420      0    1196   0:00 S   0.0   0.0    0 2805 -bash    100    0   1056 1724+ 9999+   0:00.70 rcsparky -            1   0 ....4... pts/0     666 LANG=en+

This shows the processes and their CPU/Memory usage. CPU usage is listed a percentage. Full utilization of 1 CPU is represented by 100.0%, and likewise 8 CPUs is represented by 800.0%.

tip

Multithreaded applications may also be shown in htop as 8 x 100.0% threads, also indicating 8 CPU core's usage.

GPU Usage - nvtop

nvtop will display GPU usage for a job allocated resources on GPU nodes.

$ nvtop
Device 0 [NVIDIA A100-SXM4-80GB] PCIe GEN 4@16x RX: 1.089 GiB/s TX: 1.178 GiB/s
GPU 1410MHz MEM 1593MHz TEMP  38°C FAN N/A% POW 116 / 500 W
GPU[||||||||||||||||         44%] MEM[                1.238Gi/80.000Gi]

Info on Completed Jobs

Once a job has completed/canceled/failed, you can pull the resource statistics from the job scheduler accounting database using these two programs: seff and mysacct.

seff

seff is short for "slurm efficiency" and will display the percentage of CPU and Memory used by a job relative to how long the job ran. The goal is high efficiency so that jobs are not allocating resources they are not using.

seff is detailed here.

sacct

Specify either a job ID or username with the --jobs or --user flag, respectively to pull up all information on a job:

by job id
by user
more info

sacct --jobs=12345678
sacct --jobs=12345678,12345679

sacct --user=<username>

Some available --format variables are contained in the below table, and may be passed as a comma-separated list.

sacct --user=<username> --format=<var_1[,var_2,...,var_N]>

Variable	Description
account	Account the job ran under.
allocTRES	Allocated trackable resources (e.g. cores/RAM)
avecpu	Average CPU time of all tasks in job.
cputime	Formatted (Elapsed time * core) count used
elapsed	Jobs elapsed time formatted as DD-HH:MM:SS.
state	The job's state
jobid	The id of the job.
jobname	The name of the job.
maxdiskread	Maximum number of bytes read
maxdiskwrite	Maximum number of bytes written
maxrss	Maximum RAM use of all job tasks
ncpus	The number of allocated CPUs
nnodes	The number of allocated nodes
ntasks	Number of tasks in a job
priority	Slurm priority
qos	Quality of service
user	Username of the person who ran the job

mysacct

For convenience, we have created the command alias mysacct . This can provide a quick and easy-to-remember shortcut for getting the most common and useful tidbits of information.

This is equivalent to sacct --user=$USER --format=jobid,avecpu,maxrss,cputime,allocTRES%42,state and accepts the same flags that sacct would, e.g. --starttime=YYYY-MM-DD or --endtime=YYYY-MM-DD.

Examples for better understanding job hardware utilization

Note that by default, only jobs run on the current day will be listed. To search within a different period of time, use the --starttime flag. The --long flag can also be used to show a non-abbreviated version of sacct output. For example, to list detailed job characteristics for a user's jobs since December 15th, 2020:

sacct --user=$USER --starttime=2020-12-15 --long

This produces a lot of output. As an example for formatted output, the following complete command will list information about jobs that ran today for a user, specifically information about the job's id, average CPU use, maximum amount of RAM (memory) used, the core time (wall time multiplied by number of cores allocated), and the job's state:

sacct --user=$USER --format=jobid,avecpu,maxrss,cputime,state

warning

If a + is listed at the end of a field, then that field has likely been truncated to fit into a fixed number of characters. Consider increasing the with by appending a % followed by a number to specify a new width. For example allocTRES%42 overrides the default width to 42 characters.

Info on Running Jobs​

CPU Usage - htop​

GPU Usage - nvtop​

Info on Completed Jobs​

seff​

sacct​

mysacct​

Examples for better understanding job hardware utilization​