Skip to main content

Pulling Job Statistics

This page will help researchers gather system usage information about their running and completed jobs.

Info on Running Jobs

seff/sacct cannot not pull accurate statistics on currently running jobs. To see the current CPU, Memory, and GPU usage, you will need to connect to the node during execution.

If you have a sbatch script running, you can use the myjobs command to find the node your job is running on. You can identify the target node to connect to under NODELIST.

rcsparky@login02:~]$ myjobs
JobID ... PARTITION/QOS NAME STATE ... Node/Core/GPU NODELIST(REASON)
11273558 ... general/public myjob RUNNING ... 1/1/NA sc008

In the example above, the job is running on node c008. You can connect directly to that node with ssh.

tip

You can only ssh to nodes you have a job currently running on: when you ssh to a node.

[rcsparky@login01:~]$ ssh sc008
[rcsparky@sc008:~]$

The prompt changed from username@login01 to username@sc008, indicating a successful connection to the compute node sc008. Once on the node, we can see the realtime resource usage.

CPU Usage - htop

htop will show the CPU and Memory of the processes running in our job

$ htop
top - 15:12:17 up 62 days, 16:04, 6 users, load average: 83.96, 83.95, 83.37
Tasks: 1490 total, 7 running, 1483 sleeping, 0 stopped, 0 zombie
%Cpu(s): 63.6 us, 0.8 sy, 0.0 ni, 35.2 id, 0.1 wa, 0.3 hi, 0.0 si, 0.0 st
KiB Mem : 52768358+total, 43774208+free, 44024912 used, 45916584 buff/cache
KiB Swap: 4194300 total, 1393524 free, 2800776 used. 47617328+avail Mem

PID PPID USER PR VIRT RES SHR SWAP DATA TIME S %CPU %MEM nMaj nMin COMMAND P nDRT CODE UID GID TIME+ RUSER WCHAN nTH NI Flags TTY OOMs ENVIRON
931372 928130 rcsparky 20 7162680 1.0g 341492 0 1145200 0:07 S 69.1 0.2 138 191k /packag+ 125 0 144 1724+ 9999+ 0:07.10 rcsparky - 44 0 4...4... ? 668 SLURM_M+
931369 929093 rcsparky 20 67228 7016 4456 0 3072 0:00 R 1.0 0.0 0 828 top -u + 101 0 108 1724+ 9999+ 0:00.34 rcsparky - 1 0 ....4... pts/0 666 LS_COLO+
928130 928126 rcsparky 20 12868 3100 2908 168 536 5:02 S 0.0 0.0 516 1417 /bin/ba+ 125 0 1056 1724+ 9999+ 5:02.62 rcsparky - 1 0 ....41.. ? 666 LS_COLO+
929092 929086 rcsparky 20 175560 5592 4208 0 1232 0:00 S 0.0 0.0 0 175 sshd: j+ 99 0 824 1724+ 9999+ 0:00.01 rcsparky - 1 0 ....414. ? 666 -
929093 929092 rcsparky 20 24656 4516 3420 0 1196 0:00 S 0.0 0.0 0 2805 -bash 100 0 1056 1724+ 9999+ 0:00.70 rcsparky - 1 0 ....4... pts/0 666 LANG=en+

This shows the processes and their CPU/Memory usage. CPU usage is listed a percentage. Full utilization of 1 CPU is represented by 100.0%, and likewise 8 CPUs is represented by 800.0%.

tip

Multithreaded applications may also be shown in htop as 8 x 100.0% threads, also indicating 8 CPU core's usage.

GPU Usage - nvtop

nvtop will display GPU usage for a job allocated resources on GPU nodes.

$ nvtop
Device 0 [NVIDIA A100-SXM4-80GB] PCIe GEN 4@16x RX: 1.089 GiB/s TX: 1.178 GiB/s
GPU 1410MHz MEM 1593MHz TEMP 38°C FAN N/A% POW 116 / 500 W
GPU[|||||||||||||||| 44%] MEM[ 1.238Gi/80.000Gi]

Info on Completed Jobs

Once a job has completed/canceled/failed, you can pull the resource statistics from the job scheduler accounting database using these two programs: seff and mysacct.

seff

seff is short for "slurm efficiency" and will display the percentage of CPU and Memory used by a job relative to how long the job ran. The goal is high efficiency so that jobs are not allocating resources they are not using.

seff is detailed here.

sacct

Specify either a job ID or username with the --jobs or --user flag, respectively to pull up all information on a job:

sacct --jobs=12345678
sacct --jobs=12345678,12345679

mysacct

For convenience, we have created the command alias mysacct . This can provide a quick and easy-to-remember shortcut for getting the most common and useful tidbits of information.

This is equivalent to sacct --user=$USER --format=jobid,avecpu,maxrss,cputime,allocTRES%42,state and accepts the same flags that sacct would, e.g. --starttime=YYYY-MM-DD or --endtime=YYYY-MM-DD.

Examples for better understanding job hardware utilization

Note that by default, only jobs run on the current day will be listed. To search within a different period of time, use the --starttime flag. The --long flag can also be used to show a non-abbreviated version of sacct output. For example, to list detailed job characteristics for a user's jobs since December 15th, 2020:

sacct --user=$USER --starttime=2020-12-15 --long

This produces a lot of output. As an example for formatted output, the following complete command will list information about jobs that ran today for a user, specifically information about the job's id, average CPU use, maximum amount of RAM (memory) used, the core time (wall time multiplied by number of cores allocated), and the job's state:

sacct --user=$USER --format=jobid,avecpu,maxrss,cputime,state
warning

If a + is listed at the end of a field, then that field has likely been truncated to fit into a fixed number of characters. Consider increasing the with by appending a % followed by a number to specify a new width. For example allocTRES%42 overrides the default width to 42 characters.