numatop - a tool for memory access locality characterization and analysis.

Description

This manual page briefly documents the numatop command.

Most modern systems use a Non-Uniform Memory Access (NUMA) design for multiprocessing. In NUMA systems,
memory and processors are organized in such a way that some parts of memory are closer to a given
processor, while other parts are farther from it. A processor can access memory that is closer to it much
faster than the memory that is farther from it. Hence, the latency between the processors and different
portions of the memory in a NUMA machine may be significantly different.

numatop is an observation tool for runtime memory locality characterization and analysis of processes and
threads running on a NUMA system. It helps the user to characterize the NUMA behavior of processes and
threads and to identify where the NUMA-related performance bottlenecks reside. The tool uses hardware
performance counter sampling technologies and associates the performance data with Linux system runtime
information to provide real-time analysis in production systems. The tool can be used to:

A) Characterize the locality of all running processes and threads to identify those with the poorest
locality in the system.

B) Identify the "hot" memory areas, report average memory access latency, and provide the location where
accessed memory is allocated. A "hot" memory area is where process/thread(s) accesses are most frequent.
numatop has a metric called "ACCESS%" that specifies what percentage of memory accesses are attributable
to each memory area.

Note:numatoprecordsonlythememoryaccesseswhichhavelatenciesgreaterthanapredefinedthreshold(128CPUcycles).C) Provide the call-chain(s) in the process/thread code that accesses a given hot memory area.

D) Provide the call-chain(s) when the process/thread generates certain counter events (RMA/LMA/IR/CYCLE).
The call-chain(s) helps to locate the source code that generates the events.

RMA: Remote Memory Access.
LMA: Local Memory Access.
IR: Instruction Retired.
CYCLE: CPU cycles.

E) Provide per-node statistics for memory and CPU utilization. A node is: a region of memory in which
every byte has the same distance from each CPU.

F) Show, using a user-friendly interface, the list of processes/threads sorted by some metrics (by
default, sorted by CPU utilization), with the top process having the highest CPU utilization in the
system and the bottom one having the lowest CPU utilization. Users can also use hotkeys to resort the
output by these metrics: RMA, LMA, RMA/LMA, CPI, and CPU%.

RMA/LMA: ratio of RMA/LMA.
CPI: CPU cycle per instruction.
CPU%: CPU utilization.

numatop is a GUI tool that periodically tracks and analyzes the NUMA activity of processes and threads
and displays useful metrics. Users can scroll up/down by using the up or down key to navigate in the
current window and can use several hot keys shown at the bottom of the window, to switch between windows
or to change the running state of the tool. For example, hotkey 'R' refreshes the data in the current
window.

Below is a detailed description of the various display windows and the data items that they display:

[WIN1-Monitoringprocessesandthreads]:
Get the locality characterization of all processes. This is the first window upon startup, it's numatop's
"Home" window. This window displays a list of processes. The top process has the highest system CPU
utilization (CPU%), while the bottom process has the lowest CPU% in the system. Generally, the memory-
intensive process is also CPU-intensive, so the processes shown in this window are sorted by CPU% by
default. The user can press hotkeys '1', '2', '3', '4', or '5' to resort the output by "RMA", "LMA",
"RMA/LMA", "CPI", or "CPU%".

[KEYMETRICS]:
RMA(K): number of Remote Memory Access (unit is 1000).
RMA(K) = RMA / 1000;
LMA(K): number of Local Memory Access (unit is 1000).
LMA(K) = LMA / 1000;
RMA/LMA: ratio of RMA/LMA.
CPI: CPU cycles per instruction.
CPU%: system CPU utilization (busy time across all CPUs).

[HOTKEY]:
Q: Quit the application.
H: WIN1 refresh.
R: Refresh to show the latest data.
I: Switch to WIN2 to show the normalized data.
N: Switch to WIN11 to show the per-node statistics.
1: Sort by RMA.
2: Sort by LMA.
3: Sort by RMA/LMA.
4: Sort by CPI.
5: Sort by CPU%

[WIN2-Monitoringprocessesandthreads(normalized)]:
Get the normalized locality characterization of all processes.

[KEYMETRICS]:
RPI(K): RMA normalized by 1000 instructions.
RPI(K) = RMA / (IR / 1000);
LPI(K): LMA normalized by 1000 instructions.
LPI(K) = LMA / (IR / 1000);
Other metrics remain the same.

[HOTKEY]:
Q: Quit the application.
H: Switch to WIN1.
B: Back to previous window.
R: Refresh to show the latest data.
N: Switch to WIN11 to show the per-node statistics.
1: Sort by RPI.
2: Sort by LPI.
3: Sort by RMA/LMA.
4: Sort by CPI.
5: Sort by CPU%

[WIN3-Monitoringtheprocess]:
Get the locality characterization with node affinity of a specified process.

[KEYMETRICS]:
NODE: the node ID.
CPU%: per-node CPU utilization.
Other metrics remain the same.

[HOTKEY]:
Q: Quit the application.
H: Switch to WIN1.
B: Back to previous window.
R: Refresh to show the latest data.
N: Switch to WIN11 to show the per-node statistics.
L: Show the latency information.
C: Show the call-chain.

[WIN4-Monitoringallthreads]:
Get the locality characterization of all threads in a specified process.

[KEYMETRICS]:
CPU%: per-CPU CPU utilization.
Other metrics remain the same.

[HOTKEY]:
Q: Quit the application.
H: Switch to WIN1.
B: Back to previous window.
R: Refresh to show the latest data.
N: Switch to WIN11 to show the per-node statistics.

[WIN5-Monitoringthethread]:
Get the locality characterization with node affinity of a specified thread.

[KEYMETRICS]:
CPU%: per-CPU CPU utilization.
Other metrics remain the same.

[WIN6-Monitoringmemoryareas]:
Get the memory area use with the associated accessing latency of a specified process/thread.

[KEYMETRICS]:
ADDR: starting address of the memory area.
SIZE: size of memory area (K/M/G bytes).
ACCESS%: percentage of memory accesses are to this memory area.
LAT(ns): the average latency (nanoseconds) of memory accesses.
DESC: description of memory area (from /proc/<pid>/maps).

[HOTKEY]:
Q: Quit the application.
H: Switch to WIN1.
B: Back to previous window.
R: Refresh to show the latest data.
A: Show the memory access node distribution.
C: Show the call-chain when process/thread accesses the memory area.

[WIN7-Memoryaccessnodedistributionoverview]:
Get the percentage of memory accesses originated from the process/thread to each node.

[KEYMETRICS]:
NODE: the node ID.
ACCESS%: percentage of memory accesses are to this node.
LAT(ns): the average latency (nanoseconds) of memory accesses to this node.

[HOTKEY]:
Q: Quit the application.
H: Switch to WIN1.
B: Back to previous window.
R: Refresh to show the latest data.

[WIN8-Breakdownthememoryareaintophysicalmemoryonnode]:
Break down the memory area into the physical mapping on node with the associated accessing latency of a
process/thread.

[KEYMETRICS]:
NODE: the node ID.
Other metrics remain the same.

[HOTKEY]:
Q: Quit the application.
H: Switch to WIN1.
B: Back to previous window.
R: Refresh to show the latest data.

[WIN9-Call-chainwhenprocess/threadgeneratestheevent("RMA"/"LMA"/"CYCLE"/"IR")]:
Determine the call-chains to the code that generates "RMA"/"LMA"/"CYCLE"/"IR".

[KEYMETRICS]:
Call-chain list: a list of call-chains.

[HOTKEY]:
Q: Quit the application.
H: Switch to WIN1.
B: Back to the previous window.
R: Refresh to show the latest data.
1: Locate call-chain when process/thread generates "RMA"
2: Locate call-chain when process/thread generates "LMA"
3: Locate call-chain when process/thread generates "CYCLE" (CPU cycle)
4: Locate call-chain when process/thread generates "IR" (Instruction Retired)

[WIN10-Call-chainwhenprocess/threadaccessthememoryarea]:
Determine the call-chains to the code that references this memory area. The latency must be greater than
the predefined latency threshold (128 CPU cycles).

[KEYMETRICS]:
Call-chain list: a list of call-chains.
Other metrics remain the same.

[HOTKEY]:
Q: Quit the application.
H: Switch to WIN1.
B: Back to previous window.
R: Refresh to show the latest data.

[WIN11-NodeOverview]:
Show the basic per-node statistics for this system

[KEYMETRICS]:
MEM.ALL: total usable RAM (physical RAM minus a few reserved bits and the kernel binary code).
MEM.FREE: sum of LowFree + HighFree (overall stat) .
CPU%: per-node CPU utilization.
Other metrics remain the same.

[WIN12-InformationofNodeN]:
Show the memory use and CPU utilization for the selected node.

[KEYMETRICS]:
CPU: array of logical CPUs which belong to this node.
CPU%: per-node CPU utilization.
MEM active: the amount of memory that has been used more recently and is not usually reclaimed unless
absolute necessary.
MEM inactive: the amount of memory that has not been used for a while and is eligible to be swapped to
disk.
Dirty: the amount of memory waiting to be written back to the disk.
Writeback: the amount of memory actively being written back to the disk.
Mapped: all pages mapped into a process.

[HOTKEY]:
Q: Quit the application.
H: Switch to WIN1.
B: Back to previous window.
R: Refresh to show the latest data.

Examples

       Example 1: Launch numatop with high sampling precision
       numatop -s high

       Example 2: Write all warning messages in /tmp/numatop.log
       numatop -l 2 -o /tmp/numatop.log

       Example 3: Dump screen data in /tmp/dump.log
       numatop -d /tmp/dump.log

Exit Status

       0: successful operation.
       Other value: an error occurred.

Name

       numatop - a tool for memory access locality characterization and analysis.

Options

       The following options are supported by numatop:

       -s sampling_precision
       normal: balance precision and overhead (default)
       high: high sampling precision (high overhead)
       low: low sampling precision, suitable for high load system

       -l log_level
       Specifies the level of logging in the log file. Valid values are:
       1: unknown (reserved for future use)
       2: all

       -f log_file
       Specifies  the  log  file  where  output  will be written. If the log file is not writable, the tool will
       prompt "Cannot open '<file name>' for writting.".

       -d dump_file
       Specifies the dump file where the screen data will be written.  Generally  the  dump  file  is  used  for
       automated  test. If the dump file is not writable, the tool will prompt "Cannot open <file name> for dump
       writing."

       -h Displays the command's usage.

       -t duration
       Specifies run time duration in seconds.

Synopsis

numatop [-s] [-l] [-f] [-d]

       numatop [-h]

Usage

       You must have root privileges to run numatop.
       Or set -1 in /proc/sys/kernel/perf_event_paranoidNote: The perf_event_paranoid setting has security implications and a non-root user probably doesn't have
       authority to access /proc. It is highly recommended that the user runs numatop as root.

Version

numatop requires a patch set to support PEBS Load Latency functionality in the kernel. The patch set  has
       not  been  integrated  in 3.8. Probably it will be integrated in 3.9. The following steps show how to get
       and apply the patch set.

       1. git clone git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
       2. cd tip
       3. git checkout perf/x86
       4. build kernel as usual

       numatop supports the Intel Xeon processors: 5500-series, 6500/7500-series, 5600  series,  E7-x8xx-series,
       and  E5-16xx/24xx/26xx/46xx-series.   Note:  CPU microcode version 0x618 or 0x70c or later is required on
       E5-16xx/24xx/26xx/46xx-series. It also supports IBM Power8, Power9, Power10 and Power11 processors.

                                                 August 1, 2024                                       NUMATOP(8)

numatop - a tool for memory access locality characterization and analysis.

Contents

Description

Examples

Exit Status

Name

Options

Synopsis

Usage

Version

See Also