Create the sqlhist executable:
man tracefs_sql | sed -ne '/^EXAMPLE/,/FILES/ { /EXAMPLE/d ; /FILES/d ; p}' > sqlhist.c
gcc -o sqlhist sqlhist.c `pkg-config --cflags --libs libtracefs`
As described above, for testing purposes, make a copy of the event directory:
$ mkdir /tmp/tracing
$ sudo cp -r /sys/kernel/tracing/events /tmp/tracing/
$ sudo chmod -R 0644 /tmp/tracing/
For an example of simple histogram output using the copy of the tracefs directory.
$ ./sqlhist -t /tmp/tracing/ 'SELECT CAST(call_site as SYM-OFFSET), bytes_req, CAST(bytes_alloc AS _COUNTER_) FROM kmalloc'
Produces the output:
echo 'hist:keys=call_site.sym-offset,bytes_req:vals=bytes_alloc' > /sys/kernel/tracing/events/kmem/kmalloc/trigger
Which could be used by root:
# echo 'hist:keys=call_site.sym-offset,bytes_req:vals=bytes_alloc' > /sys/kernel/tracing/events/kmem/kmalloc/trigger
# cat /sys/kernel/tracing/events/kmem/kmalloc/hist
# event histogram
#
# trigger info: hist:keys=call_site.sym-offset,bytes_req:vals=hitcount,bytes_alloc:sort=hitcount:size=2048 [active]
#
{ call_site: [ffffffff813f8d8a] load_elf_phdrs+0x4a/0xb0 , bytes_req: 728 } hitcount: 1 bytes_alloc: 1024
{ call_site: [ffffffffc0c69e74] nf_ct_ext_add+0xd4/0x1d0 [nf_conntrack] , bytes_req: 128 } hitcount: 1 bytes_alloc: 128
{ call_site: [ffffffff818355e6] dma_resv_get_fences+0xf6/0x440 , bytes_req: 8 } hitcount: 1 bytes_alloc: 8
{ call_site: [ffffffffc06dc73f] intel_gt_get_buffer_pool+0x15f/0x290 [i915] , bytes_req: 424 } hitcount: 1 bytes_alloc: 512
{ call_site: [ffffffff813f8d8a] load_elf_phdrs+0x4a/0xb0 , bytes_req: 616 } hitcount: 1 bytes_alloc: 1024
{ call_site: [ffffffff8161a44c] __sg_alloc_table+0x11c/0x180 , bytes_req: 32 } hitcount: 1 bytes_alloc: 32
{ call_site: [ffffffffc070749d] shmem_get_pages+0xad/0x5d0 [i915] , bytes_req: 16 } hitcount: 1 bytes_alloc: 16
{ call_site: [ffffffffc07507f5] intel_framebuffer_create+0x25/0x60 [i915] , bytes_req: 408 } hitcount: 1 bytes_alloc: 512
{ call_site: [ffffffffc06fc20f] eb_parse+0x34f/0x910 [i915] , bytes_req: 408 } hitcount: 1 bytes_alloc: 512
{ call_site: [ffffffffc0700ebd] i915_gem_object_get_pages_internal+0x5d/0x270 [i915] , bytes_req: 16 } hitcount: 1 bytes_alloc: 16
{ call_site: [ffffffffc0771188] intel_frontbuffer_get+0x38/0x220 [i915] , bytes_req: 400 } hitcount: 1 bytes_alloc: 512
{ call_site: [ffffffff8161a44c] __sg_alloc_table+0x11c/0x180 , bytes_req: 128 } hitcount: 1 bytes_alloc: 128
{ call_site: [ffffffff813f8f45] load_elf_binary+0x155/0x1680 , bytes_req: 28 } hitcount: 1 bytes_alloc: 32
{ call_site: [ffffffffc07038c8] __assign_mmap_offset+0x208/0x3d0 [i915] , bytes_req: 288 } hitcount: 1 bytes_alloc: 512
{ call_site: [ffffffff813737b2] alloc_bprm+0x32/0x2f0 , bytes_req: 416 } hitcount: 1 bytes_alloc: 512
{ call_site: [ffffffff813f9027] load_elf_binary+0x237/0x1680 , bytes_req: 64 } hitcount: 1 bytes_alloc: 64
{ call_site: [ffffffff8161a44c] __sg_alloc_table+0x11c/0x180 , bytes_req: 64 } hitcount: 1 bytes_alloc: 64
{ call_site: [ffffffffc040ffe7] drm_vma_node_allow+0x27/0xe0 [drm] , bytes_req: 40 } hitcount: 2 bytes_alloc: 128
{ call_site: [ffffffff813cda98] __do_sys_timerfd_create+0x58/0x1c0 , bytes_req: 336 } hitcount: 2 bytes_alloc: 1024
{ call_site: [ffffffff818355e6] dma_resv_get_fences+0xf6/0x440 , bytes_req: 40 } hitcount: 2 bytes_alloc: 128
{ call_site: [ffffffff8139b75a] single_open+0x2a/0xa0 , bytes_req: 32 } hitcount: 2 bytes_alloc: 64
{ call_site: [ffffffff815df715] bio_kmalloc+0x25/0x80 , bytes_req: 136 } hitcount: 2 bytes_alloc: 384
{ call_site: [ffffffffc071e5cd] i915_vma_work+0x1d/0x50 [i915] , bytes_req: 416 } hitcount: 3 bytes_alloc: 1536
{ call_site: [ffffffff81390d0d] alloc_fdtable+0x4d/0x100 , bytes_req: 56 } hitcount: 3 bytes_alloc: 192
{ call_site: [ffffffffc06ff65f] i915_gem_do_execbuffer+0x158f/0x2440 [i915] , bytes_req: 16 } hitcount: 4 bytes_alloc: 64
{ call_site: [ffffffff8137713c] alloc_pipe_info+0x5c/0x230 , bytes_req: 384 } hitcount: 5 bytes_alloc: 2560
{ call_site: [ffffffff813771b4] alloc_pipe_info+0xd4/0x230 , bytes_req: 640 } hitcount: 5 bytes_alloc: 5120
{ call_site: [ffffffff81834cdb] dma_resv_list_alloc+0x1b/0x40 , bytes_req: 40 } hitcount: 6 bytes_alloc: 384
{ call_site: [ffffffff81834cdb] dma_resv_list_alloc+0x1b/0x40 , bytes_req: 56 } hitcount: 9 bytes_alloc: 576
{ call_site: [ffffffff8120086e] tracing_map_sort_entries+0x9e/0x3e0 , bytes_req: 24 } hitcount: 60 bytes_alloc: 1920
Totals:
Hits: 122
Entries: 30
Dropped: 0
Note, although the examples use uppercase for the SQL keywords, they do not have to be. SELECT could also
be select or even sElEcT.
By using the full SQL language, synthetic events can be made and processed. For example, using sqlhist
along with trace-cmd(1), wake up latency can be recorded by creating a synthetic event by attaching the
sched_waking and the sched_switch events.
# sqlhist -n wakeup_lat -e -T -m lat 'SELECT end.next_comm AS comm, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) AS lat FROM ' \
'sched_waking AS start JOIN sched_switch AS end ON start.pid = end.next_pid WHERE end.next_prio < 100 && end.next_comm == "cyclictest"'
# trace-cmd start -e all -e wakeup_lat -R stacktrace
# cyclictest -l 1000 -p80 -i250 -a -t -q -m -d 0 -b 1000 --tracemark
# trace-cmd show -s | tail -30
<idle>-0 [002] dNh4 23454.902246: sched_wakeup: comm=cyclictest pid=12272 prio=120 target_cpu=002
<idle>-0 [005] ...1 23454.902246: cpu_idle: state=4294967295 cpu_id=5
<idle>-0 [007] d..1 23454.902246: cpu_idle: state=0 cpu_id=7
<idle>-0 [002] dNh1 23454.902247: hrtimer_expire_exit: hrtimer=0000000037956dc2
<idle>-0 [005] d..1 23454.902248: cpu_idle: state=0 cpu_id=5
<idle>-0 [002] dNh1 23454.902248: write_msr: 6e0, value 4866ce957272
<idle>-0 [006] ...1 23454.902248: cpu_idle: state=4294967295 cpu_id=6
<idle>-0 [002] dNh1 23454.902249: local_timer_exit: vector=236
<idle>-0 [006] d..1 23454.902250: cpu_idle: state=0 cpu_id=6
<idle>-0 [002] .N.1 23454.902250: cpu_idle: state=4294967295 cpu_id=2
<idle>-0 [002] dN.1 23454.902251: rcu_utilization: Start context switch
<idle>-0 [002] dN.1 23454.902252: rcu_utilization: End context switch
<idle>-0 [001] ...1 23454.902252: cpu_idle: state=4294967295 cpu_id=1
<idle>-0 [002] dN.3 23454.902253: prandom_u32: ret=3692516021
<idle>-0 [001] d..1 23454.902254: cpu_idle: state=0 cpu_id=1
<idle>-0 [002] d..2 23454.902254: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=cyclictest next_pid=12275 next_prio=19
<idle>-0 [002] d..4 23454.902256: wakeup_lat: next_comm=cyclictest lat=17
<idle>-0 [002] d..5 23454.902258: <stack trace>
=> trace_event_raw_event_synth
=> action_trace
=> event_hist_trigger
=> event_triggers_call
=> trace_event_buffer_commit
=> trace_event_raw_event_sched_switch
=> __traceiter_sched_switch
=> __schedule
=> schedule_idle
=> do_idle
=> cpu_startup_entry
=> secondary_startup_64_no_verify
Here’s the options for sqlhist explained:
-nwakeup_lat
Name the synthetic event to use wakeup_lat.
-e
Execute the commands that are printed.
-T
Perform both a trace action and then a snapshot action (swap the buffer into the static snapshot
buffer).
-mlat
Trigger the actions whenever lat hits a new maximum value.
Now a breakdown of the SQL statement:
'SELECT end.next_comm AS comm, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) AS lat FROM ' \
'sched_waking AS start JOIN sched_switch AS end ON start.pid = end.next_pid WHERE end.next_prio < 100 && end.next_comm == "cyclictest"'
end.next_commAScomm
Save the sched_switch field next_comm and place it into the comm field of the wakeup_lat synthetic
event.
(end.TIMESTAMP_USECS-start.TIMESTAMP_USECS)ASlat
Take the delta of the time stamps from the sched_switch event and the sched_waking event. As time
stamps are usually recorded in nanoseconds, TIMESTAMP would give the full nanosecond time stamp, but
here, the TIMESTAMP_USECS will truncate it into microseconds. The value is saved in the variable lat,
which will also be recorded in the synthetic event.
FROMsched_wakingASstartJOINsched_switchASendONstart.pid=end.next_pid
Create the synthetic event by joining sched_waking to sched_switch, matching the sched_wakingpid
field with the sched_switchnext_pid field. Also make start an alias for sched_waking and end an
alias for sched_switch which then an use start and end as a subsitute for sched_waking and
sched_switch respectively through out the rest of the SQL statement.
WHEREend.next_prio<100&&end.next_comm=="cyclictest"
Filter the logic where it executes only if sched_wakingnext_prio field is less than 100. (Note, in
the Kernel, priorities are inverse, and the real-time priorities are represented from 0-100 where 0
is the highest priority). Also only trace when the next_comm (the task scheduling in) of the
sched_switch event has the name "cyclictest".
For the trace-cmd(3) command:
trace-cmd start -e all -e wakeup_lat -R stacktrace
trace-cmdstart
Enables tracing (does not record to a file).
-eall
Enable all events
-ewakeup_lat-Rstacktrace
have the "wakeup_lat" event (our synthetic event) enable the stacktrace trigger, were for every
instance of the "wakeup_lat" event, a kernel stack trace will be recorded in the ring buffer.
After calling cyclictest (a real-time tool to measure wakeup latency), read the snapshot buffer.
trace-cmdshow-strace-cmdshow reads the kernel ring buffer, and the -s option will read the snapshot buffer instead
of the normal one.
<idle>-0 [002] d..4 23454.902256: wakeup_lat: next_comm=cyclictest lat=17
We see on the "wakeup_lat" event happened on CPU 2, with a wake up latency 17 microseconds.
This can be extracted into a trace.dat file that trace-cmd(3) can read and do further analysis, as well
as kernelshark.
# trace-cmd extract -s
# trace-cmd report --cpu 2 | tail -30
<idle>-0 [002] 23454.902238: prandom_u32: ret=1633425088
<idle>-0 [002] 23454.902239: sched_wakeup: cyclictest:12275 [19] CPU:002
<idle>-0 [002] 23454.902241: hrtimer_expire_exit: hrtimer=0xffffbbd68286fe60
<idle>-0 [002] 23454.902241: hrtimer_cancel: hrtimer=0xffffbbd6826efe70
<idle>-0 [002] 23454.902242: hrtimer_expire_entry: hrtimer=0xffffbbd6826efe70 now=23455294430750 function=hrtimer_wakeup/0x0
<idle>-0 [002] 23454.902243: sched_waking: comm=cyclictest pid=12272 prio=120 target_cpu=002
<idle>-0 [002] 23454.902244: prandom_u32: ret=1102749734
<idle>-0 [002] 23454.902246: sched_wakeup: cyclictest:12272 [120] CPU:002
<idle>-0 [002] 23454.902247: hrtimer_expire_exit: hrtimer=0xffffbbd6826efe70
<idle>-0 [002] 23454.902248: write_msr: 6e0, value 4866ce957272
<idle>-0 [002] 23454.902249: local_timer_exit: vector=236
<idle>-0 [002] 23454.902250: cpu_idle: state=4294967295 cpu_id=2
<idle>-0 [002] 23454.902251: rcu_utilization: Start context switch
<idle>-0 [002] 23454.902252: rcu_utilization: End context switch
<idle>-0 [002] 23454.902253: prandom_u32: ret=3692516021
<idle>-0 [002] 23454.902254: sched_switch: swapper/2:0 [120] R ==> cyclictest:12275 [19]
<idle>-0 [002] 23454.902256: wakeup_lat: next_comm=cyclictest lat=17
<idle>-0 [002] 23454.902258: kernel_stack: <stack trace >
=> trace_event_raw_event_synth (ffffffff8121a0db)
=> action_trace (ffffffff8121e9fb)
=> event_hist_trigger (ffffffff8121ca8d)
=> event_triggers_call (ffffffff81216c72)
=> trace_event_buffer_commit (ffffffff811f7618)
=> trace_event_raw_event_sched_switch (ffffffff8110fda4)
=> __traceiter_sched_switch (ffffffff8110d449)
=> __schedule (ffffffff81c02002)
=> schedule_idle (ffffffff81c02c86)
=> do_idle (ffffffff8111e898)
=> cpu_startup_entry (ffffffff8111eba9)
=> secondary_startup_64_no_verify (ffffffff81000107)