logo
Free, unlimited AI code reviews that run on commit
git-lrc git-lrc GitHub Install Now We'd appreciate a star git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt

ply - dynamically instrument the kernel

Authors

       Tobias Waldekranz tobias@waldekranz.com

Description

       ply dynamically instruments the running kernel to aggregate and extract user-defined data. It compiles an
       input  program  to  one or more Linux bpf(2) binaries and attaches them to arbitrary points in the kernel
       using kprobes and tracepoints.

Example

Extractingdata
       Print all openated files on the system, and who opened them:

           kprobe:SyS_openat
           {
               print(comm, pid, str(arg1));
           }

   Quantize
       Record the distribution of the return value of read(2):

           kretprobe:SyS_read
           {
               @["dist"] = quantize(retval);
           }

   Wildcards
       Count all syscalls made on the system, grouped by function:

           kprobe:SyS_*
           {
               @[caller] = count();
           }

       Count all syscalls made by every dd(1) process, grouped by function:

           kprobe:SyS_* / !strcmp(execname, "dd") /
           {
               @[caller] = count();
           }

   ObjectTracking
       Record the distribution of the time it takes an skb to go from netif_receive to ip_rcv:

           kprobe:__netif_receive_skb_core
           {
               rx[arg0] = time;
           }

           kprobe:ip_rcv / rx[arg0] /
           {
               @["diff"] = quantize(time - rx[arg0]);
           }

Name

ply - dynamically instrument the kernel

Options

-ccommand, --command=command
              When all probes are running, run command. When the command exits, stop all probes  and  exit.  The
              command is run as if invoked with sh-c<command>.

       -d, --debug
              Enable debugging output.

       -e, --dry-run
              Exit  after  compilation, without actually instrumenting the system. Typically used in conjunction
              with --dump.

       -h, --help
              Print usage message.

       -k, --keep-going
              Instead of printing a warning and exiting whenever any trace data is lost, only print the  warning
              and keep going.

       -S, --dump
              After  compilation,  dump  the  internal  AST,  generated  BPF  instructions  and  other  internal
              information. This is very useful to include when reporting a bug.

       -T, --self-test
              Run the built-in self-test which verifies that the required features in the kernel  are  available
              and that all providers are operational.

       -u, --unbuffer
              Unconditionally  disable  buffering  of stdout/stderr - even when they are not connected to a TTY.
              This is useful in interactive sessions where the output is futher processed by another program.

       -v, --version
              Print version information.

Providers

       A provider makes data available to the user by exporting functions and variables to the  probe.  Function
       calls  use  the  same  syntax as most languages that inherit from C. In addition to the provider-specific
       functions, all providers inherits a set of common functions and variables:

       •   char[16]comm, char[16]execnamename of the running process's executable.

       •   u32cpuCPUID of the processor on which the probe fired.

       •   u32gidGroupID of the running process.

       •   u32kpid: KernelPID of the running process. Also known as pid by the kernel. For  a  single-threaded
           process kpid is equal to pid. For multi-threaded processes, kpid will be unique while pid will be the
           same across all threads.

       •   char[N]mem(void*address[,intsize]) Copy size bytes from address. If size is omitted, 64 bytes
           will be copied.

       •   s64time, s64walltime: Nanoseconds elaped since system boot. time is intended for  time  deltas  and
           walltime  should  be  used  for  timestamps.  They refer to the same data, but with different default
           output formats.

       •   u32pid: ProcessID of the running process. Also known as threadgroupID (tgid) by the kernel.

       •   voidprint(...): Print each expression with its  default  output  format,  separated  by  commas  and
           terminated with a newline, to ply's standard out.

       •   voidprintf(format,...): Prints formattedoutput to ply's standard out. In addition to the formats
           recognized by the printf sitting in your

       •   intstrcmp(char*a,char*b): Returns -1, 0 or 1 if the first argument is  less  than,  equal  to  or
           greater than the second argument respectively. Strings are compared by their lexicographical order.

       •   u32uid: UserID of the running process.

   kprobeandkretprobe
       These providers use the corresponding kernel features to instrument arbitrary instructions in the kernel.
       The  probe-definition may be either an address or a symbol name. When using a symbol name, glob expansion
       is performed allowing a single probe to be inserted at multiple locations. An offset relative to a symbol
       may also be specfied for kprobes.

       Examples:

       •   kretprobe:schedule: Trace every time schedule returns.

       •   kprobe:SyS_*: Trace every time a syscall is made.

       •   kprobe:dev_hard_start_xmit+8: Trace function with offset.

       Shared variables:

       structpt_regs*regs
              Hardware register contents from when the probe was  triggered.  This  matches  the  definition  in
              <sys/ptrace.h> on your system.

       u32stackStacktraceID of the current probe. This is just returns an index into a separate map containing
              the actual instruction pointers. As a user though, you can think of this function as  returning  a
              string  containing  the  stack  trace  at  the  current location. Indeed print(stack) will produce
              exactly that.

              CAUTION: On some architectures (looking at you, ARM), capturing stack traces at  the  entry  of  a
              function,  before  the prologue has run, does not work. Setting your probe after the prologue will
              work around the issue (typically two instructions, or +8, on ARM).

       kprobe specific functions:

       arg0, arg1 ... argN:

       void*caller
              The program counter, as recorded in regs, at the time the probe was triggered. was  attached.  The
              default output format will resolve it to a symbolic name if one is available.

       kretprobe specific function:

       retval Return value of the probed function.

   tracepoint
       The tracepoint provider can instrument all stable tracepoints in the kernel. They are identified by their
       relative  path from the /sys/kernel/debug/tracing/events directory, where each leaf directory corresponds
       to a tracepoint.

       Examples:

       •   tracepoint:sched/sched_wakeup: Trace every time a process is awoken.

       •   tracepoint:irq/irq_handler_entry: Trace every time an interrupt is handled.

       Variables:

       struct<X>*data
              A struct is dynamically generated for each tracepoint by parsing its  format  file.  I.e.  if  the
              contents of the format file looks like the following:

               name: tcp_send_reset
               ID: 1304
               format:
                    field:unsigned short common_type;  offset:0; size:2;   signed:0;
                    field:unsigned char common_flags;  offset:2; size:1;   signed:0;
                    field:unsigned char common_preempt_count;    offset:3; size:1;   signed:0;
                    field:int common_pid;    offset:4; size:4;   signed:1;

                    field:const void * skbaddr;   offset:8; size:8;   signed:0;
                    field:const void * skaddr;    offset:16;     size:8;   signed:0;
                    field:__u16 sport;  offset:24;     size:2;   signed:0;
                    field:__u16 dport;  offset:26;     size:2;   signed:0;
                    field:__u8 saddr[4];     offset:28;     size:4;   signed:0;
                    field:__u8 daddr[4];     offset:32;     size:4;   signed:0;
                    field:__u8 saddr_v6[16]; offset:36;     size:16;  signed:0;
                    field:__u8 daddr_v6[16]; offset:52;     size:16;  signed:0;

       Then data would point to a struct of the following type:

               struct data {
                   unsigned short common_type;
                   unsigned char common_flags;
                   unsigned char common_preempt_count;
                   int common_pid;

                   const void * skbaddr;
                   const void * skaddr;
                   __u16 sport;
                   __u16 dport;
                   __u8 saddr[4];
                   __u8 daddr[4];
                   __u8 saddr_v6[16];
                   __u8 daddr_v6[16];
               };

       Functions:

       char[N]dyn(void*address[,intsize])
              Copy size bytes from a dynamic data pointer in data, i.e. a member marked with __data_loc. If size
              is omitted, the default string size determines the number of bytes to be copied.

   BEGINandEND
       These  special  providers  are  called  at  the beginning and the end of the tracing session like awk and
       bpftrace. The names are case sensitive. Users can print some messages or fill maps to known info.

   interval
       The interval provider will be trigger at each given interval. Users can specify time and unit (optional).
       If unit is omitted, then second is used. The supported units are:

       •   m: minutes

       •   s: seconds (default)

       •   ms: milli-seconds

       •   us: micro-seconds

       •   ns: nano-seconds

       Examples:

       •   interval:1: Called for every second

       •   interval:500ms: Called for every 500 milli-second

   profile
       The profile provider supports profiling by allowing the user to specify  how  many  times  it  will  fire
       per-second. Values of 1-1000 are supported, and the profile provider supports two probe formats:

       •   profile:[N]hz: Profile on all CPUs N times per second

       •   profile:[C]:[N]hz: Profile on CPU C N times per second

Return Value

0      Program was successfully compiled and loaded into the kernel.

       Non-Zero
              An error occurred during compilation or during kernel setup.

See Also

awk(1) dtrace(1) bpf(2)

                                                  February 2025                                           PLY(1)

Synopsis

plyprogram-fileplyprogram-text

Syntax

       The syntax is C-like in general, taking its inspiration dtrace(1) and, by extension, from awk(1).

   Probes
       A program consists of one or more probes, which are analogous to  awk's  pattern-action  statements.  The
       syntax for a probe is as follows:

           provider:probe-definition ['/' predicate '/']
           {
                statement ';'
               [statement ';' ... ]
           }

       The  provider  selects which probe interface to use. See the PROVIDERS section for more information about
       each provider. It is then up to the provider to parse the probe-definition to determine the  point(s)  of
       instrumentation.

       When  tracing, it is often desirable to filter events to match some criteria. Because of this, ply allows
       you to provide a predicate, i.e. an expression that must evaluate to a non-zero value in  order  for  the
       probe to be executed.

       Then follows a block of statements that perform the actual information gathering.

       A provider may define a default probe clause to be used if the user does not supply one.

   ControlofFlow
       Probes support basic conditional control of flow via an if-statement, which conforms to the same rules as
       C's equivalent:

           'if' '(' expr ')'
               statement ';' | block
           [else
               statement ';' | block]

       In  order  to  ensure  that  a  probe  will  have  a  finite run-time the kernel does not allow backwards
       branching. As a result, ply does not have any loop construct like for or while. A  simple  for  statement
       with  an  invariant  that is known at compile-time could be added later. In that case we could unroll the
       loop when generating BPF.

   TypeSystem
       The type system is modeled after C. As such ply understands the difference between  signed  and  unsigned
       integers, the difference between a short and a long long, what separates an integer from a pointer, how a
       struct  is  laid  out  in memory and so on. It is not complete though, notably floating point numbers and
       unions are missing.

       Programs are statically typed, but all types are inferred automatically. Thus, the type system is  mostly
       hidden  from  the user. Plans are to expose more of it in the future by allowing casts, type declarations
       and so on.

       Numbers and string literals are specified in the same way as in C.

   Maps
       The primary way to extract information is to store it in a map, i.e. in a hash table.  Like  awk(1),  ply
       dynamically  creates  any  referenced maps and their key and value types are inferred from the context in
       which they are used. All maps are in the global scope and can thus be used both for  extracting  data  to
       the end-user, and for carrying data between probes. Map names follow the rules of identifiers from C.

           mapname[exprs]

       Data can be stored in a map by assigning a value to a given key:

           mapname[exprs] = expr

       The delete keyword can be used to remove an association from a map:

           delete mapname[exprs]

       You can also remove all elements in the map using clear function.

   Aggregations
       More  often  than  not,  looking  at  each  individual  datum from a trace is not nearly as helpful as an
       aggregation of the data. Therefore ply supports aggregating data at the source, thereby reducing  tracing
       overhead.  Aggregations  are  syntactically  similar to maps, indeed they are a kind of map, but they are
       distinguished by a leading '@'. Also, they can only be assigned  the  result  of  one  of  the  following
       aggregation functions:

       @agg[exprs]=count()
              Bump a counter.

       @agg[exprs]=sum(scalr-expr)
              Evaluates the argument and aggregates the result.

       @agg[exprs]=quantize(scalar-expr)
              Evaluates  the  argument and aggregates on the most significant bit of the result. In other words,
              it stores the distribution of the expression.

See Also