1
0
mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-01-18 12:31:11 +00:00

1287 Commits

Author SHA1 Message Date
Leo Yan
0a75ba3e84 perf docs: Document building with Clang
Add example commands for building perf with Clang.

Since recent Android NDK releases use Clang as the default compiler, a
separate Android specific document is no longer needed; point to the
general build documentation instead.

Signed-off-by: Leo Yan <leo.yan@arm.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20251006-perf_build_android_ndk-v3-9-4305590795b2@arm.com
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Nick Desaulniers <nick.desaulniers+lkml@gmail.com>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Bill Wendling <morbo@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: James Clark <james.clark@linaro.org>
Cc: linux-riscv@lists.infradead.org
Cc: llvm@lists.linux.dev
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-perf-users@vger.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-10-06 16:49:25 -03:00
Ian Rogers
e444c2d4a2 perf check: Add libLLVM feature
Advertise when perf is built with the HAVE_LIBLLVM_SUPPORT option.

Committer testing:

  $ perf -vv | grep LLVM
               libLLVM: [ on  ]  # HAVE_LIBLLVM_SUPPORT
  $

And the form to use in scripts, notably the tools/perf/tests/shell/
'perf test' ones:

  $ perf check feature libllvm
               libLLVM: [ on  ]  # HAVE_LIBLLVM_SUPPORT
  $ perf check -q feature libllvm && echo LLVM is present
  LLVM is present
  $ perf check -q feature liballvm && echo ALLVM is present
  $

Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.ibm.com>
Cc: Bill Wendling <morbo@google.com>
Cc: Charlie Jenkins <charlie@rivosinc.com>
Cc: Collin Funk <collin.funk1@gmail.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Dr. David Alan Gilbert <linux@treblig.org>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Haibo Xu <haibo1.xu@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Li Huafei <lihuafei1@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <nick.desaulniers+lkml@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <song@kernel.org>
Cc: Stephen Brennan <stephen.s.brennan@oracle.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-10-06 15:24:13 -03:00
Thomas Falcon
6b9c0261b3 perf record: Add ratio-to-prev term
Provide ratio-to-prev term which allows the user to
set the event sample period of two events corresponding
to a desired ratio.

If using on an Intel x86 platform with Auto Counter Reload support, also
set corresponding event's config2 attribute with a bitmask which
counters to reset and which counters to sample if the desired ratio is
met or exceeded.

On other platforms, only the sample period is affected by the
ratio-to-prev term.

Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Thomas Falcon <thomas.falcon@intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-10-03 16:49:51 -03:00
Markus Heidelberg
a93b9ccb03 perf tools: Fix duplicated words in documentation and comments
- "the the"
- "in in"
- "a a"

Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Markus Heidelberg <m.heidelberg@cab.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-10-01 09:44:02 -03:00
Ankur Arora
a8f0992998 perf bench mem: Add mmap() workloads
Add two mmap() workloads: one that eagerly populates a region and
another that demand faults it in.

The intent is to probe the memory subsytem performance incurred
by mmap().

  $ perf bench mem mmap -s 4gb -p 4kb -l 10 -f populate
  # Running 'mem/mmap' benchmark:
  # function 'populate' (Eagerly populated map())
  # Copying 4gb bytes ...

       1.811691 GB/sec

  $ perf bench mem mmap -s 4gb -p 2mb -l 10 -f populate
  # Running 'mem/mmap' benchmark:
  # function 'populate' (Eagerly populated mmap())
  # Copying 4gb bytes ...

      12.272017 GB/sec

  $ perf bench mem mmap -s 4gb -p 1gb -l 10 -f populate
  # Running 'mem/mmap' benchmark:
  # function 'populate' (Eagerly populated mmap())
  # Copying 4gb bytes ...

      17.085927 GB/sec

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-09-19 12:43:59 -03:00
Ankur Arora
fd1d882c4c perf bench mem: Allow chunking on a memory region
There can be a significant gap in memset/memcpy performance depending
on the size of the region being operated on.

With chunk-size=4kb:

  $ echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

  $ perf bench mem memset -p 4kb -k 4kb -s 4gb -l 10 -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
  # Copying 4gb bytes ...

      13.011655 GB/sec

With chunk-size=1gb:

  $ echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

  $ perf bench mem memset -p 4kb -k 1gb -s 4gb -l 10 -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
  # Copying 4gb bytes ...

      21.936355 GB/sec

So, allow the user to specify the chunk-size.

The default value is identical to the total size of the region, which
preserves current behaviour.

Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-09-19 12:43:38 -03:00
Ankur Arora
7b6837e63a perf bench mem: Allow mapping of hugepages
Page sizes that can be selected: 4KB, 2MB, 1GB.

Both the reservation and node from which hugepages are allocated
from are expected to be addressed by the user.

An example of page-size selection:

  $ perf bench mem memset -s 4gb -p 2mb
  # Running 'mem/memset' benchmark:
  # function 'default' (Default memset() provided by glibc)
  # Copying 4gb bytes ...

        14.919194 GB/sec
  # function 'x86-64-unrolled' (unrolled memset() in arch/x86/lib/memset_64.S)
  # Copying 4gb bytes ...

        11.514503 GB/sec
  # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
  # Copying 4gb bytes ...

          12.600568 GB/sec

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-09-19 12:43:26 -03:00
Thomas Richter
817af72c05 perf tools: Update header documentation on BPF_PROG_INFO
Update the perf.data file format description on header section
HEADER_BPF_PROG_INFO.

The information is taken from process_bpf_prog_info() and
write_bpf_prog_info() from file util/header.c.

Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-09-19 12:14:30 -03:00
Namhyung Kim
ece3c7754f perf trace: Add --max-summary option
The --max-summary option is to limit the number of output lines for
syscall summary stats.  The max applies to each entries like thread and
cgroups.  For total summary, it will just print up to the given number.

For example,

  $ sudo perf trace -as --max-summary 3 sleep 0.1

   ThreadPoolServi (1011651), 114 events, 14.8%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     epoll_wait            38      0    95.589     0.000     2.515    11.153     28.98%
     futex                  9      0     0.040     0.002     0.004     0.014     28.63%
     read                  10      0     0.037     0.003     0.004     0.005      4.67%

   sleep (1050529), 250 events, 32.4%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     clock_nanosleep        1      0   100.156   100.156   100.156   100.156      0.00%
     execve                 4      3     1.020     0.005     0.255     0.989     95.93%
     openat                36     17     0.416     0.003     0.012     0.029     10.58%

   ...

And this is for per-cgroup summary using BPF.

  $ sudo perf trace -as --max-summary 3 --summary-mode=cgroup --bpf-summary sleep 0.1

   cgroup /user.slice/user-657345.slice/user@657345.service/session.slice/org.gnome.Shell@x11.service, 12 events

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     recvmsg                8      7     0.016     0.001     0.002     0.006     39.73%
     ppoll                  1      0     0.014     0.014     0.014     0.014      0.00%
     write                  2      0     0.010     0.002     0.005     0.008     61.02%

   cgroup /user.slice/user-657345.slice/session-4.scope, 73 events

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     epoll_wait             8      0    13.461     0.010     1.683    12.235     89.66%
     ioctl                 20      0     0.204     0.001     0.010     0.113     54.01%
     writev                11      0     0.164     0.004     0.015     0.042     20.34%

Reviewed-by: Howard Chu <howardchu95@gmail.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-09-19 12:14:29 -03:00
Ian Rogers
035c178930 perf parse-events: Add 'X' modifier to exclude an event from being regrouped
The function parse_events__sort_events_and_fix_groups is needed to fix
uncore events like:
```
$ perf stat -e '{data_read,data_write}' ...
```
so that the multiple uncore PMUs have a group each of data_read and
data_write events.

The same function will perform architecture sorting and group fixing,
in particular for Intel topdown/perf-metric events. Grouping multiple
perf metric events together causes perf_event_open to fail as the
group can only support one. This means command lines like:
```
$ perf stat -e 'slots,slots' ...
```
fail as the slots events are forced into a group together to try to
satisfy the perf-metric event constraints.

As the user may know better than
parse_events__sort_events_and_fix_groups add a 'X' modifier to skip
its regrouping behavior. This allows the following to succeed rather
than fail on the second slots event being opened:
```
$ perf stat -e 'slots,slots:X' -a sleep 1

 Performance counter stats for 'system wide':

     6,834,154,071      cpu_core/slots/                                                         (50.13%)
     5,548,629,453      cpu_core/slots/X                                                        (49.87%)

       1.002634606 seconds time elapsed
```

Closes: https://lore.kernel.org/lkml/20250822082233.1850417-1-dapeng1.mi@linux.intel.com/
Reported-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reported-by: Xudong Hao <xudong.hao@intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Howard Chu <howardchu95@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Falcon <thomas.falcon@intel.com>
Cc: Yoshihiro Furudera <fj5100bi@fujitsu.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-09-12 15:53:32 -03:00
James Clark
edf93f2a24 perf docs: Update SPE doc to include default instructions group
The instructions group is now generated by default so update the doc to
reflect this. Also explain the period/downsampling mechanism in more
detail.

Signed-off-by: James Clark <james.clark@linaro.org>
Tested-by: Leo Yan <leo.yan@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ben Gainey <Ben.Gainey@arm.com>
Cc: George Wort <George.Wort@arm.com>
Cc: Graham Woodward <Graham.Woodward@arm.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Leo Yan <leo.yan@linux.dev>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Williams <Michael.Williams@arm.com>
Cc: Mike Leach <mike.leach@linaro.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-09-09 14:51:33 -03:00
Namhyung Kim
7dbe89ca3d perf annotate: Add --code-with-type support for TUI
Until now, the --code-with-type option is available only on stdio.
But it was an artifical limitation because of an implemention issue.

Implement the same logic in annotation_line__write() for stdio2/TUI
and remove the limitation and update the man page.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250816031635.25318-8-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-08-28 12:33:08 -03:00
Ian Rogers
53b00ff358 perf record: Make --buildid-mmap the default
Support for build IDs in mmap2 perf events has been present since
Linux v5.12:
https://lore.kernel.org/lkml/20210219194619.1780437-1-acme@kernel.org/
Build ID mmap events don't avoid the need to inject build IDs for DSO
touched by samples as the build ID cache is populated by perf
record. They can avoid some cases of symbol mis-resolution caused by
the file system changing from when a sample occurred and when the DSO
is sought.

Unlike the --buildid-mmap option, this chnage doesn't disable the
build ID cache but it does disable the processing of samples looking
for DSOs to inject build IDs for. To disable the build ID cache the -B
(--no-buildid) option should be used.

Making this option the default was raised on the list in:
https://lore.kernel.org/linux-perf-users/CAP-5=fXP7jN_QrGUcd55_QH5J-Y-FCaJ6=NaHVtyx0oyNh8_-Q@mail.gmail.com/

Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-9-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25 10:37:56 -07:00
Ian Rogers
bd741d80dc perf parse-events: Allow the cpu term to be a PMU or CPU range
On hybrid systems, events like msr/tsc/ will aggregate counts across
all CPUs. Often metrics only want a value like msr/tsc/ for the cores
on which the metric is being computed. Listing each CPU with terms
cpu=0,cpu=1.. is laborious and would need to be encoded for all
variations of a CPU model.

Allow the cpumask from a PMU to be an argument to the cpu term. For
example in the following the cpumask of the cstate_pkg PMU selects the
CPUs to count msr/tsc/ counter upon:
```
$ cat /sys/bus/event_source/devices/cstate_pkg/cpumask
0
$ perf stat -A -e 'msr/tsc,cpu=cstate_pkg/' -a sleep 0.1

 Performance counter stats for 'system wide':

CPU0          252,621,253      msr/tsc,cpu=cstate_pkg/

       0.101184092 seconds time elapsed
```

As the cpu term is now also allowed to be a string, allow it to encode
a range of CPUs (a list can't be supported as ',' is already a special
token).

The "event qualifiers" section of the `perf list` man page is updated
to detail the additional behavior.  The man page formatting is tidied
up in this section, as it was incorrectly appearing within the
"parameterized events" section.

Reviewed-by: Thomas Falcon <thomas.falcon@intel.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250719030517.1990983-5-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-24 13:41:35 -07:00
Changbin Du
129f70bd60 perf: ftrace: add graph tracer options args/retval/retval-hex/retaddr
This change adds support for new funcgraph tracer options funcgraph-args,
funcgraph-retval, funcgraph-retval-hex and funcgraph-retaddr.

The new added options are:
  - args       : Show function arguments.
  - retval     : Show function return value.
  - retval-hex : Show function return value in hexadecimal format.
  - retaddr    : Show function return address.

 # ./perf ftrace -G vfs_write --graph-opts retval,retaddr
 # tracer: function_graph
 #
 # CPU  DURATION                  FUNCTION CALLS
 # |     |   |                     |   |   |   |
 5)               |  mutex_unlock() { /* <-rb_simple_write+0xda/0x150 */
 5)   0.188 us    |    local_clock(); /* <-lock_release+0x2ad/0x440 ret=0x3bf2a3cf90e */
 5)               |    rt_mutex_slowunlock() { /* <-rb_simple_write+0xda/0x150 */
 5)               |      _raw_spin_lock_irqsave() { /* <-rt_mutex_slowunlock+0x4f/0x200 */
 5)   0.123 us    |        preempt_count_add(); /* <-_raw_spin_lock_irqsave+0x23/0x90 ret=0x0 */
 5)   0.128 us    |        local_clock(); /* <-__lock_acquire.isra.0+0x17a/0x740 ret=0x3bf2a3cfc8b */
 5)   0.086 us    |        do_raw_spin_trylock(); /* <-_raw_spin_lock_irqsave+0x4a/0x90 ret=0x1 */
 5)   0.845 us    |      } /* _raw_spin_lock_irqsave ret=0x292 */
 5)               |      _raw_spin_unlock_irqrestore() { /* <-rt_mutex_slowunlock+0x191/0x200 */
 5)   0.097 us    |        local_clock(); /* <-lock_release+0x2ad/0x440 ret=0x3bf2a3cff1f */
 5)   0.086 us    |        do_raw_spin_unlock(); /* <-_raw_spin_unlock_irqrestore+0x23/0x60 ret=0x1 */
 5)   0.104 us    |        preempt_count_sub(); /* <-_raw_spin_unlock_irqrestore+0x35/0x60 ret=0x0 */
 5)   0.726 us    |      } /* _raw_spin_unlock_irqrestore ret=0x80000000 */
 5)   1.881 us    |    } /* rt_mutex_slowunlock ret=0x0 */
 5)   2.931 us    |  } /* mutex_unlock ret=0x0 */

Signed-off-by: Changbin Du <changbin.du@huawei.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250613114048.132336-1-changbin.du@huawei.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-22 17:47:22 -07:00
Namhyung Kim
8db1d77248 perf ftrace latency: Add -e option to measure time between two events
In addition to the function latency, it can measure events latencies.
Some kernel tracepoints are paired and it's menningful to measure how
long it takes between the two events.  The latency is tracked for the
same thread.

Currently it only uses BPF to do the work but it can be lifted later.
Instead of having separate a BPF program for each tracepoint, it only
uses generic 'event_begin' and 'event_end' programs to attach to any
(raw) tracepoints.

  $ sudo perf ftrace latency -a -b --hide-empty \
    -e i915_request_wait_begin,i915_request_wait_end -- sleep 1
  #   DURATION     |      COUNT | GRAPH                                |
     256 -  512 us |          4 | ######                               |
       2 -    4 ms |          2 | ###                                  |
       4 -    8 ms |         12 | ###################                  |
       8 -   16 ms |         10 | ################                     |

  # statistics  (in usec)
    total time:               194915
      avg time:                 6961
      max time:                12855
      min time:                  373
         count:                   28

Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250714052143.342851-1-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-14 22:51:58 -07:00
Chun-Tse Shao
aa497357c1 perf stat: Fix uncore aggregation number
Follow up:
lore.kernel.org/CAP-5=fVDF4-qYL1Lm7efgiHk7X=_nw_nEFMBZFMcsnOOJgX4Kg@mail.gmail.com/

The patch adds unit aggregation during evsel merge the aggregated uncore
counters. Change the name of the column to `ctrs` and `counters` for
json mode.

Tested on a 2-socket machine with SNC3, uncore_imc_[0-11] and
cpumask="0,120"
Before:
  perf stat -e clockticks -I 1000 --per-socket
  #           time socket cpus             counts unit events
       1.001085024 S0        1         9615386315      clockticks
       1.001085024 S1        1         9614287448      clockticks
  perf stat -e clockticks -I 1000 --per-node
  #           time node   cpus             counts unit events
       1.001029867 N0        1         3205726984      clockticks
       1.001029867 N1        1         3205444421      clockticks
       1.001029867 N2        1         3205234018      clockticks
       1.001029867 N3        1         3205224660      clockticks
       1.001029867 N4        1         3205207213      clockticks
       1.001029867 N5        1         3205528246      clockticks
After:
  perf stat -e clockticks -I 1000 --per-socket
  #           time socket ctrs             counts unit events
       1.001026071 S0       12         9619677996      clockticks
       1.001026071 S1       12         9618612614      clockticks
  perf stat -e clockticks -I 1000 --per-node
  #           time node   ctrs             counts unit events
       1.001027449 N0        4         3207251859      clockticks
       1.001027449 N1        4         3207315930      clockticks
       1.001027449 N2        4         3206981828      clockticks
       1.001027449 N3        4         3206566126      clockticks
       1.001027449 N4        4         3206032609      clockticks
       1.001027449 N5        4         3205651355      clockticks

Tested with JSON output linter:
  perf test "perf stat JSON output linter"
   94: perf stat JSON output linter                                    : Ok

Suggested-by: Ian Rogers <irogers@google.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Chun-Tse Shao <ctshao@google.com>
Link: https://lore.kernel.org/r/20250627201818.479421-1-ctshao@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-06-27 16:14:10 -07:00
Yuzhuo Jing
8e63fd1e00 tools: Remove libcrypto dependency
Remove all occurrence of libcrypto in the build system.

Signed-off-by: Yuzhuo Jing <yuzhuo@google.com>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250625202311.23244-5-ebiggers@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-06-26 10:51:41 -07:00
Namhyung Kim
c833e8cc4d Linux 6.16-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCgA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmhYZ9AeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGPdcIAIR01BnZbU7knvkI
 QSMlJR4zl8IDIu+9P35v5jmoklJFqYtIGQUnNTxrWK0i/NGj6+FAmA8xgAzeJZ7d
 Kv0Qs0fkgwQWxFbbM9Xg9Vob8gxSZ6Qyo5ETz8UDupCNWyUPpMaRDLv8IRtF/19I
 YAgXJqCXC4LuJJNROSkk3wiLqg+CerzJ/m7GJtBQdCmbv6h87HETaeQpKVgEwTkR
 uEn99CDnOryovIYWoDaBbqLRF5AQr2JA6lqmQAr57wyjzuj1Ul2BNgLC/dlnuSQl
 G8kv0+Kf+l1X3UJy1w8V4lkSLyjpTko7QIgv3AVDLDvOobCtZglWD4vPJ1OGvw4m
 cHDrPg4=
 =thw2
 -----END PGP SIGNATURE-----

Merge tag 'v6.16-rc3' into perf-tools-next

To get the fixes in libbpf and perf tools.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-06-22 21:54:03 -07:00
Blake Jones
1d0654b7fd perf build: detect support for libbpf's emit_strings option
This creates a config option that detects libbpf's ability to display
character arrays as strings, which was just added to the BPF tree
(https://git.kernel.org/bpf/bpf-next/c/87c9c79a02b4).

To test this change, I built perf (from later in this patch set) with:

 - static libbpf (default, using source from kernel tree)
 - dynamic libbpf (LIBBPF_DYNAMIC=1 LIBBPF_INCLUDE=/usr/local/include)

For both the static and dynamic versions, I used headers with and without
the ".emit_strings" option.

I verified that of the four resulting binaries, the two with
".emit_strings" would successfully record BPF_METADATA events, and the two
without wouldn't.  All four binaries would successfully display
BPF_METADATA events, because the relevant bit of libbpf code is only used
during "perf record".

Signed-off-by: Blake Jones <blakejones@google.com>
Link: https://lore.kernel.org/r/20250612194939.162730-2-blakejones@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-06-20 14:48:14 -07:00
Namhyung Kim
bbeb1088a4 perf mem: Document new output fields (op, cache, mem, dtlb, snoop)
Update the documentation of the new fields with examples and caveats.

Also update the related documentation for AMD IBS.

Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250610005742.2173050-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-06-16 14:05:10 -03:00
Howard Chu
46e34646ae perf trace: Remove --map-dump documentation
The --map-dump option was removed in 5e6da6be3082 ("perf trace: Migrate
BPF augmentation to use a skeleton"), this patch removes its remaining
documentation.

Fixes: 5e6da6be3082 ("perf trace: Migrate BPF augmentation to use a skeleton")
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Link: https://lore.kernel.org/r/20250601173252.717780-1-howardchu95@gmail.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-06-09 11:18:19 -07:00
Namhyung Kim
0df14c1f1e perf lock contention: Reject more than 10ms delays for safety
Delaying kernel operations can be dangerous and the kernel may kill
(non-sleepable) BPF programs running for long in the future.

Limit the max delay to 10ms and update the document about it.

  $ sudo ./perf lock con -abl -J 100000us@cgroup_mutex true
  lock delay is too long: 100000us (> 10ms)

   Usage: perf lock contention [<options>]

      -J, --inject-delay <TIME@FUNC>
                            Inject delays to specific locks

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20250515181042.555189-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-31 08:45:24 -03:00
Ravi Bangoria
00a23c000e perf mem: Describe overhead calculation in brief
Unlike perf-report which uses sample period for overhead calculation,
perf-mem overhead is calculated using sample weight. Describe perf-mem
overhead calculation method in it's man page.

Reviewed-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: https://lore.kernel.org/r/20250523222157.1259998-1-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-28 14:41:10 -03:00
Chun-Tse Shao
208c0e1683 perf record: Add 8-byte aligned event type PERF_RECORD_COMPRESSED2
The original PERF_RECORD_COMPRESS is not 8-byte aligned, which can cause
asan runtime error:

  # Build with asan
  $ make -C tools/perf O=/tmp/perf DEBUG=1 EXTRA_CFLAGS="-O0 -g -fno-omit-frame-pointer -fsanitize=undefined"
  # Test success with many asan runtime errors:
  $ /tmp/perf/perf test "Zstd perf.data compression/decompression" -vv
   83: Zstd perf.data compression/decompression:
  ...
  util/session.c:1959:13: runtime error: member access within misaligned address 0x7f69e3f99653 for type 'union perf_event', which requires 13 byte alignment
  0x7f69e3f99653: note: pointer points here
   d0  3a 50 69 44 00 00 00 00  00 08 00 bb 07 00 00 00  00 00 00 44 00 00 00 00  00 00 00 ff 07 00 00
                ^
  util/session.c:2163:22: runtime error: member access within misaligned address 0x7f69e3f99653 for type 'union perf_event', which requires 8 byte alignment
  0x7f69e3f99653: note: pointer points here
   d0  3a 50 69 44 00 00 00 00  00 08 00 bb 07 00 00 00  00 00 00 44 00 00 00 00  00 00 00 ff 07 00 00
                ^
  ...

Since there is no way to align compressed data in zstd compression, this
patch add a new event type `PERF_RECORD_COMPRESSED2`, which adds a field
`data_size` to specify the actual compressed data size.

The `header.size` contains the total record size, including the padding
at the end to make it 8-byte aligned.

Tested with `Zstd perf.data compression/decompression`

Signed-off-by: Chun-Tse Shao <ctshao@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ben Gainey <ben.gainey@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Leo Yan <leo.yan@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250303183646.327510-1-ctshao@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-16 17:31:40 -03:00
Namhyung Kim
ef60b8f572 perf trace: Support --summary-mode=cgroup
Add a new summary mode to collect stats for each cgroup.

  $ sudo ./perf trace -as --bpf-summary --summary-mode=cgroup -- sleep 1

   Summary of events:

   cgroup /user.slice/user-657345.slice/user@657345.service/session.slice/org.gnome.Shell@x11.service, 535 events

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     ppoll                 15      0   373.600     0.004    24.907   197.491     55.26%
     poll                  15      0     1.325     0.001     0.088     0.369     38.76%
     close                 66      0     0.567     0.007     0.009     0.026      3.55%
     write                150      0     0.471     0.001     0.003     0.010      3.29%
     recvmsg               94     83     0.290     0.000     0.003     0.037     16.39%
     ioctl                 26      0     0.237     0.001     0.009     0.096     50.13%
     timerfd_create        66      0     0.236     0.003     0.004     0.024      8.92%
     timerfd_settime       70      0     0.160     0.001     0.002     0.012      7.66%
     writev                10      0     0.118     0.001     0.012     0.019     18.17%
     read                   9      0     0.021     0.001     0.002     0.004     14.07%
     getpid                14      0     0.019     0.000     0.001     0.004     20.28%

   cgroup /system.slice/polkit.service, 94 events

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     ppoll                 22      0    19.811     0.000     0.900     9.273     63.88%
     write                 30      0     0.040     0.001     0.001     0.003     12.09%
     recvmsg               12      0     0.018     0.001     0.002     0.006     28.15%
     read                  18      0     0.013     0.000     0.001     0.003     21.99%
     poll                  12      0     0.006     0.000     0.001     0.001      4.48%

   cgroup /user.slice/user-657345.slice/user@657345.service/app.slice/app-org.gnome.Terminal.slice/gnome-terminal-server.service, 21 events

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     ppoll                  4      0    17.476     0.003     4.369    13.298     69.65%
     recvmsg               15     12     0.068     0.002     0.005     0.014     26.53%
     writev                 1      0     0.033     0.033     0.033     0.033      0.00%
     poll                   1      0     0.005     0.005     0.005     0.005      0.00%

   ...

It works only for --bpf-summary for now.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Howard Chu <howardchu95@gmail.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20250501225337.928470-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-13 18:20:46 -03:00
Namhyung Kim
39922dc53c perf report: Add 'tgid' sort key
Sometimes we need to analyze the data in process level but current sort
keys only work on thread level.  Let's add 'tgid' sort key for that as
'pid' is already taken for thread.

This will look mostly the same, but it only uses tgid instead of tid.
Here's an example of a process with two threads (thloop).

  $ perf record -- perf test -w thloop

  $ perf report --stdio -s tgid,pid -H
  ...
  #
  #    Overhead  Tgid:Command / Pid:Command
  # ...........  ..........................
  #
     100.00%     2018407:perf
         50.34%     2018407:perf
         49.66%     2018409:perf

Suggested-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250509210421.197245-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-13 17:51:32 -03:00
Ian Rogers
255f5b6d06 perf parse-events: Add "cpu" term to set the CPU an event is recorded on
The -C option allows the CPUs for a list of events to be specified but
its not possible to set the CPU for a single event. Add a term to
allow this. The term isn't a general CPU list due to ',' already being
a special character in event parsing instead multiple cpu= terms may
be provided and they will be merged/unioned together.

An example of mixing different types of events counted on different CPUs:
```
$ perf stat -A -C 0,4-5,8 -e "instructions/cpu=0/,l1d-misses/cpu=4,cpu=5/,inst_retired.any/cpu=8/,cycles" -a sleep 0.1

 Performance counter stats for 'system wide':

CPU0            6,979,225      instructions/cpu=0/              #    0.89  insn per cycle
CPU4               75,138      cpu/l1d-misses/
CPU5            1,418,939      cpu/l1d-misses/
CPU8              797,553      cpu/inst_retired.any,cpu=8/
CPU0            7,845,302      cycles
CPU4            6,546,859      cycles
CPU5          185,915,438      cycles
CPU8            2,065,668      cycles

       0.112449242 seconds time elapsed
```

Committer testing:

  root@number:~# grep -m1 "model name" /proc/cpuinfo
  model name	: AMD Ryzen 9 9950X3D 16-Core Processor
  root@number:~# perf stat -A -e "instructions/cpu=0/,instructions,l1d-misses/cpu=4,cpu=5/,cycles" -a sleep 0.1

   Performance counter stats for 'system wide':

  CPU0    2,398,351   instructions/cpu=0/    #  0.44  insn per cycle
  CPU0    2,398,152   instructions           #  0.44  insn per cycle
  CPU1    1,265,634   instructions           #  0.49  insn per cycle
  CPU2      606,087   instructions           #  0.50  insn per cycle
  CPU3    4,025,752   instructions           #  0.52  insn per cycle
  CPU4    4,236,810   instructions           #  0.53  insn per cycle
  CPU5    3,984,832   instructions           #  0.66  insn per cycle
  CPU6      434,132   instructions           #  0.44  insn per cycle
  CPU7       65,752   instructions           #  0.41  insn per cycle
  CPU8      459,083   instructions           #  0.48  insn per cycle
  CPU9    6,464,161   instructions           #  1.31  insn per cycle
  <SNIP>
  root@number:~# perf stat -e "instructions/cpu=0/,instructions,l1d-misses/cpu=4,cpu=5/,cycles" -a sleep 0.

   Performance counter stats for 'system wide':

             144,822      instructions/cpu=0/              #    0.03  insn per cycle
           4,666,114      instructions                     #    0.93  insn per cycle
               2,583      l1d-misses
           4,993,633      cycles

         0.000868512 seconds time elapsed

  root@number:~#

Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tested-by: Kan Liang <kan.liang@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Leo Yan <leo.yan@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Weilin Wang <weilin.wang@intel.com>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Link: https://lore.kernel.org/r/20250403194337.40202-5-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-12 14:23:19 -03:00
Adrian Hunter
352b088164 perf intel-pt: Do not default to recording all switch events
On systems with many CPUs, recording extra context switch events can be
excessive and unnecessary. Add perf config intel-pt.all-switch-events=false
to control the behaviour.

Example:

 # perf config intel-pt.all-switch-events=false
 # perf record -eintel_pt//u uname
 Linux
 [ perf record: Woken up 1 times to write data ]
 [ perf record: Captured and wrote 0.082 MB perf.data ]
 # perf script -D | grep PERF_RECORD_SWITCH | awk '{print $5}' | uniq -c
       5 PERF_RECORD_SWITCH
 # perf config intel-pt.all-switch-events=true
 # perf record -eintel_pt//u uname
 Linux
 [ perf record: Woken up 1 times to write data ]
 [ perf record: Captured and wrote 0.102 MB perf.data ]
 # perf script -D | grep PERF_RECORD_SWITCH | awk '{print $5}' | uniq -c
     180 PERF_RECORD_SWITCH_CPU_WIDE

Committer testing:

While doing a make -j28 allmodconfig:

  root@five:~# grep "model name" -m1 /proc/cpuinfo
  model name	: Intel(R) Core(TM) i7-14700K
  root@five:~#
  root@five:~# perf config intel-pt.all-switch-events=false
  root@five:~# perf record -e intel_pt//u uname
  Linux
  [ perf record: Woken up 2 times to write data ]
  [ perf record: Captured and wrote 0.019 MB perf.data ]
  root@five:~# perf report --stats | grep SWITCH_CPU_WIDE
  root@five:~#
  root@five:~# perf config intel-pt.all-switch-events=true
  root@five:~# perf record -e intel_pt//u uname
  Linux
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.047 MB perf.data ]
  root@five:~# perf report --stats | grep SWITCH_CPU_WIDE
       SWITCH_CPU_WIDE events:        542  (96.4%)
  root@five:~#

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250512093932.79854-3-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-12 14:18:16 -03:00
Namhyung Kim
c42e219942 perf lock contention: Add -J/--inject-delay option
This is to slow down lock acquistion (on contention locks) deliberately.

A possible use case is to estimate impact on application performance by
optimization of kernel locking behavior.  By delaying the lock it can
simulate the worse condition as a control group, and then compare with
the current behavior as a optimized condition.

The syntax is 'time@function' and the time can have unit suffix like
"us" and "ms".  For example, I ran a simple test like below.

  $ sudo perf lock con -abl -L tasklist_lock -- \
    sh -c 'for i in $(seq 1000); do sleep 1 & done; wait'
   contended   total wait     max wait     avg wait            address   symbol

          92      1.18 ms    199.54 us     12.79 us   ffffffff8a806080   tasklist_lock (rwlock)

The contention count was 92 and the average wait time was around 10 us.
But if I add 100 usec of delay to the tasklist_lock,

  $ sudo perf lock con -abl -L tasklist_lock -J 100us@tasklist_lock -- \
    sh -c 'for i in $(seq 1000); do sleep 1 & done; wait'
   contended   total wait     max wait     avg wait            address   symbol

         190     15.67 ms    230.10 us     82.46 us   ffffffff8a806080   tasklist_lock (rwlock)

The contention count increased and the average wait time was up closed
to 100 usec.  If I increase the delay even more,

  $ sudo perf lock con -abl -L tasklist_lock -J 1ms@tasklist_lock -- \
    sh -c 'for i in $(seq 1000); do sleep 1 & done; wait'
   contended   total wait     max wait     avg wait            address   symbol

        1002      2.80 s       3.01 ms      2.80 ms   ffffffff8a806080   tasklist_lock (rwlock)

Now every sleep process had contention and the wait time was more than 1
msec.  This is on my 4 CPU laptop so I guess one CPU has the lock while
other 3 are waiting for it mostly.

For simplicity, it only supports global locks for now.

Committer testing:

  root@number:~# grep -m1 'model name' /proc/cpuinfo
  model name : AMD Ryzen 9 9950X3D 16-Core Processor
  root@number:~# perf lock con -abl -L tasklist_lock -- sh -c 'for i in $(seq 1000); do sleep 1 & done; wait'
   contended  total wait   max wait  avg wait           address  symbol

         142   453.85 us   25.39 us   3.20 us  ffffffffae808080  tasklist_lock (rwlock)
  root@number:~# perf lock con -abl -L tasklist_lock -J 100us@tasklist_lock -- sh -c 'for i in $(seq 1000); do sleep 1 & done; wait'
   contended  total wait   max wait  avg wait           address  symbol

        1040     2.39 s     3.11 ms   2.30 ms  ffffffffae808080  tasklist_lock (rwlock)
  root@number:~# perf lock con -abl -L tasklist_lock -J 1ms@tasklist_lock -- sh -c 'for i in $(seq 1000); do sleep 1 & done; wait'
   contended  total wait   max wait  avg wait           address  symbol

        1025    24.72 s    31.01 ms  24.12 ms  ffffffffae808080  tasklist_lock (rwlock)
  root@number:~#

Suggested-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20250509171950.183591-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-09 14:32:15 -03:00
Howard Chu
9557c00076 perf record --off-cpu: Add --off-cpu-thresh option
Specify the threshold for dumping offcpu samples with --off-cpu-thresh,
the unit is milliseconds. Default value is 500ms.

Example:

  perf record --off-cpu --off-cpu-thresh 824

The example above collects direct off-cpu samples where the off-cpu time
is longer than 824ms.

Committer testing:

After commenting out the end off-cpu dump to have just the ones that are
added right after the task is scheduled back, and using a threshould of
1000ms, we see some periods (the 5th column, just before "offcpu-time"
in the 'perf script' output) that are over 1000.000.000 nanoseconds:

  root@number:~# perf record --off-cpu --off-cpu-thresh 10000
  ^C[ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 3.902 MB perf.data (34335 samples) ]
  root@number:~# perf script
<SNIP>
  Isolated Web Co   59932 [028] 63839.594437: 1000049427 offcpu-time:
             7fe63c7976c2 __syscall_cancel_arch_end+0x0 (/usr/lib64/libc.so.6)
             7fe63c78c04c __futex_abstimed_wait_common+0x7c (/usr/lib64/libc.so.6)
             7fe63c78e928 pthread_cond_timedwait@@GLIBC_2.3.2+0x178 (/usr/lib64/libc.so.6)
             5599974a9fe7 mozilla::detail::ConditionVariableImpl::wait_for(mozilla::detail::MutexImpl&, mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> const&)+0xe7 (/usr/lib64/fir>
                100000000 [unknown] ([unknown])

          swapper       0 [025] 63839.594459:     195724    cycles:P:  ffffffffac328270 read_tsc+0x0 ([kernel.kallsyms])
  Isolated Web Co   59932 [010] 63839.594466: 1000055278 offcpu-time:
             7fe63c7976c2 __syscall_cancel_arch_end+0x0 (/usr/lib64/libc.so.6)
             7fe63c78ba24 __syscall_cancel+0x14 (/usr/lib64/libc.so.6)
             7fe63c804c4e __poll+0x1e (/usr/lib64/libc.so.6)
             7fe633b0d1b8 PollWrapper(_GPollFD*, unsigned int, int) [clone .lto_priv.0]+0xf8 (/usr/lib64/firefox/libxul.so)
                10000002c [unknown] ([unknown])

          swapper       0 [027] 63839.594475:     134433    cycles:P:  ffffffffad4c45d9 irqentry_enter+0x19 ([kernel.kallsyms])
          swapper       0 [028] 63839.594499:     215838    cycles:P:  ffffffffac39199a switch_mm_irqs_off+0x10a ([kernel.kallsyms])
  MediaPD~oder #1 1407676 [027] 63839.594514:     134433    cycles:P:      7f982ef5e69f dct_IV(int*, int, int*)+0x24f (/usr/lib64/libfdk-aac.so.2.0.0)
          swapper       0 [024] 63839.594524:     267411    cycles:P:  ffffffffad4c6ee6 poll_idle+0x56 ([kernel.kallsyms])
  MediaSu~sor #75 1093827 [026] 63839.594555:     332652    cycles:P:      55be753ad030 moz_xmalloc+0x200 (/usr/lib64/firefox/firefox)
          swapper       0 [027] 63839.594616:     160548    cycles:P:  ffffffffad144840 menu_select+0x570 ([kernel.kallsyms])
  Isolated Web Co   14019 [027] 63839.595120: 1000050178 offcpu-time:
             7fc9537cc6c2 __syscall_cancel_arch_end+0x0 (/usr/lib64/libc.so.6)
             7fc9537c104c __futex_abstimed_wait_common+0x7c (/usr/lib64/libc.so.6)
             7fc9537c3928 pthread_cond_timedwait@@GLIBC_2.3.2+0x178 (/usr/lib64/libc.so.6)
             7fc95372a3c8 pt_TimedWait+0xb8 (/usr/lib64/libnspr4.so)
             7fc95372a8d8 PR_WaitCondVar+0x68 (/usr/lib64/libnspr4.so)
             7fc94afb1f7c WatchdogMain(void*)+0xac (/usr/lib64/firefox/libxul.so)
             7fc947498660 [unknown] ([unknown])
             7fc9535fce88 [unknown] ([unknown])
             7fc94b620e60 WatchdogManager::~WatchdogManager()+0x0 (/usr/lib64/firefox/libxul.so)
          fff8548387f8b48 [unknown] ([unknown])

          swapper       0 [003] 63839.595712:     212948    cycles:P:  ffffffffacd5b865 acpi_os_read_port+0x55 ([kernel.kallsyms])
<SNIP>

Suggested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Suggested-by: Ian Rogers <irogers@google.com>
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tested-by: Gautam Menghani <gautam@linux.ibm.com>
Tested-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-2-howardchu95@gmail.com
Link: https://lore.kernel.org/r/20250501022809.449767-10-howardchu95@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-05 21:51:54 -03:00
Namhyung Kim
43a6446998 perf record: Add --sample-mem-info option
There's no way to enable PERF_SAMPLE_DATA_SRC without PERF_SAMPLE_ADDR
which brings a lot of overhead due to the number of MMAP[2] records.

Let's add a new option to enable this information separately.

Committer testing:

  # perf record -a --sample-mem-info
  ^C[ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 1.815 MB perf.data (2637 samples) ]
  #
  # perf evlist -v
  cycles:P: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0 (PERF_COUNT_HW_CPU_CYCLES), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER|DATA_SRC, read_format: ID|LOST, disabled: 1, freq: 1, precise_ip: 2, sample_id_all: 1
  dummy:u: type: 1 (PERF_TYPE_SOFTWARE), size: 136, config: 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|CPU|IDENTIFIER|DATA_SRC, read_format: ID|LOST, exclude_kernel: 1, exclude_hv: 1, mmap: 1, comm: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1
  #
  # perf report -D |& grep -w PERF_RECORD_SAMPLE -A3 -m1
  0 44675164447282 0x1a7590 [0x40]: PERF_RECORD_SAMPLE(IP, 0x4001): 107299/107299: 0xffffffffac4a5e11 period: 144 addr: 0
   . data_src: 0x229080142
   ... thread: perf:107299
   ...... dso: /lib/modules/6.15.0-rc4+/build/vmlinux
  #

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Leo Yan <leo.yan@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250430205548.789750-3-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-02 15:36:14 -03:00
Ravi Bangoria
fa1332a801 perf mem/c2c amd: Add ldlat support
'perf mem/c2c' uses IBS Op PMU on AMD platforms.

IBS Op PMU on Zen5 uarch has added support for Load Latency filtering.

Implement 'perf mem/c2c' --ldlat using IBS Op Load Latency filtering
capability.

Some subtle differences between AMD and other arch:

o --ldlat is disabled by default on AMD

o Supported values are 128 to 2048.

Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Ananth Narayan <ananth.narayan@amd.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Joe Mario <jmario@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Santosh Shukla <santosh.shukla@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Link: https://lore.kernel.org/r/20250429035938.1301-4-ravi.bangoria@amd.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-04-29 22:30:46 -03:00
Ravi Bangoria
eeefc13c71 perf amd ibs: Add Load Latency bits in raw dump
IBS OP PMU on Zen5 supports Load Latency filtering. Decode and dump Load
Latency filtering related bits into perf script raw dump.

Also add oneliner example in the perf-amd-ibs man page.

Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Ananth Narayan <ananth.narayan@amd.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Joe Mario <jmario@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Santosh Shukla <santosh.shukla@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Link: https://lore.kernel.org/r/20250429035938.1301-2-ravi.bangoria@amd.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-04-29 22:30:46 -03:00
Namhyung Kim
1bec43f523 perf trace: Implement syscall summary in BPF
When -s/--summary option is used, it doesn't need (augmented) arguments
of syscalls.  Let's skip the augmentation and load another small BPF
program to collect the statistics in the kernel instead of copying the
data to the ring-buffer to calculate the stats in userspace.  This will
be much more light-weight than the existing approach and remove any lost
events.

Let's add a new option --bpf-summary to control this behavior.  I cannot
make it default because there's no way to get e_machine in the BPF which
is needed for detecting different ABIs like 32-bit compat mode.

No functional changes intended except for no more LOST events. :)

  $ sudo ./perf trace -as --summary-mode=total --bpf-summary sleep 1

   Summary of events:

   total, 6194 events

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     epoll_wait           561      0  4530.843     0.000     8.076   520.941     18.75%
     futex                693     45  4317.231     0.000     6.230   500.077     21.98%
     poll                 300      0  1040.109     0.000     3.467   120.928     17.02%
     clock_nanosleep        1      0  1000.172  1000.172  1000.172  1000.172      0.00%
     ppoll                360      0   872.386     0.001     2.423   253.275     41.91%
     epoll_pwait           14      0   384.349     0.001    27.453   380.002     98.79%
     pselect6              14      0   108.130     7.198     7.724     8.206      0.85%
     nanosleep             39      0    43.378     0.069     1.112    10.084     44.23%
     ...

Reviewed-by: Howard Chu <howardchu95@gmail.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Howard Chu <howardchu95@gmail.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20250326044001.3503432-1-namhyung@kernel.org
[ Added fixup sent from Namhyung in response to my report to make it also dependent on CONFIG_TRACE ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-04-28 16:53:11 -03:00
Ian Rogers
f19306f065 perf stat: Add mean, min, max and last --tpebs-mode options
Add command line configuration option for how retirement latency
events are combined.

The default "mean" gives the average of retirement latency.

"min" or "max" give the smallest or largest retirment latency times
respectively.

"last" uses the last retirment latency sample's time.

Committer notes:

Enclose parse_tpebs_mode() under HAVE_ARCH_X86_64_SUPPORT to match the
ifdef block where it is used, fixing the build in systems like:

  20     5.60 debian:experimental-x-mips    : FAIL gcc version 14.2.0 (Debian 14.2.0-1)
    builtin-stat.c:2330:12: error: 'parse_tpebs_mode' defined but not used [-Werror=unused-function]
     2330 | static int parse_tpebs_mode(const struct option *opt, const char *str,
          |            ^~~~~~~~~~~~~~~~

Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Weilin Wang <weilin.wang@intel.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Torgue <alexandre.torgue@foss.st.com>
Cc: Andreas Färber <afaerber@suse.de>
Cc: Caleb Biggers <caleb.biggers@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Perry Taylor <perry.taylor@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Falcon <thomas.falcon@intel.com>
Link: https://lore.kernel.org/r/20250414174134.3095492-15-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-04-25 12:31:36 -03:00
Yujie Liu
fa9bc517af perf script: Update brstack syntax documentation
The following commits added new fields/flags to the branch stack field
list:

commit 1f48989cdc7d ("perf script: Output branch sample type")
commit 6ade6c646035 ("perf script: Show branch speculation info")
commit 1e66dcff7b9b ("perf script: Add not taken event for branch stack")

Update brstack syntax documentation to be consistent with the latest
branch stack field list. Improve the descriptions to help users
interpret the fields accurately.

Signed-off-by: Yujie Liu <yujie.liu@intel.com>
Reviewed-by: Leo Yan <leo.yan@arm.com>
Reviewed-by: Sandipan Das <sandipan.das@amd.com>
Link: https://lore.kernel.org/r/20250312072329.419020-1-yujie.liu@intel.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-03-14 10:41:08 -07:00
Namhyung Kim
bbf006d6d1 perf annotate: Add --code-with-type option.
This option is to show data type info in the regular (code) annotation.
It tries to find data type for each (memory) instruction in the
function.  It'd be useful to see function-level memory access pattern
and also to debug the data type profiling result.

The output would be added at the end of the line and have "# data-type:"
prefix.

For now, it only works with --stdio mode for simplicity.  I can work on
enabling it for TUI later.

  $ perf annotate --stdio --code-with-type
   Percent |      Source code & Disassembly of vmlinux for cpu/mem-loads/ppk (253 samples, percent: local period)
  ---------------------------------------------------------------------------------------------------------------
           : 0                0xffffffff81baa000 <check_preemption_disabled>:
      0.00 :   ffffffff81baa000:        pushq   %r12              # data-type: (stack operation)
      0.00 :   ffffffff81baa002:        pushq   %rbp              # data-type: (stack operation)
      0.00 :   ffffffff81baa003:        pushq   %rbx              # data-type: (stack operation)
      0.00 :   ffffffff81baa004:        subq    $0x8, %rsp
     18.00 :   ffffffff81baa008:        movl    %gs:0x7e48893d(%rip), %ebx  # 0x3294c <pcpu_hot+0xc>              # data-type: struct pcpu_hot +0xc (cpu_number)
     12.58 :   ffffffff81baa00f:        movl    %gs:0x7e488932(%rip), %eax  # 0x32948 <pcpu_hot+0x8>              # data-type: struct pcpu_hot +0x8 (preempt_count)
      0.00 :   ffffffff81baa016:        testl   $0x7fffffff, %eax
      0.00 :   ffffffff81baa01b:        je      0xffffffff81baa02c <check_preemption_disabled+0x2c>
      0.00 :   ffffffff81baa01d:        addq    $0x8, %rsp
      0.00 :   ffffffff81baa021:        movl    %ebx, %eax
     14.19 :   ffffffff81baa023:        popq    %rbx              # data-type: (stack operation)
     18.86 :   ffffffff81baa024:        popq    %rbp              # data-type: (stack operation)
     12.10 :   ffffffff81baa025:        popq    %r12              # data-type: (stack operation)
     17.78 :   ffffffff81baa027:        jmp     0xffffffff81bc1170 <__x86_return_thunk>
      6.49 :   ffffffff81baa02c:        callq   *0xc9139e(%rip)  # 0xffffffff8283b3d0 <pv_ops+0xf0>               # data-type: (stack operation)
      0.00 :   ffffffff81baa032:        testb   $0x2, %ah
      0.00 :   ffffffff81baa035:        je      0xffffffff81baa01d <check_preemption_disabled+0x1d>
      0.00 :   ffffffff81baa037:        movq    %rdi, %rbp
      0.00 :   ffffffff81baa03a:        movq    %gs:0x32940, %rax         # data-type: struct pcpu_hot +0 (current_task)
      0.00 :   ffffffff81baa043:        testb   $0x4, 0x2f(%rax)          # data-type: struct task_struct +0x2f (flags)
      0.00 :   ffffffff81baa047:        je      0xffffffff81baa052 <check_preemption_disabled+0x52>
      0.00 :   ffffffff81baa049:        cmpl    $0x1, 0x3d0(%rax)         # data-type: struct task_struct +0x3d0 (nr_cpus_allowed)
      0.00 :   ffffffff81baa050:        je      0xffffffff81baa01d <check_preemption_disabled+0x1d>
      0.00 :   ffffffff81baa052:        movq    %gs:0x32940, %r12         # data-type: struct pcpu_hot +0 (current_task)
      0.00 :   ffffffff81baa05b:        cmpw    $0x0, 0x7f0(%r12)         # data-type: struct task_struct +0x7f0 (migration_disabled)
      0.00 :   ffffffff81baa065:        movq    %rsi, (%rsp)
      0.00 :   ffffffff81baa069:        jne     0xffffffff81baa01d <check_preemption_disabled+0x1d>
      0.00 :   ffffffff81baa06b:        movl    0xe8dd13(%rip), %eax  # 0xffffffff82a37d84 <system_state>         # data-type: enum system_states +0
      0.00 :   ffffffff81baa071:        testl   %eax, %eax
      0.00 :   ffffffff81baa073:        je      0xffffffff81baa01d <check_preemption_disabled+0x1d>
      0.00 :   ffffffff81baa075:        incl    %gs:0x7e4888cc(%rip)  # 0x32948 <pcpu_hot+0x8>            # data-type: struct pcpu_hot +0x8 (preempt_count)
      0.00 :   ffffffff81baa07c:        movq    $-0x7e14a100, %rdi
      0.00 :   ffffffff81baa083:        callq   0xffffffff81148c40 <__printk_ratelimit>           # data-type: (stack operation)
      0.00 :   ffffffff81baa088:        testl   %eax, %eax
      0.00 :   ffffffff81baa08a:        je      0xffffffff81baa0d5 <check_preemption_disabled+0xd5>
      0.00 :   ffffffff81baa08c:        movl    0x958(%r12), %r9d         # data-type: struct task_struct +0x958 (pid)
      0.00 :   ffffffff81baa094:        movq    (%rsp), %rdx              # data-type: char* +0
      0.00 :   ffffffff81baa098:        movq    %rbp, %rsi
      0.00 :   ffffffff81baa09b:        leaq    0xb88(%r12), %r8          # data-type: struct task_struct +0xb88 (comm)
      0.00 :   ffffffff81baa0a3:        movl    %gs:0x7e48889e(%rip), %ecx  # 0x32948 <pcpu_hot+0x8>              # data-type: struct pcpu_hot +0x8 (preempt_count)
      0.00 :   ffffffff81baa0aa:        andl    $0x7fffffff, %ecx
      0.00 :   ffffffff81baa0b0:        movq    $-0x7dd3cdf0, %rdi
      0.00 :   ffffffff81baa0b7:        subl    $0x1, %ecx
      0.00 :   ffffffff81baa0ba:        callq   0xffffffff81149340 <_printk>              # data-type: (stack operation)
      0.00 :   ffffffff81baa0bf:        movq    0x20(%rsp), %rsi
      0.00 :   ffffffff81baa0c4:        movq    $-0x7ddb8c7e, %rdi
      0.00 :   ffffffff81baa0cb:        callq   0xffffffff81149340 <_printk>              # data-type: (stack operation)
      0.00 :   ffffffff81baa0d0:        callq   0xffffffff81b7ab60 <dump_stack>           # data-type: (stack operation)
      0.00 :   ffffffff81baa0d5:        decl    %gs:0x7e48886c(%rip)  # 0x32948 <pcpu_hot+0x8>            # data-type: struct pcpu_hot +0x8 (preempt_count)
      0.00 :   ffffffff81baa0dc:        jmp     0xffffffff81baa01d <check_preemption_disabled+0x1d>

Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250310224925.799005-8-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-03-13 00:19:51 -07:00
Chun-Tse Shao
3c97e7b991 perf lock: Report owner stack in usermode
This patch parses `owner_lock_stat` into a RB tree, enabling ordered
reporting of owner lock statistics with stack traces. It also updates
the documentation for the `-o` option in contention mode, decouples `-o`
from `-t`, and issues a warning to inform users about the new behavior
of `-ov`.

Example output:
  $ sudo ~/linux/tools/perf/perf lock con -abvo -Y mutex-spin -E3 perf bench sched pipe
  ...
   contended   total wait     max wait     avg wait         type   caller

         171      1.55 ms     20.26 us      9.06 us        mutex   pipe_read+0x57
                          0xffffffffac6318e7  pipe_read+0x57
                          0xffffffffac623862  vfs_read+0x332
                          0xffffffffac62434b  ksys_read+0xbb
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
          36    193.71 us     15.27 us      5.38 us        mutex   pipe_write+0x50
                          0xffffffffac631ee0  pipe_write+0x50
                          0xffffffffac6241db  vfs_write+0x3bb
                          0xffffffffac6244ab  ksys_write+0xbb
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
           4     51.22 us     16.47 us     12.80 us        mutex   do_epoll_wait+0x24d
                          0xffffffffac691f0d  do_epoll_wait+0x24d
                          0xffffffffac69249b  do_epoll_pwait.part.0+0xb
                          0xffffffffac693ba5  __x64_sys_epoll_pwait+0x95
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76

  === owner stack trace ===

           3     31.24 us     15.27 us     10.41 us        mutex   pipe_read+0x348
                          0xffffffffac631bd8  pipe_read+0x348
                          0xffffffffac623862  vfs_read+0x332
                          0xffffffffac62434b  ksys_read+0xbb
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
  ...

Signed-off-by: Chun-Tse Shao <ctshao@google.com>
Tested-by: Athira Rajeev <atrajeev@linux.ibm.com>
Link: https://lore.kernel.org/r/20250227003359.732948-5-ctshao@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-28 10:09:02 -08:00
James Clark
5c496f1d67 perf list: Document -v option deduplication feature
-v disables deduplication of similarly suffixed PMUs so add it to the
help and doc strings.

Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20250226104111.564443-4-james.clark@linaro.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-26 16:23:47 -08:00
Greg Kroah-Hartman
0cced76a02 perf tools: Fix up some comments and code to properly use the event_source bus
In sysfs, the perf events are all located in
/sys/bus/event_source/devices/ but some places ended up hard-coding the
location to be at the root of /sys/devices/ which could be very risky as
you do not exactly know what type of device you are accessing in sysfs
at that location.

So fix this all up by properly pointing everything at the bus device
list instead of the root of the sysfs devices/ tree.

Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/2025021955-implant-excavator-179d@gregkh
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-19 13:23:43 -08:00
Dmitry Vyukov
32ecca8d7a perf report: Add latency and parallelism profiling documentation
Describe latency and parallelism profiling, related flags, and differences
with the currently only supported CPU-consumption-centric profiling.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lore.kernel.org/r/a13f270ed33cedb03ce9ebf9ddbd064854ca0f19.1739437531.git.dvyukov@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-18 14:04:32 -08:00
Dmitry Vyukov
2570c02c3a perf report: Add --latency flag
Add record/report --latency flag that allows to capture and show
latency-centric profiles rather than the default CPU-consumption-centric
profiles. For latency profiles record captures context switch events,
and report shows Latency as the first column.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lore.kernel.org/r/e9640464bcbc47dde2cb557003f421052ebc9eec.1739437531.git.dvyukov@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-18 14:04:32 -08:00
Namhyung Kim
fc00897c8a perf trace: Add --summary-mode option
The --summary-mode option will select how to show the syscall summary at
the end.  By default, it'll show the summary for each thread and it's
the same as if --summary-mode=thread is passed.

The other option is to show total summary, which is --summary-mode=total.
I'd like to have this instead of a separate option like --total-summary
because we may want to add a new summary mode (by cgroup) later.

  $ sudo ./perf trace -as --summary-mode=total sleep 1

   Summary of events:

   total, 21580 events

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     epoll_wait          1305      0 14716.712     0.000    11.277   551.529      8.87%
     futex               1256     89 13331.197     0.000    10.614   733.722     15.49%
     poll                 669      0  6806.618     0.000    10.174   459.316     11.77%
     ppoll                220      0  3968.797     0.000    18.040   516.775     25.35%
     clock_nanosleep        1      0  1000.027  1000.027  1000.027  1000.027      0.00%
     epoll_pwait           21      0   592.783     0.000    28.228   522.293     88.29%
     nanosleep             16      0    60.515     0.000     3.782    10.123     33.33%
     ioctl                510      0     4.284     0.001     0.008     0.182      8.84%
     recvmsg             1434    775     3.497     0.001     0.002     0.174      6.37%
     write               1393      0     2.854     0.001     0.002     0.017      1.79%
     read                1063    100     2.236     0.000     0.002     0.083      5.11%
     ...

Reviewed-by: Howard Chu <howardchu95@gmail.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20250205205443.1986408-5-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-12 19:44:16 -08:00
Linus Torvalds
7685b334d1 perf-tools changes for v6.14
There are a lot of changes in the perf tools in this cycle.
 
 build
 -----
 * Use generic syscall table to generate syscall numbers on supported archs.
 * This also enables to get rid of libaudit which was used for syscall numbers.
 * Remove python2 support as it's deprecated for years.
 * Fix issues on static build with libzstd.
 
 perf record
 -----------
 * Intel-PT supports "aux-action" config term to pause or resume tracing in
   the aux-buffer.  Users can start the intel_pt event as "started-paused" and
   configure other events to control the Intel-PT tracing.
 
     # perf record --kcore -e intel_pt/aux-action=start-paused/   \
         -e syscalls:sys_enter_newuname/aux-action=resume/        \
         -e syscalls:sys_exit_newuname/aux-action=pause/ -- uname
 
   This requires the kernel support (which was added in v6.13).
 
 perf lock
 ---------
 * 'perf lock contention' command has an ability to symbolize locks in
   dynamically allocated objects using slab cache name when it runs with BPF.
   Those dynamic locks would have "&" prefix in the name to distinguish them
   from ordinary (static) locks.
 
     # perf lock con -abl -E 5 sleep 1
        contended   total wait     max wait     avg wait            address   symbol
 
                2      1.95 us      1.77 us       975 ns   ffff9d5e852d3498   &task_struct (mutex)
                1      1.18 us      1.18 us      1.18 us   ffff9d5e852d3538   &task_struct (mutex)
                4      1.12 us       354 ns       279 ns   ffff9d5e841ca800   &kmalloc-cg-512 (mutex)
                2       859 ns       617 ns       429 ns   ffffffffa41c3620   delayed_uprobe_lock (mutex)
                3       691 ns       388 ns       230 ns   ffffffffa41c0940   pack_mutex (mutex)
 
   This also requires the kernel/BPF support (which was added in v6.13).
 
 perf ftrace
 -----------
 * 'perf ftrace latency' command gets a couple of options to support linear
   buckets instead of exponential.  Also it's possible to specify max and
   min latency for the linear buckets.
 
     # perf ftrace latency -abn -T switch_mm_irqs_off --bucket-range=100   \
         --min-latency=200 --max-latency=800 -- sleep 1
     #   DURATION     |      COUNT | GRAPH                                  |
          0 -  200 ns |        186 | ###                                    |
        200 -  300 ns |        256 | #####                                  |
        300 -  400 ns |        364 | #######                                |
        400 -  500 ns |        223 | ####                                   |
        500 -  600 ns |        111 | ##                                     |
        600 -  700 ns |         41 |                                        |
        700 -  800 ns |        141 | ##                                     |
        800 -  ... ns |        169 | ###                                    |
 
     # statistics  (in nsec)
       total time:              2162212
         avg time:                  967
         max time:                16817
         min time:                  132
            count:                 2236
 
 * As you can see in the above example, it nows shows the statistics at the
   end so that users can see the avg/max/min latencies easily.
 
 * 'perf ftrace profile' command has --graph-opts option like 'perf ftrace
   trace' so that it can control the tracing behaviors in the same way.
   For example, it can limit the function call depth or threshold.
 
 perf script
 -----------
 * Improve physical memory resolution in 'mem-phys-addr' script by parsing
   /proc/iomem file.
 
     # perf script mem-phys-addr -- find /
     ...
     Event: mem_inst_retired.all_loads:P
     Memory type                                    count  percentage
     ----------------------------------------  ----------  ----------
     100000000-85f7fffff : System RAM                8929        69.7
       547600000-54785d23f : Kernel data             1240         9.7
       546a00000-5474bdfff : Kernel rodata            490         3.8
       5480ce000-5485fffff : Kernel bss               121         0.9
     0-fff : Reserved                                3860        30.1
     100000-89c01fff : System RAM                      18         0.1
     8a22c000-8df6efff : System RAM                     5         0.0
 
 Others
 ------
 * 'perf test' gets --runs-per-test option to run the test cases repeatedly.
   This would be helpful to see if it's flaky.
 
 * Add 'parse_events' method to Python perf extension module, so that users
   can use the same event parsing logic in the python code.  One more step
   towards implementing perf tools in Python. :)
 
 * Support opening tracepoint events without libtraceevent.  This will be
   helpful if it won't use the tracing data like in 'perf stat'.
 
 * Update ARM Neoverse N2/V2 JSON events and metrics
 
 Signed-off-by: Namhyung Kim <namhyung@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSo2x5BnqMqsoHtzsmMstVUGiXMgwUCZ5AgiQAKCRCMstVUGiXM
 g0WhAP43Dpfatrm1jicTyAogk5D/JrIMOgjGtrJJi5RXG/r0gwD8DSWFzLppS9xy
 KGtjLHrN6v6BqR4DCubdlZmRfh9Qjgg=
 =M0Kz
 -----END PGP SIGNATURE-----

Merge tag 'perf-tools-for-v6.14-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools

Pull perf-tools updates from Namhyung Kim:
 "There are a lot of changes in the perf tools in this cycle.

  build:

   - Use generic syscall table to generate syscall numbers on supported
     archs

   - This also enables to get rid of libaudit which was used for syscall
     numbers

   - Remove python2 support as it's deprecated for years

   - Fix issues on static build with libzstd

  perf record:

   - Intel-PT supports "aux-action" config term to pause or resume
     tracing in the aux-buffer. Users can start the intel_pt event as
     "started-paused" and configure other events to control the Intel-PT
     tracing:

         # perf record --kcore -e intel_pt/aux-action=start-paused/   \
             -e syscalls:sys_enter_newuname/aux-action=resume/        \
             -e syscalls:sys_exit_newuname/aux-action=pause/ -- uname

     This requires kernel support (which was added in v6.13)

  perf lock:

   - 'perf lock contention' command has an ability to symbolize locks in
     dynamically allocated objects using slab cache name when it runs
     with BPF. Those dynamic locks would have "&" prefix in the name to
     distinguish them from ordinary (static) locks

        # perf lock con -abl -E 5 sleep 1
           contended   total wait     max wait     avg wait            address   symbol

                   2      1.95 us      1.77 us       975 ns   ffff9d5e852d3498   &task_struct (mutex)
                   1      1.18 us      1.18 us      1.18 us   ffff9d5e852d3538   &task_struct (mutex)
                   4      1.12 us       354 ns       279 ns   ffff9d5e841ca800   &kmalloc-cg-512 (mutex)
                   2       859 ns       617 ns       429 ns   ffffffffa41c3620   delayed_uprobe_lock (mutex)
                   3       691 ns       388 ns       230 ns   ffffffffa41c0940   pack_mutex (mutex)

     This also requires kernel/BPF support (which was added in v6.13)

  perf ftrace:

   - 'perf ftrace latency' command gets a couple of options to support
     linear buckets instead of exponential. Also it's possible to
     specify max and min latency for the linear buckets:

        # perf ftrace latency -abn -T switch_mm_irqs_off --bucket-range=100   \
            --min-latency=200 --max-latency=800 -- sleep 1
        #   DURATION     |      COUNT | GRAPH                                  |
             0 -  200 ns |        186 | ###                                    |
           200 -  300 ns |        256 | #####                                  |
           300 -  400 ns |        364 | #######                                |
           400 -  500 ns |        223 | ####                                   |
           500 -  600 ns |        111 | ##                                     |
           600 -  700 ns |         41 |                                        |
           700 -  800 ns |        141 | ##                                     |
           800 -  ... ns |        169 | ###                                    |

        # statistics  (in nsec)
          total time:              2162212
            avg time:                  967
            max time:                16817
            min time:                  132
               count:                 2236

   - As you can see in the above example, it nows shows the statistics
     at the end so that users can see the avg/max/min latencies easily

   - 'perf ftrace profile' command has --graph-opts option like 'perf
     ftrace trace' so that it can control the tracing behaviors in the
     same way. For example, it can limit the function call depth or
     threshold

  perf script:

   - Improve physical memory resolution in 'mem-phys-addr' script by
     parsing /proc/iomem file

        # perf script mem-phys-addr -- find /
        ...
        Event: mem_inst_retired.all_loads:P
        Memory type                                    count  percentage
        ----------------------------------------  ----------  ----------
        100000000-85f7fffff : System RAM                8929        69.7
          547600000-54785d23f : Kernel data             1240         9.7
          546a00000-5474bdfff : Kernel rodata            490         3.8
          5480ce000-5485fffff : Kernel bss               121         0.9
        0-fff : Reserved                                3860        30.1
        100000-89c01fff : System RAM                      18         0.1
        8a22c000-8df6efff : System RAM                     5         0.0

  Others:

   - 'perf test' gets --runs-per-test option to run the test cases
     repeatedly. This would be helpful to see if it's flaky

   - Add 'parse_events' method to Python perf extension module, so that
     users can use the same event parsing logic in the python code. One
     more step towards implementing perf tools in Python. :)

   - Support opening tracepoint events without libtraceevent. This will
     be helpful if it won't use the tracing data like in 'perf stat'

   - Update ARM Neoverse N2/V2 JSON events and metrics"

* tag 'perf-tools-for-v6.14-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (176 commits)
  perf test: Update event_groups test to use instructions
  perf bench: Fix undefined behavior in cmpworker()
  perf annotate: Prefer passing evsel to evsel->core.idx
  perf lock: Rename fields in lock_type_table
  perf lock: Add percpu-rwsem for type filter
  perf lock: Fix parse_lock_type which only retrieve one lock flag
  perf lock: Fix return code for functions in __cmd_contention
  perf hist: Fix width calculation in hpp__fmt()
  perf hist: Fix bogus profiles when filters are enabled
  perf hist: Deduplicate cmp/sort/collapse code
  perf test: Improve verbose documentation
  perf test: Add a runs-per-test flag
  perf test: Fix parallel/sequential option documentation
  perf test: Send list output to stdout rather than stderr
  perf test: Rename functions and variables for better clarity
  perf tools: Expose quiet/verbose variables in Makefile.perf
  perf config: Add a function to set one variable in .perfconfig
  perf test perftool_testsuite: Return correct value for skipping
  perf test perftool_testsuite: Add missing description
  perf test record+probe_libc_inet_pton: Make test resilient
  ...
2025-01-24 05:45:40 -08:00
Chun-Tse Shao
e9188ae3cd perf lock: Add percpu-rwsem for type filter
percpu-rwsem was missing in man page. And for backward compatibility,
replace `pcpu-sem` with `percpu-rwsem` before parsing lock name.
Tested `./perf lock con -ab -Y pcpu-sem` and `./perf lock con -ab -Y
percpu-rwsem`

Fixes: 4f701063bfa2 ("perf lock contention: Show lock type with address")
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Chun-Tse Shao <ctshao@google.com>
Cc: nick.forrington@arm.com
Link: https://lore.kernel.org/r/20250116235838.2769691-2-ctshao@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-01-17 10:12:40 -08:00
Ian Rogers
4e38f2814f perf test: Improve verbose documentation
Add a little more detail on the output expectations for each verbose
level.

Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Cc: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20250110045736.598281-6-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-01-16 11:01:03 -08:00
Ian Rogers
1c0d9816e9 perf test: Add a runs-per-test flag
To detect flakes it is useful to run tests more than once. Add a
runs-per-test flag that will run each test multiple times. Example
output:

```
$ perf test -r 3 lbr -v
122: perf record LBR tests                                           : Ok
122: perf record LBR tests                                           : Ok
122: perf record LBR tests                                           : Ok
```

Update the documentation for the runs-per-test option.

Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Cc: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20250110045736.598281-5-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-01-16 11:01:03 -08:00
Ian Rogers
4dd8bc4bf5 perf test: Fix parallel/sequential option documentation
The parallel option was removed in commit 94d1a913bdc4 ("perf test:
Make parallel testing the default"). Update the sequential
documentation to reflect it isn't the default except for "exclusive"
tests.

Fixes: 94d1a913bdc4 ("perf test: Make parallel testing the default")
Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Cc: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20250110045736.598281-4-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-01-16 11:01:03 -08:00
James Clark
ba113ecad8 perf docs: arm_spe: Document new discard mode
Document the flag along with PMU events to hint what it's used for and
give an example with other useful options to get minimal output.

Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com>
Signed-off-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20250108142904.401139-3-james.clark@linaro.org
Signed-off-by: Will Deacon <will@kernel.org>
2025-01-10 14:50:55 +00:00