Bug 2172568 - Perf has difficulty producing accurate function backtraces for nbdkit
Summary: Perf has difficulty producing accurate function backtraces for nbdkit
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel-tools
Version: 39
Hardware: x86_64
OS: Linux
unspecified
low
Target Milestone: ---
Assignee: Justin M. Forbes
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-02-22 15:05 UTC by William Cohen
Modified: 2023-08-16 08:14 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Embargoed:


Attachments (Terms of Use)
The FP and DWARF flamegraphs and output of perf report (457.93 KB, application/gzip)
2023-02-22 15:05 UTC, William Cohen
no flags Details
New flamegraphs on rawhide machine. (644.36 KB, application/gzip)
2023-03-02 17:15 UTC, William Cohen
no flags Details

Description William Cohen 2023-02-22 15:05:17 UTC
Created attachment 1945746 [details]
The FP and DWARF flamegraphs and output of perf report

A couple weeks ago I started talking with Richard WM Jones about his
experiences using the DWARF based call-graph support for debugging a
qemu issue mentioned at:

https://listman.redhat.com/archives/libguestfs/2023-February/030581.html

With the current enablement of frame pointers in Fedora rawhide it is
a bit easier to compare the results between perf data generated from
FP and DWARF data.  There were some issues he noticed on resulting
flame graphs:

-Fewer levels of calls in DWARF version
-"[unknown]" and binary names labelling rather than actual functions in DWARF version
-Detached frames

The fewer level of call levels may be due to the limited size of the
stack snapshot (8KB by default).  I initially thought that there the
DWARF unwinding might be getting the unwinding wrong, but also see
"[unknown]" and binary names in the resulting flamegraphs rather than
function names.

System setup:

sudo dnf install -y perf flamegraph flamegraph-stackcollapse
sudo dnf install -y qemu-img nbdkit nbdkit-curl-plugin
sudo dnf debuginfo-install -y qemu-img nbdkit nbdkit-curl-plugin
rpm -q kernel perf flamegraph flamegraph-stackcollapse qemu-img nbdkit nbdkit-curl-plugin

Current machine setup has:

kernel-6.2.0-0.rc8.20230217gitec35307e18ba.60.fc39.x86_64
perf-6.2.0-0.rc8.git0.1.fc39.x86_64
flamegraph-1.0-13.20220917gitd9fcc27.fc38.noarch
flamegraph-stackcollapse-1.0-13.20220917gitd9fcc27.fc38.noarch
qemu-img-7.2.0-6.fc38.x86_64
nbdkit-1.33.8-1.fc38.x86_64
nbdkit-curl-plugin-1.33.8-1.fc38.x86_64


It takes on the order of an hour to collect data using Generate FP backtrace because of the slow data link:

perf record --call-graph=fp --output=qemu_fp_perf.data  nbdkit -r curl https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img --filter=multi-conn multi-conn-mode=unsafe timeout=2000 --run ' qemu-img convert -W -m 16 -p -f qcow2 $uri -O raw jammy-server-cloudimg-amd64.img.raw '
perf script --input=qemu_fp_perf.data | stackcollapse-perf.pl > out.qemu_fp-folded
flamegraph.pl out.qemu_fp-folded > qemu_fp.svg
perf report --input qemu_fp_perf.data  > qemu_fp_perf.out


Similar data collection for the DWARF backtraces:

perf record --call-graph=dwarf --output=qemu_dwarf_perf.data  nbdkit -r curl https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img --filter=multi-conn multi-conn-mode=unsafe timeout=2000 --run ' qemu-img convert -W -m 16 -p -f qcow2 $uri -O raw jammy-server-cloudimg-amd64.img.raw '
perf script --input=qemu_dwarf_perf.data | stackcollapse-perf.pl > out.qemu_dwarf-folded
flamegraph.pl out.qemu_dwarf-folded > qemu_dwarf.svg
perf report --input qemu_dwarf_perf.data  > qemu_dwarf_perf.out

When looking through the qemu_*_perf.out files from perf report it
looks like both stack backtrace methods (FP and DWARF) are having
issues with this particular use case.

One of the things that might be confusing the FP unwinding is qemu-img
use of coroutines and stack switching.  A some "[unknown]" below the
__start_context of qemu-img binary on the FP flame graph.

On both FP and DWARF flame graphs see [nbdkit] and [libcurl.so.4.8.0]
rather than actual function names listed rather than actual function
names.

The *.out and *.svg fils are in nbdkit_perf_data.tar.gz.  The original perf record files are too large to attach to the bug.

Comment 1 William Cohen 2023-03-02 17:14:27 UTC
Most of the the unknown entries were caused by missing debuginfo on
the machine.  I reviewed which binaries had missing debuginfo and
installed them and reran the experiments.  That eliminated much of the
[unknown] entries in the data.  However, there are still some areas
where the data is lacking.  The updated flamegraphs are in
20230301.tar.gz added to this bug.

Summary of the places where  perf fp had issues:
-PLT entries
-libc syscalls and vdso code (also saw this issue on dwarf unwinding)
-libc exit() function not listed in the backtrace
-qemu coroutine code

PLT entries

I started to look up which addresses were problematic I did the
following to group things by address:

grep unknown qemu_fp_perf_script.out |sort | uniq -c |sort -k 2|more > qemu_fp_perf_unknowns.out

when looking through the data would see sections like the following in the output:

     93 	           12070 [unknown] (/usr/lib64/libcurl.so.4.8.0)
     51 	           120a0 [unknown] (/usr/lib64/libcurl.so.4.8.0)
      5 	           120a4 [unknown] (/usr/lib64/libcurl.so.4.8.0)
      7 	           120d0 [unknown] (/usr/lib64/libcurl.so.4.8.0)
      3 	           120d4 [unknown] (/usr/lib64/libcurl.so.4.8.0)
      1 	           12100 [unknown] (/usr/lib64/libcurl.so.4.8.0)

After verifying that libcurl debuginfo was install found that these addresses were mapping back to PLT entries like:

0000000000012070 <SSL_get_error@plt>:
   12070:	f3 0f 1e fa          	endbr64
   12074:	ff 25 26 6f 09 00    	jmp    *0x96f26(%rip)        # a8fa0 <SSL_get_error.0>
   1207a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)

Looking at the original "perf script" output would see things like:

nbdkit  2274 10327.200181:     264642 cycles:u: 
	           12070 [unknown] (/usr/lib64/libcurl.so.4.8.0)
	           75c1e ssl_cf_recv.lto_priv.0+0x2e (/usr/lib64/libcurl.so.4.8.0)
	           5b8ae Curl_read+0x5e (/usr/lib64/libcurl.so.4.8.0)
	           6e99a Curl_readwrite+0x7ea (/usr/lib64/libcurl.so.4.8.0)
	           52ceb multi_runsingle+0xa2b (/usr/lib64/libcurl.so.4.8.0)
	           5572e curl_multi_perform+0xde (/usr/lib64/libcurl.so.4.8.0)
	           290bb curl_easy_perform+0x19b (/usr/lib64/libcurl.so.4.8.0)
	            4b71 curl_pread+0xc1 (/usr/lib64/nbdkit/plugins/nbdkit-curl-plugin.so)
	            9e46 plugin_pread+0x66 (/usr/sbin/nbdkit)
	            d40b backend_pread+0xfb (/usr/sbin/nbdkit)
	            d566 filter_pread+0x46 (/usr/sbin/nbdkit)
	            d40b backend_pread+0xfb (/usr/sbin/nbdkit)
	           11fe7 protocol_recv_request_send_reply+0x987 (/usr/sbin/nbdkit)
	           12fc3 handle_single_connection+0x443 (/usr/sbin/nbdkit)
	           14275 start_thread+0x65 (/usr/sbin/nbdkit)
	           8cef7 start_thread+0x2f7 (/usr/lib64/libc.so.6)
	          11307c clone3+0x2c (/usr/lib64/libc.so.6)



libc syscalls and vdso code

There looks to be cases where the the samples where taken right after
entering the kernel (maybe an effect of the sampling skid). In the
sorted output of qemu_fp_perf_unknowns.out see following entries that
kind of look like kernel addresses, but don't actually seem to map to
actual kernel functions:

    760 	ffffffffb0000000 [unknown] ([unknown])
      6 	ffffffffb0000290 [unknown] ([unknown])
   1848 	ffffffffb00002c0 [unknown] ([unknown])
   8799 	ffffffffb00002c8 [unknown] ([unknown])
     14 	ffffffffb0000b60 [unknown] ([unknown])
    101 	ffffffffb0000e70 [unknown] ([unknown])

Searching through the "perf script" output see ffffffffb0000000 in entries like:

nbdkit  2274  8729.808538:      99918 cycles:u: 
	ffffffffb0000000 [unknown] ([unknown])
	             a49 __vdso_clock_gettime+0xc9 ([vdso])
	    7f576c001280 [unknown] ([unknown])
	    7f5771330750 [unknown] ([unknown])

and 

nbdkit  2274  8781.579889:      48946 cycles:u: 
	ffffffffb0000000 [unknown] ([unknown])
	          105a9d __poll+0x4d (/usr/lib64/libc.so.6)
	           58ee8 Curl_poll+0xa8 (/usr/lib64/libcurl.so.4.8.0)
	           4dc2a multi_wait.part.0+0x4aa (/usr/lib64/libcurl.so.4.8.0)
	    7f5771330668 [unknown] ([unknown])
	    7f576c001280 [unknown] ([unknown])

There are cases __vdso_clock_gettime where the FP unwinder was able to
unwind things successfully from the same address user-space address:

nbdkit  2274  8881.165087:     421448 cycles:u: 
	ffffffffb0000000 [unknown] ([unknown])
	             a49 __vdso_clock_gettime+0xc9 ([vdso])
	           d6b0d clock_gettime@@GLIBC_2.17+0x1d (/usr/lib64/libc.so.6)
	           69ff9 allocate_conn.lto_priv.0+0xc9 (/usr/lib64/libcurl.so.4.8.0)
	           5295d multi_runsingle+0x69d (/usr/lib64/libcurl.so.4.8.0)
	           5572e curl_multi_perform+0xde (/usr/lib64/libcurl.so.4.8.0)
	           290bb curl_easy_perform+0x19b (/usr/lib64/libcurl.so.4.8.0)
	            4b71 curl_pread+0xc1 (/usr/lib64/nbdkit/plugins/nbdkit-curl-plugin.so)
	            9e46 plugin_pread+0x66 (/usr/sbin/nbdkit)
	            d40b backend_pread+0xfb (/usr/sbin/nbdkit)
	            d566 filter_pread+0x46 (/usr/sbin/nbdkit)
	            d40b backend_pread+0xfb (/usr/sbin/nbdkit)
	           11fe7 protocol_recv_request_send_reply+0x987 (/usr/sbin/nbdkit)
	           12fc3 handle_single_connection+0x443 (/usr/sbin/nbdkit)
	           14275 start_thread+0x65 (/usr/sbin/nbdkit)
	           8cef7 start_thread+0x2f7 (/usr/lib64/libc.so.6)
	          11307c clone3+0x2c (/usr/lib64/libc.so.6)

nbdkit  2274  8725.256890:     102341 cycles:u: 
	ffffffffb00002c0 [unknown] ([unknown])
	          105a9d __poll+0x4d (/usr/lib64/libc.so.6)
	           58ee8 Curl_poll+0xa8 (/usr/lib64/libcurl.so.4.8.0)
	           58fe7 Curl_socket_check+0x97 (/usr/lib64/libcurl.so.4.8.0)
	           6e99a Curl_readwrite+0x7ea (/usr/lib64/libcurl.so.4.8.0)
	           52ceb multi_runsingle+0xa2b (/usr/lib64/libcurl.so.4.8.0)
	           5572e curl_multi_perform+0xde (/usr/lib64/libcurl.so.4.8.0)
	           290bb curl_easy_perform+0x19b (/usr/lib64/libcurl.so.4.8.0)
	            4b71 curl_pread+0xc1 (/usr/lib64/nbdkit/plugins/nbdkit-curl-plugin.so)
	            9e46 plugin_pread+0x66 (/usr/sbin/nbdkit)
	            d40b backend_pread+0xfb (/usr/sbin/nbdkit)
	            d566 filter_pread+0x46 (/usr/sbin/nbdkit)
	            d40b backend_pread+0xfb (/usr/sbin/nbdkit)
	           11fe7 protocol_recv_request_send_reply+0x987 (/usr/sbin/nbdkit)
	           12fc3 handle_single_connection+0x443 (/usr/sbin/nbdkit)
	           14275 start_thread+0x65 (/usr/sbin/nbdkit)
	           8cef7 start_thread+0x2f7 (/usr/lib64/libc.so.6)
	          11307c clone3+0x2c (/usr/lib64/libc.so.6)


libc exit() function not listed in backtrace

See samples like the following:

nbdkit  2270 11971.254333:     436407 cycles:u: 
	           9d735 cfree.5+0x25 (/usr/lib64/libc.so.6)
	          15cf18 OSSL_DECODER_free+0x48 (/usr/lib64/libcrypto.so.3.0.8)
	          1dd8bc impl_free+0x1c (/usr/lib64/libcrypto.so.3.0.8)
	          205b09 OPENSSL_sk_pop_free+0x39 (/usr/lib64/libcrypto.so.3.0.8)
	          1e6c3f alg_cleanup+0x2f (/usr/lib64/libcrypto.so.3.0.8)
	          1c0f54 sa_doall.lto_priv.0+0x134 (/usr/lib64/libcrypto.so.3.0.8)
	          1e6d21 ossl_method_store_free+0x31 (/usr/lib64/libcrypto.so.3.0.8)
	          1b9f0b CRYPTO_free_ex_data+0x15b (/usr/lib64/libcrypto.so.3.0.8)
	          1ba0a6 context_deinit.part.0+0x66 (/usr/lib64/libcrypto.so.3.0.8)
	          1bcc52 OPENSSL_cleanup+0x212 (/usr/lib64/libcrypto.so.3.0.8)
	           40246 __run_exit_handlers+0x166 (/usr/lib64/libc.so.6)
	           4038e [unknown] (/usr/lib64/libc.so.6)
	           27b51 __libc_start_call_main+0x81 (/usr/lib64/libc.so.6)
	           27c0b __libc_start_main@@GLIBC_2.34+0x8b (/usr/lib64/libc.so.6)
	            94d5 _start+0x25 (/usr/sbin/nbdkit)

This looks to be more a issue with the libc exit function having the
call to __run_exit_handler as the last instruction in the region
marked as belonging to exit.  The (unused) return address on the stack
is outside the exit function.  Thus it is listed as unknown.


qemu coroutine code

The qemu code uses setjmp and longjmp operations to implement
coroutine code.  In some cases end up with backtraces like the
following where FP unwinding gets confused:

qemu-img  2271  8724.707399:     121182 cycles:u: 
	           a9cff nbd_reply_chunk_iter_receive+0xbf (/usr/bin/qemu-img)
	           ab182 nbd_client_co_preadv+0x232 (/usr/bin/qemu-img)
	           9f466 bdrv_driver_preadv+0x106 (/usr/bin/qemu-img)
	           a2828 bdrv_aligned_preadv+0x348 (/usr/bin/qemu-img)
	           a2f5c bdrv_co_preadv_part+0x17c (/usr/bin/qemu-img)
	           c2f87 qcow2_co_preadv_task_entry+0x177 (/usr/bin/qemu-img)
	           c3ab8 qcow2_co_preadv_part+0x278 (/usr/bin/qemu-img)
	           a2828 bdrv_aligned_preadv+0x348 (/usr/bin/qemu-img)
	           a2f5c bdrv_co_preadv_part+0x17c (/usr/bin/qemu-img)
	           93313 blk_co_do_preadv_part+0x83 (/usr/bin/qemu-img)
	           9349c blk_co_pread+0x5c (/usr/bin/qemu-img)
	           59a4e convert_co_do_copy+0x47e (/usr/bin/qemu-img)
	          15da82 coroutine_trampoline+0x52 (/usr/bin/qemu-img)
	           55090 __start_context+0x0 (/usr/lib64/libc.so.6)
	    560a5ee22337 [unknown] ([unknown])
	               0 [unknown] ([unknown])

This looks to be one thing that the dwarf unwinding handles better it
doesn't attempt to unwind past that and don't see any "unknown"
entries on the qemu_dwarf.svg under the coroutine_trampoline.

Comment 2 William Cohen 2023-03-02 17:15:59 UTC
Created attachment 1947526 [details]
New flamegraphs on rawhide machine.

Comment 3 William Cohen 2023-03-22 20:08:55 UTC
Looking at the entry_SYSCALL_64 code
(https://elixir.bootlin.com/linux/v6.3-rc3/source/arch/x86/entry/entry_64.S#L87)
there appears to be a window where the pt_regs might not hold the
current values of the registers until after the PUSH_AND_CLEAR_REGS.
Is it possible that the perf sampling occurs in the early part of
entry_SYSCALL_64 and the perf sampling code observes the the sample is
currently in the kernel assumes that the pt_regs are valid when they
are not actually?  There is a comment about "IRQ are off" in
entry_SYSCALL_64.  Is perf using NMI?

Comment 4 William Cohen 2023-03-23 17:58:32 UTC
I have though about other possible causes of the problem and
considered whether this might be a difference between perf collecting
data in a virtualized environment vs non-virtualized. The initial
experiments were on a qemu guest vm and a fedora 37 host. Running the
experiments on the same hardware (a Lenovo P53 laptop and Intel(R)
Xeon(R) E-2276M CPU @ 2.80GHz processor) with updated RPMs:

On the fedora 37 host:
$ rpm -q kernel qemu
kernel-6.2.7-200.fc37.x86_64
qemu-7.0.0-14.fc37.x86_64

On the rawhide guest:

kernel-6.3.0-0.rc3.20230322gita1effab7a3a3.31.fc39.x86_64

perf record --call-graph=fp --output=guest_qemu_fp_perf.data  nbdkit -r curl https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img --filter=multi-conn multi-conn-mode=unsafe timeout=2000 --run ' qemu-img convert -W -m 16 -p -f qcow2 $uri -O raw jammy-server-cloudimg-amd64.img.raw '
perf script --input=guest_qemu_fp_perf.data | stackcollapse-perf.pl > out.guest_qemu_fp-folded
flamegraph.pl out.guest_qemu_fp-folded > guest_qemu_fp.svg
perf report --input guest_qemu_fp_perf.data  > guest_qemu_fp_perf.out


Tried the same experiment on a physical machine, a Intel(R) Xeon(R)
CPU E7540, running rawhide kernel

kernel-6.3.0-0.rc3.20230322gita1effab7a3a3.31.fc39.x86_64

perf record --call-graph=fp --output=raw_qemu_fp_perf.data  nbdkit -r curl https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img --filter=multi-conn multi-conn-mode=unsafe timeout=2000 --run ' qemu-img convert -W -m 16 -p -f qcow2 $uri -O raw jammy-server-cloudimg-amd64.img.raw '
perf script --input=raw_qemu_fp_perf.data | stackcollapse-perf.pl > out.raw_qemu_fp-folded
flamegraph.pl out.raw_qemu_fp-folded > raw_qemu_fp.svg
perf report --input raw_qemu_fp_perf.data  > raw_qemu_fp_perf.out


The data taken on the non-virtualized machine has saner callback
stacks where __libc_recv and __poll are involved as shown in the
svg files in 20230323_data.tar.gz

Comment 6 Fedora Release Engineering 2023-08-16 08:14:47 UTC
This bug appears to have been reported against 'rawhide' during the Fedora Linux 39 development cycle.
Changing version to 39.


Note You need to log in before you can comment on or make changes to this bug.