Created attachment 1945746 [details] The FP and DWARF flamegraphs and output of perf report A couple weeks ago I started talking with Richard WM Jones about his experiences using the DWARF based call-graph support for debugging a qemu issue mentioned at: https://listman.redhat.com/archives/libguestfs/2023-February/030581.html With the current enablement of frame pointers in Fedora rawhide it is a bit easier to compare the results between perf data generated from FP and DWARF data. There were some issues he noticed on resulting flame graphs: -Fewer levels of calls in DWARF version -"[unknown]" and binary names labelling rather than actual functions in DWARF version -Detached frames The fewer level of call levels may be due to the limited size of the stack snapshot (8KB by default). I initially thought that there the DWARF unwinding might be getting the unwinding wrong, but also see "[unknown]" and binary names in the resulting flamegraphs rather than function names. System setup: sudo dnf install -y perf flamegraph flamegraph-stackcollapse sudo dnf install -y qemu-img nbdkit nbdkit-curl-plugin sudo dnf debuginfo-install -y qemu-img nbdkit nbdkit-curl-plugin rpm -q kernel perf flamegraph flamegraph-stackcollapse qemu-img nbdkit nbdkit-curl-plugin Current machine setup has: kernel-6.2.0-0.rc8.20230217gitec35307e18ba.60.fc39.x86_64 perf-6.2.0-0.rc8.git0.1.fc39.x86_64 flamegraph-1.0-13.20220917gitd9fcc27.fc38.noarch flamegraph-stackcollapse-1.0-13.20220917gitd9fcc27.fc38.noarch qemu-img-7.2.0-6.fc38.x86_64 nbdkit-1.33.8-1.fc38.x86_64 nbdkit-curl-plugin-1.33.8-1.fc38.x86_64 It takes on the order of an hour to collect data using Generate FP backtrace because of the slow data link: perf record --call-graph=fp --output=qemu_fp_perf.data nbdkit -r curl https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img --filter=multi-conn multi-conn-mode=unsafe timeout=2000 --run ' qemu-img convert -W -m 16 -p -f qcow2 $uri -O raw jammy-server-cloudimg-amd64.img.raw ' perf script --input=qemu_fp_perf.data | stackcollapse-perf.pl > out.qemu_fp-folded flamegraph.pl out.qemu_fp-folded > qemu_fp.svg perf report --input qemu_fp_perf.data > qemu_fp_perf.out Similar data collection for the DWARF backtraces: perf record --call-graph=dwarf --output=qemu_dwarf_perf.data nbdkit -r curl https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img --filter=multi-conn multi-conn-mode=unsafe timeout=2000 --run ' qemu-img convert -W -m 16 -p -f qcow2 $uri -O raw jammy-server-cloudimg-amd64.img.raw ' perf script --input=qemu_dwarf_perf.data | stackcollapse-perf.pl > out.qemu_dwarf-folded flamegraph.pl out.qemu_dwarf-folded > qemu_dwarf.svg perf report --input qemu_dwarf_perf.data > qemu_dwarf_perf.out When looking through the qemu_*_perf.out files from perf report it looks like both stack backtrace methods (FP and DWARF) are having issues with this particular use case. One of the things that might be confusing the FP unwinding is qemu-img use of coroutines and stack switching. A some "[unknown]" below the __start_context of qemu-img binary on the FP flame graph. On both FP and DWARF flame graphs see [nbdkit] and [libcurl.so.4.8.0] rather than actual function names listed rather than actual function names. The *.out and *.svg fils are in nbdkit_perf_data.tar.gz. The original perf record files are too large to attach to the bug.
Most of the the unknown entries were caused by missing debuginfo on the machine. I reviewed which binaries had missing debuginfo and installed them and reran the experiments. That eliminated much of the [unknown] entries in the data. However, there are still some areas where the data is lacking. The updated flamegraphs are in 20230301.tar.gz added to this bug. Summary of the places where perf fp had issues: -PLT entries -libc syscalls and vdso code (also saw this issue on dwarf unwinding) -libc exit() function not listed in the backtrace -qemu coroutine code PLT entries I started to look up which addresses were problematic I did the following to group things by address: grep unknown qemu_fp_perf_script.out |sort | uniq -c |sort -k 2|more > qemu_fp_perf_unknowns.out when looking through the data would see sections like the following in the output: 93 12070 [unknown] (/usr/lib64/libcurl.so.4.8.0) 51 120a0 [unknown] (/usr/lib64/libcurl.so.4.8.0) 5 120a4 [unknown] (/usr/lib64/libcurl.so.4.8.0) 7 120d0 [unknown] (/usr/lib64/libcurl.so.4.8.0) 3 120d4 [unknown] (/usr/lib64/libcurl.so.4.8.0) 1 12100 [unknown] (/usr/lib64/libcurl.so.4.8.0) After verifying that libcurl debuginfo was install found that these addresses were mapping back to PLT entries like: 0000000000012070 <SSL_get_error@plt>: 12070: f3 0f 1e fa endbr64 12074: ff 25 26 6f 09 00 jmp *0x96f26(%rip) # a8fa0 <SSL_get_error.0> 1207a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) Looking at the original "perf script" output would see things like: nbdkit 2274 10327.200181: 264642 cycles:u: 12070 [unknown] (/usr/lib64/libcurl.so.4.8.0) 75c1e ssl_cf_recv.lto_priv.0+0x2e (/usr/lib64/libcurl.so.4.8.0) 5b8ae Curl_read+0x5e (/usr/lib64/libcurl.so.4.8.0) 6e99a Curl_readwrite+0x7ea (/usr/lib64/libcurl.so.4.8.0) 52ceb multi_runsingle+0xa2b (/usr/lib64/libcurl.so.4.8.0) 5572e curl_multi_perform+0xde (/usr/lib64/libcurl.so.4.8.0) 290bb curl_easy_perform+0x19b (/usr/lib64/libcurl.so.4.8.0) 4b71 curl_pread+0xc1 (/usr/lib64/nbdkit/plugins/nbdkit-curl-plugin.so) 9e46 plugin_pread+0x66 (/usr/sbin/nbdkit) d40b backend_pread+0xfb (/usr/sbin/nbdkit) d566 filter_pread+0x46 (/usr/sbin/nbdkit) d40b backend_pread+0xfb (/usr/sbin/nbdkit) 11fe7 protocol_recv_request_send_reply+0x987 (/usr/sbin/nbdkit) 12fc3 handle_single_connection+0x443 (/usr/sbin/nbdkit) 14275 start_thread+0x65 (/usr/sbin/nbdkit) 8cef7 start_thread+0x2f7 (/usr/lib64/libc.so.6) 11307c clone3+0x2c (/usr/lib64/libc.so.6) libc syscalls and vdso code There looks to be cases where the the samples where taken right after entering the kernel (maybe an effect of the sampling skid). In the sorted output of qemu_fp_perf_unknowns.out see following entries that kind of look like kernel addresses, but don't actually seem to map to actual kernel functions: 760 ffffffffb0000000 [unknown] ([unknown]) 6 ffffffffb0000290 [unknown] ([unknown]) 1848 ffffffffb00002c0 [unknown] ([unknown]) 8799 ffffffffb00002c8 [unknown] ([unknown]) 14 ffffffffb0000b60 [unknown] ([unknown]) 101 ffffffffb0000e70 [unknown] ([unknown]) Searching through the "perf script" output see ffffffffb0000000 in entries like: nbdkit 2274 8729.808538: 99918 cycles:u: ffffffffb0000000 [unknown] ([unknown]) a49 __vdso_clock_gettime+0xc9 ([vdso]) 7f576c001280 [unknown] ([unknown]) 7f5771330750 [unknown] ([unknown]) and nbdkit 2274 8781.579889: 48946 cycles:u: ffffffffb0000000 [unknown] ([unknown]) 105a9d __poll+0x4d (/usr/lib64/libc.so.6) 58ee8 Curl_poll+0xa8 (/usr/lib64/libcurl.so.4.8.0) 4dc2a multi_wait.part.0+0x4aa (/usr/lib64/libcurl.so.4.8.0) 7f5771330668 [unknown] ([unknown]) 7f576c001280 [unknown] ([unknown]) There are cases __vdso_clock_gettime where the FP unwinder was able to unwind things successfully from the same address user-space address: nbdkit 2274 8881.165087: 421448 cycles:u: ffffffffb0000000 [unknown] ([unknown]) a49 __vdso_clock_gettime+0xc9 ([vdso]) d6b0d clock_gettime@@GLIBC_2.17+0x1d (/usr/lib64/libc.so.6) 69ff9 allocate_conn.lto_priv.0+0xc9 (/usr/lib64/libcurl.so.4.8.0) 5295d multi_runsingle+0x69d (/usr/lib64/libcurl.so.4.8.0) 5572e curl_multi_perform+0xde (/usr/lib64/libcurl.so.4.8.0) 290bb curl_easy_perform+0x19b (/usr/lib64/libcurl.so.4.8.0) 4b71 curl_pread+0xc1 (/usr/lib64/nbdkit/plugins/nbdkit-curl-plugin.so) 9e46 plugin_pread+0x66 (/usr/sbin/nbdkit) d40b backend_pread+0xfb (/usr/sbin/nbdkit) d566 filter_pread+0x46 (/usr/sbin/nbdkit) d40b backend_pread+0xfb (/usr/sbin/nbdkit) 11fe7 protocol_recv_request_send_reply+0x987 (/usr/sbin/nbdkit) 12fc3 handle_single_connection+0x443 (/usr/sbin/nbdkit) 14275 start_thread+0x65 (/usr/sbin/nbdkit) 8cef7 start_thread+0x2f7 (/usr/lib64/libc.so.6) 11307c clone3+0x2c (/usr/lib64/libc.so.6) nbdkit 2274 8725.256890: 102341 cycles:u: ffffffffb00002c0 [unknown] ([unknown]) 105a9d __poll+0x4d (/usr/lib64/libc.so.6) 58ee8 Curl_poll+0xa8 (/usr/lib64/libcurl.so.4.8.0) 58fe7 Curl_socket_check+0x97 (/usr/lib64/libcurl.so.4.8.0) 6e99a Curl_readwrite+0x7ea (/usr/lib64/libcurl.so.4.8.0) 52ceb multi_runsingle+0xa2b (/usr/lib64/libcurl.so.4.8.0) 5572e curl_multi_perform+0xde (/usr/lib64/libcurl.so.4.8.0) 290bb curl_easy_perform+0x19b (/usr/lib64/libcurl.so.4.8.0) 4b71 curl_pread+0xc1 (/usr/lib64/nbdkit/plugins/nbdkit-curl-plugin.so) 9e46 plugin_pread+0x66 (/usr/sbin/nbdkit) d40b backend_pread+0xfb (/usr/sbin/nbdkit) d566 filter_pread+0x46 (/usr/sbin/nbdkit) d40b backend_pread+0xfb (/usr/sbin/nbdkit) 11fe7 protocol_recv_request_send_reply+0x987 (/usr/sbin/nbdkit) 12fc3 handle_single_connection+0x443 (/usr/sbin/nbdkit) 14275 start_thread+0x65 (/usr/sbin/nbdkit) 8cef7 start_thread+0x2f7 (/usr/lib64/libc.so.6) 11307c clone3+0x2c (/usr/lib64/libc.so.6) libc exit() function not listed in backtrace See samples like the following: nbdkit 2270 11971.254333: 436407 cycles:u: 9d735 cfree.5+0x25 (/usr/lib64/libc.so.6) 15cf18 OSSL_DECODER_free+0x48 (/usr/lib64/libcrypto.so.3.0.8) 1dd8bc impl_free+0x1c (/usr/lib64/libcrypto.so.3.0.8) 205b09 OPENSSL_sk_pop_free+0x39 (/usr/lib64/libcrypto.so.3.0.8) 1e6c3f alg_cleanup+0x2f (/usr/lib64/libcrypto.so.3.0.8) 1c0f54 sa_doall.lto_priv.0+0x134 (/usr/lib64/libcrypto.so.3.0.8) 1e6d21 ossl_method_store_free+0x31 (/usr/lib64/libcrypto.so.3.0.8) 1b9f0b CRYPTO_free_ex_data+0x15b (/usr/lib64/libcrypto.so.3.0.8) 1ba0a6 context_deinit.part.0+0x66 (/usr/lib64/libcrypto.so.3.0.8) 1bcc52 OPENSSL_cleanup+0x212 (/usr/lib64/libcrypto.so.3.0.8) 40246 __run_exit_handlers+0x166 (/usr/lib64/libc.so.6) 4038e [unknown] (/usr/lib64/libc.so.6) 27b51 __libc_start_call_main+0x81 (/usr/lib64/libc.so.6) 27c0b __libc_start_main@@GLIBC_2.34+0x8b (/usr/lib64/libc.so.6) 94d5 _start+0x25 (/usr/sbin/nbdkit) This looks to be more a issue with the libc exit function having the call to __run_exit_handler as the last instruction in the region marked as belonging to exit. The (unused) return address on the stack is outside the exit function. Thus it is listed as unknown. qemu coroutine code The qemu code uses setjmp and longjmp operations to implement coroutine code. In some cases end up with backtraces like the following where FP unwinding gets confused: qemu-img 2271 8724.707399: 121182 cycles:u: a9cff nbd_reply_chunk_iter_receive+0xbf (/usr/bin/qemu-img) ab182 nbd_client_co_preadv+0x232 (/usr/bin/qemu-img) 9f466 bdrv_driver_preadv+0x106 (/usr/bin/qemu-img) a2828 bdrv_aligned_preadv+0x348 (/usr/bin/qemu-img) a2f5c bdrv_co_preadv_part+0x17c (/usr/bin/qemu-img) c2f87 qcow2_co_preadv_task_entry+0x177 (/usr/bin/qemu-img) c3ab8 qcow2_co_preadv_part+0x278 (/usr/bin/qemu-img) a2828 bdrv_aligned_preadv+0x348 (/usr/bin/qemu-img) a2f5c bdrv_co_preadv_part+0x17c (/usr/bin/qemu-img) 93313 blk_co_do_preadv_part+0x83 (/usr/bin/qemu-img) 9349c blk_co_pread+0x5c (/usr/bin/qemu-img) 59a4e convert_co_do_copy+0x47e (/usr/bin/qemu-img) 15da82 coroutine_trampoline+0x52 (/usr/bin/qemu-img) 55090 __start_context+0x0 (/usr/lib64/libc.so.6) 560a5ee22337 [unknown] ([unknown]) 0 [unknown] ([unknown]) This looks to be one thing that the dwarf unwinding handles better it doesn't attempt to unwind past that and don't see any "unknown" entries on the qemu_dwarf.svg under the coroutine_trampoline.
Created attachment 1947526 [details] New flamegraphs on rawhide machine.
Looking at the entry_SYSCALL_64 code (https://elixir.bootlin.com/linux/v6.3-rc3/source/arch/x86/entry/entry_64.S#L87) there appears to be a window where the pt_regs might not hold the current values of the registers until after the PUSH_AND_CLEAR_REGS. Is it possible that the perf sampling occurs in the early part of entry_SYSCALL_64 and the perf sampling code observes the the sample is currently in the kernel assumes that the pt_regs are valid when they are not actually? There is a comment about "IRQ are off" in entry_SYSCALL_64. Is perf using NMI?
I have though about other possible causes of the problem and considered whether this might be a difference between perf collecting data in a virtualized environment vs non-virtualized. The initial experiments were on a qemu guest vm and a fedora 37 host. Running the experiments on the same hardware (a Lenovo P53 laptop and Intel(R) Xeon(R) E-2276M CPU @ 2.80GHz processor) with updated RPMs: On the fedora 37 host: $ rpm -q kernel qemu kernel-6.2.7-200.fc37.x86_64 qemu-7.0.0-14.fc37.x86_64 On the rawhide guest: kernel-6.3.0-0.rc3.20230322gita1effab7a3a3.31.fc39.x86_64 perf record --call-graph=fp --output=guest_qemu_fp_perf.data nbdkit -r curl https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img --filter=multi-conn multi-conn-mode=unsafe timeout=2000 --run ' qemu-img convert -W -m 16 -p -f qcow2 $uri -O raw jammy-server-cloudimg-amd64.img.raw ' perf script --input=guest_qemu_fp_perf.data | stackcollapse-perf.pl > out.guest_qemu_fp-folded flamegraph.pl out.guest_qemu_fp-folded > guest_qemu_fp.svg perf report --input guest_qemu_fp_perf.data > guest_qemu_fp_perf.out Tried the same experiment on a physical machine, a Intel(R) Xeon(R) CPU E7540, running rawhide kernel kernel-6.3.0-0.rc3.20230322gita1effab7a3a3.31.fc39.x86_64 perf record --call-graph=fp --output=raw_qemu_fp_perf.data nbdkit -r curl https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img --filter=multi-conn multi-conn-mode=unsafe timeout=2000 --run ' qemu-img convert -W -m 16 -p -f qcow2 $uri -O raw jammy-server-cloudimg-amd64.img.raw ' perf script --input=raw_qemu_fp_perf.data | stackcollapse-perf.pl > out.raw_qemu_fp-folded flamegraph.pl out.raw_qemu_fp-folded > raw_qemu_fp.svg perf report --input raw_qemu_fp_perf.data > raw_qemu_fp_perf.out The data taken on the non-virtualized machine has saner callback stacks where __libc_recv and __poll are involved as shown in the svg files in 20230323_data.tar.gz
This bug appears to have been reported against 'rawhide' during the Fedora Linux 39 development cycle. Changing version to 39.