Bug 1567356 - RHEL7.5 - backtraces no longer work with systemtap but worked in RHEL7.4 - adding 'nokaslr' on kernel commandline makes them work again
Summary: RHEL7.5 - backtraces no longer work with systemtap but worked in RHEL7.4 - ad...
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: systemtap   
(Show other bugs)
Version: 7.5
Hardware: Unspecified
OS: Unspecified
urgent
unspecified
Target Milestone: rc
: ---
Assignee: Frank Ch. Eigler
QA Contact: Martin Cermak
Vladimír Slávik
URL:
Whiteboard:
Keywords: Regression, ZStream
Depends On:
Blocks: 1572501
TreeView+ depends on / blocked
 
Reported: 2018-04-13 20:46 UTC by Dave Wysochanski
Modified: 2018-10-30 10:47 UTC (History)
11 users (show)

Fixed In Version: systemtap-3.2-8.el7.src.rpm
Doc Type: Bug Fix
Doc Text:
See 1572501 for z-stream doc text.
Story Points: ---
Clone Of:
: 1572501 (view as bug list)
Environment:
Last Closed: 2018-10-30 10:46:00 UTC
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2018:3168 None None None 2018-10-30 10:47 UTC
Red Hat Knowledge Base (Solution) 3489011 None None None 2018-06-18 14:40 UTC

Description Dave Wysochanski 2018-04-13 20:46:11 UTC
Description of problem:
Simple stap to print or save a backtrace no longer works.


Version-Release number of selected component (if applicable):
3.10.0-862.el7.x86_64


How reproducible:
everytime


Steps to Reproduce:
stap -v --all-modules -e 'probe kernel.function("kmem_cache_alloc") { print_backtrace(); exit() }' &

Actual results:
RHEL7u5:
[root@rhel7u5-node2 1487722]# stap -v --all-modules -e 'probe kernel.function("kmem_cache_alloc") { print_backtrace(); exit() }'
Pass 1: parsed user script and 472 library scripts using 240080virt/42004res/3456shr/38628data kb, in 750usr/80sys/924real ms.
Pass 2: analyzed script: 1 probe, 2 functions, 0 embeds, 0 globals using 281220virt/84048res/4296shr/79768data kb, in 1340usr/600sys/2038real ms.
Pass 3: using cached /root/.systemtap/cache/f3/stap_f3db6d8728e4fa909d7eb9d142e542ca_3375.c
Pass 4: using cached /root/.systemtap/cache/f3/stap_f3db6d8728e4fa909d7eb9d142e542ca_3375.ko
Pass 5: starting run.
 0xffffffffa2ff7240 : kmem_cache_alloc+0x0/0x1f0 [kernel]
 0x0 (inexact)
Pass 5: run completed in 20usr/90sys/409real ms.


Expected results:
something similar to RHEL7u4

root@rhel7u4-node2 ~]# stap -v --all-modules -e 'probe kernel.function("kmem_cache_alloc") { print_backtrace(); exit() }' &
[1] 14004
[root@rhel7u4-node2 ~]# Pass 1: parsed user script and 468 library scripts using 226760virt/39824res/3304shr/36648data kb, in 660usr/40sys/841real ms.
Pass 2: analyzed script: 1 probe, 2 functions, 0 embeds, 0 globals using 265860virt/79784res/4120shr/75748data kb, in 1040usr/190sys/1374real ms.
Pass 3: translated to C into "/tmp/stapIdLqrH/stap_4d16c620401778e2ae617e659e8cabb0_3464_src.c" using 279860virt/85788res/4556shr/89748data kb, in 5050usr/170sys/5273real ms.
Pass 4: compiled C into "stap_4d16c620401778e2ae617e659e8cabb0_3464.ko" in 12920usr/1010sys/14481real ms.
Pass 5: starting run.
 0xffffffff811df510 : kmem_cache_alloc+0x0/0x1e0 [kernel]
 0xffffffff810348ce : arch_dup_task_struct+0x4e/0x130 [kernel]
 0xffffffff81085144 : copy_process+0x214/0x1970 [kernel]
 0xffffffff81086a51 : do_fork+0x91/0x320 [kernel]
 0xffffffff81086d06 : kernel_thread+0x26/0x30 [kernel]
 0xffffffff810a4ad7 : wait_for_helper+0x77/0xb0 [kernel]
 0xffffffff816b4f18 : ret_from_fork+0x58/0x90 [kernel]
Pass 5: run completed in 30usr/60sys/358real ms.



Additional info:

Comment 2 Dave Wysochanski 2018-04-13 21:13:52 UTC
This looks like a casualty of kaslr being enabled by default.

If I boot with 'nokaslr' the backtraces are ok again:

[root@rhel7u5-node2 ~]# !186
stap -v --all-modules -e 'probe kernel.function("kmem_cache_alloc") { print_backtrace(); exit() }'
Pass 1: parsed user script and 472 library scripts using 240076virt/42000res/3456shr/38624data kb, in 750usr/70sys/911real ms.
Pass 2: analyzed script: 1 probe, 2 functions, 0 embeds, 0 globals using 281224virt/84040res/4296shr/79772data kb, in 890usr/540sys/2163real ms.
Pass 3: translated to C into "/tmp/stappqNV2S/stap_43e81a9b193918462a43c57525217ba7_3325_src.c" using 294372virt/91084res/4720shr/92920data kb, in 4520usr/490sys/5158real ms.
Pass 4: compiled C into "stap_43e81a9b193918462a43c57525217ba7_3325.ko" in 13940usr/1940sys/16000real ms.
Pass 5: starting run.
WARNING: DWARF expression stack underflow in CFI
 0xffffffff811f7240 : kmem_cache_alloc+0x0/0x1f0 [kernel]
 0xffffffff815da62a : __build_skb+0x2a/0x1b0 [kernel]
 0xffffffff815da919 : __netdev_alloc_skb+0xc9/0x110 [kernel]
 0xffffffffc00e624e : page_to_skb+0x4e/0x1f0 [virtio_net]
 0xffffffffc00e8529 : virtnet_poll+0x2c9/0x750 [virtio_net]
 0xffffffff815ec3af : net_rx_action+0x26f/0x390 [kernel]
 0xffffffff8109a945 : __do_softirq+0xf5/0x280 [kernel]
 0xffffffff81722d2c : call_softirq+0x1c/0x30 [kernel]
 0xffffffff8102d625 : do_softirq+0x65/0xa0 [kernel]
 0xffffffff8109acc5 : irq_exit+0x105/0x110 [kernel]
 0xffffffff81723fc6 : __irqentry_text_start+0x56/0xf0 [kernel]
 0xffffffff81716362 : ret_from_intr+0x0/0x15 [kernel]
 0x0 (inexact)
Pass 5: run completed in 20usr/140sys/766real ms.
[root@rhel7u5-node2 ~]# stap -v --all-modules -e 'probe kernel.function("kmem_cache_alloc") { print_backtrace(); exit() }'
Pass 1: parsed user script and 472 library scripts using 240072virt/42000res/3456shr/38620data kb, in 730usr/50sys/781real ms.
Pass 2: analyzed script: 1 probe, 2 functions, 0 embeds, 0 globals using 281220virt/84036res/4296shr/79768data kb, in 1220usr/340sys/1562real ms.
Pass 3: using cached /root/.systemtap/cache/43/stap_43e81a9b193918462a43c57525217ba7_3325.c
Pass 4: using cached /root/.systemtap/cache/43/stap_43e81a9b193918462a43c57525217ba7_3325.ko
Pass 5: starting run.
WARNING: Missing unwind data for a module, rerun with 'stap -d (unknown; retry with -DDEBUG_UNWIND)'
 0xffffffff811f7240 : kmem_cache_alloc+0x0/0x1f0 [kernel]
 0xffffffff810c19a6 : prepare_creds+0x26/0x1c0 [kernel]
 0xffffffff81218f9c : SyS_faccessat+0x4c/0x230 [kernel]
 0xffffffff81219198 : SyS_access+0x18/0x20 [kernel]
 0xffffffff8171f7d5 : system_call_fastpath+0x1c/0x21 [kernel]
 0x7fd357bd6cb7
 0x0 (inexact)
Pass 5: run completed in 20usr/100sys/398real ms.
[root@rhel7u5-node2 ~]# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-862.el7.x86_64 root=/dev/mapper/rhel-root ro crashkernel=auto rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap console=ttyS0,115200 nokaslr

Comment 3 Mark Wielaard 2018-04-13 21:46:52 UTC
(In reply to Dave Wysochanski from comment #2)
> This looks like a casualty of kaslr being enabled by default.

Thanks for that quick diagnosis.

For reference this refers to "KASLR fully supported and enabled by default":
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.5_release_notes/new_features_kernel#BZ1491226

Comment 4 Mark Wielaard 2018-04-14 11:48:07 UTC
There are two bugs here. First when using the stap DWARF unwinder looking up the kernel address in the unwind table fails. The problem seems to be that the unwind table itself is constructed "offline" from the addresses on disk, but then at runtime there is no correction for the kernel load address (this might have to be adjusted in a couple more places). Secondly when falling back on the builtin kernel unwinder stap fails because it is looking for dump_trace, but that has been replaced by the new "unwind" one in the new kernel.

Comment 5 Mark Wielaard 2018-04-17 20:07:15 UTC
As a start of a workaround/fix I added a new kernel fallback unwinder:
https://sourceware.org/git/?p=systemtap.git;a=commit;h=553b6df07c9b7ab30ed468a6a4374cbdf73d1c0d

This patch has been added to the fedora systemtap build for f28:
http://koji.fedoraproject.org/koji/buildinfo?buildID=1072496

There is also a copr build for testing on f27, centos/rhel7 x86_64:
https://copr.fedorainfracloud.org/coprs/mjw/systemtap-fallback-unwind/

With EPEL yum-plugin-copr it can be enabled and installed with:
$ yum install yum-plugin-copr
$ yum copr enable mjw/systemtap-fallback-unwind
$ yum upgrade systemtap

The result is of course totally unsupported, untested and might totally break your system. But feedback on how well it works would be appreciated.

The above doesn't fix the stap DWARF unwinder, but the fallback looks pretty good.

Comment 6 Mark Wielaard 2018-04-18 15:49:19 UTC
I have another fix, this time for the kernel DWARF unwinder:
https://sourceware.org/git/?p=systemtap.git;a=commit;h=17ee540dd61113fe4f557f191db3480db875cca1

Both fixes together should resolve this issue.

I integrated it into the fedora 28 package and the copr package
(including the bug fix for Bug 1566422) for testing
(see comment #5 for install instructions)

Comment 8 Dave Wysochanski 2018-04-24 21:29:45 UTC
Thanks for the quick fix!  Just testing updated systemtap looks good:

Original:
[root@rhel7u5-node1 ~]# rpm -qa | grep systemtap
systemtap-devel-3.2-4.el7.x86_64
systemtap-runtime-3.2-4.el7.x86_64
systemtap-3.2-4.el7.x86_64
systemtap-client-3.2-4.el7.x86_64
[root@rhel7u5-node1 ~]# stap -v -e 'probe kernel.function("kmem_cache_alloc") { print_backtrace(); exit() }' 
Pass 1: parsed user script and 472 library scripts using 240076virt/41992res/3452shr/38624data kb, in 710usr/50sys/763real ms.
Pass 2: analyzed script: 1 probe, 2 functions, 0 embeds, 0 globals using 281216virt/84036res/4296shr/79764data kb, in 1240usr/270sys/1512real ms.
Pass 3: using cached /root/.systemtap/cache/b9/stap_b95ff1718ccfdfde276cd3c92c4bb869_1292.c
Pass 4: using cached /root/.systemtap/cache/b9/stap_b95ff1718ccfdfde276cd3c92c4bb869_1292.ko
Pass 5: starting run.
 0xffffffffa4df7240 : kmem_cache_alloc+0x0/0x1f0 [kernel]
 0x0 (inexact)
Pass 5: run completed in 10usr/70sys/365real ms.


Fixed:
[root@rhel7u5-node1 ~]# stap -v -e 'probe kernel.function("kmem_cache_alloc") { print_backtrace(); exit() }' 
Pass 1: parsed user script and 472 library scripts using 240072virt/41992res/3452shr/38620data kb, in 720usr/60sys/775real ms.
Pass 2: analyzed script: 1 probe, 2 functions, 0 embeds, 0 globals using 281212virt/84040res/4300shr/79760data kb, in 1280usr/250sys/1529real ms.
Pass 3: translated to C into "/tmp/stapV4eb1t/stap_0eafdab18e2cf4df683a97ec5f0e4f60_1292_src.c" using 281212virt/84368res/4628shr/79760data kb, in 750usr/170sys/921real ms.
Pass 4: compiled C into "stap_0eafdab18e2cf4df683a97ec5f0e4f60_1292.ko" in 26700usr/4000sys/30861real ms.
Pass 5: starting run.
WARNING: Missing unwind data for a module, rerun with 'stap -d (unknown; retry with -DDEBUG_UNWIND)'
 0xffffffffa4df7240 : kmem_cache_alloc+0x0/0x1f0 [kernel]
 0xffffffffa4cc19a6 : prepare_creds+0x26/0x1c0 [kernel]
 0xffffffffa4e18f9c : SyS_faccessat+0x4c/0x230 [kernel]
 0xffffffffa4e19198 : SyS_access+0x18/0x20 [kernel]
 0xffffffffa531f7d5 : system_call_fastpath+0x1c/0x21 [kernel]
 0x7f96efdf6cb7
Pass 5: run completed in 30usr/70sys/363real ms.

Comment 21 errata-xmlrpc 2018-10-30 10:46:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:3168


Note You need to log in before you can comment on or make changes to this bug.