Testcase for Bug 920444 - [FJ6.4 Bug] regressed in rhel-8.1.0: ======= 8.1 Server x86_64 # cat nfs.proc.commit_done.stp probe nfs.proc.commit_done{ printf("nfs.proc.commit_done server_ip: %s\n", ip_ntop(server_ip)); } 8.1 Server x86_64 # stap -p4 nfs.proc.commit_done.stp semantic error: failed to retrieve location attribute for 'task' [man error::dwarf]: identifier '$task' at /usr/share/systemtap/tapset/linux/nfs_proc.stp:1393:36 dieoffset: 0x7fd56 from /usr/lib/debug/lib/modules/4.18.0-86.el8.x86_64/kernel/fs/nfs/nfsv3.ko.debug function: nfs3_commit_done at fs/nfs/nfs3proc.c:847:2 inlined by nfs3_commit_done at fs/nfs/nfs3proc.c:840:12 source: server_ip = @_nfs_data_server_ip($task->tk_calldata) ^ in expansion of macro: operator '@nfs_data' at /usr/share/systemtap/tapset/linux/nfs_proc.stpm:16:12 source: ( get_ip(&@nfs_data->task) ) ^ in expansion of macro: operator '@_nfs_data_server_ip' at /usr/share/systemtap/tapset/linux/nfs_proc.stp:1393:15 source: server_ip = @_nfs_data_server_ip($task->tk_calldata) ^ semantic error: failed to retrieve location attribute for 'task' [man error::dwarf]: identifier '$task' at :1393:36 dieoffset: 0x7fd56 from /usr/lib/debug/lib/modules/4.18.0-86.el8.x86_64/kernel/fs/nfs/nfsv3.ko.debug function: nfs3_commit_done at fs/nfs/nfs3proc.c:847:2 inlined by nfs3_commit_done at fs/nfs/nfs3proc.c:840:12 source: server_ip = @_nfs_data_server_ip($task->tk_calldata) ^ in expansion of macro: operator '@nfs_data' at /usr/share/systemtap/tapset/linux/nfs_proc.stpm:16:12 source: ( get_ip(&@nfs_data->task) ) ^ in expansion of macro: operator '@_nfs_data_server_ip' at /usr/share/systemtap/tapset/linux/nfs_proc.stp:1393:15 source: server_ip = @_nfs_data_server_ip($task->tk_calldata) ^ semantic error: unresolved type : identifier '$task' at :1393:36 source: server_ip = @_nfs_data_server_ip($task->tk_calldata) ^ in expansion of macro: operator '@nfs_data' at /usr/share/systemtap/tapset/linux/nfs_proc.stpm:16:12 source: ( get_ip(&@nfs_data->task) ) ^ in expansion of macro: operator '@_nfs_data_server_ip' at /usr/share/systemtap/tapset/linux/nfs_proc.stp:1393:15 source: server_ip = @_nfs_data_server_ip($task->tk_calldata) ^ Pass 2: analysis failed. [man error::pass2] (1) 8.1 Server x86_64 # rpm -q systemtap systemtap-4.1-1.el8.x86_64 8.1 Server x86_64 # 8.1 Server x86_64 # uname -r 4.18.0-86.el8.x86_64 8.1 Server x86_64 # =======
Historically this testcase was passing with systemtap-4.0-1 on top of 4.18.0-0.rc7.3.el8+7. Upstream systemtap-4.2 (6494d4c3c) doesn't have a fix for this.
Checked the kernel -84 -> -85 diffs. Nothing relevant to NFS or compiler flags. Next thing to check may be the compiler version that built those rpms.
If you have any old RHEL8 vm, just installing a pre- and post -85 kernel, with their -devel and -debuginfo, should allow you to do: stap -r 4.18.0-86.el8.x86_64 -p4 -e 'probe module("nfsv3").function("nfs3_commit_done") { println($task) }' stap -r 4.18.0-83.el8.x86_64 -p4 -e 'probe module("nfsv3").function("nfs3_commit_done") { println($task) }' You will not need to reboot. Alternately, unpack the debuginfo RPMs for pre and post 85 kernels, find the nfsv3.ko.debug files, and compare debuginfo for that inlined function via eu-readelf -w etc.
Thanks. I can kind-of see the differences in the debug records for nfs3_commit_done. Not sure what they all mean yet, but I've at least got a place where I can start looking. There's a couple of relevant BZs for changes in the debug symbols we generate. Anyway, will continue poking tomorrow.
This looks like an intentional change. In gcc-8.2.x and earlier we have something like this: [ 2d11e] subprogram abbrev: 187 abstract_origin (ref4) [ 2a3b4] low_pc (addr) .text+0x00000000000002e0 <nfs3_commit_done> high_pc (data8) 92 (.text+0x000000000000033c) frame_base (exprloc) [ 0] call_frame_cfa sibling (ref4) [ 2d1ab] [ 2d13a] formal_parameter abbrev: 95 abstract_origin (ref4) [ 2a3c6] location (sec_offset) location list [ 1ed0] [ 2d143] formal_parameter abbrev: 95 abstract_origin (ref4) [ 2a3d3] location (sec_offset) location list [ 1f2c] [ 2d14c] inlined_subroutine abbrev: 50 abstract_origin (ref4) [ 2c856] entry_pc (addr) .text+0x0000000000000304 <nfs3_commit_done+0x24> ranges (sec_offset) range list [ 200] call_file (data1) nfs3proc.c (1) call_line (data2) 845 call_column (data1) 6 sibling (ref4) [ 2d17d] [ 2d165] formal_parameter abbrev: 25 abstract_origin (ref4) [ 2c873] [ 2d16a] formal_parameter abbrev: 25 abstract_origin (ref4) [ 2c867] [ 2d16f] GNU_call_site abbrev: 21 low_pc (addr) .text+0x0000000000000328 <nfs3_commit_done+0x48> abstract_origin (ref4) [ 2d064] [ 2d17d] lexical_block abbrev: 43 abstract_origin (ref4) [ 2a3b4] low_pc (addr) .text+0x000000000000030d <nfs3_commit_done+0x2d> high_pc (data8) 14 (.text+0x000000000000031b <nfs3_commit_done+0x3b>) [ 2d192] formal_parameter abbrev: 25 abstract_origin (ref4) [ 2a3c6] [ 2d197] formal_parameter abbrev: 25 abstract_origin (ref4) [ 2a3d3] [ 2d19c] GNU_call_site abbrev: 21 low_pc (addr) .text+0x0000000000000319 <nfs3_commit_done+0x39> abstract_origin (ref4) [ 2d6cf] Of particular interest is the lexical block at 2d17d which then has two parameters, one of which is the task parameter we're interested in. I can't immediately think of a case where a lexical block should have parameters, Richi found it odd as well, but couldn't find anything which would make that strictly invalid. After this upstream change form Richi: commit c0bce4a7a8657012b377a576d3a1b6dbd5d5ef99 Author: rguenth <rguenth@138bc75d-0d04-0410-961f-82ee72b054a4> Date: Mon Oct 29 14:13:56 2018 +0000 2018-10-29 Richard Biener <rguenther> Backport from mainline 2018-09-26 Richard Biener <rguenther> PR debug/87428 PR debug/87362 * tree-inline.c (expand_call_inline): When the location of the call is UNKNOWN_LOCATION use DECL_SOURCE_LOCATION or BUILTINS_LOCATION for the BLOCK_SOURCE_LOCATION of the inserted BLOCK to make inlined_function_outer_scope_p recognize it. * dwarf2out.c (add_call_src_coords_attributes): Do not add coords for reserved locations. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gcc-8-branch@265590 138bc75d-0d04-0410-961f-82ee72b054a4 The records for nfs3_commit_done looks like this: [ 2d11e] subprogram abbrev: 187 abstract_origin (ref4) [ 2a3b4] low_pc (addr) .text+0x00000000000002e0 <nfs3_commit_done> high_pc (data8) 92 (.text+0x000000000000033c) frame_base (exprloc) [ 0] call_frame_cfa sibling (ref4) [ 2d1b0] [ 2d13a] formal_parameter abbrev: 95 abstract_origin (ref4) [ 2a3c6] location (sec_offset) location list [ 1ed0] [ 2d143] formal_parameter abbrev: 95 abstract_origin (ref4) [ 2a3d3] location (sec_offset) location list [ 1f2c] [ 2d14c] inlined_subroutine abbrev: 50 abstract_origin (ref4) [ 2c856] entry_pc (addr) .text+0x0000000000000304 <nfs3_commit_done+0x24> ranges (sec_offset) range list [ 200] call_file (data1) nfs3proc.c (1) call_line (data2) 845 call_column (data1) 6 sibling (ref4) [ 2d17d] [ 2d165] formal_parameter abbrev: 25 abstract_origin (ref4) [ 2c873] [ 2d16a] formal_parameter abbrev: 25 abstract_origin (ref4) [ 2c867] [ 2d16f] GNU_call_site abbrev: 21 low_pc (addr) .text+0x0000000000000328 <nfs3_commit_done+0x48> abstract_origin (ref4) [ 2d064] [ 2d17d] inlined_subroutine abbrev: 141 abstract_origin (ref4) [ 2a3b4] low_pc (addr) .text+0x000000000000030d <nfs3_commit_done+0x2d> high_pc (data8) 14 (.text+0x000000000000031b <nfs3_commit_done+0x3b>) call_file (data1) nfs3proc.c (1) call_line (data2) 840 call_column (data1) 12 [ 2d197] formal_parameter abbrev: 25 abstract_origin (ref4) [ 2a3c6] [ 2d19c] formal_parameter abbrev: 25 abstract_origin (ref4) [ 2a3d3] [ 2d1a1] GNU_call_site abbrev: 21 low_pc (addr) .text+0x0000000000000319 <nfs3_commit_done+0x39> abstract_origin (ref4) [ 2d6e4] In particular note how the lexical block at 2d17d has changed into an inlined_subroutine with two formal parameters, one of which is the "task" parameter we want. That seems more sensible, but seems to be running afoul of stap. I'm not familiar enough with the stap code and its dwarf reader, but I think the next step is to dig into stap and see how it handles this change. Note this seems to relate to partial inlining of nfs3_async_handle_jukebox, so it might all change again with the introduction of IEPM. Anyway, I think we need to coordinate with someone from the stap team here. While the updated debug records seem more sensible to me, I'm working from several assumptions about what the stap code is trying to do and how it might be responding to changes in the debug info.
Thanks for looking into it, Jeff! I believe your reference to the lexical block is not on the mark because the systemtap probes aim for the function entry point (the low_pc), not some internal point. And in the 8.2.x case, the parameter list, 0x2d13a and 0x2d143 have location. In the 8.3.x case, this function (also) got inlined, and the inlined copy of the subroutine does not have a location for these parameters. Your second readelf didn't include the inlined copy which the stap error message's die# would identify. For example, running 4.18.0-105.el8 (the freshest at the moment): running the same stap sript: % stap -p4 -e 'probe module("nfsv3").function("nfs3_commit_done") { println($task) }' semantic error: failed to retrieve location attribute for 'task' [man error::dwarf]: identifier '$task' at <input>:1:62 dieoffset: 0x8068a from /usr/lib/debug/lib/modules/4.18.0-105.el8.x86_64/kernel/fs/nfs/nfsv3.ko.debug function: nfs3_commit_done at fs/nfs/nfs3proc.c:847:2 inlined by nfs3_commit_done at fs/nfs/nfs3proc.c:840:12 source: probe module("nfsv3").function("nfs3_commit_done") { println($task) } % eu-readelf /usr/lib/debug/lib/modules/4.18.0-105.el8.x86_64/kernel/fs/nfs/nfsv3.ko.debug .... [ 80667] inlined_subroutine abbrev: 166 abstract_origin (ref4) [ 7d6db] entry_pc (addr) .text+0x00000000000004dd <nfs3_commit_done+0x2d> GNU_entry_view (data1) 3 low_pc (addr) .text+0x00000000000004dd <nfs3_commit_done+0x2d> high_pc (data8) 14 (.text+0x00000000000004eb <nfs3_commit_done+0x3b>) call_file (data1) nfs3proc.c (1) call_line (data2) 840 call_column (data1) 12 [ 8068a] formal_parameter abbrev: 26 abstract_origin (ref4) [ 7d6ed] [ 8068f] formal_parameter abbrev: 26 abstract_origin (ref4) [ 7d6fa] [ 80694] GNU_call_site abbrev: 21 low_pc (addr) .text+0x00000000000004e9 <nfs3_commit_done+0x39> abstract_origin (ref4) [ 80c0d] .... So it appears as though gcc's inlining preferences have changed, and its debuginfo generation has not kept up, just between 8.2.1 and 8.3.1.
I'm going to have to do another dive into this, meaning it's going to miss RHEL 8.1.
This is function splitting in action, nfs3_commit_done is split by fnsplit pass into the small nfs3_commit_done and nfs3_commit_done.part.0 which does just the (cold) nfs_refresh_inode call. Before fnsplit, we already have # DEBUG task => NULL # DEBUG inode => NULL before that call, will need to track down exactly where that comes from. Anyway, later on we inline nfs3_commit_done.part.0 back into nfs3_commit_done, and before Richi's change we emitted clearly invalid DWARF (DW_TAG_lexical_block shouldn't have DW_TAG_formal_parameter children), but from systemtap POV perhaps a better thing. As I have been saying for years, we should have some way to express in DWARF through some extension the fact that we have partially outlined functions and have a way to mark partial inlining parts, OpenMP parallel/task/target/teams outlined regions, OpenACC regions etc., so tools like systemtap or gdb could figure out that if a user wants to stop or insert something at the start of a function, the start of the rest of partial inlinined function is certainly not that, that is somewhere deep in the middle of a function. And/or, if we are inlining the cold part of a partially inlined function back into itself, we should perhaps try harder to undo the mess fnsplit created and make sure we don't represent it as another inlined function.
And the rest is just kernel's fault, KBUILD_CFLAGS += $(call cc-option, -fno-var-tracking-assignments) is ridiculous, if the kernel wants to have a workaround for each gcc wrong-code bug that ever existed by disabling the corresponding option, it would need to disable pretty much all options, including -O2 or -O0. I see https://lore.kernel.org/patchwork/patch/563362/ but don't find that in the latest kernel, nor in the RHEL8 kernel. I mean, if kernel wants to work-around a GCC 4.9 bug, it could perhaps do that for GCC 4.9 only, or say GCC 4.5-4.9 where 5/6/7/8 minor releases could be affected, but GCC5+ doesn't have this bug and the debug info quality loss is significant.
(In reply to Jakub Jelinek from comment #22) > And the rest is just kernel's fault, > KBUILD_CFLAGS += $(call cc-option, -fno-var-tracking-assignments) > is ridiculous, if the kernel wants to have a workaround for each gcc > wrong-code bug that ever existed by disabling the corresponding option, it > would need to disable pretty much all options, including -O2 or -O0. > I see https://lore.kernel.org/patchwork/patch/563362/ but don't find that in > the latest kernel, nor in the RHEL8 kernel. This being tracked (for the RHEL8 kernel) as bug #1785287
This is being tracked in PR92983.