Bug 1811208

Summary: unable to set scope: set: gdb cannot find text block for address: st_create
Product: Red Hat Enterprise Linux 8 Reporter: John Pittman <jpittman>
Component: crashAssignee: Dave Anderson <anderson>
Status: CLOSED NOTABUG QA Contact: Emma Wu <xiawu>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 8.1CC: cye, djeffery, dwysocha, ruyang
Target Milestone: rc   
Target Release: 8.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-26 14:53:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1674330    

Description John Pittman 2020-03-06 20:36:10 UTC
Description of problem:

      UPTIME: 01:03:59
LOAD AVERAGE: 6.11, 2.31, 1.11
       TASKS: 3029
     RELEASE: 3.10.0-862.14.4.el7.x86_64
     VERSION: #1 SMP Fri Sep 21 09:07:21 UTC 2018
     MACHINE: x86_64  (2793 Mhz)
      MEMORY: 1023.8 GB

unable to set scope: set: gdb cannot find text block for address: st_create

crash> sym st_create
ffffffffc080e270 (t) st_create [dm_service_time]
 
crash> sym ffffffffc080e270
ffffffffc080e270 (t) st_create [dm_service_time] 

crash> mod | grep dm_service_time
ffffffffc0810080  dm_service_time          12889  /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-862.14.4.el7.x86_64/kernel/drivers/md/dm-service-time.ko.debug 

crash> set scope ffffffffc080e270
set: gdb cannot find text block for address: ffffffffc080e270

crash> set scope st_create
set: gdb cannot find text block for address: st_create

Version-Release number of selected component (if applicable):

crash-7.2.7-2.el8.x86_64

How reproducible:

Every time in this core

Comment 2 Dave Anderson 2020-03-06 21:13:09 UTC
As I recall, that's always been an issue with gdb and module symbols, some work, some (most?) don't...

Comment 3 Dave Anderson 2020-03-06 23:07:22 UTC
> As I recall, that's always been an issue with gdb and module symbols, some work, some (most?) don't...

Ah, yes, it's coming back...

When a module object file is loaded into the crash session, its data structures that 
that contain its "text blocks" are searched to find the one that fits for the target address.
But the data structure text block values are not  kernel virtual addresses, but rather contain
the offset values of the symbols in the object.  (i.e. if you do a "nm -Bn object.ko.debug")

But there was this fix that went into crash-6.0.1, back in 2012:

         - Implemented a new "mod -r" option, which will pass the "-readnow"
           flag to the embedded gdb module, which will override the two-stage
           strategy that it uses for reading symbol tables from module object
           files.  If the crash session was invoked with the "--readnow" flag,
           then the same override will occur automatically.  It should be noted
           that doing will increase the virtual and resident memory set size.
           (anderson)

With respect to the "two-stage strategy" referenced above, which updates the offset values 
into kernel virtual addresses, I don't know how that happens in the normal course of events.
I believe I may have been able to force the second stage occur for a module, but I don't
recall what I did.

Anyway, the only foolproof way I could force transition was by using the more
time-consuming --readnow flag, either by: 

 (1) invoke crash with the --readnow flag  (which uses --readnow on the vmlinux file and any modules that get loaded)
 (2) use the "mod -r" flag                 (which uses --readnow on modules)

For example, there's a queue_io() function in both the kernel proper and
the dm_mod module:

  crash> sym queue_io
  ffffffffbc4721a0 (t) queue_io /usr/src/debug/kernel-3.10.0-957.27.2.el7/linux-3.10.0-957.27.2.el7.x86_64/fs/fs-writeback.c: 341
  ffffffffc0235600 (t) queue_io [dm_mod]
  crash>

If I try to set the module's version, it fails:
 
  crash> set scope ffffffffc0235600
  set: attempting to find/load "dm_mod" module debuginfo
       MODULE       NAME                             SIZE  OBJECT FILE
  ffffffffc024b880  dm_mod                         124461  /usr/lib/debug/usr/lib/modules/3.10.0-957.27.2.el7.x86_64/kernel/drivers/md/dm-mod.ko.debug 
  set: gdb cannot find text block for address: ffffffffc0235600
  crash> 

If I unload the module, and then reload it adding the -r option:

  crash> mod -D
  crash> mod -sr dm_mod /usr/lib/debug/usr/lib/modules/3.10.0-957.27.2.el7.x86_64/kernel/drivers/md/dm-mod.ko.debug
       MODULE       NAME                             SIZE  OBJECT FILE
  ffffffffc024b880  dm_mod                         124461  /usr/lib/debug/usr/lib/modules/3.10.0-957.27.2.el7.x86_64/kernel/drivers/md/dm-mod.ko.debug 
  crash> set scope ffffffffc0235600
  scope: ffffffffc0235600 (queue_io)
  crash>

You can also load them all with --readnow with "mod -rS".

Does that work for you?

Comment 5 Dave Anderson 2020-03-13 16:10:06 UTC
Given that this extension module requires loading with --readnow in order to set scope,
using "mod -r" is preferable:

$ retrace-server-interact 929057675 crash

... [ cut ] ...

crash> set scope st_create
set: gdb cannot find text block for address: st_create
crash> mod -d dm_service_time
crash> mod -sr dm_service_time /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-862.14.4.el7.x86_64/kernel/drivers/md/dm-service-time.ko.debug
     MODULE       NAME                      SIZE  OBJECT FILE
ffffffffc0810080  dm_service_time          12889  /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-862.14.4.el7.x86_64/kernel/drivers/md/dm-service-time.ko.debug 
crash> set scope st_create
scope: ffffffffc080e270 (st_create)
crash> 

Loading all modules with --readnow by default results in a huge waste of memory,
so I'd like to close this BZ as NOTABUG.

Comment 6 John Pittman 2020-03-13 16:20:31 UTC
Sorry for the wait; I've been meaning to come back to this.  Thanks a lot for looking.  Unloading and reloading --readnow did work for me.  Just curious, how much memory does it waste to load everything --readnow?  Assuming it's way too much to do by default?

Comment 7 Dave Anderson 2020-03-13 18:18:10 UTC
It depends upon the module set, but just check the RSS of a
crash utility session with "mod -S" vs. "mod -rS".

In any case, there's nothing preventing your retrace-server
script from using "mod -rS" by default on galvatron.  I just
prefer that we don't force it on everybody.

Comment 8 John Pittman 2020-03-13 18:28:35 UTC
David, sorry to interrupt.  Could I solicit your opinion on this bz?  Do you see this causing any issues?  Is it overkill to load with 'mod -rS'?  I see these failures from time to time when we have to set scope in the pykdump scripts.  Thanks for any info.

Comment 9 Dave Anderson 2020-03-13 19:40:49 UTC
FWIW, here's a quick example on an 8GB x86_64 beaker machine with the following module set:
  
  crash> mod
       MODULE       NAME                      SIZE  OBJECT FILE
  ffffffffc0331bc0  dm_mod                  151552  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc033e180  dm_log                   20480  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0348040  dm_region_hash           20480  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0353280  dm_mirror                28672  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc035b280  ip_tables                28672  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc037c280  video                    45056  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0385600  wmi                      32768  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0391100  sg                       40960  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0396040  libcrc32c                16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc039c200  crc32c_intel             24576  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc03a7580  libahci                  40960  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc03b7900  ahci                     40960  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc03c3000  fb_sys_fops              16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc03c8000  sysimgblt                16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc03cd000  syscopyarea              16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc03d6040  sysfillrect              16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc03ff6c0  drm_kms_helper          212992  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0410000  i2c_algo_bit             16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0416000  intel_gtt                20480  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0455840  e1000e                  282624  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0466140  sr_mod                   28672  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc049f9c0  libata                  270336  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc04b1000  mxm_wmi                  16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc04b92c0  lpc_ich                  28672  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc04be540  dell_smm_hwmon           16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc052c980  drm                     536576  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc055be00  ttm                     114688  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0744940  nouveau                2220032  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0788080  snd_intel_dspcfg         24576  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc078d540  dcdbas                   16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0794080  snd_hwdep                16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc07a9380  snd                      94208  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc07b3000  iTCO_vendor_support      16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc07be000  mei_me                   45056  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc07c4400  ghash_clmulni_intel      16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc07c9040  soundcore                16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc09c4f80  i915                   2330624  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0a1d600  snd_pcm                 110592  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0a34380  snd_hda_core             90112  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0a49680  snd_hda_codec_hdmi       69632  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0a59000  snd_hda_intel            40960  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0a694c0  sd_mod                   53248  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0a6f000  ledtrig_audio            16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0a7e280  cdrom                    65536  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0ba5800  xfs                    1519616  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0bfc500  i2c_i801                 28672  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0c01140  mei_wdt                  16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0c08880  intel_rapl_perf          20480  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0c0eb80  intel_cstate             20480  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0c2bc40  mei                     110592  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0c38100  snd_timer                36864  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0c576c0  snd_hda_codec           151552  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0c64100  iTCO_wdt                 16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0c69100  pcspkr                   16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0c6e200  crct10dif_pclmul         16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0c73200  crc32_pclmul             16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0c78300  intel_rapl_msr           16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0c7d1c0  intel_powerclamp         16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0c82180  coretemp                 16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0c8e140  intel_wmi_thunderbolt    16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0ca4580  snd_hda_codec_generic    86016  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0cab040  irqbypass                16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0cb2640  intel_rapl_common        24576  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0cde800  intel_uncore            143360  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0cfe5c0  snd_hda_codec_realtek   122880  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0d9ebc0  kvm                     786432  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0e06880  kvm_intel               294912  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0e181c0  x86_pkg_temp_thermal     16384  (not loaded)  [CONFIG_KALLSYMS]
  ffffffffc0ebea80  sunrpc                  454656  (not loaded)  [CONFIG_KALLSYMS]
  crash> 
  
After "mod -S", here's the "ps aux" output:

  USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
  root       50598 19.5  5.9 501020 472644 pts/0   S+   15:14   0:12 crash
  ...

After "mod -rS", note the increase in %MEM and RSS:

  USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
  root       50598 16.4 27.9 2232508 2205552 pts/0 S+   15:14   0:28 crash
  ...

If my math is correct, the RSS goes from ~461MB to well over 2GB.

Comment 10 Dave Wysochanski 2020-03-25 21:07:30 UTC
(In reply to John Pittman from comment #8)
> David, sorry to interrupt.  Could I solicit your opinion on this bz?  Do you
> see this causing any issues?  Is it overkill to load with 'mod -rS'?  I see
> these failures from time to time when we have to set scope in the pykdump
> scripts.  Thanks for any info.

Yes the memory usage would be an issue as Dave A points out and especially on our production system.  In just the one example he indicates a 4X's increase in RSS and given the workaround, I see a fairly large cost to benefit ratio especially considering there's easy workarounds.

The problem on our retrace-server is people often leave crash sessions open for a very long time and forget about them and this change looks like it would consume a lot of RAM.  Unless there is a way to identify such module conflicts easily in which case we could modify retrace-server to have a more surgical "symbol load" crashrc file and so the impact would not be as great.  Today we have this:
mod -S /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-957.21.3.el7.x86_64 > /dev/null

We could potentially do:
mod -s <non-problem module>
...
mod -rs <problem module>

Comment 11 John Pittman 2020-03-26 12:47:25 UTC
Dave W. thanks for confirming.  Dave A. thanks for your time and assistance on this.  I'll try and configure the scripts to unload and reload if they see an issue.  Thanks again, you can close.

Comment 12 Dave Anderson 2020-03-26 14:19:21 UTC
Thanks John and Dave.

I'm definitely rusty with respect to actually analyzing real-life vmcores,
but I wonder what percentage of crash dumps have panic backtraces that 
actually pass through module code, and by extension require the loading
of that module's debuginfo data? (not to mention the full slate of *all*
modules) 

You could parse the panic task backtrace for any entries that contain 
a [module] name string tagged on to the end of a stack frame:   

  crash> bt | grep "#" | grep "]$"
   #7 [ffff88000904ee88] scsi_io_completion at ffffffffa005f556 [scsi_mod]
  crash>

But then extend the bt line to do some kind of awk extraction of the "scsi_mod"
string, and dump it into a file, which could then be used as an argument
input file:

  crash> bt | grep "#" | grep "]$" | awk magic-extraction > input
  crash> mod -rs < input 

Just a thought...