Bug 2188671 - bt takes many minutes (27 minutes) on some pids in some vmcores
Summary: bt takes many minutes (27 minutes) on some pids in some vmcores
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: crash
Version: 39
Hardware: Unspecified
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: ltao
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-04-21 14:49 UTC by Dave Wysochanski
Modified: 2024-02-09 01:24 UTC (History)
8 users (show)

Fixed In Version: crash-8.0.4-2.fc39
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-02-09 01:24:56 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FC-812 0 None None None 2023-04-21 14:52:07 UTC

Description Dave Wysochanski 2023-04-21 14:49:45 UTC
This is a confirmed regression in the latest upstream:
2c7310a Replace lseek/read into pread for kcore and vmcore reading

vs RHEL8's crash-7.3.1-5.el8.x86_64

I have not 

Reproducible: Always

Steps to Reproduce:
1. Start crash on affected vmcore
2. Run 'bt PID' where PID is one that is slow

Actual Results:  
bt time is:
real	27m58.509s
user	27m54.682s
sys	0m1.218s


Vs rhel8's crash:
real	0m23.950s
user	0m23.308s
sys	0m0.590s


Expected Results:  
Similar 'bt' time to execute as RHEL8's crash for the same vmcore and same pid

I can share the specific vmcore to any Red Hat employee but cannot share publicly due to customer data concerns.

I think this affects only certain vmcores, but it is a problem as it's showing up on our production vmcore system.  Though I don't know how many vmcores it affects there, my guess is it's a small percentage (less than 10%) or someone would have reported it more widely.

I'll see if I can determine how far back the problem reproduces, if this is a 8.x branch problem (since it's not seen in RHEL8's 7.x build) or if it was introduced recently.

Comment 3 Dave Wysochanski 2023-04-21 14:58:07 UTC
From comment #0, I meant to finish this:
"I have not ..."

I have not narrowed this down further or run gdb on it to understand why it takes so long.  Simple tests I'm running now is to see if it reproduces if I build 8.0.1 or 8.0.2 and go from there.  Not sure if someone has some ideas on it or if it's been reported before.

Comment 4 Dave Wysochanski 2023-04-21 15:37:00 UTC
GOOD: crash-8.0.0-ec568e2 
crash-8.0.0-ec568e2> quit

real	0m9.500s
user	0m8.748s
sys	0m0.695s

Comment 5 Dave Wysochanski 2023-04-21 15:37:55 UTC
BAD: crash-8.0.2-f1cd581
crash-8.0.2-f1cd581> quit

real	28m1.614s
user	27m57.931s
sys	0m1.006s

Comment 7 Dave Wysochanski 2023-04-21 16:15:52 UTC
Looks like this is a regression sometime after the 8.0.1 release (first regression seen in 8.0.2 release).

PASS: 8.0.1
FAIL: 8.0.2 and above

crash-8.0.1-2d19346> quit

real	0m6.416s
user	0m5.882s
sys	0m0.477s

Comment 8 Dave Wysochanski 2023-04-21 17:49:45 UTC
I am bisecting this now.

Comment 9 Dave Wysochanski 2023-04-21 21:11:29 UTC
$ git bisect bad
2145b2bb79c59aa25c5155a8f9851554d1813fb9 is the first bad commit
commit 2145b2bb79c59aa25c5155a8f9851554d1813fb9
Author: Tao Liu <ltao>
Date:   Wed Aug 31 11:54:13 2022 +0800

    Let gdb get kernel module symbols info from crash
    
    Gdb will try to resolve an address to its corresponding symbol name such as
    when printing a structure. It works fine for kernel symbols, because gdb can
    find them through vmlinux. However as for kernel modules symbols, crash
    resolves them by dig into "struct module", which gdb don't know. As a result,
    gdb fails to translate a kernel module address to its symbol name without
    "mod -s|-S" options. For example we can reproduce the issue as follows.
    
        crash> timer
        ....
        4331308176       336  ffff94ea24240860  ffffffffc03762c0  <estimation_timer>
        ....
        crash> sym 0xffffffffc03762c0
        ffffffffc03762c0 (t) estimation_timer [ip_vs]
    
    Before patch:
        crash> timer_list ffff94ea24240860
        struct timer_list {
          ....
          function = 0xffffffffc03762c0,
          ....
        }
    
    After patch:
        crash> timer_list ffff94ea24240860
        struct timer_list {
          ....
          function = 0xffffffffc03762c0 <estimation_timer>,
          ....
        }
    
    In this patch, we add an interface for gdb, when gdb trying to build kernel
    module's address symbolic, the info can be get from crash.
    
    Signed-off-by: Tao Liu <ltao>

 defs.h          |  2 ++
 gdb-10.2.patch  | 35 +++++++++++++++++++++++++++++++++++
 gdb_interface.c | 12 ++++++++++++
 3 files changed, 49 insertions(+)
$ git bisect log
git bisect start
# good: [2d193468e5fe7ee1c6be4c73083cc5ef8d922b74] crash-8.0.0 -> crash-8.0.1
git bisect good 2d193468e5fe7ee1c6be4c73083cc5ef8d922b74
# bad: [f1cd581d1c4afa5b8ffdfaa6a3ea9f545fe4ec91] crash-8.0.1 -> crash-8.0.2
git bisect bad f1cd581d1c4afa5b8ffdfaa6a3ea9f545fe4ec91
# bad: [f1cd581d1c4afa5b8ffdfaa6a3ea9f545fe4ec91] crash-8.0.1 -> crash-8.0.2
git bisect bad f1cd581d1c4afa5b8ffdfaa6a3ea9f545fe4ec91
# good: [cdd57e8b16aba2f5714673368d6dbc7565d59841] ppc64: handle backtrace when CPU is in an emergency stack
git bisect good cdd57e8b16aba2f5714673368d6dbc7565d59841
# good: [cdd57e8b16aba2f5714673368d6dbc7565d59841] ppc64: handle backtrace when CPU is in an emergency stack
git bisect good cdd57e8b16aba2f5714673368d6dbc7565d59841
# good: [9cbfea67eb4f094d47cd841b73ddbbdbe6b58696] Fix "task -R" by adding end identifier for union in task_struct
git bisect good 9cbfea67eb4f094d47cd841b73ddbbdbe6b58696
# bad: [4ea3a806d11f000f2eb1ddc72c2b7a543e319f64] Fix for the invalid linux_banner pointer issue
git bisect bad 4ea3a806d11f000f2eb1ddc72c2b7a543e319f64
# bad: [5b9d3e98cda9d99f3277aabec30d076e62cc5e71] Add debian/ubuntu vmlinux location to default search dirs
git bisect bad 5b9d3e98cda9d99f3277aabec30d076e62cc5e71
# bad: [3ed9ec5c8d09cffac9772abbf54214125ade9127] x86_64: Correct the identifier when locating the call instruction
git bisect bad 3ed9ec5c8d09cffac9772abbf54214125ade9127
# bad: [2145b2bb79c59aa25c5155a8f9851554d1813fb9] Let gdb get kernel module symbols info from crash
git bisect bad 2145b2bb79c59aa25c5155a8f9851554d1813fb9
# first bad commit: [2145b2bb79c59aa25c5155a8f9851554d1813fb9] Let gdb get kernel module symbols info from crash

Comment 10 Dave Wysochanski 2023-04-21 21:30:52 UTC
Tao, can you take a look?  Note the vmcore in question has out of tree modules.  So at a high-level, there may be something wrong there and the reason this corner case may need a follow-on patch to your original one.

Comment 11 ltao 2023-04-23 02:39:47 UTC
Hi Dave,

Sure, I will take a look into this.

Thanks,
Tao Liu

Comment 12 Fedora Release Engineering 2023-08-16 08:09:03 UTC
This bug appears to have been reported against 'rawhide' during the Fedora Linux 39 development cycle.
Changing version to 39.

Comment 14 Fedora Update System 2024-01-31 06:59:56 UTC
FEDORA-2024-190d76044a has been submitted as an update to Fedora 39. https://bodhi.fedoraproject.org/updates/FEDORA-2024-190d76044a

Comment 15 Fedora Update System 2024-02-01 02:45:57 UTC
FEDORA-2024-190d76044a has been pushed to the Fedora 39 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-190d76044a`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-190d76044a

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 16 Fedora Update System 2024-02-09 01:24:56 UTC
FEDORA-2024-190d76044a (crash-8.0.4-2.fc39) has been pushed to the Fedora 39 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.