Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2000221

Summary:	crash produces incorrect backtrace by using stale value
Product:	Red Hat Enterprise Linux 8	Reporter:	John Pittman <jpittman>
Component:	crash	Assignee:	lijiang
Status:	CLOSED MIGRATED	QA Contact:	xiaoying yan <yiyan>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	8.4	CC:	bhe, cye, denli, djeffery, lijiang, loberman, ltao, prudo, ruyang
Target Milestone:	rc	Keywords:	MigratedToJIRA, Triaged
Target Release:	---	Flags:	pm-rhel: mirror+
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-09-25 17:26:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description John Pittman 2021-09-01 15:32:45 UTC

Description of problem:

It looks like the backtracer is potentially mishandling part of the stack.

crash> bt 17813
PID: 17813  TASK: ffff9f8c438f9080  CPU: 3   COMMAND: "jsivm"
 #0 [ffff9f8e3f2c8e48] crash_nmi_callback at ffffffffa4858567
 #1 [ffff9f8e3f2c8e58] nmi_handle at ffffffffa4f8d93c
 #2 [ffff9f8e3f2c8eb0] do_nmi at ffffffffa4f8db5d
 #3 [ffff9f8e3f2c8ef0] end_repeat_nmi at ffffffffa4f8cd9c
    [exception RIP: prune_super+9]
    RIP: ffffffffa4a50869  RSP: ffff9f70347a3868  RFLAGS: 00200206
    RAX: ffffffffa4a50860  RBX: 0000000000000400  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: ffff9f70347a3998  RDI: ffff9f6fc5fe4bb0
    RBP: ffff9f70347a3868   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: fff00000f0000000  R12: 000000000000001d
    R13: 0000000000000000  R14: ffff9f6fc5fe4bb0  R15: ffff9f70347a3998
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
--- <NMI exception stack> ---
 #4 [ffff9f70347a3868] prune_super at ffffffffa4a50869
 #5 [ffff9f70347a3890] vmpressure at ffffffffa4a42b71
 #6 [ffff9f70347a3918] do_try_to_free_pages at ffffffffa49d4d8a
 #7 [ffff9f70347a3990] try_to_free_pages at ffffffffa49d4fdc
 #8 [ffff9f70347a3a28] free_more_memory at ffffffffa4a85d5e
 #9 [ffff9f70347a3a60] __getblk at ffffffffa4a870cb
#10 [ffff9f70347a3ac0] __breadahead at ffffffffa4a88b71
#11 [ffff9f70347a3ae0] __ext4_get_inode_loc at ffffffffc04b63fa [ext4]
#12 [ffff9f70347a3b48] ext4_get_inode_loc at ffffffffc04b894d [ext4]
#13 [ffff9f70347a3b58] ext4_reserve_inode_write at ffffffffc04ba716 [ext4]
#14 [ffff9f70347a3b88] ext4_mark_inode_dirty at ffffffffc04ba7e3 [ext4]
#15 [ffff9f70347a3be8] ext4_dirty_inode at ffffffffc04be0d0 [ext4]
#16 [ffff9f70347a3c08] __mark_inode_dirty at ffffffffa4a7f2bd
#17 [ffff9f70347a3c40] update_time at ffffffffa4a6bb99
#18 [ffff9f70347a3c70] file_update_time at ffffffffa4a6bc80
#19 [ffff9f70347a3cb0] ext4_page_mkwrite at ffffffffc04be30d [ext4]
#20 [ffff9f70347a3d08] do_page_mkwrite at ffffffffa49ee18a
#21 [ffff9f70347a3d88] do_shared_fault at ffffffffa49ee826
#22 [ffff9f70347a3de8] handle_mm_fault at ffffffffa49f58b3
#23 [ffff9f70347a3eb0] __do_page_fault at ffffffffa4f90653
#24 [ffff9f70347a3f20] do_page_fault at ffffffffa4f90975
#25 [ffff9f70347a3f50] page_fault at ffffffffa4f8c778
    RIP: 00000000ecaf1e7d  RSP: 00000000a9bfe290  RFLAGS: 00210286
    RAX: 0000000002eeec30  RBX: 00000000ebf05a10  RCX: 00000000aaa00000
    RDX: 0000000000000000  RSI: 000000000e2b8188  RDI: 0000000000000000
    RBP: 00000000a9bfe298   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0023  SS: 002b

In this stack, there shouldn't be a call to vmpressure.  It's a stale value.  It should be a call to shrink_slab, which does have a value on the stack.  (maybe crash was confused by catching prune_super while still doing initial pushes to stack?)  So this part of the backtrace should have looked like:

    [exception RIP: prune_super+9]
    RIP: ffffffffa4a50869  RSP: ffff9f70347a3868  RFLAGS: 00200206
    RAX: ffffffffa4a50860  RBX: 0000000000000400  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: ffff9f70347a3998  RDI: ffff9f6fc5fe4bb0
    RBP: ffff9f70347a3868   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: fff00000f0000000  R12: 000000000000001d
    R13: 0000000000000000  R14: ffff9f6fc5fe4bb0  R15: ffff9f70347a3998
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
--- <NMI exception stack> ---
 #4 [ffff9f70347a3868] prune_super at ffffffffa4a50869
 #5 [ffff9f70347a3870] shrink_slab at ffffffffa49d19de   #<---- shrink_slab, not vmpressure
 #6 [ffff9f70347a3918] do_try_to_free_pages at ffffffffa49d4d8a
 #7 [ffff9f70347a3990] try_to_free_pages at ffffffffa49d4fdc
 #8 [ffff9f70347a3a28] free_more_memory at ffffffffa4a85d5e
...

Above data provided by djeffery.

Version-Release number of selected component (if applicable):

crash-7.2.8-7.el8.x86_64

Issue also reproducible with latest upstream. Expectation is for correct translation.

Comment 2 lijiang 2021-09-08 07:44:51 UTC

> In this stack, there shouldn't be a call to vmpressure.  It's a stale value.
> It should be a call to shrink_slab, which does have a value on the stack. 
> (maybe crash was confused by catching prune_super while still doing initial
> pushes to stack?)  So this part of the backtrace should have looked like:
> 

The vmpressure() could be invoked from the reclaim path.

How did you confirm that it shouldn't be a call to vmpressure()? Could you please provide more details?

Thanks.

Comment 3 John Pittman 2021-09-08 14:09:17 UTC

Hi Lijiang, we can see the correct sequence in the assembly:

crash> dis -rl ffffffffa49d4d8a
0xffffffffa49d49c0 <do_try_to_free_pages>:	nopl   0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffffa49d49c5 <do_try_to_free_pages+5>:	push   %rbp
0xffffffffa49d49c6 <do_try_to_free_pages+6>:	mov    %rsp,%rbp
0xffffffffa49d49c9 <do_try_to_free_pages+9>:	push   %r15
0xffffffffa49d49cb <do_try_to_free_pages+11>:	mov    %rsi,%r15
0xffffffffa49d49ce <do_try_to_free_pages+14>:	push   %r14
....
usr/src/debug/kernel-3.10.0-1160.24.1.el7/linux-3.10.0-1160.24.1.el7.x86_64/mm/vmscan.c: 2501
0xffffffffa49d4d63 <do_try_to_free_pages+931>:	callq  0xffffffffa49de760 <next_zones_zonelist>
0xffffffffa49d4d68 <do_try_to_free_pages+936>:	mov    %rax,%r13
0xffffffffa49d4d6b <do_try_to_free_pages+939>:	mov    -0x48(%rbp),%rax
0xffffffffa49d4d6f <do_try_to_free_pages+943>:	test   %rax,%rax
0xffffffffa49d4d72 <do_try_to_free_pages+946>:	jne    0xffffffffa49d4d20 <do_try_to_free_pages+864>
0xffffffffa49d4d74 <do_try_to_free_pages+948>:	mov    %r12,%rax
0xffffffffa49d4d77 <do_try_to_free_pages+951>:	mov    -0x58(%rbp),%r12
/usr/src/debug/kernel-3.10.0-1160.24.1.el7/linux-3.10.0-1160.24.1.el7.x86_64/mm/vmscan.c: 2509
0xffffffffa49d4d7b <do_try_to_free_pages+955>:	mov    (%r15),%rsi
0xffffffffa49d4d7e <do_try_to_free_pages+958>:	mov    -0x68(%rbp),%rdi
0xffffffffa49d4d82 <do_try_to_free_pages+962>:	mov    %rax,%rdx
0xffffffffa49d4d85 <do_try_to_free_pages+965>:	callq  0xffffffffa49d1930 <shrink_slab>   <=======
/usr/src/debug/kernel-3.10.0-1160.24.1.el7/linux-3.10.0-1160.24.1.el7.x86_64/mm/vmscan.c: 2510
0xffffffffa49d4d8a <do_try_to_free_pages+970>:	mov    -0x60(%rbp),%rdx

2509             shrink_slab(shrink, sc->nr_scanned, lru_pages);
2510             if (reclaim_state) {
2511                 sc->nr_reclaimed += reclaim_state->reclaimed_slab;
2512                 reclaim_state->reclaimed_slab = 0;
2513             }

Comment 4 David Jeffery 2021-09-08 15:06:53 UTC

In addition to the return address for do_try_to_free_pages showing shrink_slab should have been called, vmpressure can't end up in a shrink callback like prune_super without something like a hardware error.  The return address was for calling memcg_to_vmpressure, which is a very small function which doesn't call anything else.

0xffffffffa4a42b6c <vmpressure+28>:     callq  0xffffffffa4a3ca50 <memcg_to_vmpressure>
0xffffffffa4a42b71 <vmpressure+33>:     and    $0xca,%r12d
0xffffffffa4a42b78 <vmpressure+40>:     je     0xffffffffa4a42b7f <vmpressure+47>


Looking at the raw stack relative from the exception frame's RSP to where prune_super was interrupted:

0xffffffffa4a50860 <prune_super>:	nopl   0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffffa4a50865 <prune_super+5>:	push   %rbp
0xffffffffa4a50866 <prune_super+6>:	mov    %rsp,%rbp
0xffffffffa4a50869 <prune_super+9>:	push   %r14

crash> rd ffff9f70347a3868 -s 16
ffff9f70347a3868:  ffff9f70347a3910 shrink_slab+174  
ffff9f70347a3878:  ffff9f703879b800 00000050017a3878 
ffff9f70347a3888:  ffff9f70347a38b8 vmpressure+33    
ffff9f70347a3898:  000000000000101e 0000000000000080 
ffff9f70347a38a8:  ffff9f6fc53df0f0 000000000000003a 
ffff9f70347a38b8:  0000000000000000 0000000000000000 
ffff9f70347a38c8:  000000000000101d 0000000000000020 
ffff9f70347a38d8:  000000000000001d 000000009e5dda76 

prune_super had only executed a single push to modify the stack pointer by 8 bytes.  This aligns with the location of the shrink_slab return address.  The vmpressure return address would need 40 bytes worth of pushes or subtractions to RSP to be the return point for prune_super.


And the values saved on the stack match shrink_slab(). e.g. The scan_control address ffff9f70347a39a8 was used by do_try_to_free_pages() in its register R15.  It is preserved on the stack at a location as expected with a call to shrink_slab() using "push %r15".  If the vmpressure backtrace was correct, the scan_control address should be in the exception frame's R15, which it is not.


So there are multiple parts of the backtrace which align and produce sane results if shrink_slab() is part of the backtrace instead of vmpressure().  vmpressure() was a stale value and does not appear to belong.

Comment 5 lijiang 2021-09-14 11:07:16 UTC

Thank you for the explanation in detail, John and David.

But anyway, I don't think that it is a bug, because some stack data could be contained in a frame, for this case, the "bt -f/-F" command can display all stack data contained in a frame. Would you mind trying it?

crash> bt -f 17813
PID: 17813  TASK: ffff9f8c438f9080  CPU: 3   COMMAND: "jsivm"
 #0 [ffff9f8e3f2c8e48] crash_nmi_callback at ffffffffa4858567
    ffff9f8e3f2c8e50: ffff9f8e3f2c8ea8 ffffffffa4f8d93c 
 #1 [ffff9f8e3f2c8e58] nmi_handle at ffffffffa4f8d93c
    ffff9f8e3f2c8e60: 0000000000000000 ffff9f8e3f2c8ef8 
    ffff9f8e3f2c8e70: ffffffffa54332e0 000000009e5dda76 
    ffff9f8e3f2c8e80: ffff9f8e3f2c8ef8 ffff9f70347a3fd8 
    ffff9f8e3f2c8e90: 0000000000000000 ffff9f70347a3fd8 
    ffff9f8e3f2c8ea0: 00000000ffffffff ffff9f8e3f2c8ee8 
    ffff9f8e3f2c8eb0: ffffffffa4f8db5d 
 #2 [ffff9f8e3f2c8eb0] do_nmi at ffffffffa4f8db5d
    ffff9f8e3f2c8eb8: 0000000000000000 0000000000000001 
    ffff9f8e3f2c8ec8: 00007f445e85d198 00000000ffffffff 
    ffff9f8e3f2c8ed8: 00000016ed1d4000 ffff9f70347a3998 
    ffff9f8e3f2c8ee8: ffff9f70347a3868 ffffffffa4f8cd9c 
 #3 [ffff9f8e3f2c8ef0] end_repeat_nmi at ffffffffa4f8cd9c
    [exception RIP: prune_super+9]
    RIP: ffffffffa4a50869  RSP: ffff9f70347a3868  RFLAGS: 00200206
    RAX: ffffffffa4a50860  RBX: 0000000000000400  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: ffff9f70347a3998  RDI: ffff9f6fc5fe4bb0
    RBP: ffff9f70347a3868   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: fff00000f0000000  R12: 000000000000001d
    R13: 0000000000000000  R14: ffff9f6fc5fe4bb0  R15: ffff9f70347a3998
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
--- <NMI exception stack> ---
 #4 [ffff9f70347a3868] prune_super at ffffffffa4a50869
    ffff9f70347a3870: ffffffffa49d19de ffff9f703879b800 
    ffff9f70347a3880: 00000050017a3878 ffff9f70347a38b8 
    ffff9f70347a3890: ffffffffa4a42b71 
...

crash> bt -F 17813
PID: 17813  TASK: ffff9f8c438f9080  CPU: 3   COMMAND: "jsivm"
 #0 [ffff9f8e3f2c8e48] crash_nmi_callback at ffffffffa4858567
    ffff9f8e3f2c8e50: ffff9f8e3f2c8ea8 nmi_handle+140   
 #1 [ffff9f8e3f2c8e58] nmi_handle at ffffffffa4f8d93c
    ffff9f8e3f2c8e60: 0000000000000000 ffff9f8e3f2c8ef8 
    ffff9f8e3f2c8e70: crash_nmi_callback_na.35510 000000009e5dda76 
    ffff9f8e3f2c8e80: ffff9f8e3f2c8ef8 ffff9f70347a3fd8 
    ffff9f8e3f2c8e90: 0000000000000000 ffff9f70347a3fd8 
    ffff9f8e3f2c8ea0: 00000000ffffffff ffff9f8e3f2c8ee8 
    ffff9f8e3f2c8eb0: do_nmi+349       
 #2 [ffff9f8e3f2c8eb0] do_nmi at ffffffffa4f8db5d
    ffff9f8e3f2c8eb8: 0000000000000000 0000000000000001 
    ffff9f8e3f2c8ec8: 00007f445e85d198 00000000ffffffff 
    ffff9f8e3f2c8ed8: 00000016ed1d4000 ffff9f70347a3998 
    ffff9f8e3f2c8ee8: ffff9f70347a3868 end_repeat_nmi+30 
 #3 [ffff9f8e3f2c8ef0] end_repeat_nmi at ffffffffa4f8cd9c
    [exception RIP: prune_super+9]
    RIP: ffffffffa4a50869  RSP: ffff9f70347a3868  RFLAGS: 00200206
    RAX: ffffffffa4a50860  RBX: 0000000000000400  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: ffff9f70347a3998  RDI: ffff9f6fc5fe4bb0
    RBP: ffff9f70347a3868   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: fff00000f0000000  R12: 000000000000001d
    R13: 0000000000000000  R14: ffff9f6fc5fe4bb0  R15: ffff9f70347a3998
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
--- <NMI exception stack> ---
 #4 [ffff9f70347a3868] prune_super at ffffffffa4a50869
    ffff9f70347a3870: shrink_slab+174  [kmalloc-1024]   
    ffff9f70347a3880: 00000050017a3878 ffff9f70347a38b8 
    ffff9f70347a3890: vmpressure+33    
 #5 [ffff9f70347a3890] vmpressure at ffffffffa4a42b71
...

Comment 6 John Pittman 2021-09-22 19:07:53 UTC

Thanks again for looking lijiang.  I see the same thing as you do, but I don't understand why it is not a bug.  Based on comment 4, vmpressure is incorrect.  But we still see it both in bt -f and bt -F.  Could you comment specifically to the data provided in comment 4?

Comment 15 RHEL Program Management 2023-09-25 17:26:02 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 16 RHEL Program Management 2023-09-25 17:26:39 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.