Bug 1916589

Summary: watchdog: use nmi registers snapshot in hardlockup handler
Product: Red Hat Enterprise Linux 7 Reporter: Christian Horn <chorn>
Component: kernelAssignee: Prarit Bhargava <prarit>
kernel sub component: Platform Enablement QA Contact: Rachel Sibley <rasibley>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: urgent CC: dvacek, jreznik, mmilgram, nmurray, prarit, rasibley, rvr
Version: 7.9Keywords: Triaged, ZStream
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-3.10.0-1160.19.1.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-16 13:54:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1916612    
Attachments:
Description Flags
RHEL PATCH 1/1 none

Description Christian Horn 2021-01-15 07:52:01 UTC
Description of problem:
One of our partners has seen several crash dumps where crash notes regs were missing, affecting ability to root cause issue from kdump.

Version-Release number of selected component (if applicable):
all rhel7 kernels affected, rhel8 is not affected

How reproducible:
unknown

Additional info:
It took a while to realize why is that happening but we believe to have the cause. This investigation was done out-of-band with the original customer calls, this is about the crash dump inconsistency, not about the cause of the panic.

There is a bug in “kernel/watchdog.c” where watchdog_overflow_callback() is over-writing passed NMI regs argument by doing:

/* Callback function for perf event subsystem */
static void watchdog_overflow_callback(struct perf_event *event,
                 struct perf_sample_data *data,
                 struct pt_regs *regs)
{
...
        if (is_hardlockup()) {
               int this_cpu = smp_processor_id();
               struct pt_regs *regs = get_irq_regs();      <<<<< HERE
               ...

This bug has been fixed in upstream by removing this single line, see:
https://github.com/torvalds/linux/commit/4d1f0fb096aedea7bb5489af93498a82e467c480#diff-bcebb2b2d89ecc04ae073c76a55631c86a560b5d44a1ab066bc2fa2c5bc8fd62

If we get an NMI in regular kernel context (eg. not on IRQ), then get_irq_regs() may pull a NULL pointer and pass it to nmi_panic() up to crash_save_cpu(), and we would get a page fault, and we would end up with a processor stalling without saving crash notes regs. Removing the above HERE line would prevent running into another case where we are missing crash notes regs.

Comment 11 Christian Horn 2021-01-22 00:15:39 UTC
Our partner thinks the problem was introduced in 7.3 when implementing all cpu backtrace:

https://github.com/torvalds/linux/commit/55537871ef666b4153fd1ef8782e4a13fee142cc

3.10.0-327.90.2 is still ok
3.10.0-514 has the bad code (the struct pt_regs *regs = get_irq_regs();)

Comment 12 Prarit Bhargava 2021-01-25 12:48:58 UTC
(In reply to Christian Horn from comment #11)
> Our partner thinks the problem was introduced in 7.3 when implementing all
> cpu backtrace:
> 
> https://github.com/torvalds/linux/commit/
> 55537871ef666b4153fd1ef8782e4a13fee142cc
> 
> 3.10.0-327.90.2 is still ok
> 3.10.0-514 has the bad code (the struct pt_regs *regs = get_irq_regs();)

Yes, that's exactly what I found.  This code was erroneously added back in 7.3.  I will be including that information in the POST today.

P.

Comment 13 Prarit Bhargava 2021-01-25 13:33:46 UTC
Created attachment 1750515 [details]
RHEL PATCH 1/1

Comment 14 Augusto Caringi 2021-02-10 23:01:02 UTC
Patch(es) committed on kernel-3.10.0-1160.19.1.el7

Comment 23 errata-xmlrpc 2021-03-16 13:54:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0856