Bug 1916589 - watchdog: use nmi registers snapshot in hardlockup handler
Summary: watchdog: use nmi registers snapshot in hardlockup handler
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.9
Hardware: x86_64
OS: Linux
urgent
medium
Target Milestone: rc
: ---
Assignee: Prarit Bhargava
QA Contact: Rachel Sibley
URL:
Whiteboard:
Depends On:
Blocks: 1916612
TreeView+ depends on / blocked
 
Reported: 2021-01-15 07:52 UTC by Christian Horn
Modified: 2021-03-16 13:55 UTC (History)
7 users (show)

Fixed In Version: kernel-3.10.0-1160.19.1.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-16 13:54:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
RHEL PATCH 1/1 (2.50 KB, text/plain)
2021-01-25 13:33 UTC, Prarit Bhargava
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5684331 0 None None None 2021-01-18 00:27:18 UTC

Description Christian Horn 2021-01-15 07:52:01 UTC
Description of problem:
One of our partners has seen several crash dumps where crash notes regs were missing, affecting ability to root cause issue from kdump.

Version-Release number of selected component (if applicable):
all rhel7 kernels affected, rhel8 is not affected

How reproducible:
unknown

Additional info:
It took a while to realize why is that happening but we believe to have the cause. This investigation was done out-of-band with the original customer calls, this is about the crash dump inconsistency, not about the cause of the panic.

There is a bug in “kernel/watchdog.c” where watchdog_overflow_callback() is over-writing passed NMI regs argument by doing:

/* Callback function for perf event subsystem */
static void watchdog_overflow_callback(struct perf_event *event,
                 struct perf_sample_data *data,
                 struct pt_regs *regs)
{
...
        if (is_hardlockup()) {
               int this_cpu = smp_processor_id();
               struct pt_regs *regs = get_irq_regs();      <<<<< HERE
               ...

This bug has been fixed in upstream by removing this single line, see:
https://github.com/torvalds/linux/commit/4d1f0fb096aedea7bb5489af93498a82e467c480#diff-bcebb2b2d89ecc04ae073c76a55631c86a560b5d44a1ab066bc2fa2c5bc8fd62

If we get an NMI in regular kernel context (eg. not on IRQ), then get_irq_regs() may pull a NULL pointer and pass it to nmi_panic() up to crash_save_cpu(), and we would get a page fault, and we would end up with a processor stalling without saving crash notes regs. Removing the above HERE line would prevent running into another case where we are missing crash notes regs.

Comment 11 Christian Horn 2021-01-22 00:15:39 UTC
Our partner thinks the problem was introduced in 7.3 when implementing all cpu backtrace:

https://github.com/torvalds/linux/commit/55537871ef666b4153fd1ef8782e4a13fee142cc

3.10.0-327.90.2 is still ok
3.10.0-514 has the bad code (the struct pt_regs *regs = get_irq_regs();)

Comment 12 Prarit Bhargava 2021-01-25 12:48:58 UTC
(In reply to Christian Horn from comment #11)
> Our partner thinks the problem was introduced in 7.3 when implementing all
> cpu backtrace:
> 
> https://github.com/torvalds/linux/commit/
> 55537871ef666b4153fd1ef8782e4a13fee142cc
> 
> 3.10.0-327.90.2 is still ok
> 3.10.0-514 has the bad code (the struct pt_regs *regs = get_irq_regs();)

Yes, that's exactly what I found.  This code was erroneously added back in 7.3.  I will be including that information in the POST today.

P.

Comment 13 Prarit Bhargava 2021-01-25 13:33:46 UTC
Created attachment 1750515 [details]
RHEL PATCH 1/1

Comment 14 Augusto Caringi 2021-02-10 23:01:02 UTC
Patch(es) committed on kernel-3.10.0-1160.19.1.el7

Comment 23 errata-xmlrpc 2021-03-16 13:54:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0856


Note You need to log in before you can comment on or make changes to this bug.