Bug 1916589
| Summary: | watchdog: use nmi registers snapshot in hardlockup handler | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Christian Horn <chorn> | ||||
| Component: | kernel | Assignee: | Prarit Bhargava <prarit> | ||||
| kernel sub component: | Platform Enablement | QA Contact: | Rachel Sibley <rasibley> | ||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||
| Severity: | medium | ||||||
| Priority: | urgent | CC: | dvacek, jreznik, mmilgram, nmurray, prarit, rasibley, rvr | ||||
| Version: | 7.9 | Keywords: | Triaged, ZStream | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | kernel-3.10.0-1160.19.1.el7 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-03-16 13:54:41 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1916612 | ||||||
| Attachments: |
|
||||||
Our partner thinks the problem was introduced in 7.3 when implementing all cpu backtrace: https://github.com/torvalds/linux/commit/55537871ef666b4153fd1ef8782e4a13fee142cc 3.10.0-327.90.2 is still ok 3.10.0-514 has the bad code (the struct pt_regs *regs = get_irq_regs();) (In reply to Christian Horn from comment #11) > Our partner thinks the problem was introduced in 7.3 when implementing all > cpu backtrace: > > https://github.com/torvalds/linux/commit/ > 55537871ef666b4153fd1ef8782e4a13fee142cc > > 3.10.0-327.90.2 is still ok > 3.10.0-514 has the bad code (the struct pt_regs *regs = get_irq_regs();) Yes, that's exactly what I found. This code was erroneously added back in 7.3. I will be including that information in the POST today. P. Created attachment 1750515 [details]
RHEL PATCH 1/1
Patch(es) committed on kernel-3.10.0-1160.19.1.el7 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: kernel security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0856 |
Description of problem: One of our partners has seen several crash dumps where crash notes regs were missing, affecting ability to root cause issue from kdump. Version-Release number of selected component (if applicable): all rhel7 kernels affected, rhel8 is not affected How reproducible: unknown Additional info: It took a while to realize why is that happening but we believe to have the cause. This investigation was done out-of-band with the original customer calls, this is about the crash dump inconsistency, not about the cause of the panic. There is a bug in “kernel/watchdog.c” where watchdog_overflow_callback() is over-writing passed NMI regs argument by doing: /* Callback function for perf event subsystem */ static void watchdog_overflow_callback(struct perf_event *event, struct perf_sample_data *data, struct pt_regs *regs) { ... if (is_hardlockup()) { int this_cpu = smp_processor_id(); struct pt_regs *regs = get_irq_regs(); <<<<< HERE ... This bug has been fixed in upstream by removing this single line, see: https://github.com/torvalds/linux/commit/4d1f0fb096aedea7bb5489af93498a82e467c480#diff-bcebb2b2d89ecc04ae073c76a55631c86a560b5d44a1ab066bc2fa2c5bc8fd62 If we get an NMI in regular kernel context (eg. not on IRQ), then get_irq_regs() may pull a NULL pointer and pass it to nmi_panic() up to crash_save_cpu(), and we would get a page fault, and we would end up with a processor stalling without saving crash notes regs. Removing the above HERE line would prevent running into another case where we are missing crash notes regs.