Bug 506694

Summary: kdump hangs up if INIT is received while kdump is starting
Product: Red Hat Enterprise Linux 5 Reporter: Veaceslav Falico <vfalico>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED ERRATA QA Contact: Han Pingtian <phan>
Severity: high Docs Contact:
Priority: urgent    
Version: 5.3CC: anderson, moshiro, ofourdan, peterm, phan, qcai, syeghiay, tindoh
Target Milestone: rc   
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.18-215.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-01-13 20:49:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 499522, 527955, 533192, 591850    
Attachments:
Description Flags
updated version of patch with missing chunks
none
updated patch with fixed chunk
none
new patch with args->signr check added back
none
backport of requested commits none

Description Veaceslav Falico 2009-06-18 11:55:54 UTC
kdump hangs up if INIT is received while kdump is starting.
The following two cases are thought.

1.Kdump starts as follows when INIT interrupt is raised.
 (1) 1st INIT is generated
 (2) SAL masks interrupt, and calls OS INIT handlers.
 (3) OS INIT handlers are executed.
 (4) 2nd kernel boots.
 (5) INIT mask is released in the initialization of 2nd kernel.
 (6) OS INIT handlers of the 2nd kernel are registered in SAL.
 (7) 2nd-kernel collects the dump.

 If 2nd INIT is received from (2) through (6),
 kdump hangs up.

2.Kdump starts as follows when the system does panic.
 (1) panic occur.
 (2) panic processing works.
 (3) 2nd-kernel boots.
 (4) The INIT mask is released in the initialization of 2nd-kernel.
 (5) INIT handler of 2nd-kernel is registered in SAL.
 (6) 2nd-kernel boots and collects the dump.

 If panic processing received INIT from (3) to (5), kdump hang up.

Version-Release number of selected component:
Red Hat Enterprise Linux Version Number: RHEL5
Release Number: 3
Architecture: ia64
Kernel Version: 2.6.18-128.el5
Related Package Version: kexec-tools-1.102pre-56.el5
Related Middleware / Application: none

Drivers or hardware or architecture dependency:
ia64

How reproducible:
Sometimes.

Step to Reproduce:
1) Generate INIT interrupt by using INIT button.
Generate INIT interrupt by using INIT button immediately after 1st
INIT interrupt.
Generate INIT interrupt by using INIT button immediately after panic.

Actual Results:
Kdump hang up.

Expected Results:
Kdump works completely.

Summary of actions taken to resolve issue:
System reset.

Location of diagnostic data:
None.

Hardware configuration:
Model: PRIMEQUEST 520A
CPU Info: Intel Itanium2
Memory Info: 32GB
Hardware Component Information: None.
Configuration Info: None.

Fujitsu has posted the fix for upstream - http://lkml.org/lkml/2009/6/18/34 .

Comment 3 Neil Horman 2009-06-18 15:04:50 UTC
Looks like you might have missed a few chunks in patch 4 and 6.  I'll fix it up shortly.  Assuming this gets:

1) accepted upstream
2) approved for 5.4
3) Tested on 5.4 by fujitsu

I'll post this as soon as the ACKs come in and testing is done.

Comment 4 Neil Horman 2009-06-18 22:53:38 UTC
Created attachment 348578 [details]
updated version of patch with missing chunks

Comment 5 Neil Horman 2009-06-19 00:13:16 UTC
Created attachment 348585 [details]
updated patch with fixed chunk

Comment 6 Neil Horman 2009-06-19 00:48:13 UTC
 http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1852502
Please test this build ASAP.  It needs to be validated in the next few days to have any chance of making it into 5.4

Comment 8 Neil Horman 2009-06-19 11:03:28 UTC
Created attachment 348634 [details]
new patch with args->signr check added back

yeah, you're right, looking at the patch set and upstream code again, we need the signr check added back.  New patch attached correcting that.  Heres the build below.  Please have Fujitsu test it and veryify its functionality ASAP, this weekend if at all possible.
 http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1852840

Comment 16 Neil Horman 2009-07-09 17:30:52 UTC
version 2 of the patchset should apply cleanly to RHEL5 if youd like to test it early and expidite the inclusion process should it get accepted upstream:
http://lists.infradead.org/pipermail/kexec/2009-July/003422.html

Comment 23 Veaceslav Falico 2009-12-02 18:06:10 UTC
0cced40e7c58b1105aef3ca446da7b158a18a9a6
5959906ee9dee602a46e49c868a7e543e050d605
1726b0883dd08636705ea55d577eb0ec314ba427
68cb14c7c46d9204ba451a534f15a8bc12c88e28
6cc3efcdf01cf874ffe770919395918a3ee9365b
07a6a4ae827b54cec4c1b1d92bed1cc9176b45ec
4295ab34883d2070b1145e14f4619478e9788807

Here are the upstream fixes for it.

Comment 24 Neil Horman 2009-12-03 21:41:45 UTC
Created attachment 375925 [details]
backport of requested commits

heres a backport of the requested commits.  I'll have a build for you to test soon.

Comment 25 Neil Horman 2009-12-03 21:45:55 UTC
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2123601

Theres your build.  Please test and confirm that it works as expected, thanks!

Comment 27 Neil Horman 2009-12-04 14:34:49 UTC
please remember to clear needinfo when you update a bz, or I likely wont see it.  

Yeah, the patch broke, looks pretty simple, I'll try get to it today.

Comment 28 Neil Horman 2009-12-04 17:10:57 UTC
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2128975

There you go, new build to test.

Comment 29 Neil Horman 2009-12-04 18:30:09 UTC
dang, failed again, I'll look at it shortly.

Comment 30 Neil Horman 2009-12-04 20:21:45 UTC
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2129625

There that should complete.

Comment 32 Neil Horman 2009-12-16 11:53:43 UTC
Any thoughts as to why this might be happening?  It looks from the log like this was a panic, unrelated to the originoal issue.  Does this occur consistently, or only if you try to reproduce the origional problem by asserting an INIT during kdump startup?

Comment 37 Neil Horman 2010-02-03 16:37:42 UTC
I'll try get to this today

Comment 38 Neil Horman 2010-02-03 19:54:25 UTC
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2242169

Heres a new build for you to test.

Comment 50 Jarod Wilson 2010-03-03 15:43:43 UTC
in kernel-2.6.18-191.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 56 Jarod Wilson 2010-03-10 17:01:29 UTC
in kernel-2.6.18-192.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 57 Jarod Wilson 2010-03-10 19:46:46 UTC
(In reply to comment #56)
> in kernel-2.6.18-192.el5
> You can download this test kernel from http://people.redhat.com/jwilson/el5
> 
> Please update the appropriate value in the Verified field
> (cf_verified) to indicate this fix has been successfully
> verified. Include a comment with verification details.    

The above was the revert of the patch, due to a regression it introduced. Moving bug back to ASSIGNED.

Comment 65 Neil Horman 2010-07-27 12:58:39 UTC
The parts of these patches that I understand seem reasonable to me.  Whats the upstream status on these changes?

Comment 70 Jarod Wilson 2010-09-03 19:05:24 UTC
in kernel-2.6.18-215.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 74 Neil Horman 2010-10-19 14:53:43 UTC
Um, this seems like a different problem.  Checking the last few lines of the backtrace:

all Trace:
 [<a0000001006769e0>] schedule+0x1e20/0x2100
                                sp=e000000016d3fc70 bsp=e000000016d39140
 [<a0000001000ab170>] worker_thread+0x170/0x240
                                sp=e000000016d3fd00 bsp=e000000016d39110
 [<a0000001000b3210>] kthread+0x230/0x2c0
                                sp=e000000016d3fd50 bsp=e000000016d390c8
 [<a000000100012210>] kernel_thread_helper+0x30/0x60
                                sp=e000000016d3fe30 bsp=e000000016d390a0
 [<a0000001000090c0>] start_kernel_thread+0x20/0x40
                                sp=e000000016d3fe30 bsp=e000000016d390a0
kdump_init_notifier: kdump not configured

This appears to be a boot of the normal kernel (not the kdump kernel) and you've hung it up by sending a series of NMI's to it, which I think is expected.  If you wanted to claim this problem was reproduced, you would need to configure kdump, crash the kernel and hang the kdump kernel by issuing a series of NMI's to prevent kdump from starting.

Comment 75 Han Pingtian 2010-10-20 03:18:35 UTC
(In reply to comment #74)
> Um, this seems like a different problem.  Checking the last few lines of the
> backtrace:
> 
> all Trace:
>  [<a0000001006769e0>] schedule+0x1e20/0x2100
>                                 sp=e000000016d3fc70 bsp=e000000016d39140
>  [<a0000001000ab170>] worker_thread+0x170/0x240
>                                 sp=e000000016d3fd00 bsp=e000000016d39110
>  [<a0000001000b3210>] kthread+0x230/0x2c0
>                                 sp=e000000016d3fd50 bsp=e000000016d390c8
>  [<a000000100012210>] kernel_thread_helper+0x30/0x60
>                                 sp=e000000016d3fe30 bsp=e000000016d390a0
>  [<a0000001000090c0>] start_kernel_thread+0x20/0x40
>                                 sp=e000000016d3fe30 bsp=e000000016d390a0
> kdump_init_notifier: kdump not configured
> 
> This appears to be a boot of the normal kernel (not the kdump kernel) and
No. This was booting a kdump kernel by pressing INIT button once. The hang was caused by pressing the button several times after waiting a second after the first pressing. I think this could be found out by looking at the head of the log:

Red Hat Enterprise Linux Server release 5.5 (Tikanga)
Kernel 2.6.18-215.el5 on an ia64

intel-s6e5231-01.rhts.eng.nay.redhat.com login: Linux version 2.6.18-215.el5
(mockbuild.redhat.com) (gcc version 4.1.2 20080704 (Red Hat
4.1.2-48)) #1 SMP Tue Aug 31 22:19:41 EDT 2010
Ignoring memory below 128MB
Ignoring memory above 384MB
EFI v1.10 by INTEL: SALsystab=0x7e964370 ACPI 2.0=0x7edd7000 SMBIOS=0xf8020
HCDP=0x7ed9e1a0
booting generic kernel on platform dig
PCDP: v1 at 0x7ed9e1a0
Explicit "console="; ignoring PCDP
ACPI: Unable to map DSDT
Number of logical nodes in system = 1
Number of memory chunks in system = 3

> you've hung it up by sending a series of NMI's to it, which I think is
> expected.  If you wanted to claim this problem was reproduced, you would need
> to configure kdump, crash the kernel and hang the kdump kernel by issuing a
> series of NMI's to prevent kdump from starting.

Comment 76 Neil Horman 2010-10-28 17:34:44 UTC
first of all, no, I don't see how you can determine this is a kdump boot from the head of the log.  I don't see any indicator in which the log entries above diverge from a normal boot.  But thats not relevant, if you're sure that you've issued a kdump, then you have, and we can work under that supposition.

Looking at this, I'm still not sure you've actually reproduced anything that we can do anything about.  Looking at your log above, this line jumps out:
kdump_init_notifier: kdump not configured

The fact that kdump isn't configured is irrelevant.  If we're in a kdump kernel then that is expected. What is interesting is how that printk gets presented.  That printk is in kdump_init_notifier, which is a registered notifier function that gets registered from machine_crash_setup.  machine_crash_setup is an initcall (level 1), which means that it gets called early, right after the init process is forked, which in turn is right after the kernel has finished initalization.  The fact that you are getting that printk indicates that you received an NMI (from your pressing of the init button) _after_ this step in the origional report above:

(5) INIT handler of 2nd-kernel is registered in SAL.

The origional bug report call for the receipt of an NMI _prior_ to that step.  Once the INIT handlers are registered, if we get an NMI, the system crashes, which is what is expected, and what you got (as is evidenced by the dumping out of all the backtraces in comment 73).  If you're expectation was that the system would never crash while capturing a vmcore while pressing the INIT button, thats incorrect.  The only way we can avoid that is by never registering INIT handlers, which would prevent the INIT button from ever working, which is not something we want to do, as it helps in debugging kdump issues where the system hangs, and is not something done in normal production work.  

Not sure how you want to handle this, but I really dont think you have reproduced anything we can act upon here.  Moving back in to ON_QA

Comment 77 Han Pingtian 2010-11-03 07:34:20 UTC
Set verified status based on comment https://bugzilla.redhat.com/show_bug.cgi?id=506694#c71.

Comment 79 errata-xmlrpc 2011-01-13 20:49:03 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html