kdump hangs up if INIT is received while kdump is starting. The following two cases are thought. 1.Kdump starts as follows when INIT interrupt is raised. (1) 1st INIT is generated (2) SAL masks interrupt, and calls OS INIT handlers. (3) OS INIT handlers are executed. (4) 2nd kernel boots. (5) INIT mask is released in the initialization of 2nd kernel. (6) OS INIT handlers of the 2nd kernel are registered in SAL. (7) 2nd-kernel collects the dump. If 2nd INIT is received from (2) through (6), kdump hangs up. 2.Kdump starts as follows when the system does panic. (1) panic occur. (2) panic processing works. (3) 2nd-kernel boots. (4) The INIT mask is released in the initialization of 2nd-kernel. (5) INIT handler of 2nd-kernel is registered in SAL. (6) 2nd-kernel boots and collects the dump. If panic processing received INIT from (3) to (5), kdump hang up. Version-Release number of selected component: Red Hat Enterprise Linux Version Number: RHEL5 Release Number: 3 Architecture: ia64 Kernel Version: 2.6.18-128.el5 Related Package Version: kexec-tools-1.102pre-56.el5 Related Middleware / Application: none Drivers or hardware or architecture dependency: ia64 How reproducible: Sometimes. Step to Reproduce: 1) Generate INIT interrupt by using INIT button. Generate INIT interrupt by using INIT button immediately after 1st INIT interrupt. Generate INIT interrupt by using INIT button immediately after panic. Actual Results: Kdump hang up. Expected Results: Kdump works completely. Summary of actions taken to resolve issue: System reset. Location of diagnostic data: None. Hardware configuration: Model: PRIMEQUEST 520A CPU Info: Intel Itanium2 Memory Info: 32GB Hardware Component Information: None. Configuration Info: None. Fujitsu has posted the fix for upstream - http://lkml.org/lkml/2009/6/18/34 .
Looks like you might have missed a few chunks in patch 4 and 6. I'll fix it up shortly. Assuming this gets: 1) accepted upstream 2) approved for 5.4 3) Tested on 5.4 by fujitsu I'll post this as soon as the ACKs come in and testing is done.
Created attachment 348578 [details] updated version of patch with missing chunks
Created attachment 348585 [details] updated patch with fixed chunk
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1852502 Please test this build ASAP. It needs to be validated in the next few days to have any chance of making it into 5.4
Created attachment 348634 [details] new patch with args->signr check added back yeah, you're right, looking at the patch set and upstream code again, we need the signr check added back. New patch attached correcting that. Heres the build below. Please have Fujitsu test it and veryify its functionality ASAP, this weekend if at all possible. http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1852840
version 2 of the patchset should apply cleanly to RHEL5 if youd like to test it early and expidite the inclusion process should it get accepted upstream: http://lists.infradead.org/pipermail/kexec/2009-July/003422.html
0cced40e7c58b1105aef3ca446da7b158a18a9a6 5959906ee9dee602a46e49c868a7e543e050d605 1726b0883dd08636705ea55d577eb0ec314ba427 68cb14c7c46d9204ba451a534f15a8bc12c88e28 6cc3efcdf01cf874ffe770919395918a3ee9365b 07a6a4ae827b54cec4c1b1d92bed1cc9176b45ec 4295ab34883d2070b1145e14f4619478e9788807 Here are the upstream fixes for it.
Created attachment 375925 [details] backport of requested commits heres a backport of the requested commits. I'll have a build for you to test soon.
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2123601 Theres your build. Please test and confirm that it works as expected, thanks!
please remember to clear needinfo when you update a bz, or I likely wont see it. Yeah, the patch broke, looks pretty simple, I'll try get to it today.
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2128975 There you go, new build to test.
dang, failed again, I'll look at it shortly.
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2129625 There that should complete.
Any thoughts as to why this might be happening? It looks from the log like this was a panic, unrelated to the originoal issue. Does this occur consistently, or only if you try to reproduce the origional problem by asserting an INIT during kdump startup?
I'll try get to this today
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2242169 Heres a new build for you to test.
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2243192
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2245151
in kernel-2.6.18-191.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details.
in kernel-2.6.18-192.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details.
(In reply to comment #56) > in kernel-2.6.18-192.el5 > You can download this test kernel from http://people.redhat.com/jwilson/el5 > > Please update the appropriate value in the Verified field > (cf_verified) to indicate this fix has been successfully > verified. Include a comment with verification details. The above was the revert of the patch, due to a regression it introduced. Moving bug back to ASSIGNED.
The parts of these patches that I understand seem reasonable to me. Whats the upstream status on these changes?
in kernel-2.6.18-215.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Um, this seems like a different problem. Checking the last few lines of the backtrace: all Trace: [<a0000001006769e0>] schedule+0x1e20/0x2100 sp=e000000016d3fc70 bsp=e000000016d39140 [<a0000001000ab170>] worker_thread+0x170/0x240 sp=e000000016d3fd00 bsp=e000000016d39110 [<a0000001000b3210>] kthread+0x230/0x2c0 sp=e000000016d3fd50 bsp=e000000016d390c8 [<a000000100012210>] kernel_thread_helper+0x30/0x60 sp=e000000016d3fe30 bsp=e000000016d390a0 [<a0000001000090c0>] start_kernel_thread+0x20/0x40 sp=e000000016d3fe30 bsp=e000000016d390a0 kdump_init_notifier: kdump not configured This appears to be a boot of the normal kernel (not the kdump kernel) and you've hung it up by sending a series of NMI's to it, which I think is expected. If you wanted to claim this problem was reproduced, you would need to configure kdump, crash the kernel and hang the kdump kernel by issuing a series of NMI's to prevent kdump from starting.
(In reply to comment #74) > Um, this seems like a different problem. Checking the last few lines of the > backtrace: > > all Trace: > [<a0000001006769e0>] schedule+0x1e20/0x2100 > sp=e000000016d3fc70 bsp=e000000016d39140 > [<a0000001000ab170>] worker_thread+0x170/0x240 > sp=e000000016d3fd00 bsp=e000000016d39110 > [<a0000001000b3210>] kthread+0x230/0x2c0 > sp=e000000016d3fd50 bsp=e000000016d390c8 > [<a000000100012210>] kernel_thread_helper+0x30/0x60 > sp=e000000016d3fe30 bsp=e000000016d390a0 > [<a0000001000090c0>] start_kernel_thread+0x20/0x40 > sp=e000000016d3fe30 bsp=e000000016d390a0 > kdump_init_notifier: kdump not configured > > This appears to be a boot of the normal kernel (not the kdump kernel) and No. This was booting a kdump kernel by pressing INIT button once. The hang was caused by pressing the button several times after waiting a second after the first pressing. I think this could be found out by looking at the head of the log: Red Hat Enterprise Linux Server release 5.5 (Tikanga) Kernel 2.6.18-215.el5 on an ia64 intel-s6e5231-01.rhts.eng.nay.redhat.com login: Linux version 2.6.18-215.el5 (mockbuild.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) #1 SMP Tue Aug 31 22:19:41 EDT 2010 Ignoring memory below 128MB Ignoring memory above 384MB EFI v1.10 by INTEL: SALsystab=0x7e964370 ACPI 2.0=0x7edd7000 SMBIOS=0xf8020 HCDP=0x7ed9e1a0 booting generic kernel on platform dig PCDP: v1 at 0x7ed9e1a0 Explicit "console="; ignoring PCDP ACPI: Unable to map DSDT Number of logical nodes in system = 1 Number of memory chunks in system = 3 > you've hung it up by sending a series of NMI's to it, which I think is > expected. If you wanted to claim this problem was reproduced, you would need > to configure kdump, crash the kernel and hang the kdump kernel by issuing a > series of NMI's to prevent kdump from starting.
first of all, no, I don't see how you can determine this is a kdump boot from the head of the log. I don't see any indicator in which the log entries above diverge from a normal boot. But thats not relevant, if you're sure that you've issued a kdump, then you have, and we can work under that supposition. Looking at this, I'm still not sure you've actually reproduced anything that we can do anything about. Looking at your log above, this line jumps out: kdump_init_notifier: kdump not configured The fact that kdump isn't configured is irrelevant. If we're in a kdump kernel then that is expected. What is interesting is how that printk gets presented. That printk is in kdump_init_notifier, which is a registered notifier function that gets registered from machine_crash_setup. machine_crash_setup is an initcall (level 1), which means that it gets called early, right after the init process is forked, which in turn is right after the kernel has finished initalization. The fact that you are getting that printk indicates that you received an NMI (from your pressing of the init button) _after_ this step in the origional report above: (5) INIT handler of 2nd-kernel is registered in SAL. The origional bug report call for the receipt of an NMI _prior_ to that step. Once the INIT handlers are registered, if we get an NMI, the system crashes, which is what is expected, and what you got (as is evidenced by the dumping out of all the backtraces in comment 73). If you're expectation was that the system would never crash while capturing a vmcore while pressing the INIT button, thats incorrect. The only way we can avoid that is by never registering INIT handlers, which would prevent the INIT button from ever working, which is not something we want to do, as it helps in debugging kdump issues where the system hangs, and is not something done in normal production work. Not sure how you want to handle this, but I really dont think you have reproduced anything we can act upon here. Moving back in to ON_QA
Set verified status based on comment https://bugzilla.redhat.com/show_bug.cgi?id=506694#c71.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html