Bug 506694
Summary: | kdump hangs up if INIT is received while kdump is starting | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Veaceslav Falico <vfalico> | ||||||||||
Component: | kernel | Assignee: | Neil Horman <nhorman> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Han Pingtian <phan> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | urgent | ||||||||||||
Version: | 5.3 | CC: | anderson, moshiro, ofourdan, peterm, phan, qcai, syeghiay, tindoh | ||||||||||
Target Milestone: | rc | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | ia64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | kernel-2.6.18-215.el5 | Doc Type: | Bug Fix | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2011-01-13 20:49:03 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 499522, 527955, 533192, 591850 | ||||||||||||
Attachments: |
|
Description
Veaceslav Falico
2009-06-18 11:55:54 UTC
Looks like you might have missed a few chunks in patch 4 and 6. I'll fix it up shortly. Assuming this gets: 1) accepted upstream 2) approved for 5.4 3) Tested on 5.4 by fujitsu I'll post this as soon as the ACKs come in and testing is done. Created attachment 348578 [details]
updated version of patch with missing chunks
Created attachment 348585 [details]
updated patch with fixed chunk
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1852502 Please test this build ASAP. It needs to be validated in the next few days to have any chance of making it into 5.4 Created attachment 348634 [details] new patch with args->signr check added back yeah, you're right, looking at the patch set and upstream code again, we need the signr check added back. New patch attached correcting that. Heres the build below. Please have Fujitsu test it and veryify its functionality ASAP, this weekend if at all possible. http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1852840 version 2 of the patchset should apply cleanly to RHEL5 if youd like to test it early and expidite the inclusion process should it get accepted upstream: http://lists.infradead.org/pipermail/kexec/2009-July/003422.html 0cced40e7c58b1105aef3ca446da7b158a18a9a6 5959906ee9dee602a46e49c868a7e543e050d605 1726b0883dd08636705ea55d577eb0ec314ba427 68cb14c7c46d9204ba451a534f15a8bc12c88e28 6cc3efcdf01cf874ffe770919395918a3ee9365b 07a6a4ae827b54cec4c1b1d92bed1cc9176b45ec 4295ab34883d2070b1145e14f4619478e9788807 Here are the upstream fixes for it. Created attachment 375925 [details]
backport of requested commits
heres a backport of the requested commits. I'll have a build for you to test soon.
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2123601 Theres your build. Please test and confirm that it works as expected, thanks! please remember to clear needinfo when you update a bz, or I likely wont see it. Yeah, the patch broke, looks pretty simple, I'll try get to it today. http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2128975 There you go, new build to test. dang, failed again, I'll look at it shortly. http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2129625 There that should complete. Any thoughts as to why this might be happening? It looks from the log like this was a panic, unrelated to the originoal issue. Does this occur consistently, or only if you try to reproduce the origional problem by asserting an INIT during kdump startup? I'll try get to this today http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2242169 Heres a new build for you to test. in kernel-2.6.18-191.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details. in kernel-2.6.18-192.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details. (In reply to comment #56) > in kernel-2.6.18-192.el5 > You can download this test kernel from http://people.redhat.com/jwilson/el5 > > Please update the appropriate value in the Verified field > (cf_verified) to indicate this fix has been successfully > verified. Include a comment with verification details. The above was the revert of the patch, due to a regression it introduced. Moving bug back to ASSIGNED. The parts of these patches that I understand seem reasonable to me. Whats the upstream status on these changes? in kernel-2.6.18-215.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. Um, this seems like a different problem. Checking the last few lines of the backtrace: all Trace: [<a0000001006769e0>] schedule+0x1e20/0x2100 sp=e000000016d3fc70 bsp=e000000016d39140 [<a0000001000ab170>] worker_thread+0x170/0x240 sp=e000000016d3fd00 bsp=e000000016d39110 [<a0000001000b3210>] kthread+0x230/0x2c0 sp=e000000016d3fd50 bsp=e000000016d390c8 [<a000000100012210>] kernel_thread_helper+0x30/0x60 sp=e000000016d3fe30 bsp=e000000016d390a0 [<a0000001000090c0>] start_kernel_thread+0x20/0x40 sp=e000000016d3fe30 bsp=e000000016d390a0 kdump_init_notifier: kdump not configured This appears to be a boot of the normal kernel (not the kdump kernel) and you've hung it up by sending a series of NMI's to it, which I think is expected. If you wanted to claim this problem was reproduced, you would need to configure kdump, crash the kernel and hang the kdump kernel by issuing a series of NMI's to prevent kdump from starting. (In reply to comment #74) > Um, this seems like a different problem. Checking the last few lines of the > backtrace: > > all Trace: > [<a0000001006769e0>] schedule+0x1e20/0x2100 > sp=e000000016d3fc70 bsp=e000000016d39140 > [<a0000001000ab170>] worker_thread+0x170/0x240 > sp=e000000016d3fd00 bsp=e000000016d39110 > [<a0000001000b3210>] kthread+0x230/0x2c0 > sp=e000000016d3fd50 bsp=e000000016d390c8 > [<a000000100012210>] kernel_thread_helper+0x30/0x60 > sp=e000000016d3fe30 bsp=e000000016d390a0 > [<a0000001000090c0>] start_kernel_thread+0x20/0x40 > sp=e000000016d3fe30 bsp=e000000016d390a0 > kdump_init_notifier: kdump not configured > > This appears to be a boot of the normal kernel (not the kdump kernel) and No. This was booting a kdump kernel by pressing INIT button once. The hang was caused by pressing the button several times after waiting a second after the first pressing. I think this could be found out by looking at the head of the log: Red Hat Enterprise Linux Server release 5.5 (Tikanga) Kernel 2.6.18-215.el5 on an ia64 intel-s6e5231-01.rhts.eng.nay.redhat.com login: Linux version 2.6.18-215.el5 (mockbuild.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) #1 SMP Tue Aug 31 22:19:41 EDT 2010 Ignoring memory below 128MB Ignoring memory above 384MB EFI v1.10 by INTEL: SALsystab=0x7e964370 ACPI 2.0=0x7edd7000 SMBIOS=0xf8020 HCDP=0x7ed9e1a0 booting generic kernel on platform dig PCDP: v1 at 0x7ed9e1a0 Explicit "console="; ignoring PCDP ACPI: Unable to map DSDT Number of logical nodes in system = 1 Number of memory chunks in system = 3 > you've hung it up by sending a series of NMI's to it, which I think is > expected. If you wanted to claim this problem was reproduced, you would need > to configure kdump, crash the kernel and hang the kdump kernel by issuing a > series of NMI's to prevent kdump from starting. first of all, no, I don't see how you can determine this is a kdump boot from the head of the log. I don't see any indicator in which the log entries above diverge from a normal boot. But thats not relevant, if you're sure that you've issued a kdump, then you have, and we can work under that supposition. Looking at this, I'm still not sure you've actually reproduced anything that we can do anything about. Looking at your log above, this line jumps out: kdump_init_notifier: kdump not configured The fact that kdump isn't configured is irrelevant. If we're in a kdump kernel then that is expected. What is interesting is how that printk gets presented. That printk is in kdump_init_notifier, which is a registered notifier function that gets registered from machine_crash_setup. machine_crash_setup is an initcall (level 1), which means that it gets called early, right after the init process is forked, which in turn is right after the kernel has finished initalization. The fact that you are getting that printk indicates that you received an NMI (from your pressing of the init button) _after_ this step in the origional report above: (5) INIT handler of 2nd-kernel is registered in SAL. The origional bug report call for the receipt of an NMI _prior_ to that step. Once the INIT handlers are registered, if we get an NMI, the system crashes, which is what is expected, and what you got (as is evidenced by the dumping out of all the backtraces in comment 73). If you're expectation was that the system would never crash while capturing a vmcore while pressing the INIT button, thats incorrect. The only way we can avoid that is by never registering INIT handlers, which would prevent the INIT button from ever working, which is not something we want to do, as it helps in debugging kdump issues where the system hangs, and is not something done in normal production work. Not sure how you want to handle this, but I really dont think you have reproduced anything we can act upon here. Moving back in to ON_QA Set verified status based on comment https://bugzilla.redhat.com/show_bug.cgi?id=506694#c71. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html |