Bug 351521
Summary: | kexec / kdump not working on 2.6.18-8.1.15.el5 (and before) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | John Haverty <zeio> | ||||||
Component: | kexec-tools | Assignee: | Neil Horman <nhorman> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | low | ||||||||
Version: | 5.0 | CC: | jburke, qcai | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | 1.101-194.4 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2008-03-26 00:00:49 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
John Haverty
2007-10-24 23:48:48 UTC
Created attachment 236771 [details]
lspci of system
try adding reset_devices to KERNEL_COMMANDLINE_APPEND in /etc/sysconfig/kdump. If that doesn't help, please send in a sysreport of your system. Thanks! I put the following in /etc/sysconfig/kdump: KDUMP_COMMANDLINE_APPEND="reset_devices irqpoll maxcpus=1" I added reset_devices to what was there already. The I performed "touch /etc/kdump.conf" and restarted the kdump daemon, making sure the initrd img was up to date. Then upon "echo "c" > /proc/sysrq-trigger" the same panic occurs. EDAC k8 MC0: GART TLB error: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error Kernel panic - not syncing: MC0: processor context corrupt I will try an attach the sysreport soon [need to clear with manager]. Is there any other things I can try to mitigate this? I noted that the kernel in /var/log/messages completely boots, and only after a complete boot and execution of several of the init scripts does it panic, is there a theory as to why it waits so long to panic? Is there a theory as to why the vmcore-incomplete would always be 133M: 78624 -r-------- 1 root root 138870784 Oct 25 13:32 vmcore-incomplete? hmm, that should have fixed the problem. We specifically added code to fix up the corrupt processor context in kernel 2.6.18-34.el5. What kernel are you running? As for the size of the vmcore file and the time of the crash, they are unrelated. This error can occur at any time and is asynchronous to the behavior of the rest of the system. I'm running 2.6.18-8.1.15.el5 x86_64 from the update package kernel-2.6.18-8.1.15.el5.x86_64.rpm (that came in with https://rhn.redhat.com/errata/RHSA-2007-0940.html ) But the problem was happening before, and it doesn't happen on 32-bit. Of course it will happen prior to 2.6.18-8.1.5, we didn't add the code to recognize the reset_devices option to the source tree until after that kernel shipped. And IIRC the code that causes this panic is only present on 64 bit kernels. If you grab the 5.1 beta release (which should be post 2.6.18-34 at this point) and use the reset_devices option, the problem should go away. I didn't catch that (the version numbers are long) I can try the 5.1beta kernel and see if the issue goes away. Is there a direct link to the 5.1 beta kernel for x86_64? I can't find it on RHN or on ftp.redhat.com. I can try that kernel on this setup and see if that makes the problem go away. John, You should be able to use the following kernel to test with. http://people.redhat.com/dzickus/el5/53.el5/x86_64/kernel-2.6.18-53.el5.x86_64.rpm Jeff With kernel-2.6.18-53.el5.x86_64 , with the KDUMP_COMMANDLINE_APPEND="reset_devices irqpoll maxcpus=1" set in /etc/sysconfig/kdump, the box instantly reboots back to BIOS when echo "c" > /proc/sysrq-trigger is issued. It does the same thing, instant reboot with KDUMP_COMMANDLINE_APPEND="irqpoll maxcpus=1" set in kdump. I'm sorry, this sounds completely different from what you were getting before. Are you saying that you no longer get this message: Kernel panic - not syncing: MC0: processor context corrupt But now instead effectively soft reset the box on a sysrq-c? Yes, the behavior went from booting the kexec-ed kernel and getting a long way through the init scripts and starting to dump vmcore and the new behavior with the new kernel on the same OS / same hardware, etc, is an instant reset (soft reset) This reset on kexec happened both with and without "reset_devices" in place in /etc/sysconfig/kdump Was there anything logged prior to the soft reset? Something perhaps regarding kexec failing to load? Also, can you check /sys/kernel/kexec_crash_loaded prior to issuing the sysrq-c to crash the system? also can you provide a sysreport of this system. I clean out the log files before doing the sysrq-c, then sync the disk. The logs simply end at the time of sysrq-c, nothing further is logged until the version banned of the normal kernel booting. As for /sys/kernel/kexec_crash_loaded, it has a "1" inside, see: # pwd /sys/kernel # find . -type f -print -exec cat {} \; ./kexec_crash_loaded 1 ./kexec_loaded 0 ./uevent_helper ./uevent_seqnum 698 Does this look correct? Hmm, yeah. Thats no good. Given that we get an immediate reset (without any apparent console messages to give us a clue as to why) bisection is probably going to be our best bet. Please try out this kernel: http://people.redhat.com/nhorman/rpms/kernel-2.6.18-40.el5.x86_64.rpm And tell me if it returns you to your origional behavior. If it does, I think I have a strong suspect candidate for the patch that caused the change. Report on 2.6.18-40.el5, from kernel-2.6.18-40.el5.x86_64.rpm : This kernel also did the instant reset behavior upon sysrq-c. Nothing was logged in the /var/log directory after the sysrq-c was issued, the next lines were the kernel booting. When this "instant reset" issue manifests, nothing is created in /var/crash (whereas on the -15 kernel, a vmcore incomplete will appear there.) I also noticed that lm_sensors gave a "General parse error" but that is unrelated to this (submitted as an FYI). Reverting back to 2.6.18-8.1.15.el5 and will await another kernel to try. Thanks for the update. Ok, lets try this kernel: http://people.redhat.com/nhorman/rpms/kernel-2.6.18-34.el5.x86_64.rpm I'm not seeing many suspects in the intervening kernels that indicate they might be problematic. If this kernel doesn't return us to the origional behavior, we may have a legitimate error thats causing a panic, and will have to resort to some edac tuning to avoid thei issue. kernel-2.6.18-34.el5.x86_64.rpm also did the instant-reset with nothing logged in /var/log, or in /var/crash. I reverted to 2.6.18-8.1.15.el5 and tested sysrq-c, it did exec but panicked as before. It seems the newer kernels won't kexec properly on this platform (HDAMA-I). Ok, so this reset behavior seems to be independent of the reset_devices option. Next step is to try this kernel please: http://people.redhat.com/nhorman/rpms/kernel-2.6.18-20.el5.x86_64.rpm Thanks! 2.6.18-20.el5.x86_64 hangs on boot: output: <SNIP> CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 1024K (64 bytes/line) CPU 0/0 -> Node 0 CPU: Physical Processor ID: 0 CPU: Processor Core ID: 0 SMP alternatives: switching to UP code ACPI: Core revision 20060707 Using local APIC timer interrupts. result 12528700 Detected 12.528 MHz APIC timer. SMP alternatives: switching to SMP code Booting processor 1/2 APIC 0x1 Initializing CPU#1 Calibrating delay using timer specific routine.. 4810.22 BogoMIPS (lpj=2405113) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 1024K (64 bytes/line) CPU 1/1 -> Node 0 CPU: Physical Processor ID: 0 CPU: Processor Core ID: 1 Dual Core AMD Opteron(tm) Processor 280 stepping 02 CPU 1: Syncing TSC to CPU 0. CPU 1: synchronized TSC with CPU 0 (last diff 2 cycles, maxerr 540 cycles) Brought up 2 CPUs testing NMI watchdog ... CPU#0: NMI appears to be stuck (0->0)! Disabling vsyscall due to use of PM timer time.c: Using 3.579545 MHz WALL PM GTOD PM timer. time.c: Detected 2405.508 MHz processor. HANG. (Interestingly, the screen blanking still works in this state). Ill try passing some stuff to the kernel and tweaking the BIOS from its default settings (I generally always use the default settings) with regards to NMI and see if I can get this kernel booting. Created attachment 253801 [details]
Panic output when "nosmp"
panic output from kernel booted as nosmp
I tried booting that kernel, 2.6.18-20.el5.x86_64, with "nosmp" at the end: Command line: ro root=LABEL=/ crashkernel=128M@16M console=ttyS0,115200n8 console=tty0 nosmp And the kernel panics. On SMP mode, above, it would hang, but in UP mode, it panics. Attached is the output of that panic. (See above, 253801) Ugh. Ok so this is systems seems like its never worked with kexec and is failing in various and sundry different ways depending on the kernel version. so much for bisecting. Above, were you also passing noapic at the same time that you were passing nosmp? I would suggest that you try removing the noapic is you were passing that. If the hang occurs, then you may well be seeing a problem that we're looking at on some other x86_64 systems I'm pretty sure is passed only "nosmp" , not noapic along with that. Do you want me to try both "nosmp noapic" and "nosmp" , But I'm fairly certain when I passed "nosmp" it was the only extra parameter. If you could try with noapic added please, that would be helpful. I was just reading about a simmilar set of problems that has been worked around by not using the apic: https://bugzilla.redhat.com/show_bug.cgi?id=205479 Ok; which kernel version do you want me to try to do this on? And do you want UP "nosmp noapic" and SMP "noapic" to be tested? I believe I did try noapic before, but I will do it again and better document it for completeness. just noapic will be fine, thanks Long time between updates, sorry, still not working with noapic. Ok, so interestingly having waited may solve your problem here. Looking back over your stack trace to try to find the specific problem, I note that it occurs during pci probing of stata devices. I was going to look through that routine for problems that might result in an oops, only to find that in the latest kernel, that entire part of libata has been re-written to fix several probing issues. I think perhaps the patch with those changes solved this problem, and we just didn't have the patch available last time we looked. Its now available in the latest kernel -79.el5. Can you try with that kernel please? It should fix your sata issue here. kernel-2.6.18-79.el5 - is this in the channel yet ? I'm not seeing it in the channel, is something that would be located at people.redhat.com? yeah, sorry, I'll place the latest x86_64 kernel on my people page: http://people.redhat.com/nhorman for you. You'll find a link referenced by the bz number there shortly. Thanks Hello ; Thanks for the new kernel to try , Linux localhost.localdomain 2.6.18-81.el5 #1 SMP Tue Feb 12 21:26:39 EST 2008 x86_64 x86_64 x86_64 GNU/Linux # cat /etc/redhat-release Red Hat Enterprise Linux Server release 5 (Tikanga) This new kernel did not resolve the issue for us. Can you provide the stack trace of the failure? How would I do that - the last thing I see is: EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error Kernel panic - not syncing : MC0: processor context corrupt How do I get the kernel to give more information output to the console on panic ? Note: This is happening on two different motherboards with two different BIOS types here. If its not displaying on the console, you may have to connect a serial console to get the stack trace on your system. Although looking at it, it seems to me that you've at some point removed the reset_devices line from your kdump sysconfig file. This is the code responsible for what your seeing on the screen: if (regs->nbsh & BIT(25)) { if (reset_devices == 0) panic("MC%d: processor context corrupt", mci->mc_idx); else k8_mc_printk(mci, KERN_CRIT, "processor context corrupt\n"); } Note you'll only get the panic if you don't have reset_devices in your /etc/sysconfig/kdump file. Can you please make sure that you have it in place and try the test again? Thanks! Ok, but in the package: kexec-tools-1.101-194.4.el5.src.rpm The contents contains the example config files for kdump.conf, kdump (.i386) and (.x86_64), and they dont contain a reset_devices. I can add it to the command line. I have a serial console hooked up, and it displays the same thing as the console with no stacktrace. Will add reset_devices and try again. So I put back reset_devices. The last line out of the serial console is: SELinux: Disabled at runtime. audit(1206476170.953:2): selinux=0 auid=4294967295 EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: processor context corrupt The command line used when that kernel booted was: Command line: ro root=/dev/sda7 console=ttyS0,115200n8 console=tty0 reset_devices irqpoll maxcpus=1 memmap=exactmap memmap=640K@0KK Kernel: 2.6.18-53.1.14.el5 Ill retry with reset_devices=1 you probably want to get rid of the console=tty0 line for your kdump kernel I have a working configuration: #1: /etc/modprobe.conf install k8temp /bin/true install k8_edac /bin/true - k8temp causes sensors to fail with a general parse error - k8_edac , by suppressing this module I don't get the EDAC k8 MC0: messages anymore. #2 kdump.conf is default (nothing uncommented) #3 kdump in /etc/sysconfig/ is as follows: KDUMP_KERNELVER="" KDUMP_COMMANDLINE="" KDUMP_COMMANDLINE_APPEND="reset_devices irqpoll maxcpus=1" KEXEC_ARGS=" --args-linux" KDUMP_BOOTDIR="/boot" KDUMP_IMG="vmlinuz" KDUMP_IMG_EXT="" (Note: the reset_devices line was not in the RPM when I did an rpm2cpio|cpio -ivd on that package and checked the contents , so I don't know why it wasn't there). Will the configuration like this, the VMCORE file is dumped: /var/crash : 4 drwxr-xr-x 2 root root 4096 Mar 25 14:07 2008-03-25-14:05/ 3840140 -r-------- 1 root root 4030506072 Mar 25 14:07 vmcore 3.7G -r-------- 1 root root 3.8G Mar 25 14:07 vmcore Side Question : How do I get dual console output? What I noticed is that everything works properly when specifying two consoles like that (both on the kernel command line) except the init scripts' output always goes to tty0 if console=tty0 is anywhere in the command line. So , how do we make all this information fit into the "generic" configuration ? Or should I just use these parameters going forward ? Should I test to see if this would work with the older/original kernel? (2.6.18-8.1.15.el5) Another note, the original boot command line is: #4 ro root=/dev/sda7 console=tty0 console=ttyS0,115200n8 crashkernel=256M@16M And it seems the last console specified on the list is the one that becomes /dev/console, but the kernel itself will output to both places, init and the rest will use /dev/console->last console= on the kernel command line. Ok, so this is working now, good. Regarding your dual core output question, you're notes are correct, the kernel will output to both places, but userspace only outputs to the last console specified, and you need to take that into consideration when setting up your command line. Regarding making of a generic config: I'm afraid there is no sufficiently 'generic ' kdump setup that works for all systems. The one available is the most generic that I've been able to come up with so far that works for most systems in my experience. As it happens, reset_devices has been sufficiently usefull that in the 5.2 update its part of the default config. glad its all working for you now |