Description of problem: When attempting to kexec reboot, either manually or via a panic-triggered kdump, the ES7000/ONE hangs after rebooting in the new kernel after printing the "Memory: 32839688k/33685504k available" line. How reproducible: Using RHEL 5, configure a kexec reboot for either the kexec command or kdump, and then trigger the kexec (either kexec -e or alt-sysrq-c). The new kernel will hang. Steps to Reproduce: 1. Set up kexec kernel 2. trigger the kexec (either kexec -e or alt-sysrq-c) 3. wait Actual results: System hang. Expected results: Successful reboot. Additional info: The problem has been tracked to old code in the io_apic.c file. Inside of disable_IO_APIC(), the obsolete 4-bit field physical.physical_dest was used. As of the xAPIC spec (for Xeon) this field was expanded to 8 bits. The old code cuts the upper 4 bits off of the APIC ID, and on the ES7000 this causes the timer interrupt to fail on any cell above cell 0 (the cell number ends up being the top 4 bits of the APIC ID). I have patched this in the upstream kernel with the patch titled [PATCH 2.6.19.2 1/1] kexec: update IO-APIC dest field to 8-bitfor xAPIC, which is attached. We would appreciate it if this patch could be applied to the RHEL 5 kernel.
Created attachment 146542 [details] [PATCH 2.6.19.2 1/1] kexec: update IO-APIC dest field to 8-bitfor xAPIC
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
I have no familiarity with this code, but the change to the IO_APIC_route_entry structure looks to be a RHEL5 KABI-breaker.
Not to mention that the patch doesn't apply: arch/x86_64/kernel/io_apic.c.rej: *************** *** 847,853 **** if (vector < 0) continue; - entry.dest.logical.logical_dest = cpu_mask_to_apicid(mask); entry.vector = vector; ioapic_register_intr(irq, vector, IOAPIC_AUTO); --- 847,853 ---- if (vector < 0) continue; + entry.dest = cpu_mask_to_apicid(mask); entry.vector = vector; ioapic_register_intr(irq, vector, IOAPIC_AUTO); *************** *** 1077,1094 **** printk(KERN_DEBUG ".... IRQ redirection table:\n"); - printk(KERN_DEBUG " NR Log Phy Mask Trig IRR Pol" - " Stat Dest Deli Vect: \n"); for (i = 0; i <= reg_01.bits.entries; i++) { struct IO_APIC_route_entry entry; entry = ioapic_read_entry(apic, i); - printk(KERN_DEBUG " %02x %03X %02X ", i, - entry.dest.logical.logical_dest, - entry.dest.physical.physical_dest ); printk("%1d %1d %1d %1d %1d %1d %1d %02X\n", --- 1077,1093 ---- printk(KERN_DEBUG ".... IRQ redirection table:\n"); + printk(KERN_DEBUG " NR Dst Mask Trig IRR Pol" + " Stat Dmod Deli Vect: \n"); for (i = 0; i <= reg_01.bits.entries; i++) { struct IO_APIC_route_entry entry; entry = ioapic_read_entry(apic, i); + printk(KERN_DEBUG " %02x %03X ", i, + entry.dest ); printk("%1d %1d %1d %1d %1d %1d %1d %02X\n", *************** *** 1350,1357 **** entry.dest_mode = 0; /* Physical */ entry.delivery_mode = dest_ExtINT; /* ExtInt */ entry.vector = 0; - entry.dest.logical.logical_dest = - GET_APIC_ID(apic_read(APIC_ID)); /* * Add it to the IO-APIC irq-routing table: --- 1349,1355 ---- entry.dest_mode = 0; /* Physical */ entry.delivery_mode = dest_ExtINT; /* ExtInt */ entry.vector = 0; + entry.dest = GET_APIC_ID(apic_read(APIC_ID)); /* * Add it to the IO-APIC irq-routing table: *************** *** 2257,2263 **** entry.delivery_mode = INT_DELIVERY_MODE; entry.dest_mode = INT_DEST_MODE; - entry.dest.logical.logical_dest = cpu_mask_to_apicid(mask); entry.trigger = triggering; entry.polarity = polarity; entry.mask = 1; /* Disabled (masked) */ --- 2255,2261 ---- entry.delivery_mode = INT_DELIVERY_MODE; entry.dest_mode = INT_DEST_MODE; + entry.dest = cpu_mask_to_apicid(mask); entry.trigger = triggering; entry.polarity = polarity; entry.mask = 1; /* Disabled (masked) */
The patch is the upstream patch that went against 2.6.19.2 and was approved. I can generate a backported patch for a specific 2.6.18 RHEL 5 kernel if that will address the problem with applying it. I'm not sure I understand or agree with the KABI issue. The entry.dest field hasn't moved so any binaries that access entry.dest in the old incorrect way won't be broken, though they will continue to be incompatible with any ES7000 larger than a single cell.
(In reply to comment #7) > The patch is the upstream patch that went against 2.6.19.2 and was approved. I > can generate a backported patch for a specific 2.6.18 RHEL 5 kernel if that > will address the problem with applying it. Thanks Ben -- please use the 2.6.18-45.el5 kernel. I've placed the kernel's src.rpm here: http://people.redhat.com/anderson/BZ_224373 Also, can you either post the LKML post and/or git reference number of the upstream patch? > I'm not sure I understand or agree with the KABI issue. The entry.dest field > hasn't moved so any binaries that access entry.dest in the old incorrect way > won't be broken, though they will continue to be incompatible with any ES7000 > larger than a single cell. Yes, but if the modified data structure is referenced as an argument to any EXPORT_XXX function or data variable, the checksum calculation for that function or variable will change. I'm not sure yet myself, so I first need to be able to pass a kernel with your patch applied through our build system, which will choke on any KABI issues. If it fails, we'll need to work around it somehow. If you look in the kernel source tree I've provided for "#ifndef __GENKSYMS__" references, you'll see how we typically work around situations where the data layout is basically the same, but names/types have changed, members added, or whatever. The kernel gets built with __GENKSYMS__ turned off so that it picks up the patches, but is turned on when genksyms is run.
I patched 2.6.19.2, it went through some discussion, and then went into 2.6.21, if you check the changelog for that you should find it. Here's the other information you wanted: commit ee4eff6ff6cbfc8ce38131058a18802bf6206879 Author: Benjamin Romer <benjamin.romer> Date: Tue Feb 13 13:26:25 2007 +0100 [PATCH] x86-64: update IO-APIC dest field to 8-bit for xAPIC I've pulled down that kernel source RPM, and will start moving my patch right now. :)
Created attachment 190441 [details] IO-APIC patch for RHEL 5, 2.6.18 A backported version of my IO-APIC patch.
The build failed, long before any KABI checks were done. Changes are also required for the analogous function(s) in io_apic-xen.c: arch/x86_64/kernel/io_apic-xen.c: In function 'setup_IO_APIC_irqs': arch/x86_64/kernel/io_apic-xen.c:959: error: request for member 'logical' in something not a structure or union arch/x86_64/kernel/io_apic-xen.c:977: error: request for member 'logical' in something not a structure or union arch/x86_64/kernel/io_apic-xen.c: In function 'io_apic_set_pci_routing': arch/x86_64/kernel/io_apic-xen.c:2201: error: request for member 'logical' in something not a structure or union make[1]: *** [arch/x86_64/kernel/io_apic-xen.o] Error 1 make: *** [arch/x86_64/kernel] Error 2
All right, let me fix those and I'll generate another patch file.
Created attachment 191881 [details] fixed patch for RHEL 5 2.6.18 Here's the fix for the xen file. Sorry about missing it the first time!
OK, the build is underway -- let's see how the KABI issue shakes out... Thanks, Dave
The build completed with no KABI issues. Can you please test/verify the two test kernels here: http://people.redhat.com/anderson/BZ_224373 (i.e., kernel-2.6.18-45.el5.bz224373.2.x86_64.rpm and kernel-xen-2.6.18-45.el5.bz224373.2.x86_64.rpm) The debuginfo rpms and the src.rpm are also there if you want them. Thanks, Dave
Great! I'll install these on an ES7000 and do some kexec testing. I'll get back to you as soon as possible. :)
Using the kernel from Dave kernel-2.6.18-45.el5.bz224373.2.x86_64.rpm I was able boot into the kexec kernel on the ES7000 we have here. This is what I did: kexec -l /boot/vmlinuz-2.6.18-45.el5.bz224373.2 --initrd=/boot/initrd-2.6.18-45.el5.bz224373.2.img --command-line="ro root=/dev/VolGroup00/LogVol00 crashkernel=128M@16M console=ttyS0,115200 kexecboot" kexec -e
Created attachment 192911 [details] kernel-2.6.18-45.el5.bz224373.2 x86_64 kexec boot log
I was able to get it to work here as well. :)
in 2.6.18-58.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
I apologise greatly for the long delay in testing the kernel update from #26. I was not able to get it to kexec without the lpj parameter - without it, the kernel is behaving as it was before without my patch, hanging right after the "Memory:" line during boot. Could you please verify that my patch is in the kernel?
The patch has been place since 2.6.18-57.el5, in the patch named: linux-2.6-x86_64-update-IO-APIC-dest-field-to-8-bit-for-xAPI.patch If you look at the kernel's src.rpm from Don's tree, you can see it: $ rpm2cpio kernel-2.6.18-58.el5.src.rpm | cpio -t | grep dest-field linux-2.6-x86_64-update-IO-APIC-dest-field-to-8-bit-for-xAPI.patch 173156 blocks $
Please disregard my last comment, I made a typo in menu.lst that was screwing up my test. It's working fine! :)
Whew...
Greetings Red Hat Partner, A fix for this issue should be included in the latest packages contained in RHEL5.2-Snapshot1--available now on partners.redhat.com. Please test and confirm that your issue is fixed. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to ASSIGNED. If you are receiving this message in Issue Tracker, please reply with a message to Issue Tracker about your results and I will update bugzilla for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. Thank you
Greetings Red Hat Partner, A fix for this issue should be included in the latest packages contained in RHEL5.2-Snapshot3--available now on partners.redhat.com. Please test and confirm that your issue is fixed. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to ASSIGNED. If you are receiving this message in Issue Tracker, please reply with a message to Issue Tracker about your results and I will update bugzilla for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. Thank you
Looks good! We've tested it and it's working. Thanks! :)
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html