Bug 224373 - kexec or kdump hangs on ES7000/ONE
Summary: kexec or kdump hangs on ES7000/ONE
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.0
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: rc
: ---
Assignee: Dave Anderson
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 246139 296411 372911 420521 422431 422441 442922
TreeView+ depends on / blocked
 
Reported: 2007-01-25 14:09 UTC by Ben Romer
Modified: 2018-10-19 21:09 UTC (History)
6 users (show)

Fixed In Version: RHBA-2008-0314
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-05-21 14:40:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
[PATCH 2.6.19.2 1/1] kexec: update IO-APIC dest field to 8-bitfor xAPIC (4.51 KB, patch)
2007-01-25 14:09 UTC, Ben Romer
no flags Details | Diff
IO-APIC patch for RHEL 5, 2.6.18 (3.45 KB, patch)
2007-09-07 20:01 UTC, Ben Romer
no flags Details | Diff
fixed patch for RHEL 5 2.6.18 (6.22 KB, patch)
2007-09-10 19:19 UTC, Ben Romer
no flags Details | Diff
kernel-2.6.18-45.el5.bz224373.2 x86_64 kexec boot log (105.66 KB, text/plain)
2007-09-11 19:19 UTC, Jeff Burke
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0314 0 normal SHIPPED_LIVE Updated kernel packages for Red Hat Enterprise Linux 5.2 2008-05-20 18:43:34 UTC

Description Ben Romer 2007-01-25 14:09:09 UTC
Description of problem:
When attempting to kexec reboot, either manually or via a panic-triggered kdump,
the ES7000/ONE hangs after rebooting in the new kernel after printing the
"Memory: 32839688k/33685504k available" line.

How reproducible:
Using RHEL 5, configure a kexec reboot for either the kexec command or kdump,
and then trigger the kexec (either kexec -e or alt-sysrq-c). The new kernel will
hang.

Steps to Reproduce:
1. Set up kexec kernel 
2. trigger the kexec (either kexec -e or alt-sysrq-c)
3. wait
  
Actual results:
System hang.

Expected results:
Successful reboot.

Additional info:
The problem has been tracked to old code in the io_apic.c file. Inside of
disable_IO_APIC(), the obsolete 4-bit field physical.physical_dest was used. As
of the xAPIC spec (for Xeon) this field was expanded to 8 bits. The old code
cuts the upper 4 bits off of the APIC ID, and on the ES7000 this causes the
timer interrupt to fail on any cell above cell 0 (the cell number ends up being
the top 4 bits of the APIC ID).

I have patched this in the upstream kernel with the patch titled [PATCH 2.6.19.2
1/1] kexec: update IO-APIC dest field to  8-bitfor xAPIC, which is attached.

We would appreciate it if this patch could be applied to the RHEL 5 kernel.

Comment 1 Ben Romer 2007-01-25 14:09:09 UTC
Created attachment 146542 [details]
[PATCH 2.6.19.2 1/1] kexec: update IO-APIC dest field to  8-bitfor xAPIC

Comment 2 RHEL Program Management 2007-04-25 22:04:28 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 5 Dave Anderson 2007-09-06 18:37:57 UTC
I have no familiarity with this code, but the change to the
IO_APIC_route_entry structure looks to be a RHEL5 KABI-breaker.


Comment 6 Dave Anderson 2007-09-06 20:07:21 UTC
Not to mention that the patch doesn't apply:

arch/x86_64/kernel/io_apic.c.rej:

***************
*** 847,853 ****
  			if (vector < 0)
  				continue;
  
- 			entry.dest.logical.logical_dest = cpu_mask_to_apicid(mask);
  			entry.vector = vector;
  
  			ioapic_register_intr(irq, vector, IOAPIC_AUTO);
--- 847,853 ----
  			if (vector < 0)
  				continue;
  
+ 			entry.dest = cpu_mask_to_apicid(mask);
  			entry.vector = vector;
  
  			ioapic_register_intr(irq, vector, IOAPIC_AUTO);
***************
*** 1077,1094 ****
  
  	printk(KERN_DEBUG ".... IRQ redirection table:\n");
  
- 	printk(KERN_DEBUG " NR Log Phy Mask Trig IRR Pol"
- 			  " Stat Dest Deli Vect:   \n");
  
  	for (i = 0; i <= reg_01.bits.entries; i++) {
  		struct IO_APIC_route_entry entry;
  
  		entry = ioapic_read_entry(apic, i);
  
- 		printk(KERN_DEBUG " %02x %03X %02X  ",
  			i,
- 			entry.dest.logical.logical_dest,
- 			entry.dest.physical.physical_dest
  		);
  
  		printk("%1d    %1d    %1d   %1d   %1d    %1d    %1d    %02X\n",
--- 1077,1093 ----
  
  	printk(KERN_DEBUG ".... IRQ redirection table:\n");
  
+ 	printk(KERN_DEBUG " NR Dst Mask Trig IRR Pol"
+ 			  " Stat Dmod Deli Vect:   \n");
  
  	for (i = 0; i <= reg_01.bits.entries; i++) {
  		struct IO_APIC_route_entry entry;
  
  		entry = ioapic_read_entry(apic, i);
  
+ 		printk(KERN_DEBUG " %02x %03X ",
  			i,
+ 			entry.dest
  		);
  
  		printk("%1d    %1d    %1d   %1d   %1d    %1d    %1d    %02X\n",
***************
*** 1350,1357 ****
  		entry.dest_mode       = 0; /* Physical */
  		entry.delivery_mode   = dest_ExtINT; /* ExtInt */
  		entry.vector          = 0;
- 		entry.dest.logical.logical_dest =
- 					GET_APIC_ID(apic_read(APIC_ID));
  
  		/*
  		 * Add it to the IO-APIC irq-routing table:
--- 1349,1355 ----
  		entry.dest_mode       = 0; /* Physical */
  		entry.delivery_mode   = dest_ExtINT; /* ExtInt */
  		entry.vector          = 0;
+ 		entry.dest          = GET_APIC_ID(apic_read(APIC_ID));
  
  		/*
  		 * Add it to the IO-APIC irq-routing table:
***************
*** 2257,2263 ****
  
  	entry.delivery_mode = INT_DELIVERY_MODE;
  	entry.dest_mode = INT_DEST_MODE;
- 	entry.dest.logical.logical_dest = cpu_mask_to_apicid(mask);
  	entry.trigger = triggering;
  	entry.polarity = polarity;
  	entry.mask = 1;					 /* Disabled (masked) */
--- 2255,2261 ----
  
  	entry.delivery_mode = INT_DELIVERY_MODE;
  	entry.dest_mode = INT_DEST_MODE;
+ 	entry.dest = cpu_mask_to_apicid(mask);
  	entry.trigger = triggering;
  	entry.polarity = polarity;
  	entry.mask = 1;					 /* Disabled (masked) */



Comment 7 Ben Romer 2007-09-07 12:29:36 UTC
The patch is the upstream patch that went against 2.6.19.2 and was approved. I
can generate a backported patch for a specific 2.6.18 RHEL 5 kernel if that will
address the problem with applying it.

I'm not sure I understand or agree with the KABI issue. The entry.dest field
hasn't moved so any binaries that access entry.dest in the old incorrect way
won't be broken, though they will continue to be incompatible with any ES7000
larger than a single cell. 

Comment 8 Dave Anderson 2007-09-07 13:11:03 UTC
(In reply to comment #7)
> The patch is the upstream patch that went against 2.6.19.2 and was approved. I
> can generate a backported patch for a specific 2.6.18 RHEL 5 kernel if that
> will address the problem with applying it.

Thanks Ben -- please use the 2.6.18-45.el5 kernel.  I've placed the
kernel's src.rpm here:

  http://people.redhat.com/anderson/BZ_224373

Also, can you either post the LKML post and/or git reference number
of the upstream patch?

> I'm not sure I understand or agree with the KABI issue. The entry.dest field
> hasn't moved so any binaries that access entry.dest in the old incorrect way
> won't be broken, though they will continue to be incompatible with any ES7000
> larger than a single cell. 

Yes, but if the modified data structure is referenced as an argument
to any EXPORT_XXX function or data variable, the checksum calculation
for that function or variable will change.  I'm not sure yet myself,
so I first need to be able to pass a kernel with your patch applied
through our build system, which will choke on any KABI issues.  If it
fails, we'll need to work around it somehow.  If you look in the kernel
source tree I've provided for "#ifndef __GENKSYMS__" references, you'll
see how we typically work around situations where the data layout is
basically the same, but names/types have changed, members added, or
whatever.  The kernel gets built with __GENKSYMS__ turned off so that
it picks up the patches, but is turned on when genksyms is run.





Comment 9 Ben Romer 2007-09-07 13:51:43 UTC
I patched 2.6.19.2, it went through some discussion, and then went into 2.6.21,
if you check the changelog for that you should find it. Here's the other
information you wanted:

commit ee4eff6ff6cbfc8ce38131058a18802bf6206879
Author: Benjamin Romer <benjamin.romer>
Date:   Tue Feb 13 13:26:25 2007 +0100

    [PATCH] x86-64: update IO-APIC dest field to 8-bit for xAPIC

I've pulled down that kernel source RPM, and will start moving my patch right
now. :)

Comment 10 Ben Romer 2007-09-07 20:01:42 UTC
Created attachment 190441 [details]
IO-APIC patch for RHEL 5, 2.6.18

A backported version of my IO-APIC patch.

Comment 11 Dave Anderson 2007-09-10 12:33:53 UTC
The build failed, long before any KABI checks were done.
Changes are also required for the analogous function(s) in io_apic-xen.c:

arch/x86_64/kernel/io_apic-xen.c: In function 'setup_IO_APIC_irqs':
arch/x86_64/kernel/io_apic-xen.c:959: error: request for member 'logical' in
something not a structure or union
arch/x86_64/kernel/io_apic-xen.c:977: error: request for member 'logical' in
something not a structure or union
arch/x86_64/kernel/io_apic-xen.c: In function 'io_apic_set_pci_routing':
arch/x86_64/kernel/io_apic-xen.c:2201: error: request for member 'logical' in
something not a structure or union
make[1]: *** [arch/x86_64/kernel/io_apic-xen.o] Error 1
make: *** [arch/x86_64/kernel] Error 2


Comment 12 Ben Romer 2007-09-10 12:54:51 UTC
All right, let me fix those and I'll generate another patch file. 

Comment 13 Ben Romer 2007-09-10 19:19:29 UTC
Created attachment 191881 [details]
fixed patch for RHEL 5 2.6.18

Here's the fix for the xen file. Sorry about missing it the first time!

Comment 14 Dave Anderson 2007-09-10 20:49:08 UTC
OK, the build is underway -- let's see how the KABI issue shakes out...

Thanks,
  Dave


Comment 15 Dave Anderson 2007-09-11 14:59:07 UTC
The build completed with no KABI issues.   

Can you please test/verify the two test kernels here:

  http://people.redhat.com/anderson/BZ_224373 

(i.e., kernel-2.6.18-45.el5.bz224373.2.x86_64.rpm and 
kernel-xen-2.6.18-45.el5.bz224373.2.x86_64.rpm)

The debuginfo rpms and the src.rpm are also there if you
want them.

Thanks,
  Dave


 

Comment 16 Ben Romer 2007-09-11 16:19:03 UTC
Great! I'll install these on an ES7000 and do some kexec testing. I'll get back
to you as soon as possible. :)

Comment 17 Jeff Burke 2007-09-11 19:18:22 UTC
Using the kernel from Dave kernel-2.6.18-45.el5.bz224373.2.x86_64.rpm I was able
boot into the kexec kernel on the ES7000 we have here.

This is what I did:
kexec -l /boot/vmlinuz-2.6.18-45.el5.bz224373.2
--initrd=/boot/initrd-2.6.18-45.el5.bz224373.2.img --command-line="ro
root=/dev/VolGroup00/LogVol00 crashkernel=128M@16M console=ttyS0,115200 kexecboot"

kexec -e



Comment 18 Jeff Burke 2007-09-11 19:19:10 UTC
Created attachment 192911 [details]
kernel-2.6.18-45.el5.bz224373.2 x86_64 kexec boot log

Comment 19 Ben Romer 2007-09-12 16:21:06 UTC
I was able to get it to work here as well. :)

Comment 26 Don Zickus 2007-11-29 17:07:11 UTC
in 2.6.18-58.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 27 Ben Romer 2007-12-13 17:47:32 UTC
I apologise greatly for the long delay in testing the kernel update from #26. I
was not able to get it to kexec without the lpj parameter - without it, the
kernel is behaving as it was before without my patch, hanging right after the
"Memory:" line during boot.

Could you please verify that my patch is in the kernel?

Comment 28 Dave Anderson 2007-12-13 18:25:01 UTC
The patch has been place since 2.6.18-57.el5, in the patch named:

  linux-2.6-x86_64-update-IO-APIC-dest-field-to-8-bit-for-xAPI.patch

If you look at the kernel's src.rpm from Don's tree, you can see it:

  $ rpm2cpio kernel-2.6.18-58.el5.src.rpm | cpio -t | grep dest-field
  linux-2.6-x86_64-update-IO-APIC-dest-field-to-8-bit-for-xAPI.patch
  173156 blocks
  $




Comment 29 Ben Romer 2007-12-13 18:41:47 UTC
Please disregard my last comment, I made a typo in menu.lst that was screwing up
my test. It's working fine! :)

Comment 30 Dave Anderson 2007-12-13 18:47:07 UTC
Whew...

Comment 32 John Poelstra 2008-03-21 03:59:14 UTC
Greetings Red Hat Partner,

A fix for this issue should be included in the latest packages contained in
RHEL5.2-Snapshot1--available now on partners.redhat.com.  

Please test and confirm that your issue is fixed.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to ASSIGNED.

If you are receiving this message in Issue Tracker, please reply with a message
to Issue Tracker about your results and I will update bugzilla for you.  If you
need assistance accessing ftp://partners.redhat.com, please contact your Partner
Manager.

Thank you

Comment 33 John Poelstra 2008-04-02 21:39:59 UTC
Greetings Red Hat Partner,

A fix for this issue should be included in the latest packages contained in
RHEL5.2-Snapshot3--available now on partners.redhat.com.  

Please test and confirm that your issue is fixed.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to ASSIGNED.

If you are receiving this message in Issue Tracker, please reply with a message
to Issue Tracker about your results and I will update bugzilla for you.  If you
need assistance accessing ftp://partners.redhat.com, please contact your Partner
Manager.

Thank you


Comment 34 Ben Romer 2008-04-07 16:38:54 UTC
Looks good! We've tested it and it's working. Thanks! :)

Comment 40 errata-xmlrpc 2008-05-21 14:40:56 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html



Note You need to log in before you can comment on or make changes to this bug.