Bug 243312

Summary: [RHEL5.1 IA64 Xen] kernel BUG at arch/ia64/kernel/irq_ia64.c:481!
Product: Red Hat Enterprise Linux 5 Reporter: Jarod Wilson <jarod>
Component: kernelAssignee: Aron Griffis <agriffis>
Status: CLOSED CURRENTRELEASE QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0CC: armbru, ktokunag, luyu, prarit, xen-maint
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-xen-2.6.18-32.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-06-29 15:20:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 241674    
Bug Blocks:    
Attachments:
Description Flags
Console boot log from failed xen 3.1-based kernel-xen boot none

Description Jarod Wilson 2007-06-08 15:02:51 UTC
New bug for the second crasher bug uncovered with the xen 3.1.0 rebase with the
getcpu patch removed from the build. See attached for full console boot log

+++ This bug was initially created as a clone of Bug #241674 +++

Description of problem:
Recent RHEL5 xen kernels fail to boot on at least some ia64 hardware that was
previously functional. An hp zx2000 that works with the 5.0 GA kernel encounters
"Unable to handle kernel paging request at virtual address 006eb92000000000"
followed by "Unable to handle kernel NULL pointer dereference (address
0000000000000000)" on boot with the -20 kernel. See attachment for full console
dump.

Version-Release number of selected component (if applicable):
kernel-xen-2.6.18-20.el5 (ia64)

-- Additional comment from jwilson on 2007-05-29 10:31 EST --
Created an attachment (id=155593)
Console log from failed 2.6.18-20.el5xen ia64 boot


-- Additional comment from pm-rhel on 2007-05-29 10:44 EST --
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

-- Additional comment from jwilson on 2007-06-06 16:38 EST --
kernel-xen-2.6.18-17.el5.ia64 boots fine, kernel-xen-2.6.18-18.el5.ia64 blows up
similarly to -20 and -21. Next up, to look at the relevant changes between the
two...

-- Additional comment from jwilson on 2007-06-06 18:01 EST --
Best guess at possible culprits thus far:

[serial] panic in check_modem_status on 8250 (Norm Murray ) [238394]
[misc] getcpu system call (luyu ) [233046]
[mm] NULL current->mm dereference in grab_swap_token causes oops (Jerome
Marchand ) [231639]

I've got a test kernel based on -21 minus those three patches building now, will
see if my guesses hold any water in the morning...

-- Additional comment from pm-rhel on 2007-06-06 18:08 EST --
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

-- Additional comment from jwilson on 2007-06-07 10:17 EST --
Yep, its definitely one of those three patches. Now on to figuring out exactly
which one...

-- Additional comment from jwilson on 2007-06-07 12:00 EST --
The culprit appears to be "[misc] getcpu system call (luyu ) [233046]" (bug
233046). Which of course is a rather large patch, so it'll take some effort to
figure out exactly what the cause is within that patch...

-- Additional comment from jwilson on 2007-06-07 17:41 EST --
Its definitely the getcpu syscall patch, but nothing obvious jumps out as being
the cause of the boot failures. Best guess is that the greatly increased size of
the syscall table may cause a page table overlap or some such thing that xen
doesn't handle cleanly. Punting back to Luming Yu who submitted the patch in the
first place... Any ideas?

-- Additional comment from jwilson on 2007-06-08 01:08 EST --
Fun. The xen 3.1.0 rebased bits from Gerd fail to boot even with that patch
removed, in a fairly similar 
looking fashion (but definitely different -- there's an actual line that says
"kernel BUG at arch/ia64/kernel/
irq_ia64.c:481!". The following upstream changeset appears potentially relevant
to this one:

http://www.mail-archive.com/xen-ia64-devel@lists.xensource.com/msg05946.html

Doesn't apply cleanly at the moment, seems parts of it have already been
cherry-picked...

Should probably file a new bug for this new issue, shouldn't I?... (will do in
the morn)

Comment 1 Jarod Wilson 2007-06-08 15:02:51 UTC
Created attachment 156579 [details]
Console boot log from failed xen 3.1-based kernel-xen boot

Comment 2 RHEL Program Management 2007-06-08 15:04:27 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 3 RHEL Program Management 2007-06-08 15:07:09 UTC
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 4 Jarod Wilson 2007-06-11 21:42:50 UTC
Kei, do you by chance have any ideas here? Is there a specific patch in
Fujitsu's patch series that might address this?

Comment 5 Keiichiro Tokunaga 2007-06-12 15:33:34 UTC
(In reply to comment #4)
> Kei, do you by chance have any ideas here? Is there a specific patch in
> Fujitsu's patch series that might address this?

(The same comments put in bz241674.)

There is only one specific patch in my patch set posted to rh-kernel, which is 
BZ242989 that changes interface versions of dom0 so that it can get along with 
xen-3.1 bits.

Compared to Gerd's original patch set, my patch set has some additional csets.
They are cset14347, 13429, and 13454.

I have not tried 2.6.18.el5.kraxel.4xen on PRIMEQUEST yet.  I am willing to 
try it, but I am not able to use the box for now because it's in the process 
of firmware upgrading.  I will update here with the results once it gets done.

Comment 6 Jarod Wilson 2007-06-29 15:20:29 UTC
This was fixed in recent kernels, no longer seeing this issue with 2.6.18-29.el5
or so and later.