Bug 459869

Summary: Badness from xenU kernel.
Product: Red Hat Enterprise Linux 4 Reporter: Russell Coker <russell>
Component: kernelAssignee: Thomas Graf <tgraf>
Status: CLOSED NEXTRELEASE QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.7CC: bhavna.sarathy, clalance, conny.seidel, frank.arnold, jamundso, john.haxby, rkhan
Target Milestone: rc   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-13 14:30:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Proposed patch
none
submitted patch none

Description Russell Coker 2008-08-23 10:19:06 UTC
The following occurs in the kernel message log when running kernel 2.6.9-78.0.1.ELxenU on one of my virtual machines.  The Dom0 runs 2.6.18-92.1.10.el5xen.  The problem occurs when I have 1 or two VCPUs assigned to the DomU (I have not tested with more than 2).  The problem does not occur on other RHEL4 DomU's on the same Xen server, I don't know why.

Badness in local_bh_enable at kernel/softirq.c:141
 [<c0121170>] local_bh_enable+0x3f/0x62
 [<c02177cd>] skb_checksum+0x133/0x25e
 [<c0250efe>] udp_poll+0x66/0x113
 [<c02135f5>] sock_poll+0x19/0x1d
 [<c016d182>] do_select+0x190/0x2c7
 [<c016ce91>] __pollwait+0x0/0x9b
 [<c0144ab4>] __kmalloc+0x56/0xd3
 [<c016d5b8>] sys_select+0x2e7/0x45c
 [<c016bf1a>] sys_fcntl64+0x78/0x7f
 [<c010740f>] syscall_call+0x7/0xb

Comment 1 Frank Arnold 2008-09-29 12:38:57 UTC
We see nearly the same message sporadically getting logged while running stress tests with RHEL4u7 32-bit SMP and 32-bit PAE HVM guests on upstream Xen with the PV network driver (xen-vnif) enabled.

Badness in local_bh_enable at kernel/softirq.c:141
 [<c0126e1d>] local_bh_enable+0x34/0x57
 [<c0287d01>] skb_checksum+0x136/0x260
 [<c02c1a66>] udp_poll+0x5a/0x105
 [<c0283c74>] sock_poll+0x12/0x14
 [<c016d6d9>] do_select+0x196/0x2c6
 [<c016d409>] __pollwait+0x0/0x95
 [<c016dafc>] sys_select+0x2e0/0x43a
 [<c01265f5>] sys_gettimeofday+0x53/0xac
 [<c02e09db>] syscall_call+0x7/0xb

Comment 2 john.haxby@oracle.com 2008-12-16 16:09:29 UTC
Created attachment 327119 [details]
Proposed patch

The problem is that .../net/ipv4/udp.c udp_poll() acquires the wrong spinlock to protect its critical section.  This patch uses the correct spinlock (the same spinlock that the RHEL5 kernel uses).  The backported patch that included udp_poll() mistakenly picked up the wrong code.

Comment 3 john.haxby@oracle.com 2008-12-16 16:11:08 UTC
I should add that this patch has been used in anger for some little while the problem has not re-occurred.

Comment 4 Chris Lalancette 2008-12-16 16:19:19 UTC
OK, the patch looks totally reasonable, and seems to be upstream (in RHEL-5 at least).  I'm going to re-assign this to the regular kernel team, since this doesn't seem to be a virt-specific issue.

Chris Lalancette

Comment 5 john.haxby@oracle.com 2008-12-17 13:36:13 UTC
It's also a lot easier to trigger the problem than I first thought.  One of my xen 4.7 guests threw this error a lot trying to use NFS on servers in California (I'm in the UK).   A kernel built with the patch immediately quashed the errors.

Comment 6 Linda Wang 2009-01-13 14:30:57 UTC

*** This bug has been marked as a duplicate of bug 459185 ***

Comment 7 john.haxby@oracle.com 2009-01-13 15:26:47 UTC
Nice of you to close this a as a duplicate of a bug that we can't see! :-)

Do you think you could either re-open this one and close bug 459185 as a duplicate of this one or change the visibility of 459185?

Comment 8 Thomas Graf 2009-01-13 15:53:58 UTC
Created attachment 328874 [details]
submitted patch

Bug 459185 includes the following patch which also covers this bug.

Comment 9 john.haxby@oracle.com 2009-05-03 12:31:41 UTC
Is there any chance of a fix being released for this soon? I have one 4.7 machine that reports this error dozens of time a day.

Comment 10 Chris Lalancette 2009-05-04 07:38:44 UTC
(In reply to comment #9)
> Is there any chance of a fix being released for this soon? I have one 4.7
> machine that reports this error dozens of time a day.  

You can get updated RPMS with this patch in it from here:

http://people.redhat.com/vgoyal/rhel4/RPMS.kernel/

This will be part of 4.8.  If you need an officially supported fix before that, please go through your friendly support channel and ask for this to be added to z-stream (no guarantees that it will, but we can't do anything here in bugzilla).

In the future, you'll probably want to use bz 459185 to get more attention, since this BZ has been closed as a dup of that one.

Chris Lalancette

Comment 11 Frank Arnold 2009-05-04 10:56:50 UTC
(In reply to comment #10)
> In the future, you'll probably want to use bz 459185 to get more attention,
> since this BZ has been closed as a dup of that one.

Chris, see comment #7. Bug 459185 is restricted and we're still not authorized to write or even look into this bug report. But it's fixed in 4.8, that's true.

Comment 12 Jerry Amundson 2009-08-11 14:58:31 UTC
Glad to see it's nearly the end for this bad boy. It's been a long, annoying road.