Bug 459869 - Badness from xenU kernel.
Summary: Badness from xenU kernel.
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.7
Hardware: i686
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Thomas Graf
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-08-23 10:19 UTC by Russell Coker
Modified: 2014-06-18 08:29 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-13 14:30:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Proposed patch (735 bytes, patch)
2008-12-16 16:09 UTC, john.haxby@oracle.com
no flags Details | Diff
submitted patch (1.96 KB, patch)
2009-01-13 15:53 UTC, Thomas Graf
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
CentOS 3763 0 None None None Never

Description Russell Coker 2008-08-23 10:19:06 UTC
The following occurs in the kernel message log when running kernel 2.6.9-78.0.1.ELxenU on one of my virtual machines.  The Dom0 runs 2.6.18-92.1.10.el5xen.  The problem occurs when I have 1 or two VCPUs assigned to the DomU (I have not tested with more than 2).  The problem does not occur on other RHEL4 DomU's on the same Xen server, I don't know why.

Badness in local_bh_enable at kernel/softirq.c:141
 [<c0121170>] local_bh_enable+0x3f/0x62
 [<c02177cd>] skb_checksum+0x133/0x25e
 [<c0250efe>] udp_poll+0x66/0x113
 [<c02135f5>] sock_poll+0x19/0x1d
 [<c016d182>] do_select+0x190/0x2c7
 [<c016ce91>] __pollwait+0x0/0x9b
 [<c0144ab4>] __kmalloc+0x56/0xd3
 [<c016d5b8>] sys_select+0x2e7/0x45c
 [<c016bf1a>] sys_fcntl64+0x78/0x7f
 [<c010740f>] syscall_call+0x7/0xb

Comment 1 Frank Arnold 2008-09-29 12:38:57 UTC
We see nearly the same message sporadically getting logged while running stress tests with RHEL4u7 32-bit SMP and 32-bit PAE HVM guests on upstream Xen with the PV network driver (xen-vnif) enabled.

Badness in local_bh_enable at kernel/softirq.c:141
 [<c0126e1d>] local_bh_enable+0x34/0x57
 [<c0287d01>] skb_checksum+0x136/0x260
 [<c02c1a66>] udp_poll+0x5a/0x105
 [<c0283c74>] sock_poll+0x12/0x14
 [<c016d6d9>] do_select+0x196/0x2c6
 [<c016d409>] __pollwait+0x0/0x95
 [<c016dafc>] sys_select+0x2e0/0x43a
 [<c01265f5>] sys_gettimeofday+0x53/0xac
 [<c02e09db>] syscall_call+0x7/0xb

Comment 2 john.haxby@oracle.com 2008-12-16 16:09:29 UTC
Created attachment 327119 [details]
Proposed patch

The problem is that .../net/ipv4/udp.c udp_poll() acquires the wrong spinlock to protect its critical section.  This patch uses the correct spinlock (the same spinlock that the RHEL5 kernel uses).  The backported patch that included udp_poll() mistakenly picked up the wrong code.

Comment 3 john.haxby@oracle.com 2008-12-16 16:11:08 UTC
I should add that this patch has been used in anger for some little while the problem has not re-occurred.

Comment 4 Chris Lalancette 2008-12-16 16:19:19 UTC
OK, the patch looks totally reasonable, and seems to be upstream (in RHEL-5 at least).  I'm going to re-assign this to the regular kernel team, since this doesn't seem to be a virt-specific issue.

Chris Lalancette

Comment 5 john.haxby@oracle.com 2008-12-17 13:36:13 UTC
It's also a lot easier to trigger the problem than I first thought.  One of my xen 4.7 guests threw this error a lot trying to use NFS on servers in California (I'm in the UK).   A kernel built with the patch immediately quashed the errors.

Comment 6 Linda Wang 2009-01-13 14:30:57 UTC

*** This bug has been marked as a duplicate of bug 459185 ***

Comment 7 john.haxby@oracle.com 2009-01-13 15:26:47 UTC
Nice of you to close this a as a duplicate of a bug that we can't see! :-)

Do you think you could either re-open this one and close bug 459185 as a duplicate of this one or change the visibility of 459185?

Comment 8 Thomas Graf 2009-01-13 15:53:58 UTC
Created attachment 328874 [details]
submitted patch

Bug 459185 includes the following patch which also covers this bug.

Comment 9 john.haxby@oracle.com 2009-05-03 12:31:41 UTC
Is there any chance of a fix being released for this soon? I have one 4.7 machine that reports this error dozens of time a day.

Comment 10 Chris Lalancette 2009-05-04 07:38:44 UTC
(In reply to comment #9)
> Is there any chance of a fix being released for this soon? I have one 4.7
> machine that reports this error dozens of time a day.  

You can get updated RPMS with this patch in it from here:

http://people.redhat.com/vgoyal/rhel4/RPMS.kernel/

This will be part of 4.8.  If you need an officially supported fix before that, please go through your friendly support channel and ask for this to be added to z-stream (no guarantees that it will, but we can't do anything here in bugzilla).

In the future, you'll probably want to use bz 459185 to get more attention, since this BZ has been closed as a dup of that one.

Chris Lalancette

Comment 11 Frank Arnold 2009-05-04 10:56:50 UTC
(In reply to comment #10)
> In the future, you'll probably want to use bz 459185 to get more attention,
> since this BZ has been closed as a dup of that one.

Chris, see comment #7. Bug 459185 is restricted and we're still not authorized to write or even look into this bug report. But it's fixed in 4.8, that's true.

Comment 12 Jerry Amundson 2009-08-11 14:58:31 UTC
Glad to see it's nearly the end for this bad boy. It's been a long, annoying road.


Note You need to log in before you can comment on or make changes to this bug.