Bug 459876

Summary: network hangs and BUG() message at boot with -105.el5debug kernel
Product: Red Hat Enterprise Linux 5 Reporter: Jeff Layton <jlayton>
Component: kernelAssignee: Don Dutile (Red Hat) <ddutile>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.3CC: steved, syeghiay
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 20:09:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesg output from -105.el5debug
none
dmesg output from -105.el5
none
xen config file for my rhel5 guest
none
rhel5 xml file for xen guest
none
dmesg output from kernel booted with debug patch 1
none
/proc/cpuinfo from dom0
none
serial console output from kernel with xenbus_probe.patch none

Description Jeff Layton 2008-08-23 12:11:07 UTC
I'm running a 2.6.18-105.el5debug kernel variant on a x86_64 FV xen guest. I see occasional hangs of TCP connections with this kernel (mostly ssh sessions). I also see this BUG() message pop up at boot time:

BUG: warning at kernel/lockdep.c:1963/trace_hardirqs_on() (Not tainted)

Call Trace:
 <IRQ>  [<ffffffff801c5168>] evtchn_interrupt+0xce/0x152
 [<ffffffff80068885>] _spin_unlock_irq+0x24/0x27
 [<ffffffff80011542>] handle_IRQ_event+0x22/0x56
 [<ffffffff800c1bda>] __do_IRQ+0xa5/0x105
 [<ffffffff80012aa2>] __do_softirq+0x95/0xf3
 [<ffffffff80070827>] do_IRQ+0xf6/0x104
 [<ffffffff8006eef6>] default_idle+0x3a/0x69
 [<ffffffff8006eebc>] default_idle+0x0/0x69
 [<ffffffff80060652>] ret_from_intr+0x0/0xf
 <EOI>  [<ffffffff8006eef6>] default_idle+0x3a/0x69
 [<ffffffff8006eef8>] default_idle+0x3c/0x69
 [<ffffffff8006eef6>] default_idle+0x3a/0x69
 [<ffffffff8004b4e7>] cpu_idle+0x9a/0xbd
 [<ffffffff80458824>] start_kernel+0x243/0x248
 [<ffffffff8045822f>] _sinittext+0x22f/0x236

...this message and the hung connections did not occur with the -104 kernel. I built a -105 kernel that didn't contain the xen-related patches here:

- [xen] PV:  config file changes (Don Dutile ) [442991]
- [xen] PV: Makefile and Kconfig additions (Don Dutile ) [442991]
- [xen] PV: add subsystem (Don Dutile ) [442991]
- [xen] PV: shared used header file changes (Don Dutile ) [442991]
- [xen] PV: shared use of xenbus, netfront, blkfront (Don Dutile ) [442991]
- [xen] avoid dom0 hang when tearing down domains (Chris Lalancette ) [347161]
- [xen] ia64: SMP-unsafe with XENMEM_add_to_physmap on HVM (Tetsu Yamamoto ) [457137]

...and also reverted the config-generic-* changes that were introduced in this rev. With that, the BUG() pop goes away and the networking seems to be more stable. Chris L asked this:

> Out of curiosity, are you just using the default network interface, or are
> you actually using the PV-on-HVM drivers?

I believe I'm just using the default network interface, but I see some extra messages in dmesg at boot time with the -105 kernels:

netfront: Initialising virtual ethernet driver.
netfront: device eth1 has copying receive path.

...and it looks like the network interfaces are reordered or somehow. I haven't paid attention to details of this, but I can provide access to the host if it would be helpful (it's a xen guest and has serial console set up, etc).

Let me know if you need more info or access to the host.

Comment 1 Jeff Layton 2008-08-26 17:58:18 UTC
Created attachment 315021 [details]
dmesg output from -105.el5debug

Comment 2 Jeff Layton 2008-08-26 17:58:58 UTC
Created attachment 315022 [details]
dmesg output from -105.el5

Comment 3 Jeff Layton 2008-08-26 18:34:02 UTC
Created attachment 315024 [details]
xen config file for my rhel5 guest

I usually start guests from libvirt, so not sure how much this resembles the actual guest.

Comment 4 Jeff Layton 2008-08-26 18:35:06 UTC
Created attachment 315025 [details]
rhel5 xml file for xen guest

Created by:

virsh dumpxml rhel5

Comment 5 Jeff Layton 2008-08-26 18:48:14 UTC
Tried booting the rhel5 guest using the the config file in comment #3 and doing a "xm create rhel5". It paniced:

INIT: version 2.86 booting
SELinux: initialized (dev usbfs, type usbfs), uses genfs_contexts
		Welcome to Red Hat Enterprise Linux Server
		Press 'I' to enter interactive startup.
Setting clock  (utc): Tue Aug 26 14:45:46 EDT 2008 [  OK  ]
Starting udev: ----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at net/core/dev.c:3298
invalid opcode: 0000 [1] SMP 
last sysfs file: /class/cpuid/cpu0/dev
CPU 0 
Modules linked in: xen_vnif serio_raw dm_snapshot dm_zero dm_mirror dm_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 260, comm: xenwatch Not tainted 2.6.18-105.el5.jtltest.46debug #1
RIP: 0010:[<ffffffff80229ded>]  [<ffffffff80229ded>] free_netdev+0x1e/0x3e
RSP: 0000:ffff81001f8bde08  EFLAGS: 00010293
RAX: 0000000000000001 RBX: fffffffffffffffe RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff81001d7d8000
RBP: fffffffffffffffe R08: 0000000000000002 R09: 0000000000000001
R10: ffff81001d7d8780 R11: ffff81001f8bed00 R12: ffff81001d7d8680
R13: ffff81001f88b138 R14: 0000000000000000 R15: ffff81001d7d8000
FS:  0000000000000000(0000) GS:ffffffff8041e000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b5f39f2f000 CR3: 000000001d475000 CR4: 00000000000006e0
Process xenwatch (pid: 260, threadinfo ffff81001f8bc000, task ffff81001f8bed00)
Stack:  ffffffff880c6290 ffff81001f88b138 ffff81001d7d8680 ffff81001f8bde40
 0000000200000000 ffff81001fd72198 ffff81001d7da0b0 ffffffff880cb570
 0000000000000282 ffff810001bd1a90 0000000100000001 ffffffff800a2a37
Call Trace:
 [<ffffffff880c6290>] :xen_vnif:backend_changed+0x7b8/0x7ef
 [<ffffffff800a2a37>] keventd_create_kthread+0x0/0xc9
 [<ffffffff801c7a80>] xenwatch_thread+0x0/0x140
 [<ffffffff800a2a37>] keventd_create_kthread+0x0/0xc9
 [<ffffffff801c6eb2>] xenwatch_handle_callback+0x15/0x48
 [<ffffffff801c7ba7>] xenwatch_thread+0x127/0x140
 [<ffffffff800a2c4e>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800a2a37>] keventd_create_kthread+0x0/0xc9
 [<ffffffff800347b3>] kthread+0xfe/0x132
 [<ffffffff80067f86>] trace_hardirqs_on_thunk+0x35/0x37
 [<ffffffff80061079>] child_rip+0xa/0x11
 [<ffffffff80068885>] _spin_unlock_irq+0x24/0x27
 [<ffffffff800606a8>] restore_args+0x0/0x30
 [<ffffffff800346b5>] kthread+0x0/0x132
 [<ffffffff8006106f>] child_rip+0x0/0x11


Code: 0f 0b 68 88 58 2d 80 c2 e2 0c c7 87 28 04 00 00 04 00 00 00 
RIP  [<ffffffff80229ded>] free_netdev+0x1e/0x3e
 RSP <ffff81001f8bde08>
 <0>Kernel panic - not syncing: Fatal exception

Comment 6 Jeff Layton 2008-08-27 00:24:51 UTC
Created attachment 315060 [details]
dmesg output from kernel booted with debug patch 1

Comment 7 Jeff Layton 2008-08-27 20:40:23 UTC
Created attachment 315138 [details]
/proc/cpuinfo from dom0

Comment 8 Jeff Layton 2008-08-27 21:17:23 UTC
I've got a vmcore from my domU running 105.el5debug, so let me know where you want me to upload it.

Also, I think I might have some idea of what this is related to. Sometime in the distant past, I set the MTU on eth0 in this guest to 1400. Unfortunately, I don't remember the exact reason that I did that...

My suspicion is that when kudzu detected the new driver, it created this new ifcfg-eth? file without a MTU=1400 in it. I just rebooted the host and set it up to use xen_vnif and let it default to MTU=1500. It seems like that is making these sessions hang more frequently.

So this may actually have nothing at all to do with the PV on HVM stuff and may be more closely related to this more generic problem where Xen apparently needs the guests to run with a smaller MTU.

I can't yet confirm this 100%, but so far I haven't disproved it yet either.

Comment 9 Jeff Layton 2008-08-27 21:21:39 UTC
...and just after I posted this. My ssh session hung while the MTU was set to 1400. So this may not be the problem after all...

Comment 11 Jeff Layton 2008-08-28 18:51:37 UTC
Created attachment 315281 [details]
serial console output from kernel with xenbus_probe.patch

This console output is from -105.el5debug kernel with xenbus_probe.patch

Comment 13 RHEL Program Management 2008-10-17 21:27:13 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 15 Don Zickus 2008-10-29 16:18:00 UTC
in kernel-2.6.18-121.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 18 errata-xmlrpc 2009-01-20 20:09:52 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html