Bug 459876
Summary: | network hangs and BUG() message at boot with -105.el5debug kernel | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Jeff Layton <jlayton> |
Component: | kernel | Assignee: | Don Dutile (Red Hat) <ddutile> |
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 5.3 | CC: | steved, syeghiay |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-01-20 20:09:52 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Created attachment 315021 [details]
dmesg output from -105.el5debug
Created attachment 315022 [details]
dmesg output from -105.el5
Created attachment 315024 [details]
xen config file for my rhel5 guest
I usually start guests from libvirt, so not sure how much this resembles the actual guest.
Created attachment 315025 [details]
rhel5 xml file for xen guest
Created by:
virsh dumpxml rhel5
Tried booting the rhel5 guest using the the config file in comment #3 and doing a "xm create rhel5". It paniced: INIT: version 2.86 booting SELinux: initialized (dev usbfs, type usbfs), uses genfs_contexts Welcome to Red Hat Enterprise Linux Server Press 'I' to enter interactive startup. Setting clock (utc): Tue Aug 26 14:45:46 EDT 2008 [ OK ] Starting udev: ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at net/core/dev.c:3298 invalid opcode: 0000 [1] SMP last sysfs file: /class/cpuid/cpu0/dev CPU 0 Modules linked in: xen_vnif serio_raw dm_snapshot dm_zero dm_mirror dm_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 260, comm: xenwatch Not tainted 2.6.18-105.el5.jtltest.46debug #1 RIP: 0010:[<ffffffff80229ded>] [<ffffffff80229ded>] free_netdev+0x1e/0x3e RSP: 0000:ffff81001f8bde08 EFLAGS: 00010293 RAX: 0000000000000001 RBX: fffffffffffffffe RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff81001d7d8000 RBP: fffffffffffffffe R08: 0000000000000002 R09: 0000000000000001 R10: ffff81001d7d8780 R11: ffff81001f8bed00 R12: ffff81001d7d8680 R13: ffff81001f88b138 R14: 0000000000000000 R15: ffff81001d7d8000 FS: 0000000000000000(0000) GS:ffffffff8041e000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002b5f39f2f000 CR3: 000000001d475000 CR4: 00000000000006e0 Process xenwatch (pid: 260, threadinfo ffff81001f8bc000, task ffff81001f8bed00) Stack: ffffffff880c6290 ffff81001f88b138 ffff81001d7d8680 ffff81001f8bde40 0000000200000000 ffff81001fd72198 ffff81001d7da0b0 ffffffff880cb570 0000000000000282 ffff810001bd1a90 0000000100000001 ffffffff800a2a37 Call Trace: [<ffffffff880c6290>] :xen_vnif:backend_changed+0x7b8/0x7ef [<ffffffff800a2a37>] keventd_create_kthread+0x0/0xc9 [<ffffffff801c7a80>] xenwatch_thread+0x0/0x140 [<ffffffff800a2a37>] keventd_create_kthread+0x0/0xc9 [<ffffffff801c6eb2>] xenwatch_handle_callback+0x15/0x48 [<ffffffff801c7ba7>] xenwatch_thread+0x127/0x140 [<ffffffff800a2c4e>] autoremove_wake_function+0x0/0x2e [<ffffffff800a2a37>] keventd_create_kthread+0x0/0xc9 [<ffffffff800347b3>] kthread+0xfe/0x132 [<ffffffff80067f86>] trace_hardirqs_on_thunk+0x35/0x37 [<ffffffff80061079>] child_rip+0xa/0x11 [<ffffffff80068885>] _spin_unlock_irq+0x24/0x27 [<ffffffff800606a8>] restore_args+0x0/0x30 [<ffffffff800346b5>] kthread+0x0/0x132 [<ffffffff8006106f>] child_rip+0x0/0x11 Code: 0f 0b 68 88 58 2d 80 c2 e2 0c c7 87 28 04 00 00 04 00 00 00 RIP [<ffffffff80229ded>] free_netdev+0x1e/0x3e RSP <ffff81001f8bde08> <0>Kernel panic - not syncing: Fatal exception Created attachment 315060 [details]
dmesg output from kernel booted with debug patch 1
Created attachment 315138 [details]
/proc/cpuinfo from dom0
I've got a vmcore from my domU running 105.el5debug, so let me know where you want me to upload it. Also, I think I might have some idea of what this is related to. Sometime in the distant past, I set the MTU on eth0 in this guest to 1400. Unfortunately, I don't remember the exact reason that I did that... My suspicion is that when kudzu detected the new driver, it created this new ifcfg-eth? file without a MTU=1400 in it. I just rebooted the host and set it up to use xen_vnif and let it default to MTU=1500. It seems like that is making these sessions hang more frequently. So this may actually have nothing at all to do with the PV on HVM stuff and may be more closely related to this more generic problem where Xen apparently needs the guests to run with a smaller MTU. I can't yet confirm this 100%, but so far I haven't disproved it yet either. ...and just after I posted this. My ssh session hung while the MTU was set to 1400. So this may not be the problem after all... Created attachment 315281 [details]
serial console output from kernel with xenbus_probe.patch
This console output is from -105.el5debug kernel with xenbus_probe.patch
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. in kernel-2.6.18-121.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html |
I'm running a 2.6.18-105.el5debug kernel variant on a x86_64 FV xen guest. I see occasional hangs of TCP connections with this kernel (mostly ssh sessions). I also see this BUG() message pop up at boot time: BUG: warning at kernel/lockdep.c:1963/trace_hardirqs_on() (Not tainted) Call Trace: <IRQ> [<ffffffff801c5168>] evtchn_interrupt+0xce/0x152 [<ffffffff80068885>] _spin_unlock_irq+0x24/0x27 [<ffffffff80011542>] handle_IRQ_event+0x22/0x56 [<ffffffff800c1bda>] __do_IRQ+0xa5/0x105 [<ffffffff80012aa2>] __do_softirq+0x95/0xf3 [<ffffffff80070827>] do_IRQ+0xf6/0x104 [<ffffffff8006eef6>] default_idle+0x3a/0x69 [<ffffffff8006eebc>] default_idle+0x0/0x69 [<ffffffff80060652>] ret_from_intr+0x0/0xf <EOI> [<ffffffff8006eef6>] default_idle+0x3a/0x69 [<ffffffff8006eef8>] default_idle+0x3c/0x69 [<ffffffff8006eef6>] default_idle+0x3a/0x69 [<ffffffff8004b4e7>] cpu_idle+0x9a/0xbd [<ffffffff80458824>] start_kernel+0x243/0x248 [<ffffffff8045822f>] _sinittext+0x22f/0x236 ...this message and the hung connections did not occur with the -104 kernel. I built a -105 kernel that didn't contain the xen-related patches here: - [xen] PV: config file changes (Don Dutile ) [442991] - [xen] PV: Makefile and Kconfig additions (Don Dutile ) [442991] - [xen] PV: add subsystem (Don Dutile ) [442991] - [xen] PV: shared used header file changes (Don Dutile ) [442991] - [xen] PV: shared use of xenbus, netfront, blkfront (Don Dutile ) [442991] - [xen] avoid dom0 hang when tearing down domains (Chris Lalancette ) [347161] - [xen] ia64: SMP-unsafe with XENMEM_add_to_physmap on HVM (Tetsu Yamamoto ) [457137] ...and also reverted the config-generic-* changes that were introduced in this rev. With that, the BUG() pop goes away and the networking seems to be more stable. Chris L asked this: > Out of curiosity, are you just using the default network interface, or are > you actually using the PV-on-HVM drivers? I believe I'm just using the default network interface, but I see some extra messages in dmesg at boot time with the -105 kernels: netfront: Initialising virtual ethernet driver. netfront: device eth1 has copying receive path. ...and it looks like the network interfaces are reordered or somehow. I haven't paid attention to details of this, but I can provide access to the host if it would be helpful (it's a xen guest and has serial console set up, etc). Let me know if you need more info or access to the host.