Description of problem: network-attaching and detaching too fast will lead to domU crashes. The underlying network device hits a BUG. How reproducible: Always Steps to Reproduce: 1. in dom0, for i in $(seq 1000); do xm network-attach <domid>; xm network-detach <domid> $i; done 2. wait some iterations 3. watch domU crash Actual results: ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at net/core/dev.c:3341 invalid opcode: 0000 [1] SMP last sysfs file: /class/net/eth2/address CPU 0 Modules linked in: xennet ipv6 dm_multipath parport_pc lp parport pcspkr dm_snapshot dm_zero dm_mirror dm_mod xenblk ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 9, comm: xenwatch Not tainted 2.6.18-1.2857.8.2.fc6_glommerxen #1 RIP: e030:[<ffffffff803f6234>] [<ffffffff803f6234>] unregister_netdevice+0x6e/0x215 RSP: e02b:ffff8800013f1de0 EFLAGS: 00010202 RAX: 0000000000000002 RBX: ffff880001a28000 RCX: ffff8800013f1c00 RDX: 0000000000000000 RSI: 0000000000000056 RDI: ffffffff80521de0 RBP: ffffffff80395148 R08: 000000000000000a R09: 0000000000000005 R10: ffffffff8046cdfc R11: ffffffff88174aa7 R12: ffff880000035c80 R13: 0000000000000000 R14: ffff880000035c70 R15: ffff880001a28000 FS: 00002aaaaaabddb0(0000) GS:ffffffff80593000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process xenwatch (pid: 9, threadinfo ffff8800013f0000, task ffff8800013ec7d0) Stack: ffff880000035c70 ffff880001a28000 ffffffff80395148 ffffffff803f63ec ffff88000212c540 ffffffff881752d7 ffff880002081400 ffff880001a28580 0000000000000000 ffffffff8020b44c Call Trace: [<ffffffff80395148>] xenwatch_thread+0x0/0x13e [<ffffffff803f63ec>] unregister_netdev+0x11/0x17 [<ffffffff881752d7>] :xennet:backend_changed+0x830/0x865 [<ffffffff8020b44c>] kfree+0x15/0xbc [<ffffffff80393867>] xenbus_read_driver_state+0x26/0x36 [<ffffffff80395148>] xenwatch_thread+0x0/0x13e [<ffffffff80295d9f>] keventd_create_kthread+0x0/0x66 [<ffffffff803945a1>] xenwatch_handle_callback+0x15/0x48 [<ffffffff8039526d>] xenwatch_thread+0x125/0x13e [<ffffffff80295f53>] autoremove_wake_function+0x0/0x2e [<ffffffff80295d9f>] keventd_create_kthread+0x0/0x66 [<ffffffff8023321d>] kthread+0xfe/0x132 [<ffffffff8025dec8>] child_rip+0xa/0x12 [<ffffffff80295d9f>] keventd_create_kthread+0x0/0x66 [<ffffffff8025fbf3>] thread_return+0x0/0xfb [<ffffffff8023311f>] kthread+0x0/0x132 [<ffffffff8025debe>] child_rip+0x0/0x12 Code: 0f 0b 68 c5 0c 49 80 c2 0d 0d f6 83 98 00 00 00 01 74 08 48 RIP [<ffffffff803f6234>] unregister_netdevice+0x6e/0x215 RSP <ffff8800013f1de0> Expected results: attach, detach, attach, detach...
After a lof of research, the reason for that is xenbus state changes being delivered twice to frontend. He sees XenbusStateClosing twice, disconnect itself twice, etc. In the second time, it's internal state is not valid anymore, and the BUG() is hit. I have not yet determined the reason behind the double delivery.
This seems a little corner-case . . . in addition, without some leads to the solution, I recommend this be deferred to 5.1.
Created attachment 143657 [details] Upstream proposal Both Keir an Ewan confirms that although undesirable, it is perfectly legal for messages to be delivered twice. So, it becomes simpler than the path I was taking (trying to figure out why the message was being delivered twice and delivering it only once)
Created attachment 143685 [details] upstream commit This is what was commited upstream.
This request was evaluated by Red Hat Kernel Team for inclusion in a Red Hat Enterprise Linux maintenance release, and has moved to bugzilla status POST.
in 2.6.18-16.el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html