The server crashes with Xen networking on some servers that use mode 6 bonding. The servers in question are all HP Blades. Each blade has two physical network interfaces that are bonded together in mode 6 (adaptive load balancing). This all works fine without Xen. Each server is connected to a number of different VLANs. The 'bond0' interface uses the chassis' network switches' default VLAN of 172.30.220.0/24. We also have an additional VLAN-tagged interface bond0.2175 connected to VLAN 161.2.175.0/24. The 'network-script' parameter of xend-config.sxp to call a wrapper script called 'network-bridge'. This script runs the following commands when xend starts: ./network-bridge start vifnum=0 bridge=xenbr0 netdev=bond0 ./network-bridge start vifnum=1 bridge=xenbr2175 netdev=bond0.2175 The crash is triggered by running `network-bridge` against bond0. If reconfigured to use mode 1, the same wrapper script works fine. The same wrapper script has been used on other servers that are configured with mode 4 bonding (LACP) without any issues. This problem only occurs if network-bridge is used with a non-tagged mode 5 or 6 bond0 interface. Steps to reproduce Configure a server with bonding mode=6 (or 5). In xend-config.sxp, set network-script to /bin/true. Reboot the server. Run /etc/xen/scripts/network-bridge start vifnum=0 netdev=bond0 A crash dump is available at megatron.gsslab.rdu.redhat.com:/cores/20091014052831/work/ #0 [ffff8800181d7c90] crash_kexec at ffffffff802a572b #1 [ffff8800181d7d50] panic at ffffffff8028cac1 #2 [ffff8800181d7e40] softlockup_tick at ffffffff802b257b #3 [ffff8800181d7e80] timer_interrupt at ffffffff802701b6 #4 [ffff8800181d7ed0] handle_IRQ_event at ffffffff802114ad #5 [ffff8800181d7f00] __do_IRQ at ffffffff802b28b4 #6 [ffff8800181d7f40] do_IRQ at ffffffff8026df32 #7 [ffff8800181d7f60] evtchn_do_upcall at ffffffff803aed9d #8 [ffff8800181d7fb0] do_hypervisor_callback at ffffffff802608d6 --- <IRQ stack> --- #9 [ffff880364d03c98] do_hypervisor_callback at ffffffff802608d6 [exception RIP: __write_lock_failed+9] RIP: ffffffff80262071 RSP: ffff880364d03d40 RFLAGS: 00000206 RAX: ffff880364d03fd8 RBX: ffff88039a591530 RCX: 0000000000000004 RDX: 00000000000002c0 RSI: ffff8803c7e5cc00 RDI: ffff88039a591530 RBP: 0000000000000000 R8: 0000000000000002 R9: 0000000000000002 R10: ffff8803682e9380 R11: ffffffff882c1872 R12: 0000000000000000 R13: ffff88039a5911a8 R14: 0000000000000002 R15: ffff88039a591000 ORIG_RAX: fffffffffffffedc CS: e030 SS: e02b #10 [ffff880364d03d40] _write_lock_bh at ffffffff80264a09 #11 [ffff880364d03d50] bond_alb_set_mac_address at ffffffff88635bfe #12 [ffff880364d03dc0] dev_set_mac_address at ffffffff804193fd #13 [ffff880364d03de0] dev_ioctl at ffffffff8041b5e4 #14 [ffff880364d03e90] sock_ioctl at ffffffff80411220 #15 [ffff880364d03eb0] do_ioctl at ffffffff80243e61 #16 [ffff880364d03ed0] vfs_ioctl at ffffffff802316f3 #17 [ffff880364d03f40] sys_ioctl at ffffffff8024e5c7 #18 [ffff880364d03f80] tracesys at ffffffff802602f9 (via system_call) It looks as if the bond_alb_set_mac_address() routine is trying to acquire a lock and failed. But all the other processors are idle so the lock should be available.
I'm not sure if that's relevant here, but network-bridge script is known not to work well with bonding devices, could you try reconfiguring xend to use network-bridge-bonding instead? When doing so, please remove the bridge=xenbr*; the bridge would have the same name as the original netdev.
It's actually the other way around: because the network interfaces are VLANs (over bonds) network-bridge-bonding doesn't work properly. This is why we use network-bridge. We have previously found that in other tickets and I can dig the details if needed, but my understanding is that the problem on this bugzilla doesn't really depend on which script they use to configure them.
Ah, ok, I did missed that somehow... anyway, moving to kernel-xen component
That lock should be available because it appears that the bond0 thread is probably holding it for some reason. Can you run a debug kernel and see if lockdep throws any more useful messages?
This still happens with -177.
Created attachment 376995 [details] 0001-bonding-fix-alb-mode-locking-regression.patch This patch should resolve the problem. It did with my testing. I will include it in my test kernels and update here when more are available.
Were you able to test this patch? I have not done new test kernels recently so this has not made it into them.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days