Bug 533496

Summary: xen server crashes when used with network bonding modes 5 or 6
Product: Red Hat Enterprise Linux 5 Reporter: Bill Braswell <bbraswel>
Component: kernel-xenAssignee: Andy Gospodarek <agospoda>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 5.4CC: agospoda, clalance, cward, drjones, jplans, jwest, llim, pep, peterm, qcai, richard.f.dawson, rlerch, tao, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 587273 (view as bug list) Environment:
Last Closed: 2010-03-30 07:08:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 587273    
Attachments:
Description Flags
0001-bonding-fix-alb-mode-locking-regression.patch none

Description Bill Braswell 2009-11-06 21:30:45 UTC
The server crashes with Xen networking on some servers that use mode 6 bonding.  The servers in question are all HP Blades. Each blade has two physical network interfaces that are bonded together in mode 6 (adaptive load balancing). This all works fine without Xen.

Each server is connected to a number of different VLANs. The 'bond0' interface uses the chassis' network switches' default VLAN of 172.30.220.0/24.  We also have an additional VLAN-tagged interface bond0.2175 connected to VLAN 161.2.175.0/24.

The 'network-script' parameter of xend-config.sxp to call a wrapper script called 'network-bridge'.
This script runs the following commands when xend starts:

./network-bridge start vifnum=0 bridge=xenbr0 netdev=bond0
./network-bridge start vifnum=1 bridge=xenbr2175 netdev=bond0.2175

The crash is triggered by running `network-bridge` against bond0. If reconfigured to use mode 1, the same wrapper script works fine. The same wrapper script has been used on other servers that are configured with mode 4 bonding (LACP) without any issues.

This problem only occurs if network-bridge is used with a non-tagged mode 5 or 6 bond0 interface.

Steps to reproduce
Configure a server with bonding mode=6 (or 5).
In xend-config.sxp, set network-script to /bin/true.
Reboot the server.
Run /etc/xen/scripts/network-bridge start vifnum=0 netdev=bond0

A crash dump is available at megatron.gsslab.rdu.redhat.com:/cores/20091014052831/work/

 #0 [ffff8800181d7c90] crash_kexec at ffffffff802a572b

 #1 [ffff8800181d7d50] panic at ffffffff8028cac1

 #2 [ffff8800181d7e40] softlockup_tick at ffffffff802b257b

 #3 [ffff8800181d7e80] timer_interrupt at ffffffff802701b6

 #4 [ffff8800181d7ed0] handle_IRQ_event at ffffffff802114ad

 #5 [ffff8800181d7f00] __do_IRQ at ffffffff802b28b4

 #6 [ffff8800181d7f40] do_IRQ at ffffffff8026df32

 #7 [ffff8800181d7f60] evtchn_do_upcall at ffffffff803aed9d

 #8 [ffff8800181d7fb0] do_hypervisor_callback at ffffffff802608d6

--- <IRQ stack> ---

 #9 [ffff880364d03c98] do_hypervisor_callback at ffffffff802608d6

    [exception RIP: __write_lock_failed+9]

    RIP: ffffffff80262071  RSP: ffff880364d03d40  RFLAGS: 00000206

    RAX: ffff880364d03fd8  RBX: ffff88039a591530  RCX: 0000000000000004

    RDX: 00000000000002c0  RSI: ffff8803c7e5cc00  RDI: ffff88039a591530

    RBP: 0000000000000000   R8: 0000000000000002   R9: 0000000000000002

    R10: ffff8803682e9380  R11: ffffffff882c1872  R12: 0000000000000000

    R13: ffff88039a5911a8  R14: 0000000000000002  R15: ffff88039a591000

    ORIG_RAX: fffffffffffffedc  CS: e030  SS: e02b

#10 [ffff880364d03d40] _write_lock_bh at ffffffff80264a09

#11 [ffff880364d03d50] bond_alb_set_mac_address at ffffffff88635bfe

#12 [ffff880364d03dc0] dev_set_mac_address at ffffffff804193fd

#13 [ffff880364d03de0] dev_ioctl at ffffffff8041b5e4

#14 [ffff880364d03e90] sock_ioctl at ffffffff80411220

#15 [ffff880364d03eb0] do_ioctl at ffffffff80243e61

#16 [ffff880364d03ed0] vfs_ioctl at ffffffff802316f3

#17 [ffff880364d03f40] sys_ioctl at ffffffff8024e5c7

#18 [ffff880364d03f80] tracesys at ffffffff802602f9 (via system_call)


It looks as if the  bond_alb_set_mac_address() routine is trying to acquire a lock and failed.  But all the other processors are idle so the lock should be available.

Comment 1 Jiri Denemark 2009-11-09 10:09:05 UTC
I'm not sure if that's relevant here, but network-bridge script is known not to work well with bonding devices, could you try reconfiguring xend to use network-bridge-bonding instead? When doing so, please remove the bridge=xenbr*; the bridge would have the same name as the original netdev.

Comment 2 Josep 'Pep' Turro Mauri 2009-11-09 10:29:04 UTC
It's actually the other way around: because the network interfaces are VLANs (over bonds) network-bridge-bonding doesn't work properly. This is why we use network-bridge.

We have previously found that in other tickets and I can dig the details if needed, but my understanding is that the problem on this bugzilla doesn't really depend on which script they use to configure them.

Comment 3 Jiri Denemark 2009-11-09 10:41:11 UTC
Ah, ok, I did missed that somehow... anyway, moving to kernel-xen component

Comment 6 Andy Gospodarek 2009-12-01 01:07:47 UTC
That lock should be available because it appears that the bond0 thread is probably holding it for some reason.  Can you run a debug kernel and see if lockdep throws any more useful messages?

Comment 10 Andy Gospodarek 2009-12-08 17:04:47 UTC
This still happens with -177.

Comment 11 Andy Gospodarek 2009-12-08 20:02:26 UTC
Created attachment 376995 [details]
0001-bonding-fix-alb-mode-locking-regression.patch

This patch should resolve the problem.  It did with my testing.  I will include it in my test kernels and update here when more are available.

Comment 12 Andy Gospodarek 2010-01-20 15:35:03 UTC
Were you able to test this patch?

I have not done new test kernels recently so this has not made it into them.

Comment 28 errata-xmlrpc 2010-03-30 07:08:11 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Comment 32 Red Hat Bugzilla 2023-09-14 01:18:38 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days