Bug 533496 - xen server crashes when used with network bonding modes 5 or 6
Summary: xen server crashes when used with network bonding modes 5 or 6
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.4
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Andy Gospodarek
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 587273
TreeView+ depends on / blocked
 
Reported: 2009-11-06 21:30 UTC by Bill Braswell
Modified: 2023-09-14 01:18 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 587273 (view as bug list)
Environment:
Last Closed: 2010-03-30 07:08:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
0001-bonding-fix-alb-mode-locking-regression.patch (1.89 KB, patch)
2009-12-08 20:02 UTC, Andy Gospodarek
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0178 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update 2010-03-29 12:18:21 UTC

Description Bill Braswell 2009-11-06 21:30:45 UTC
The server crashes with Xen networking on some servers that use mode 6 bonding.  The servers in question are all HP Blades. Each blade has two physical network interfaces that are bonded together in mode 6 (adaptive load balancing). This all works fine without Xen.

Each server is connected to a number of different VLANs. The 'bond0' interface uses the chassis' network switches' default VLAN of 172.30.220.0/24.  We also have an additional VLAN-tagged interface bond0.2175 connected to VLAN 161.2.175.0/24.

The 'network-script' parameter of xend-config.sxp to call a wrapper script called 'network-bridge'.
This script runs the following commands when xend starts:

./network-bridge start vifnum=0 bridge=xenbr0 netdev=bond0
./network-bridge start vifnum=1 bridge=xenbr2175 netdev=bond0.2175

The crash is triggered by running `network-bridge` against bond0. If reconfigured to use mode 1, the same wrapper script works fine. The same wrapper script has been used on other servers that are configured with mode 4 bonding (LACP) without any issues.

This problem only occurs if network-bridge is used with a non-tagged mode 5 or 6 bond0 interface.

Steps to reproduce
Configure a server with bonding mode=6 (or 5).
In xend-config.sxp, set network-script to /bin/true.
Reboot the server.
Run /etc/xen/scripts/network-bridge start vifnum=0 netdev=bond0

A crash dump is available at megatron.gsslab.rdu.redhat.com:/cores/20091014052831/work/

 #0 [ffff8800181d7c90] crash_kexec at ffffffff802a572b

 #1 [ffff8800181d7d50] panic at ffffffff8028cac1

 #2 [ffff8800181d7e40] softlockup_tick at ffffffff802b257b

 #3 [ffff8800181d7e80] timer_interrupt at ffffffff802701b6

 #4 [ffff8800181d7ed0] handle_IRQ_event at ffffffff802114ad

 #5 [ffff8800181d7f00] __do_IRQ at ffffffff802b28b4

 #6 [ffff8800181d7f40] do_IRQ at ffffffff8026df32

 #7 [ffff8800181d7f60] evtchn_do_upcall at ffffffff803aed9d

 #8 [ffff8800181d7fb0] do_hypervisor_callback at ffffffff802608d6

--- <IRQ stack> ---

 #9 [ffff880364d03c98] do_hypervisor_callback at ffffffff802608d6

    [exception RIP: __write_lock_failed+9]

    RIP: ffffffff80262071  RSP: ffff880364d03d40  RFLAGS: 00000206

    RAX: ffff880364d03fd8  RBX: ffff88039a591530  RCX: 0000000000000004

    RDX: 00000000000002c0  RSI: ffff8803c7e5cc00  RDI: ffff88039a591530

    RBP: 0000000000000000   R8: 0000000000000002   R9: 0000000000000002

    R10: ffff8803682e9380  R11: ffffffff882c1872  R12: 0000000000000000

    R13: ffff88039a5911a8  R14: 0000000000000002  R15: ffff88039a591000

    ORIG_RAX: fffffffffffffedc  CS: e030  SS: e02b

#10 [ffff880364d03d40] _write_lock_bh at ffffffff80264a09

#11 [ffff880364d03d50] bond_alb_set_mac_address at ffffffff88635bfe

#12 [ffff880364d03dc0] dev_set_mac_address at ffffffff804193fd

#13 [ffff880364d03de0] dev_ioctl at ffffffff8041b5e4

#14 [ffff880364d03e90] sock_ioctl at ffffffff80411220

#15 [ffff880364d03eb0] do_ioctl at ffffffff80243e61

#16 [ffff880364d03ed0] vfs_ioctl at ffffffff802316f3

#17 [ffff880364d03f40] sys_ioctl at ffffffff8024e5c7

#18 [ffff880364d03f80] tracesys at ffffffff802602f9 (via system_call)


It looks as if the  bond_alb_set_mac_address() routine is trying to acquire a lock and failed.  But all the other processors are idle so the lock should be available.

Comment 1 Jiri Denemark 2009-11-09 10:09:05 UTC
I'm not sure if that's relevant here, but network-bridge script is known not to work well with bonding devices, could you try reconfiguring xend to use network-bridge-bonding instead? When doing so, please remove the bridge=xenbr*; the bridge would have the same name as the original netdev.

Comment 2 Josep 'Pep' Turro Mauri 2009-11-09 10:29:04 UTC
It's actually the other way around: because the network interfaces are VLANs (over bonds) network-bridge-bonding doesn't work properly. This is why we use network-bridge.

We have previously found that in other tickets and I can dig the details if needed, but my understanding is that the problem on this bugzilla doesn't really depend on which script they use to configure them.

Comment 3 Jiri Denemark 2009-11-09 10:41:11 UTC
Ah, ok, I did missed that somehow... anyway, moving to kernel-xen component

Comment 6 Andy Gospodarek 2009-12-01 01:07:47 UTC
That lock should be available because it appears that the bond0 thread is probably holding it for some reason.  Can you run a debug kernel and see if lockdep throws any more useful messages?

Comment 10 Andy Gospodarek 2009-12-08 17:04:47 UTC
This still happens with -177.

Comment 11 Andy Gospodarek 2009-12-08 20:02:26 UTC
Created attachment 376995 [details]
0001-bonding-fix-alb-mode-locking-regression.patch

This patch should resolve the problem.  It did with my testing.  I will include it in my test kernels and update here when more are available.

Comment 12 Andy Gospodarek 2010-01-20 15:35:03 UTC
Were you able to test this patch?

I have not done new test kernels recently so this has not made it into them.

Comment 28 errata-xmlrpc 2010-03-30 07:08:11 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Comment 32 Red Hat Bugzilla 2023-09-14 01:18:38 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.