Bug 533496

Summary:

xen server crashes when used with network bonding modes 5 or 6

Product:

Red Hat Enterprise Linux 5

Reporter:

Bill Braswell <bbraswel>

Component:

kernel-xen

Assignee:

Andy Gospodarek <agospoda>

Status:

CLOSED ERRATA

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

high

Docs Contact:

Priority:

high

Version:

5.4

CC:

agospoda, clalance, cward, drjones, jplans, jwest, llim, pep, peterm, qcai, richard.f.dawson, rlerch, tao, xen-maint

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

587273 (view as bug list)

Environment:

Last Closed:

2010-03-30 07:08:11 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

587273

Attachments:

Description	Flags
0001-bonding-fix-alb-mode-locking-regression.patch	none

Description Bill Braswell 2009-11-06 21:30:45 UTC

The server crashes with Xen networking on some servers that use mode 6 bonding.  The servers in question are all HP Blades. Each blade has two physical network interfaces that are bonded together in mode 6 (adaptive load balancing). This all works fine without Xen.

Each server is connected to a number of different VLANs. The 'bond0' interface uses the chassis' network switches' default VLAN of 172.30.220.0/24.  We also have an additional VLAN-tagged interface bond0.2175 connected to VLAN 161.2.175.0/24.

The 'network-script' parameter of xend-config.sxp to call a wrapper script called 'network-bridge'.
This script runs the following commands when xend starts:

./network-bridge start vifnum=0 bridge=xenbr0 netdev=bond0
./network-bridge start vifnum=1 bridge=xenbr2175 netdev=bond0.2175

The crash is triggered by running `network-bridge` against bond0. If reconfigured to use mode 1, the same wrapper script works fine. The same wrapper script has been used on other servers that are configured with mode 4 bonding (LACP) without any issues.

This problem only occurs if network-bridge is used with a non-tagged mode 5 or 6 bond0 interface.

Steps to reproduce
Configure a server with bonding mode=6 (or 5).
In xend-config.sxp, set network-script to /bin/true.
Reboot the server.
Run /etc/xen/scripts/network-bridge start vifnum=0 netdev=bond0

A crash dump is available at megatron.gsslab.rdu.redhat.com:/cores/20091014052831/work/

 #0 [ffff8800181d7c90] crash_kexec at ffffffff802a572b

 #1 [ffff8800181d7d50] panic at ffffffff8028cac1

 #2 [ffff8800181d7e40] softlockup_tick at ffffffff802b257b

 #3 [ffff8800181d7e80] timer_interrupt at ffffffff802701b6

 #4 [ffff8800181d7ed0] handle_IRQ_event at ffffffff802114ad

 #5 [ffff8800181d7f00] __do_IRQ at ffffffff802b28b4

 #6 [ffff8800181d7f40] do_IRQ at ffffffff8026df32

 #7 [ffff8800181d7f60] evtchn_do_upcall at ffffffff803aed9d

 #8 [ffff8800181d7fb0] do_hypervisor_callback at ffffffff802608d6

--- <IRQ stack> ---

 #9 [ffff880364d03c98] do_hypervisor_callback at ffffffff802608d6

    [exception RIP: __write_lock_failed+9]

    RIP: ffffffff80262071  RSP: ffff880364d03d40  RFLAGS: 00000206

    RAX: ffff880364d03fd8  RBX: ffff88039a591530  RCX: 0000000000000004

    RDX: 00000000000002c0  RSI: ffff8803c7e5cc00  RDI: ffff88039a591530

    RBP: 0000000000000000   R8: 0000000000000002   R9: 0000000000000002

    R10: ffff8803682e9380  R11: ffffffff882c1872  R12: 0000000000000000

    R13: ffff88039a5911a8  R14: 0000000000000002  R15: ffff88039a591000

    ORIG_RAX: fffffffffffffedc  CS: e030  SS: e02b

#10 [ffff880364d03d40] _write_lock_bh at ffffffff80264a09

#11 [ffff880364d03d50] bond_alb_set_mac_address at ffffffff88635bfe

#12 [ffff880364d03dc0] dev_set_mac_address at ffffffff804193fd

#13 [ffff880364d03de0] dev_ioctl at ffffffff8041b5e4

#14 [ffff880364d03e90] sock_ioctl at ffffffff80411220

#15 [ffff880364d03eb0] do_ioctl at ffffffff80243e61

#16 [ffff880364d03ed0] vfs_ioctl at ffffffff802316f3

#17 [ffff880364d03f40] sys_ioctl at ffffffff8024e5c7

#18 [ffff880364d03f80] tracesys at ffffffff802602f9 (via system_call)


It looks as if the  bond_alb_set_mac_address() routine is trying to acquire a lock and failed.  But all the other processors are idle so the lock should be available.

Comment 1 Jiri Denemark 2009-11-09 10:09:05 UTC

I'm not sure if that's relevant here, but network-bridge script is known not to work well with bonding devices, could you try reconfiguring xend to use network-bridge-bonding instead? When doing so, please remove the bridge=xenbr*; the bridge would have the same name as the original netdev.

Comment 2 Josep 'Pep' Turro Mauri 2009-11-09 10:29:04 UTC

It's actually the other way around: because the network interfaces are VLANs (over bonds) network-bridge-bonding doesn't work properly. This is why we use network-bridge.

We have previously found that in other tickets and I can dig the details if needed, but my understanding is that the problem on this bugzilla doesn't really depend on which script they use to configure them.

Comment 3 Jiri Denemark 2009-11-09 10:41:11 UTC

Ah, ok, I did missed that somehow... anyway, moving to kernel-xen component

Comment 6 Andy Gospodarek 2009-12-01 01:07:47 UTC

That lock should be available because it appears that the bond0 thread is probably holding it for some reason.  Can you run a debug kernel and see if lockdep throws any more useful messages?

Comment 10 Andy Gospodarek 2009-12-08 17:04:47 UTC

This still happens with -177.

Comment 11 Andy Gospodarek 2009-12-08 20:02:26 UTC

Created attachment 376995 [details]
0001-bonding-fix-alb-mode-locking-regression.patch

This patch should resolve the problem.  It did with my testing.  I will include it in my test kernels and update here when more are available.

Comment 12 Andy Gospodarek 2010-01-20 15:35:03 UTC

Were you able to test this patch?

I have not done new test kernels recently so this has not made it into them.

Comment 28 errata-xmlrpc 2010-03-30 07:08:11 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Comment 32 Red Hat Bugzilla 2023-09-14 01:18:38 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days