Bug 653828 - bonding failover in every monitor interval with virtio-net driver
Summary: bonding failover in every monitor interval with virtio-net driver
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Jiri Olsa
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 653832
TreeView+ depends on / blocked
 
Reported: 2010-11-16 09:25 UTC by Mark Wu
Modified: 2018-11-14 16:33 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 653832 (view as bug list)
Environment:
Last Closed: 2011-07-21 10:20:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
update timestamp of rx and tx in virtio-net driver (776 bytes, application/octet-stream)
2010-11-16 09:28 UTC, Mark Wu
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1065 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.7 kernel security and bug fix update 2011-07-21 09:21:37 UTC

Description Mark Wu 2010-11-16 09:25:48 UTC
Description of problem:


Version-Release number of selected component (if applicable):
guest OS:
kernel-2.6.18-194
host:
KVM host and version not related.

How reproducible:
100%

Steps to Reproduce:
1. create a virtual guest with two virtio network interfaces
2. configure a bonding interface with the option BONDING_OPTS="mode=active-backup arp_interval=2000 arp_ip_target=192.168.122.1" 

  
Actual results:
Bonding failovers on each arp inspect: 

Nov  4 15:56:09 localhost kernel: bonding: bond0: making interface eth0 the new active one.
Nov  4 15:56:11 localhost kernel: bonding: bond0: link status definitely up for interface eth1.
Nov  4 15:56:13 localhost kernel: bonding: bond0: link status definitely down for interface eth0, disabling it
Nov  4 15:56:13 localhost kernel: bonding: bond0: making interface eth1 the new active one.
Nov  4 15:56:15 localhost kernel: bonding: bond0: link status definitely up for interface eth0.
Nov  4 15:56:17 localhost kernel: bonding: bond0: link status definitely down for interface eth1, disabling it
Nov  4 15:56:17 localhost kernel: bonding: bond0: making interface eth0 the new active one.
Nov  4 15:56:19 localhost kernel: bonding: bond0: link status definitely up for interface eth1.
Nov  4 15:56:21 localhost kernel: bonding: bond0: link status definitely down for interface eth0, disabling it
Nov  4 15:56:21 localhost kernel: bonding: bond0: making interface eth1 the new active one.
Nov  4 15:56:23 localhost kernel: bonding: bond0: link status definitely up for interface eth0.
Nov  4 15:56:25 localhost kernel: bonding: bond0: link status definitely down for interface eth1, disabling it
Nov  4 15:56:25 localhost kernel: bonding: bond0: making interface eth0 the new active one.
Nov  4 15:56:27 localhost kernel: bonding: bond0: link status definitely up for interface eth1.
Nov  4 15:56:29 localhost kernel: bonding: bond0: link status definitely down for interface eth0, disabling it
Nov  4 15:56:29 localhost kernel: bonding: bond0: making interface eth1 the new active one.
Nov  4 15:56:31 localhost kernel: bonding: bond0: link status definitely up for interface eth0.
Nov  4 15:56:33 localhost kernel: bonding: bond0: link status definitely down for interface eth1, disabling it
Nov  4 15:56:33 localhost kernel: bonding: bond0: making interface eth0 the new active one.
Nov  4 15:56:35 localhost kernel: bonding: bond0: link status definitely up for interface eth1.


Expected results:
bonding works fine with virt-net driver.

Additional info:
 Because either ethtool or mii-tool is not available to detect the link for virtio interface, so arp monitor is used to detect link failure. The link detect doesn't make more sense for a virio interface.

This issue was caused the virtio driver doesn't update timestamp of rx and tx. 
The current slave can't pass the following check, because trans_start is not updated.
                /*
                 * Active slave is down if:
                 * - more than 2*delta since transmitting OR
                 * - (more than 2*delta since receive AND
                 *    the bond has an IP address)
                 */
                if ((slave->state == BOND_STATE_ACTIVE) &&
                    (time_after_eq(jiffies, slave->dev->trans_start +
                                    2 * delta_in_ticks) ||
                      (time_after_eq(jiffies, slave_last_rx(bond, slave)
                                     + 2 * delta_in_ticks)))) {
                        slave->new_link = BOND_LINK_DOWN;
                        commit++;
                }

Even though last_rx is ignored by virtio-net either, but it will be updated in the following codes:

static inline int skb_bond_should_drop(struct sk_buff *skb)
{
        struct net_device *dev = skb->dev;
        struct net_device *master = dev->master;

        if (master) {
                if (master->priv_flags & IFF_MASTER_ARPMON)
                        dev->last_rx = jiffies;
        ...

So the failure interface can become up in next arp inspect and then be chose on next failover, because it just check last_rx:
                if (slave->link != BOND_LINK_UP) {
                        if (time_before_eq(jiffies, slave_last_rx(bond, slave) +
                                           delta_in_ticks)) {
                                slave->new_link = BOND_LINK_UP;
                                commit++;
                        }

Then I update the rx and tx timestamp in virtio driver, and the patched virtio driver works fine with bonding.

Comment 1 Mark Wu 2010-11-16 09:28:22 UTC
Created attachment 460793 [details]
update timestamp of rx and tx in virtio-net driver

Comment 2 Dor Laor 2010-11-16 10:22:53 UTC
Is it windows guest? You opened the report for virtio-win and not kvm.
We rather fix this on rhel6 host so please retest over rhel6

Comment 3 Mark Wu 2010-11-17 01:10:13 UTC
Dor,
    It should be a bug of virtio-net driver. I am not sure how the component became virtio-win. I am sorry for that confusion. According to the code, it should also have impact on rhel6. Anyway,  I will retest it on rhel6 guest.

Comment 4 Jiri Olsa 2011-01-04 14:40:19 UTC
Thanks for the patch. The part for last_rx is not needed as it is updated
in skb_bond_should_drop, while trans_start is really omitted and need to be
taken care of.

Please test following kernel, it fixies the issue for me.
http://people.redhat.com/jolsa/653828/

I'll post the change soon, thanks.

Comment 6 RHEL Program Management 2011-02-01 16:54:39 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 8 Jarod Wilson 2011-02-09 14:56:52 UTC
in kernel-2.6.18-243.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 10 Qunfang Zhang 2011-05-24 02:39:17 UTC
Reproduced with guest kernel-2.6.18-238.el5.
1. Boot guest with 2 virtio nic:
/usr/libexec/qemu-kvm -no-hpet -no-kvm-pit-reinjection -usbdevice tablet -rtc-td-hack -startdate now -name rhel5-verify -smp 2,cores=2 -k en-us -m 1G -boot c -net nic,vlan=1,macaddr=00:1a:2a:42:29:10,model=virtio -net tap,vlan=1,script=/etc/qemu-ifup,downscript=no -net nic,vlan=2,macaddr=00:1a:2a:42:25:12,model=virtio -net tap,vlan=2,script=/etc/qemu-ifup,downscript=no   -drive file=/media/rhel5.6-64.qcow2,media=disk,if=virtio,cache=off,boot=on,format=qcow2,werror=stop -cpu qemu64,+sse2 -M rhel5.6.0 -notify all -balloon none -monitor stdio -vnc :10
2. configure a bond0 for eth0 and eth1.
3. restart guest network.

Result: 
dmesg in guest:
bonding: bond0: link status definitely down for interface eth1, disabling it
bonding: bond0: making interface eth0 the new active one.
bonding: bond0: link status definitely up for interface eth1.
bonding: bond0: link status definitely down for interface eth0, disabling it
bonding: bond0: making interface eth1 the new active one.
bonding: bond0: link status definitely up for interface eth0.
bonding: bond0: link status definitely down for interface eth1, disabling it
bonding: bond0: making interface eth0 the new active one.
bonding: bond0: link status definitely up for interface eth1.
bonding: bond0: link status definitely down for interface eth0, disabling it
bonding: bond0: making interface eth1 the new active one.
bonding: bond0: link status definitely up for interface eth0.
bonding: bond0: link status definitely down for interface eth1, disabling it
bonding: bond0: making interface eth0 the new active one.
bonding: bond0: link status definitely up for interface eth1.
bonding: bond0: link status definitely down for interface eth0, disabling it
bonding: bond0: making interface eth1 the new active one.
bonding: bond0: link status definitely up for interface eth0.
bonding: bond0: link status definitely down for interface eth1, disabling it
bonding: bond0: making interface eth0 the new active one.
bonding: bond0: link status definitely up for interface eth1.
bonding: bond0: link status definitely down for interface eth0, disabling it
bonding: bond0: making interface eth1 the new active one.
bonding: bond0: link status definitely up for interface eth0.
bonding: bond0: link status definitely down for interface eth1, disabling it
bonding: bond0: making interface eth0 the new active one.
bonding: bond0: link status definitely up for interface eth1.
bonding: bond0: link status definitely down for interface eth0, disabling it
bonding: bond0: making interface eth1 the new active one.


Verified pass with guest kernel-2.6.18-262.el5.
Tested with the same steps. In guest dmesg:
bonding: bond0: making interface eth0 the new active one.
bonding: bond0: first active interface up!
bonding: bond0: enslaving eth0 as an active interface with an up link.
bonding: bond0: Adding slave eth1.
bonding: bond0: Warning: failed to get speed and duplex from eth1, assumed to be 100Mb/sec and Full.
bonding: bond0: enslaving eth1 as a backup interface with an up link.

Then I set_link down for one network interface and ping arp_ip_target, ping available. 
Set_link down for the second network interface, can not ping arp_ip_target.
set_link up one network interface, can ping arp_ip_target again.

So, this bug is verified pass.

Comment 11 juzhang 2011-05-24 02:51:34 UTC
according comment10,set this issue as verified.

Comment 12 errata-xmlrpc 2011-07-21 10:20:08 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html


Note You need to log in before you can comment on or make changes to this bug.