Bug 653832

Summary: bonding failover in every monitor interval with virtio-net driver
Product: Red Hat Enterprise Linux 6 Reporter: Mark Wu <dwu>
Component: kernelAssignee: Jiri Olsa <jolsa>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 6.0CC: jolsa, nhorman, tgraf, ykaul
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 653828 Environment:
Last Closed: 2011-01-04 14:47:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 653828    
Bug Blocks:    

Description Mark Wu 2010-11-16 09:29:19 UTC
+++ This bug was initially created as a clone of Bug #653828 +++

Description of problem:


Version-Release number of selected component (if applicable):
guest OS:
kernel-2.6.18-194
host:
KVM host and version not related.

How reproducible:
100%

Steps to Reproduce:
1. create a virtual guest with two virtio network interfaces
2. configure a bonding interface with the option BONDING_OPTS="mode=active-backup arp_interval=2000 arp_ip_target=192.168.122.1" 

  
Actual results:
Bonding failovers on each arp inspect: 

Nov  4 15:56:09 localhost kernel: bonding: bond0: making interface eth0 the new active one.
Nov  4 15:56:11 localhost kernel: bonding: bond0: link status definitely up for interface eth1.
Nov  4 15:56:13 localhost kernel: bonding: bond0: link status definitely down for interface eth0, disabling it
Nov  4 15:56:13 localhost kernel: bonding: bond0: making interface eth1 the new active one.
Nov  4 15:56:15 localhost kernel: bonding: bond0: link status definitely up for interface eth0.
Nov  4 15:56:17 localhost kernel: bonding: bond0: link status definitely down for interface eth1, disabling it
Nov  4 15:56:17 localhost kernel: bonding: bond0: making interface eth0 the new active one.
Nov  4 15:56:19 localhost kernel: bonding: bond0: link status definitely up for interface eth1.
Nov  4 15:56:21 localhost kernel: bonding: bond0: link status definitely down for interface eth0, disabling it
Nov  4 15:56:21 localhost kernel: bonding: bond0: making interface eth1 the new active one.
Nov  4 15:56:23 localhost kernel: bonding: bond0: link status definitely up for interface eth0.
Nov  4 15:56:25 localhost kernel: bonding: bond0: link status definitely down for interface eth1, disabling it
Nov  4 15:56:25 localhost kernel: bonding: bond0: making interface eth0 the new active one.
Nov  4 15:56:27 localhost kernel: bonding: bond0: link status definitely up for interface eth1.
Nov  4 15:56:29 localhost kernel: bonding: bond0: link status definitely down for interface eth0, disabling it
Nov  4 15:56:29 localhost kernel: bonding: bond0: making interface eth1 the new active one.
Nov  4 15:56:31 localhost kernel: bonding: bond0: link status definitely up for interface eth0.
Nov  4 15:56:33 localhost kernel: bonding: bond0: link status definitely down for interface eth1, disabling it
Nov  4 15:56:33 localhost kernel: bonding: bond0: making interface eth0 the new active one.
Nov  4 15:56:35 localhost kernel: bonding: bond0: link status definitely up for interface eth1.


Expected results:
bonding works fine with virt-net driver.

Additional info:
 Because either ethtool or mii-tool is not available to detect the link for virtio interface, so arp monitor is used to detect link failure. The link detect doesn't make more sense for a virio interface.

This issue was caused the virtio driver doesn't update timestamp of rx and tx. 
The current slave can't pass the following check, because trans_start is not updated.
                /*
                 * Active slave is down if:
                 * - more than 2*delta since transmitting OR
                 * - (more than 2*delta since receive AND
                 *    the bond has an IP address)
                 */
                if ((slave->state == BOND_STATE_ACTIVE) &&
                    (time_after_eq(jiffies, slave->dev->trans_start +
                                    2 * delta_in_ticks) ||
                      (time_after_eq(jiffies, slave_last_rx(bond, slave)
                                     + 2 * delta_in_ticks)))) {
                        slave->new_link = BOND_LINK_DOWN;
                        commit++;
                }

Even though last_rx is ignored by virtio-net either, but it will be updated in the following codes:

static inline int skb_bond_should_drop(struct sk_buff *skb)
{
        struct net_device *dev = skb->dev;
        struct net_device *master = dev->master;

        if (master) {
                if (master->priv_flags & IFF_MASTER_ARPMON)
                        dev->last_rx = jiffies;
        ...

So the failure interface can become up in next arp inspect and then be chose on next failover, because it just check last_rx:
                if (slave->link != BOND_LINK_UP) {
                        if (time_before_eq(jiffies, slave_last_rx(bond, slave) +
                                           delta_in_ticks)) {
                                slave->new_link = BOND_LINK_UP;
                                commit++;
                        }

Then I update the rx and tx timestamp in virtio driver, and the patched virtio driver works fine with bonding.

--- Additional comment from dwu on 2010-11-16 04:28:22 EST ---

Created attachment 460793 [details]
update timestamp of rx and tx in virtio-net driver

Comment 2 Neil Horman 2010-11-18 15:48:01 UTC
Triage assignment.  If you feel this bug doesn't belong to you, or that it cannot be handled in a timely fashion, please contact me for re-assignment

Comment 3 Jiri Olsa 2011-01-03 17:23:42 UTC
I cannot reproduce this one in RHEL6, while I can on same setup with RHEL5 (which is taken care via BZ 653828) .. so I believe this is not an issue for RHEL6

as for RHEL5:
seems like the last_rx stat should be taken care by skb_bond_should_drop
function.. which is already ready in RHEL5.. so my guess is that the proposed
patch is probably workaround, and we might need other fix for this issue
on RHEL5

if I dont hear from you otherwise, I'll close this one in a week..

thanks,
jirka

Comment 4 Jiri Olsa 2011-01-04 14:47:26 UTC
closing, as the reason/fix was found for RHEL5 BZ 653828