Bug 641112
Summary: | bonding does not switch to slave | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Marc Milgram <mmilgram> | ||||
Component: | kernel | Assignee: | Andy Gospodarek <agospoda> | ||||
Status: | CLOSED ERRATA | QA Contact: | Network QE <network-qe> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 4.8 | CC: | dhoward, hjia, jwest, kzhang, peterm, plyons, qcai, syeghiay, tumeya, vgoyal | ||||
Target Milestone: | rc | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2011-02-16 15:15:59 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 672465 | ||||||
Attachments: |
|
Description
Marc Milgram
2010-10-07 19:09:38 UTC
[Analysis] 3334 static void bond_activebackup_arp_mon(struct net_device *bond_dev) 3335 { : 3443 if (slave) { 3444 /* if we have sent traffic in the past 2*arp_intervals but 3445 * haven't xmit and rx traffic in that time interval, select 3446 * a different slave. slave->jiffies is only updated when 3447 * a slave first becomes the curr_active_slave - not necessarily 3448 * after every arp; this ensures the slave has a full 2*delta 3449 * before being taken out. if a primary is being used, check 3450 * if it is up and needs to take over as the curr_active_slave 3451 */ 3452 if ((time_after_eq(jiffies, slave->dev->trans_start + 2*delta_in_ticks) || 3453 (time_after_eq(jiffies, slave_last_rx(bond, slave) + 2*delta_in_ticks) && 3454 bond_has_ip(bond))) && 3455 time_after_eq(jiffies, slave->jiffies + 2*delta_in_ticks)) { 3456 3457 slave->link = BOND_LINK_DOWN; 3458 3459 if (slave->link_failure_count < UINT_MAX) { 3460 slave->link_failure_count++; 3461 } 3462 3463 printk(KERN_INFO DRV_NAME 3464 ": %s: link status down for active interface " 3465 "%s, disabling it\n", 3466 bond_dev->name, 3467 slave->dev->name); : If the conditions of line 3452 - 3455 are satisfied, bonding driver will switch to the secondary NIC. And we found a problem with time_after_eq(): time_after_eq(jiffies, slave->jiffies + 2*delta_in_ticks) jiffies: counter which is incremented by 1ms, This value is reset to zero after 5 min from booting. salve->jiffies: jiffies when current slave was set delta_in_ticks: 1sec ( arp_interval=1000 ) #define time_after_eq(a,b) \ (typecheck(unsigned long, a) && \ typecheck(unsigned long, b) && \ ((long)(a) - (long)(b) >= 0)) If slave is set after 5 min from booting, slave->jiffies == 0 and above expression will be: time_after_eq(jiffies, 2*1000) If slave_change does not happen, this expression will become false about 24.8 days after the boot. And when the time_after_eq() is false, this problem happens. The jiffies sign changes in 24.8 days, overflow 49.8 days. 0 - 24.8days true problem doesn not happen 24.8days - 49.6days false problem happen 49.6days -74.4days true problem does not happen 74.4days - 99.2days false problem happen : On customer systems, this problem occurred 31days and 33days after the system boot. From RHEL5.4, the implementation of bond_activebackup_arp_mon was changed and it seems the problem no longer happens. The fix was included in RHEL5 as part of Bug 462632. Upstream, it is included in Linus' tree as b2220cad583c9b63e085476df448fa2aff5ea906 What is the newest kernel the customer has tried? It looks like the issue where slave->jiffies is unset is already resolved under bug 489362 and included in all kernels after 2.6.9-85.EL (Thanks, Don!). It seems odd to me that this is actually a concern. The customer reported the issue against 2.6.9-89.0.20.ELsmp. It looks like my suspicion was incorrect. Has the customer really confirmed this works on RHEL5 and broken only on RHEL4? Created attachment 459537 [details]
initial patch
Though this patch doesn't add all of the changes added with upstream commit:
commit cb32f2a0d194212e4e750a8cdedcc610c9ca4876
Author: Jiri Bohac <jbohac>
Date: Thu Sep 2 05:45:54 2010 +0000
bonding: Fix jiffies overflow problems (again)
I think this will prevent the jiffie-wrapping issues the customer is seeing. The patch looks larger than it is since most the changes are indenting, but you can see the basic changes.
Committed in 99.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html |