Description of problem: Bonding with arp_monitor is not switched to slave NIC when the primary NIC gets down in certain condition. /etc/modprobe.conf contains following: === alias bond0 bonding options bond0 mode=active-backup arp_interval=1000 arp_ip_target=130.196.156.62 primary=eth0 === On above bonding configuration, if primary NIC(eth0) link gets down, bondng driver does not switch to secondary NIC(eth2). This symptom happen on only 32bit(X86) systems. When this symptom happens, the message of link down of bonding device is not logged. == Jul 12 16:21:10 xxxxxx kernel: bnx2: eth0 NIC Copper Link is Down == On normal case, the below bonding messages are logged. == Aug 10 15:23:14 xxxxxx kernel: bnx2: eth0 NIC Copper Link is Down Aug 10 15:23:16 xxxxxx kernel: bonding: bond0: link status down for active interface eth0, disabling it <== Aug 10 15:23:16 xxxxxx kernel: bonding: bond0: making interface eth2 the new active one. <=== == Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Set up RHEL4.8 and configure bonding device with arp_monitor. 2. Wait for 24.8 days. 3. Disconnect cable of primary NIC Actual results: Bonding device does not switch to the secondary NIC. Expected results: Bonding device switches to the secondary NIC Additional info:
[Analysis] 3334 static void bond_activebackup_arp_mon(struct net_device *bond_dev) 3335 { : 3443 if (slave) { 3444 /* if we have sent traffic in the past 2*arp_intervals but 3445 * haven't xmit and rx traffic in that time interval, select 3446 * a different slave. slave->jiffies is only updated when 3447 * a slave first becomes the curr_active_slave - not necessarily 3448 * after every arp; this ensures the slave has a full 2*delta 3449 * before being taken out. if a primary is being used, check 3450 * if it is up and needs to take over as the curr_active_slave 3451 */ 3452 if ((time_after_eq(jiffies, slave->dev->trans_start + 2*delta_in_ticks) || 3453 (time_after_eq(jiffies, slave_last_rx(bond, slave) + 2*delta_in_ticks) && 3454 bond_has_ip(bond))) && 3455 time_after_eq(jiffies, slave->jiffies + 2*delta_in_ticks)) { 3456 3457 slave->link = BOND_LINK_DOWN; 3458 3459 if (slave->link_failure_count < UINT_MAX) { 3460 slave->link_failure_count++; 3461 } 3462 3463 printk(KERN_INFO DRV_NAME 3464 ": %s: link status down for active interface " 3465 "%s, disabling it\n", 3466 bond_dev->name, 3467 slave->dev->name); : If the conditions of line 3452 - 3455 are satisfied, bonding driver will switch to the secondary NIC. And we found a problem with time_after_eq(): time_after_eq(jiffies, slave->jiffies + 2*delta_in_ticks) jiffies: counter which is incremented by 1ms, This value is reset to zero after 5 min from booting. salve->jiffies: jiffies when current slave was set delta_in_ticks: 1sec ( arp_interval=1000 ) #define time_after_eq(a,b) \ (typecheck(unsigned long, a) && \ typecheck(unsigned long, b) && \ ((long)(a) - (long)(b) >= 0)) If slave is set after 5 min from booting, slave->jiffies == 0 and above expression will be: time_after_eq(jiffies, 2*1000) If slave_change does not happen, this expression will become false about 24.8 days after the boot. And when the time_after_eq() is false, this problem happens. The jiffies sign changes in 24.8 days, overflow 49.8 days. 0 - 24.8days true problem doesn not happen 24.8days - 49.6days false problem happen 49.6days -74.4days true problem does not happen 74.4days - 99.2days false problem happen : On customer systems, this problem occurred 31days and 33days after the system boot. From RHEL5.4, the implementation of bond_activebackup_arp_mon was changed and it seems the problem no longer happens.
The fix was included in RHEL5 as part of Bug 462632. Upstream, it is included in Linus' tree as b2220cad583c9b63e085476df448fa2aff5ea906
What is the newest kernel the customer has tried? It looks like the issue where slave->jiffies is unset is already resolved under bug 489362 and included in all kernels after 2.6.9-85.EL (Thanks, Don!). It seems odd to me that this is actually a concern.
The customer reported the issue against 2.6.9-89.0.20.ELsmp.
It looks like my suspicion was incorrect. Has the customer really confirmed this works on RHEL5 and broken only on RHEL4?
Created attachment 459537 [details] initial patch Though this patch doesn't add all of the changes added with upstream commit: commit cb32f2a0d194212e4e750a8cdedcc610c9ca4876 Author: Jiri Bohac <jbohac> Date: Thu Sep 2 05:45:54 2010 +0000 bonding: Fix jiffies overflow problems (again) I think this will prevent the jiffie-wrapping issues the customer is seeing. The patch looks larger than it is since most the changes are indenting, but you can see the basic changes.
Committed in 99.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html