Description of problem: Failover to a new active slave device in a bond using ARP monitoring and active-backup as the bonding policy frequently manifests execessively long latencies during which network connectivity is lost. The configuration in which we are seeing this problem consists of a bond containing four ethernet slaves connected pair-wise to two switches which are in turn connected to a backbone. When the backbone uplink cable is pulled from the switch connected to the currently active adapter, the bonding driver exhibits difficulty in resolving which standby slave should become the new active slave. Failover times as long as 30 seconds have been observed when no primary slave device is specified. When a primary slave is specified, the failover times are typically between 5 and 15 seconds. The failover latencies do not seem to be affected by the value chosen for arp_interval. We have used values ranging from 100 to 1000 and failover times do not appear to be reliably reduced by smaller time intervals. Version-Release number of selected component (if applicable): We have tested this only with RHEL4 U4. How reproducible: Reproducibility varies but it's fair to say that failover latencies in excess of 15 seconds occur roughly once in every three uplink cable pulls. Steps to Reproduce: 1. Configure a system as described above. 2. Generate some network traffic. 3. Pull the uplink cable from the switch connected to the active slave. Actual results: The attached fragment from /var/log/messages illustrates an instance in which the time required to select a new active slave took ~13 seconds (arp_interval=1000). Expected results: Failover times consisting of some small multiple of arp_interval (i.e., not much greater than 3-5) would be expected. Additional info: No excessive failover times have been observed when the direct link to the switch is broken using either mii or ARP monitoring. However, ARP monitoring is the only sensible choice for the configuration described above since carrier is always present on the direct link from the adapter to the switch.
Created attachment 142576 [details] Fragment of /var/log/messages
I would suggest trying one of my latest RHEL4 test kernels: http://people.redhat.com/agospoda/#rhel4 I recently backported an upstream fix that improves the behavior of the arp monitoring function on active-backup bonds by validating all ARP frames when adding using the 'arp_validate' option. Several have reported that this is working well for them, so I would guess it will resolve your issue. Please test one of these kernels and report back your results here. Here is a description of this change and its usage: +arp_validate + + Specifies whether or not ARP probes and replies should be + validated in the active-backup mode. This causes the ARP + monitor to examine the incoming ARP requests and replies, and + only consider a slave to be up if it is receiving the + appropriate ARP traffic. + + Possible values are: + + none or 0 + + No validation is performed. This is the default. + + active or 1 + + Validation is performed only for the active slave. + + backup or 2 + + Validation is performed only for backup slaves. + + all or 3 + + Validation is performed for all slaves. + + For the active slave, the validation checks ARP replies to + confirm that they were generated by an arp_ip_target. Since + backup slaves do not typically receive these replies, the + validation performed for backup slaves is on the ARP request + sent out via the active slave. It is possible that some + switch or network configurations may result in situations + wherein the backup slaves do not receive the ARP requests; in + such a situation, validation of backup slaves must be + disabled. + + This option is useful in network configurations in which + multiple bonding hosts are concurrently issuing ARPs to one or + more targets beyond a common switch. Should the link between + the switch and target fail (but not the switch itself), the + probe traffic generated by the multiple bonding instances will + fool the standard ARP monitor into considering the links as + still up. Use of the arp_validate option can resolve this, as + the ARP monitor will only consider ARP requests and replies + associated with its own instance of bonding.
Looks like a duplicate of BZ 223100.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
This request was previously evaluated by Red Hat Product Management for inclusion in the current Red Hat Enterprise Linux release, but Red Hat was unable to resolve it in time. This request will be reviewed for a future Red Hat Enterprise Linux release.
*** This bug has been marked as a duplicate of 223100 ***