218045 – Excessive failover times for active-backup bonds w/ ARP monitoring

Bug 218045 - Excessive failover times for active-backup bonds w/ ARP monitoring

Summary: Excessive failover times for active-backup bonds w/ ARP monitoring

Keywords:
Status:	CLOSED DUPLICATE of bug 223100
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.4
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Andy Gospodarek
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-12-01 15:29 UTC by Mark DeWandel
Modified:	2014-06-29 22:58 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-01-29 19:15:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Fragment of /var/log/messages (3.84 KB, text/plain) 2006-12-01 15:29 UTC, Mark DeWandel	no flags	Details
View All

Description Mark DeWandel 2006-12-01 15:29:12 UTC

Description of problem:

Failover to a new active slave device in a bond using ARP monitoring and
active-backup as the bonding policy frequently manifests execessively long
latencies during which network connectivity is lost.  The configuration in which
we are seeing this problem consists of a bond containing four ethernet slaves
connected pair-wise to two switches which are in turn connected to a backbone. 
When the backbone uplink cable is pulled from the switch connected to the
currently active adapter, the bonding driver exhibits difficulty in resolving
which standby slave should become the new active slave.  Failover times as long
as 30 seconds have been observed when no primary slave device is specified. 
When a primary slave is specified, the failover times are typically between 5
and 15 seconds.  The failover latencies do not seem to be affected by the value
chosen for arp_interval.  We have used values ranging from 100 to 1000 and
failover times do not appear to be reliably reduced by smaller time intervals.

Version-Release number of selected component (if applicable):

We have tested this only with RHEL4 U4.

How reproducible:

Reproducibility varies but it's fair to say that failover latencies in excess of
15 seconds occur roughly once in every three uplink cable pulls.

Steps to Reproduce:
1. Configure a system as described above.
2. Generate some network traffic.
3. Pull the uplink cable from the switch connected to the active slave.
  
Actual results:

The attached fragment from /var/log/messages illustrates an instance in which
the time required to select a new active slave took ~13 seconds (arp_interval=1000).

Expected results:

Failover times consisting of some small multiple of arp_interval (i.e., not much
greater than 3-5) would be expected.

Additional info:

No excessive failover times have been observed when the direct link to the
switch is broken using either mii or ARP monitoring.  However, ARP monitoring is
the only sensible choice for the configuration described above since carrier is
always present on the direct link from the adapter to the switch.

Comment 1 Mark DeWandel 2006-12-01 15:29:12 UTC

Created attachment 142576 [details]
Fragment of /var/log/messages

Comment 2 Andy Gospodarek 2007-02-12 14:19:09 UTC

I would suggest trying one of my latest RHEL4 test kernels:

http://people.redhat.com/agospoda/#rhel4

I recently backported an upstream fix that improves the behavior of the arp
monitoring function on active-backup bonds by validating all ARP frames when
adding using the 'arp_validate' option.  Several have reported that this is
working well for them, so I would guess it will resolve your issue.  Please test
one of these kernels and report back your results here.

Here is a description of this change and its usage:

+arp_validate
+
+	Specifies whether or not ARP probes and replies should be
+	validated in the active-backup mode.  This causes the ARP
+	monitor to examine the incoming ARP requests and replies, and
+	only consider a slave to be up if it is receiving the
+	appropriate ARP traffic.
+
+	Possible values are:
+
+	none or 0
+
+		No validation is performed.  This is the default.
+
+	active or 1
+
+		Validation is performed only for the active slave.
+
+	backup or 2
+
+		Validation is performed only for backup slaves.
+
+	all or 3
+
+		Validation is performed for all slaves.
+
+	For the active slave, the validation checks ARP replies to
+	confirm that they were generated by an arp_ip_target.  Since
+	backup slaves do not typically receive these replies, the
+	validation performed for backup slaves is on the ARP request
+	sent out via the active slave.  It is possible that some
+	switch or network configurations may result in situations
+	wherein the backup slaves do not receive the ARP requests; in
+	such a situation, validation of backup slaves must be
+	disabled.
+
+	This option is useful in network configurations in which
+	multiple bonding hosts are concurrently issuing ARPs to one or
+	more targets beyond a common switch.  Should the link between
+	the switch and target fail (but not the switch itself), the
+	probe traffic generated by the multiple bonding instances will
+	fool the standard ARP monitor into considering the links as
+	still up.  Use of the arp_validate option can resolve this, as
+	the ARP monitor will only consider ARP requests and replies
+	associated with its own instance of bonding.

Comment 3 Andy Gospodarek 2007-02-12 14:21:29 UTC

Looks like a duplicate of BZ 223100.

Comment 4 RHEL Program Management 2007-05-09 08:41:36 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 5 RHEL Program Management 2007-09-07 19:38:40 UTC

This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time.  This request will be
reviewed for a future Red Hat Enterprise Linux release.

Comment 6 Andrius Benokraitis 2008-01-29 19:15:23 UTC


*** This bug has been marked as a duplicate of 223100 ***

Note You need to log in before you can comment on or make changes to this bug.