526251 – bonding link failover using arp monitor

Bug 526251 - bonding link failover using arp monitor

Summary: bonding link failover using arp monitor

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Andy Gospodarek
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-09-29 14:03 UTC by Veaceslav Falico
Modified:	2018-10-20 03:56 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-10-09 19:43:18 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Veaceslav Falico 2009-09-29 14:03:54 UTC

Description of problem:
Some customers are experiencing some unexpected link failover within a bond. We are using ARP monitoring option to validate the link state of each interface within the bond (eth0 & eth1). The unexpected failover seems to occur when the system is under high CPU load and/or high disk I/O loads.


Version-Release number of selected component (if applicable):
The kernel used is the 2.6.18-134 (official hot-fix).

How reproducible:
Run a kernel with high CPU and disk load. You will see something like:
Sep 10 17:08:48 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: Setting eth0 as primary slave.
Sep 10 17:08:48 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: making interface eth1 the new active one.
Sep 10 17:08:49 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: link status down for active interface eth1, disabling it
Sep 10 17:08:49 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: now running without any active interface !
Sep 10 17:08:49 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: backup interface eth0 is now up
Sep 10 17:08:50 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: backup interface eth1 is now up
Sep 10 17:08:50 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: backup interface eth0 is now up
Sep 10 17:08:51 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: backup interface eth1 is now up
Sep 10 17:08:51 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: backup interface eth0 is now up
Sep 10 17:08:52 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: backup interface eth1 is now up
Sep 10 17:08:52 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: backup interface eth0 is now up
...


Steps to Reproduce:
1.
2.
3.
  
Actual results:
Link failovers.

Expected results:
No link failovers.

Additional info:

Comment 1 Veaceslav Falico 2009-09-29 14:05:22 UTC

Also, if we renice [bond] to -20 the bug is also reproducible, but a lot harder.

Comment 4 Flavio Leitner 2009-10-05 17:51:30 UTC

Just checking if I understand everything. The system runs some other 
real-time application and under high load, bonding failover constantly.

Looking into the logs and the tcpdump output we were able to see the arp 
probing happening in sequence after a period of silence. That indicates 
the [bond0] didn't run for some time. It seems that renicing [bond0] 
to -20 work around this problem.

Now there was another event but the investigation lead the fault to router
plus third party application combination.

So, it seems to me that renicing [bond0] thread still works regarding to the
failover problem. Is that correct?

IMHO, the [bond0] should run with higher priority but I don't know if it
fits for a real-time class.

Comment 6 Andy Gospodarek 2009-10-05 18:41:58 UTC

The bond0 workqueue is a single-threaded workqueue.  Currently there isn't support for createing realtime, singlethreaded workqueues, but I don't see a reason why that would not work (though I've not tested it).  It's a bit strange since you would suspect any workqueue that needs to be serviced in real-time should be available on any CPU, but it still could be done.  A patch like this (against the latest upstream) would probably be all that is needed:

$ git diff 
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 69c5b15..65f308c 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -5015,7 +5015,7 @@ static int bond_init(struct net_device *bond_dev)
 
        pr_debug("Begin bond_init for %s\n", bond_dev->name);
 
-       bond->wq = create_singlethread_workqueue(bond_dev->name);
+       bond->wq = create_singlethread_rt_workqueue(bond_dev->name);
        if (!bond->wq)
                return -ENOMEM;
 
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 7ef0c7b..56d6f52 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -194,6 +194,7 @@ __create_workqueue_key(const char *name, int singlethread,
 #define create_rt_workqueue(name) __create_workqueue((name), 0, 0, 1)
 #define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1, 0)
 #define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0, 0)
+#define create_singlethread_rt_workqueue(name) __create_workqueue((name), 1, 0, 1)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
Another interesting point to consider would be how cpu0 is being used for this workload.  Is something else bound specifically to the first CPU on the system?  Maybe interrupts for the network device or some other process?

When creating a singlethreaded workqueue it should be bound to cpu0, so if other processes/interrupts/tasks are specifically bound to cpu0 then that could be preventing the bonding workqueue from processing their data.

Can they check their third party app and see if it is binding itself specifically to cpu0 or if there is overlap between that third-party process and the bond0 pid?

Comment 8 Andy Gospodarek 2009-10-05 20:23:58 UTC

Just for the record, the patch in comment #6 produces a bond workqueue thread with much different priority:

# ps alx | egrep 'bond0|UID'
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
1     0  2487     2 -100  -      0     0 -      S<   ?          0:00 [bond0]

than without:

# ps alx | egrep 'bond0|UID'
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
1     0  4183     2  15  -5      0     0 -      S<   ?          0:00 [bond0]

This could be a reasonable change, but I'd prefer to understand how the customer's application is running first.

Comment 9 Flavio Leitner 2009-10-06 01:58:20 UTC

It didn't look at scheduling for quite some time so please correct if 
I'm wrong but I recall that RT tasks needs to explicit release the CPU
as it doesn't have a time slice like other non-RT tasks do. Therefore, 
as that thread is used for all bonding modes and it deals with locking,
I'm worried on what would happen if that RT thread interrupts a non-RT 
task (ethtool/ifenslave...) which is holding one of those locks on
a UP system. My guess is that it would deadlock.

Flavio

Comment 10 Andy Gospodarek 2009-10-06 15:15:23 UTC

(In reply to comment #6)
> 
> Another interesting point to consider would be how cpu0 is being used for this
> workload.  Is something else bound specifically to the first CPU on the system?
>  Maybe interrupts for the network device or some other process?
> 
> When creating a singlethreaded workqueue it should be bound to cpu0, so if
> other processes/interrupts/tasks are specifically bound to cpu0 then that could
> be preventing the bonding workqueue from processing their data.
> 
> Can they check their third party app and see if it is binding itself
> specifically to cpu0 or if there is overlap between that third-party process
> and the bond0 pid?  

Any feedback on this?

Comment 12 Andy Gospodarek 2009-10-06 16:26:03 UTC

Thanks for the update, Pierre.  That's quite an interesting test-case.  Though they are not handling interrupts on core1, I would be interested to know which core was handling the SIGSEGV and what the processor load was like during this time.  Simply assigning the interrupts to a different core will make sure that servicing the interrupts won't prevent us from the same core, but the kernel is not to the point where we can make sure to schedule a process on the same core that signaled the interrupt.

It might be interesting to try and figure out if core0 or core1 is consistently servicing the SIGSEGV and if that is having an effect on the ability to service the bond0 workqueue.  mpstat could be used to collect this data since it should be able to report which core is being used heavily over a time period.

Just to give you an idea of what I'm talking about, here is a quick test that I ran on my system.  'busy' is a program that consumes CPU.

If I renice it to -20, I can quickly make my bond fail:

# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: None
MII Status: down
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
ARP Polling Interval (ms): 100
ARP IP target/s (n.n.n.n form): 10.0.3.1

Slave Interface: eth1
MII Status: down
Link Failure Count: 5
Permanent HW addr: 00:10:18:36:0a:d4

Slave Interface: eth2
MII Status: down
Link Failure Count: 6
Permanent HW addr: 00:10:18:36:0a:d6

Here are the processes in quesion:

# ps alx | grep bond0
1     0  4183     2  30  10      0     0 -      SN   ?          0:00 [bond0]

# ps alx | grep busy
0     0  7959  3654   0 -20  63816  1096 -      S<   ttyS0      0:00 ./busy
1     0  7963  7959   0 -20  63816   436 -      R<   ttyS0      5:17 ./busy
1     0  7964  7959   0 -20  63816   436 -      R<   ttyS0      5:17 ./busyusy

And here is how mpstat (modified output) reports my system is loaded:

# mpstat -P ALL 10 1 
Linux 2.6.30 (xw4400)   10/06/2009

12:24:34 PM  CPU   %user   %nice    %irq   %soft  %steal   %idle    intr/s
12:24:44 PM  all   99.97    0.03    0.00    0.00    0.00    0.00   2009.50
12:24:44 PM    0   99.94    0.00    0.00    0.00    0.00    0.00      0.00
12:24:44 PM    1  100.00    0.00    0.00    0.00    0.00    0.00 237159982.66

If I move the 'busy' processes to cpu1, let's see what happens:

# taskset -p 02 7959
pid 7959's current affinity mask: 3
pid 7959's new affinity mask: 2
# taskset -p 02 7963
pid 7963's current affinity mask: 3
pid 7963's new affinity mask: 2
# taskset -p 02 7964
pid 7964's current affinity mask: 3
pid 7964's new affinity mask: 2

# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth2
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
ARP Polling Interval (ms): 100
ARP IP target/s (n.n.n.n form): 10.0.3.1

Slave Interface: eth1
MII Status: up
Link Failure Count: 10
Permanent HW addr: 00:10:18:36:0a:d4

Slave Interface: eth2
MII Status: up
Link Failure Count: 11
Permanent HW addr: 00:10:18:36:0a:d6

The bond comes right back up (look how the failure count increased).

# mpstat -P ALL 10 1 
Linux 2.6.30 (xw4400)   10/06/2009

12:27:58 PM  CPU   %user   %nice    %irq   %soft  %steal   %idle    intr/s
12:28:08 PM  all   49.98    0.00    0.00    0.00    0.00   49.98   2028.67
12:28:08 PM    0    0.00    0.00    0.00    0.00    0.00   99.90      0.00
12:28:08 PM    1  100.00    0.00    0.00    0.00    0.00    0.00      0.00

Now you can see that the system is much less loaded and the bond is stable.  What might be interesting is to see if the application that causes the SIGSEGV and uses the resources on the system, can also be set to only run on cpu1 for it's lifetime, or right before it handles the crash and dumps the output to a file.

I've worked with large programs that register their own SIGSEGV handler in the past and a call to the userspace program 'taskset' or the system call 'sched_setaffinity' might be a good thing to use in their program.

Comment 18 Andy Gospodarek 2009-10-09 19:43:18 UTC

The original reporter complained that the bonding workqueue (labelled as the '[bond0]' task) was too low priority and wasn't given the chance to run when competing against higher priority threads on the system.

A call to 'renice' to adjust the priority of the thread helped, but the real benefit came when using 'chrt' to make the priority higher than the threads from the custom application as well as changing the scheduling policy to SCHED_FIFO.

A priority and scheduling policy change will be something that should be done on an as-needed basis when a customer has a heavily loaded system that cannot give the bonding workqueue thread the CPU time it needs.

The patch in comment #6 is probably what I would propose upstream if needed, I don't see the need to change this upstream.  Closing as NOTABUG.

Note You need to log in before you can comment on or make changes to this bug.