Description of problem: Some customers are experiencing some unexpected link failover within a bond. We are using ARP monitoring option to validate the link state of each interface within the bond (eth0 & eth1). The unexpected failover seems to occur when the system is under high CPU load and/or high disk I/O loads. Version-Release number of selected component (if applicable): The kernel used is the 2.6.18-134 (official hot-fix). How reproducible: Run a kernel with high CPU and disk load. You will see something like: Sep 10 17:08:48 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: Setting eth0 as primary slave. Sep 10 17:08:48 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: making interface eth1 the new active one. Sep 10 17:08:49 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: link status down for active interface eth1, disabling it Sep 10 17:08:49 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: now running without any active interface ! Sep 10 17:08:49 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: backup interface eth0 is now up Sep 10 17:08:50 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: backup interface eth1 is now up Sep 10 17:08:50 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: backup interface eth0 is now up Sep 10 17:08:51 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: backup interface eth1 is now up Sep 10 17:08:51 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: backup interface eth0 is now up Sep 10 17:08:52 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: backup interface eth1 is now up Sep 10 17:08:52 LnX_CAPACITY_ccm_1_1_4 kernel: bonding: bond0: backup interface eth0 is now up ... Steps to Reproduce: 1. 2. 3. Actual results: Link failovers. Expected results: No link failovers. Additional info:
Also, if we renice [bond] to -20 the bug is also reproducible, but a lot harder.
Just checking if I understand everything. The system runs some other real-time application and under high load, bonding failover constantly. Looking into the logs and the tcpdump output we were able to see the arp probing happening in sequence after a period of silence. That indicates the [bond0] didn't run for some time. It seems that renicing [bond0] to -20 work around this problem. Now there was another event but the investigation lead the fault to router plus third party application combination. So, it seems to me that renicing [bond0] thread still works regarding to the failover problem. Is that correct? IMHO, the [bond0] should run with higher priority but I don't know if it fits for a real-time class.
The bond0 workqueue is a single-threaded workqueue. Currently there isn't support for createing realtime, singlethreaded workqueues, but I don't see a reason why that would not work (though I've not tested it). It's a bit strange since you would suspect any workqueue that needs to be serviced in real-time should be available on any CPU, but it still could be done. A patch like this (against the latest upstream) would probably be all that is needed: $ git diff diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 69c5b15..65f308c 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -5015,7 +5015,7 @@ static int bond_init(struct net_device *bond_dev) pr_debug("Begin bond_init for %s\n", bond_dev->name); - bond->wq = create_singlethread_workqueue(bond_dev->name); + bond->wq = create_singlethread_rt_workqueue(bond_dev->name); if (!bond->wq) return -ENOMEM; diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 7ef0c7b..56d6f52 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -194,6 +194,7 @@ __create_workqueue_key(const char *name, int singlethread, #define create_rt_workqueue(name) __create_workqueue((name), 0, 0, 1) #define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1, 0) #define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0, 0) +#define create_singlethread_rt_workqueue(name) __create_workqueue((name), 1, 0, 1) extern void destroy_workqueue(struct workqueue_struct *wq); Another interesting point to consider would be how cpu0 is being used for this workload. Is something else bound specifically to the first CPU on the system? Maybe interrupts for the network device or some other process? When creating a singlethreaded workqueue it should be bound to cpu0, so if other processes/interrupts/tasks are specifically bound to cpu0 then that could be preventing the bonding workqueue from processing their data. Can they check their third party app and see if it is binding itself specifically to cpu0 or if there is overlap between that third-party process and the bond0 pid?
Just for the record, the patch in comment #6 produces a bond workqueue thread with much different priority: # ps alx | egrep 'bond0|UID' F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 1 0 2487 2 -100 - 0 0 - S< ? 0:00 [bond0] than without: # ps alx | egrep 'bond0|UID' F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 1 0 4183 2 15 -5 0 0 - S< ? 0:00 [bond0] This could be a reasonable change, but I'd prefer to understand how the customer's application is running first.
It didn't look at scheduling for quite some time so please correct if I'm wrong but I recall that RT tasks needs to explicit release the CPU as it doesn't have a time slice like other non-RT tasks do. Therefore, as that thread is used for all bonding modes and it deals with locking, I'm worried on what would happen if that RT thread interrupts a non-RT task (ethtool/ifenslave...) which is holding one of those locks on a UP system. My guess is that it would deadlock. Flavio
(In reply to comment #6) > > Another interesting point to consider would be how cpu0 is being used for this > workload. Is something else bound specifically to the first CPU on the system? > Maybe interrupts for the network device or some other process? > > When creating a singlethreaded workqueue it should be bound to cpu0, so if > other processes/interrupts/tasks are specifically bound to cpu0 then that could > be preventing the bonding workqueue from processing their data. > > Can they check their third party app and see if it is binding itself > specifically to cpu0 or if there is overlap between that third-party process > and the bond0 pid? Any feedback on this?
Thanks for the update, Pierre. That's quite an interesting test-case. Though they are not handling interrupts on core1, I would be interested to know which core was handling the SIGSEGV and what the processor load was like during this time. Simply assigning the interrupts to a different core will make sure that servicing the interrupts won't prevent us from the same core, but the kernel is not to the point where we can make sure to schedule a process on the same core that signaled the interrupt. It might be interesting to try and figure out if core0 or core1 is consistently servicing the SIGSEGV and if that is having an effect on the ability to service the bond0 workqueue. mpstat could be used to collect this data since it should be able to report which core is being used heavily over a time period. Just to give you an idea of what I'm talking about, here is a quick test that I ran on my system. 'busy' is a program that consumes CPU. If I renice it to -20, I can quickly make my bond fail: # cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: None MII Status: down MII Polling Interval (ms): 0 Up Delay (ms): 0 Down Delay (ms): 0 ARP Polling Interval (ms): 100 ARP IP target/s (n.n.n.n form): 10.0.3.1 Slave Interface: eth1 MII Status: down Link Failure Count: 5 Permanent HW addr: 00:10:18:36:0a:d4 Slave Interface: eth2 MII Status: down Link Failure Count: 6 Permanent HW addr: 00:10:18:36:0a:d6 Here are the processes in quesion: # ps alx | grep bond0 1 0 4183 2 30 10 0 0 - SN ? 0:00 [bond0] # ps alx | grep busy 0 0 7959 3654 0 -20 63816 1096 - S< ttyS0 0:00 ./busy 1 0 7963 7959 0 -20 63816 436 - R< ttyS0 5:17 ./busy 1 0 7964 7959 0 -20 63816 436 - R< ttyS0 5:17 ./busyusy And here is how mpstat (modified output) reports my system is loaded: # mpstat -P ALL 10 1 Linux 2.6.30 (xw4400) 10/06/2009 12:24:34 PM CPU %user %nice %irq %soft %steal %idle intr/s 12:24:44 PM all 99.97 0.03 0.00 0.00 0.00 0.00 2009.50 12:24:44 PM 0 99.94 0.00 0.00 0.00 0.00 0.00 0.00 12:24:44 PM 1 100.00 0.00 0.00 0.00 0.00 0.00 237159982.66 If I move the 'busy' processes to cpu1, let's see what happens: # taskset -p 02 7959 pid 7959's current affinity mask: 3 pid 7959's new affinity mask: 2 # taskset -p 02 7963 pid 7963's current affinity mask: 3 pid 7963's new affinity mask: 2 # taskset -p 02 7964 pid 7964's current affinity mask: 3 pid 7964's new affinity mask: 2 # cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: eth2 MII Status: up MII Polling Interval (ms): 0 Up Delay (ms): 0 Down Delay (ms): 0 ARP Polling Interval (ms): 100 ARP IP target/s (n.n.n.n form): 10.0.3.1 Slave Interface: eth1 MII Status: up Link Failure Count: 10 Permanent HW addr: 00:10:18:36:0a:d4 Slave Interface: eth2 MII Status: up Link Failure Count: 11 Permanent HW addr: 00:10:18:36:0a:d6 The bond comes right back up (look how the failure count increased). # mpstat -P ALL 10 1 Linux 2.6.30 (xw4400) 10/06/2009 12:27:58 PM CPU %user %nice %irq %soft %steal %idle intr/s 12:28:08 PM all 49.98 0.00 0.00 0.00 0.00 49.98 2028.67 12:28:08 PM 0 0.00 0.00 0.00 0.00 0.00 99.90 0.00 12:28:08 PM 1 100.00 0.00 0.00 0.00 0.00 0.00 0.00 Now you can see that the system is much less loaded and the bond is stable. What might be interesting is to see if the application that causes the SIGSEGV and uses the resources on the system, can also be set to only run on cpu1 for it's lifetime, or right before it handles the crash and dumps the output to a file. I've worked with large programs that register their own SIGSEGV handler in the past and a call to the userspace program 'taskset' or the system call 'sched_setaffinity' might be a good thing to use in their program.
The original reporter complained that the bonding workqueue (labelled as the '[bond0]' task) was too low priority and wasn't given the chance to run when competing against higher priority threads on the system. A call to 'renice' to adjust the priority of the thread helped, but the real benefit came when using 'chrt' to make the priority higher than the threads from the custom application as well as changing the scheduling policy to SCHED_FIFO. A priority and scheduling policy change will be something that should be done on an as-needed basis when a customer has a heavily loaded system that cannot give the bonding workqueue thread the CPU time it needs. The patch in comment #6 is probably what I would propose upstream if needed, I don't see the need to change this upstream. Closing as NOTABUG.