Bug 167630

Summary: Multicast domain membership doesn't follow bonding failover
Product: Red Hat Enterprise Linux 4 Reporter: Bastien Nocera <bnocera>
Component: kernelAssignee: John W. Linville <linville>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: andy.williams, jbaron, jon.stanley, tao, vanhoof
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2006-0132 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-03-07 19:47:43 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 168429    
Attachments:
Description Flags
jwltest-bond_activebackup-igmp-hack.patch
none
jwltest-bond_activebackup-igmp-hack.patc
none
jwltest-bond_activebackup-igmp-hack.patch none

Description Bastien Nocera 2005-09-06 14:08:41 UTC
Using kernel-2.6.9-11.EL, the machine has an application in a multicast domain,
with one of the interfaces of a failover (HA) bonding.

If that bonding fails, the network traffic correctly fails over, but the
multicast domain will not be rejoined until the router (which might be a
different one from the original one used before the failover) sends an IGMP
report to check whether domain members are still alive (which can usually take
up to 60 seconds).

The bonding driver should be able to send a new membership join when fail over
occurs.

Comment 2 John W. Linville 2005-09-07 14:49:40 UTC
Is the problem only that the bond does not send an IGMP join as part of the 
failover?  Or that multicast traffic isn't received on the new master even if 
the switch at the other end sends it correctly?  It looks like there is an 
effort made to propagate the multicast address list from the old master to the 
new one.  Hopefully that is working? 
 
Generating the IGMP join(s) in response to the failover may require additional 
infrastructure...still investigating... 

Comment 4 Bastien Nocera 2005-09-07 16:27:32 UTC
The problem is that, in that particular case, the router is also a different one.

Issuing a new membership join (in user space) will do nothing, only increment
refcount. The application would need to leave the multicast domain and rejoin it
to have a new membership report generated.

Comment 6 John W. Linville 2005-09-09 12:58:56 UTC
Created attachment 118633 [details]
jwltest-bond_activebackup-igmp-hack.patch

A patch to cause IGMP frames transmitted on a bond in active backup state to be
flooded to all of the slaves.

Comment 7 John W. Linville 2005-09-09 13:01:05 UTC
Test kernels w/ the above patch are available here: 
 
   http://people.redhat.com/linville/kernels/rhel4/ 
 
Please give them a try and post the results here...thanks! 

Comment 8 Bastien Nocera 2005-09-09 15:45:32 UTC
Crashes on boot:

Remounting root filesystem in read-write mode:             [  OK  ]        
 [<f88d6329>] bond_activebackup_xmit_clone+0x3c/0x70 [bonding]                  
 [<f88d6418>] bond_xmit_activebackup+0xbb/0x111 [bonding]                       
 [<c027dbdf>] dev_queue_xmit+0x167/0x207                                        
 [<c0283075>] neigh_resolve_output+0x113/0x152                                  
 [<c0296e96>] ip_finish_output2+0x12e/0x16d                                     
 [<c0286a81>] nf_hook_slow+0x83/0xb4                                            
 [<c0296d5f>] ip_finish_output+0x1a5/0x1ae                                      
 [<c0296d68>] ip_finish_output2+0x0/0x16d                                       
 [<c02b88b2>] igmp_send_report+0x22c/0x271                                      
 [<c02b8a2a>] igmp_timer_expire+0x96/0xa2                                       
 [<c02b8994>] igmp_timer_expire+0x0/0xa2                                        
 [<c0129e15>] run_timer_softirq+0x123/0x145                                     
 [<c012641c>] __do_softirq+0x4c/0xb1                               
 [<c010812f>] do_softirq+0x4f/0x56                                 
 =======================                                                       
 [<c01173c0>] smp_apic_timer_interrupt+0x9a/0x9c                                
 [<c02d1b0e>] apic_timer_interrupt+0x1a/0x20                                    
 [<c01040e5>] mwait_idle+0x33/0x42                                 
 [<c010409d>] cpu_idle+0x26/0x3b                                   
 [<c0390786>] start_kernel+0x199/0x19d                                         
Code: ff 21 e2 3b 42 18 73 06 8b 50 fd 31 c0 c3 31 d2 b8 f2 ff ff ff c3 90 57 56
 89 d6 89 ca 53 53 53 89 c3 89 c8 c1 e8 02 89 df 89 c1 <f3> a5 f6 c2 02 74 02 66
 a5 f6 c2 01 74 01 a4 5a 89 d8 59 5b 5e                            
 <0>Kernel panic - not syncing: Fatal exception in interrupt

Comment 9 John W. Linville 2005-09-09 16:54:36 UTC
Grrr...terribly sorry!  Perhaps you can tell I don't have a good IGMP setup 
here... 
 
Will try to get something better soon... 

Comment 10 David Juran 2005-09-12 09:13:56 UTC
Update from customer:

Just in case it helps, at the time of the panic the application hasn't started;
the system is still going through startup. I may be completely off track here
but experience over several reboots of the system shows this crash to occur at
slightlyy different times. Once or twice it would get as far as a login prompt
on the console, other times it would crash before it got that far. The crash
footprint qwas the same in both cases. All crashes occurred as 'hands-off'; in
other words, if you just observed the system it would panic with me doing
anything. Each boot resulted in an identical panic, certainly within 30 seconds
of the login prompt appearing on the console.

Is it possible that the crash is as a result of the kernel responding to the
periodic (60 second) poll from the Cisco switch? At this stage of the boot there
may or may not be a live multicast subscription, since the system is configured
to use multicast NTP. However, the main application is started manually and so
isn't running at this point.

if I can be of more help, please let me know.


-Andy 

Comment 13 John W. Linville 2005-09-14 02:24:46 UTC
Created attachment 118785 [details]
jwltest-bond_activebackup-igmp-hack.patc

This one works for me...

Comment 14 John W. Linville 2005-09-14 02:27:17 UTC
New test kernels are available at the same location as in comment 7.  Please 
give them a try and report the results. 
 
I was able to fix my IGMP test to recreate the problem with the first patch 
and to verify that this patch does what I intended... :-) 

Comment 16 Andy Williams 2005-09-15 09:28:58 UTC
Copied from the 'issue tracker' report:

The initial POC failover testing looks like it's worked. I don't have the 
report of the network guy running the tests but it seems that we now see 
multicast traffic being received on both NICs in the bond set irrespective of 
which one's active & which is passive. I don't believe this is an issue for us 
but will confirm with our network team and am also awaiting a report from the 
application guys to ensure the system downstream that receives the multicast 
transmitted from this application is also behaving appropriately.

One anomaly we have seen is a 10-second delay in system response, but I'm 
still trying to qualify this in more detail. I have several terminals logged 
in via SSH and see the 'hang' on all of them. These seem to be occurring at 
intervals of just under a minute - example 'hang times' are 16:56:14->16:56:24 
then 16:57:17->16:57:27, then 16:58:16->16:58:25 then 16:59:11->16:59:15 
followed immediately by another hang from 16:59:15->16:59:25. Unfortunately 
the network guy's gone home; I was keen to try to tie these up with the 
receipt of the IGMP membership reports from the Cisco kit. 

At this stage these delays could be attributable to a multitude of things, 
since we've only performed cursory testing on our application. However, 
nothing seems to be chewing CPU time on the box.

No errors are logged against either eth0 or eth1.

If John could have a think on whether his changes could be causing this, I'll 
try to arrange for the system to be rebooted on the 'standard' kernel for 
crude comparison testing tomorrow.


Comment 18 Andy Williams 2005-09-15 15:23:25 UTC
Copied from updated Issue Tracker call:

OK, further observations from the testing done so far:

(Note: kernel versions used: "standard" is 2.6.9-11.ELsmp and "test" is 2.6.9-
19.EL.jwltest.60smp)

First, no delays of any kind noticed with "standard" kernel.

Second: On the test kernel, delays only noticed when we subscribe to 
multicast. In other words, using 'netstat -g' the "no multicast subscription" 
shows:

[15:57:19][root@lshlha1501 ~]$
[15:57:20][root@lshlha1501 ~]$ netstat -g
IPv6/IPv4 Group Memberships
Interface       RefCnt Group
--------------- ------ ---------------------
lo              1      224.0.0.1
bond0           1      224.0.0.1
eth0            1      224.0.0.1
eth1            1      224.0.0.1
lo              1      ff02::1
bond0           1      ff02::1:ff6b:2617
bond0           1      ff02::1
eth0            1      ff02::1
eth1            1      ff02::1
[15:57:22][root@lshlha1501 ~]$

Conversely, the "subscribed to multicast" shows:

[15:51:45]$ netstat -g
IPv6/IPv4 Group Memberships
Interface       RefCnt Group
--------------- ------ ---------------------
lo              1      224.0.0.1
bond0           1      224.0.1.1
bond0           1      224.0.0.1
eth0            1      224.0.0.1
eth1            1      224.0.0.1
lo              1      ff02::1
bond0           1      ff02::1:ff6b:2617
bond0           1      ff02::1
eth0            1      ff02::1
eth1            1      ff02::1
[15:51:52]$ 

The 224.0.1.1 subscription is NTP...

[root@lshlha1501 ~]#
[root@lshlha1501 ~]# cat /etc/ntp.conf
keys /etc/ntp/keys
enable auth
trustedkey 10
multicastclient 224.0.1.1
driftfile /var/lib/ntp/drift
restrict 127.0.0.1 nomodify notrap
restrict default nomodify notrap noquery
[root@lshlha1501 ~]#

Almost immediately we do a 'service ntpd start' the delays cut in. However, we 
can go one stage further here:

First, I have two sessions to this system. The first is a standard SSH 
connection via PuTTY. The second is via the system console (in fact it's the 
integrated lights-out card but there's no material difference). The delays 
don't show up on the console, only on the SSH session. While the SSH session 
is 'hung' the console remains responsive.

Secondly, the delays appear to be on input rather than output. I did two 
simple things: I changed my session prompt to the time of day (PS1="[\t]$ ") 
and I created this silly script:

[root@lshlha1501 ~]# cat a1.sh
#!/bin/bash
while :
 do
        date
        sleep 1
 done
[root@lshlha1501 ~]#

If I leave the script running in my SSH session and start NTP (ie. start a 
multicast subscription) from the console I get the date output every second in 
the SSH session. I've left this running for 15 minutes & it doesn't miss a 
beat. However, if I start NTP running from the console window and simply keep 
hitting the return key on my SSH session ("What did you do in the office 
today, Daddy? I wished my keyboard many happy returns...")  I see immediate 
hangs and on several occasions the session was dropped.

So, in summary: No apparent delays with no multicast subscriptions; no delays 
using the console; delays appear to be on input rather than output; observed 
delays range from 2-3 seconds up to the loss of a session.

The only testing I couldn't get organised is to try to correlate the delays 
with IGMP traffic because the network guru is out of the office. Not sure how 
useful that is based on this experimental evidence...

Cheers,

-Andy

Comment 19 John W. Linville 2005-09-15 19:21:19 UTC
Actually, this is very thorough testing.  If you ever need a reference for a 
QA job, have them contact me! :-) 
 
I think I know the problem: the IGMP frames are getting sent out multiple 
ports but all with the same source MAC address.  The switch is then having 
learning problems related to that situation.  I had put code in place to 
prevent that, but apparently I was copying the wrong MAC address into the 
frames transmitted on the inactive slaves. 
 
I have a patch that corrects that issue.  More test kernels to follow! 

Comment 20 John W. Linville 2005-09-15 19:26:57 UTC
Created attachment 118864 [details]
jwltest-bond_activebackup-igmp-hack.patch

Comment 21 John W. Linville 2005-09-15 20:08:22 UTC
Test kernels with the above patch are available at the same location as in 
comment 7.  Please give these a try and see if the disruptions caused by the 
IGMP traffic go away...thanks! 

Comment 22 Andy Williams 2005-09-15 22:36:23 UTC
Thanks for the very fast turnaround!

The new test kernel has been downloaded & deployed. I ran through the quick and 
dirty 'idiot tests' with multicast NTP that showed up the delays in the 
previous update and they don't appear to be happening, which is great news.

Tomorrow (Friday) we'll move on to the application itself and will run through 
all the failover tests again. I hope to have a useful update for you tomorrow 
evening.



Comment 23 Andy Williams 2005-09-28 18:51:24 UTC
Iâm currently helping our System Test group define the tests to be run against 
the POC kernel. After reading Johnâs excellent explanation of the fix, one 
particular scenario sprang to mind but since itâs still a week or so away 
before testing can begin here I thought Iâd give John the chance to mull it 
over.

The fix is described as âreplicate any outgoing IGMP traffic to all "inactive" 
slaves in bondâ. I may be taking this too literally but I plan to put four NICs 
in a system and create two bond sets, each containing two NICs. Device âbond0â 
will contain âeth0â and âeth1â while âbond1â will contain âeth2â and âeth3â. 
Iâll subscribe each bond device to a separate multicast stream before shutting 
the NICs down in the controlled tests to ensure thereâs no âcross pollinationâ 
of multicast /IGMP across the two bond sets.

John: Do you perceive this will cause a problem?


Comment 24 John W. Linville 2005-09-28 20:39:47 UTC
The traffic is only replicated between devices in the same bond.  There should 
be no "cross pollination". 

Comment 26 John W. Linville 2005-10-17 16:45:21 UTC
Apparently using "primary=ethX" reduces the effectiveness of this fix..."don't 
do that"... 

Comment 27 Andy Williams 2005-10-17 19:54:57 UTC
Thanks for the advice!  We'd initially used this option (primary=eth0) but 
removed it early in the bonding testing when we noticed that as soon as eth0 
came back everything immediately failed back to it. Once bonding had failed 
across to the 'standby' NIC we wanted the traffic to stay there; removing this 
option achieves that. (We have found other issues with bonding apparently 
fouling up but that's outside the focus of this call).

-Andy

Comment 29 Andy Williams 2005-11-07 13:48:01 UTC
Copied from Issue Tracker:

Software Engineering have now completed their testing on the POC kernel. They 
were completely successful and we would now like to formally request a hotfix. 
Since this is RHEL 4, could the hotfix please be based on the latest released 
RHEL 4 Update 2 kernel, which we believe is 2.6.9-22.0.1?

Since the Engineering test group have now gone onto other things I'd also be 
grateful if you could give us an estimate of when we could expect it, since 
we'll have to schedule acceptance testing.

If you have any issues or question then please contact either Phil or me in 
the office.

Cheers,

-Andy 

Comment 33 Andy Williams 2005-12-13 10:25:41 UTC
We now have all our RHEL4 systems upgraded with the Hotfix kernel and have 
transferred some live services to them.

From our side, we can now close this call but a big thank you to all concerned 
for making this happen, permitting us to provide a significant improvement in 
resilience to what is a very critical part of our trading infrastructure.

-Andy

Comment 36 Red Hat Bugzilla 2006-03-07 19:47:44 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0132.html