Bug 167630
Summary: | Multicast domain membership doesn't follow bonding failover | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Bastien Nocera <bnocera> | ||||||||
Component: | kernel | Assignee: | John W. Linville <linville> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 4.0 | CC: | andy.williams, jbaron, jon.stanley, tao, vanhoof | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | RHSA-2006-0132 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2006-03-07 19:47:43 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 168429 | ||||||||||
Attachments: |
|
Description
Bastien Nocera
2005-09-06 14:08:41 UTC
Is the problem only that the bond does not send an IGMP join as part of the failover? Or that multicast traffic isn't received on the new master even if the switch at the other end sends it correctly? It looks like there is an effort made to propagate the multicast address list from the old master to the new one. Hopefully that is working? Generating the IGMP join(s) in response to the failover may require additional infrastructure...still investigating... The problem is that, in that particular case, the router is also a different one. Issuing a new membership join (in user space) will do nothing, only increment refcount. The application would need to leave the multicast domain and rejoin it to have a new membership report generated. Created attachment 118633 [details]
jwltest-bond_activebackup-igmp-hack.patch
A patch to cause IGMP frames transmitted on a bond in active backup state to be
flooded to all of the slaves.
Test kernels w/ the above patch are available here: http://people.redhat.com/linville/kernels/rhel4/ Please give them a try and post the results here...thanks! Crashes on boot: Remounting root filesystem in read-write mode: [ OK ] [<f88d6329>] bond_activebackup_xmit_clone+0x3c/0x70 [bonding] [<f88d6418>] bond_xmit_activebackup+0xbb/0x111 [bonding] [<c027dbdf>] dev_queue_xmit+0x167/0x207 [<c0283075>] neigh_resolve_output+0x113/0x152 [<c0296e96>] ip_finish_output2+0x12e/0x16d [<c0286a81>] nf_hook_slow+0x83/0xb4 [<c0296d5f>] ip_finish_output+0x1a5/0x1ae [<c0296d68>] ip_finish_output2+0x0/0x16d [<c02b88b2>] igmp_send_report+0x22c/0x271 [<c02b8a2a>] igmp_timer_expire+0x96/0xa2 [<c02b8994>] igmp_timer_expire+0x0/0xa2 [<c0129e15>] run_timer_softirq+0x123/0x145 [<c012641c>] __do_softirq+0x4c/0xb1 [<c010812f>] do_softirq+0x4f/0x56 ======================= [<c01173c0>] smp_apic_timer_interrupt+0x9a/0x9c [<c02d1b0e>] apic_timer_interrupt+0x1a/0x20 [<c01040e5>] mwait_idle+0x33/0x42 [<c010409d>] cpu_idle+0x26/0x3b [<c0390786>] start_kernel+0x199/0x19d Code: ff 21 e2 3b 42 18 73 06 8b 50 fd 31 c0 c3 31 d2 b8 f2 ff ff ff c3 90 57 56 89 d6 89 ca 53 53 53 89 c3 89 c8 c1 e8 02 89 df 89 c1 <f3> a5 f6 c2 02 74 02 66 a5 f6 c2 01 74 01 a4 5a 89 d8 59 5b 5e <0>Kernel panic - not syncing: Fatal exception in interrupt Grrr...terribly sorry! Perhaps you can tell I don't have a good IGMP setup here... Will try to get something better soon... Update from customer: Just in case it helps, at the time of the panic the application hasn't started; the system is still going through startup. I may be completely off track here but experience over several reboots of the system shows this crash to occur at slightlyy different times. Once or twice it would get as far as a login prompt on the console, other times it would crash before it got that far. The crash footprint qwas the same in both cases. All crashes occurred as 'hands-off'; in other words, if you just observed the system it would panic with me doing anything. Each boot resulted in an identical panic, certainly within 30 seconds of the login prompt appearing on the console. Is it possible that the crash is as a result of the kernel responding to the periodic (60 second) poll from the Cisco switch? At this stage of the boot there may or may not be a live multicast subscription, since the system is configured to use multicast NTP. However, the main application is started manually and so isn't running at this point. if I can be of more help, please let me know. -Andy Created attachment 118785 [details]
jwltest-bond_activebackup-igmp-hack.patc
This one works for me...
New test kernels are available at the same location as in comment 7. Please give them a try and report the results. I was able to fix my IGMP test to recreate the problem with the first patch and to verify that this patch does what I intended... :-) Copied from the 'issue tracker' report: The initial POC failover testing looks like it's worked. I don't have the report of the network guy running the tests but it seems that we now see multicast traffic being received on both NICs in the bond set irrespective of which one's active & which is passive. I don't believe this is an issue for us but will confirm with our network team and am also awaiting a report from the application guys to ensure the system downstream that receives the multicast transmitted from this application is also behaving appropriately. One anomaly we have seen is a 10-second delay in system response, but I'm still trying to qualify this in more detail. I have several terminals logged in via SSH and see the 'hang' on all of them. These seem to be occurring at intervals of just under a minute - example 'hang times' are 16:56:14->16:56:24 then 16:57:17->16:57:27, then 16:58:16->16:58:25 then 16:59:11->16:59:15 followed immediately by another hang from 16:59:15->16:59:25. Unfortunately the network guy's gone home; I was keen to try to tie these up with the receipt of the IGMP membership reports from the Cisco kit. At this stage these delays could be attributable to a multitude of things, since we've only performed cursory testing on our application. However, nothing seems to be chewing CPU time on the box. No errors are logged against either eth0 or eth1. If John could have a think on whether his changes could be causing this, I'll try to arrange for the system to be rebooted on the 'standard' kernel for crude comparison testing tomorrow. Copied from updated Issue Tracker call: OK, further observations from the testing done so far: (Note: kernel versions used: "standard" is 2.6.9-11.ELsmp and "test" is 2.6.9- 19.EL.jwltest.60smp) First, no delays of any kind noticed with "standard" kernel. Second: On the test kernel, delays only noticed when we subscribe to multicast. In other words, using 'netstat -g' the "no multicast subscription" shows: [15:57:19][root@lshlha1501 ~]$ [15:57:20][root@lshlha1501 ~]$ netstat -g IPv6/IPv4 Group Memberships Interface RefCnt Group --------------- ------ --------------------- lo 1 224.0.0.1 bond0 1 224.0.0.1 eth0 1 224.0.0.1 eth1 1 224.0.0.1 lo 1 ff02::1 bond0 1 ff02::1:ff6b:2617 bond0 1 ff02::1 eth0 1 ff02::1 eth1 1 ff02::1 [15:57:22][root@lshlha1501 ~]$ Conversely, the "subscribed to multicast" shows: [15:51:45]$ netstat -g IPv6/IPv4 Group Memberships Interface RefCnt Group --------------- ------ --------------------- lo 1 224.0.0.1 bond0 1 224.0.1.1 bond0 1 224.0.0.1 eth0 1 224.0.0.1 eth1 1 224.0.0.1 lo 1 ff02::1 bond0 1 ff02::1:ff6b:2617 bond0 1 ff02::1 eth0 1 ff02::1 eth1 1 ff02::1 [15:51:52]$ The 224.0.1.1 subscription is NTP... [root@lshlha1501 ~]# [root@lshlha1501 ~]# cat /etc/ntp.conf keys /etc/ntp/keys enable auth trustedkey 10 multicastclient 224.0.1.1 driftfile /var/lib/ntp/drift restrict 127.0.0.1 nomodify notrap restrict default nomodify notrap noquery [root@lshlha1501 ~]# Almost immediately we do a 'service ntpd start' the delays cut in. However, we can go one stage further here: First, I have two sessions to this system. The first is a standard SSH connection via PuTTY. The second is via the system console (in fact it's the integrated lights-out card but there's no material difference). The delays don't show up on the console, only on the SSH session. While the SSH session is 'hung' the console remains responsive. Secondly, the delays appear to be on input rather than output. I did two simple things: I changed my session prompt to the time of day (PS1="[\t]$ ") and I created this silly script: [root@lshlha1501 ~]# cat a1.sh #!/bin/bash while : do date sleep 1 done [root@lshlha1501 ~]# If I leave the script running in my SSH session and start NTP (ie. start a multicast subscription) from the console I get the date output every second in the SSH session. I've left this running for 15 minutes & it doesn't miss a beat. However, if I start NTP running from the console window and simply keep hitting the return key on my SSH session ("What did you do in the office today, Daddy? I wished my keyboard many happy returns...") I see immediate hangs and on several occasions the session was dropped. So, in summary: No apparent delays with no multicast subscriptions; no delays using the console; delays appear to be on input rather than output; observed delays range from 2-3 seconds up to the loss of a session. The only testing I couldn't get organised is to try to correlate the delays with IGMP traffic because the network guru is out of the office. Not sure how useful that is based on this experimental evidence... Cheers, -Andy Actually, this is very thorough testing. If you ever need a reference for a QA job, have them contact me! :-) I think I know the problem: the IGMP frames are getting sent out multiple ports but all with the same source MAC address. The switch is then having learning problems related to that situation. I had put code in place to prevent that, but apparently I was copying the wrong MAC address into the frames transmitted on the inactive slaves. I have a patch that corrects that issue. More test kernels to follow! Created attachment 118864 [details]
jwltest-bond_activebackup-igmp-hack.patch
Test kernels with the above patch are available at the same location as in comment 7. Please give these a try and see if the disruptions caused by the IGMP traffic go away...thanks! Thanks for the very fast turnaround! The new test kernel has been downloaded & deployed. I ran through the quick and dirty 'idiot tests' with multicast NTP that showed up the delays in the previous update and they don't appear to be happening, which is great news. Tomorrow (Friday) we'll move on to the application itself and will run through all the failover tests again. I hope to have a useful update for you tomorrow evening. Iâm currently helping our System Test group define the tests to be run against the POC kernel. After reading Johnâs excellent explanation of the fix, one particular scenario sprang to mind but since itâs still a week or so away before testing can begin here I thought Iâd give John the chance to mull it over. The fix is described as âreplicate any outgoing IGMP traffic to all "inactive" slaves in bondâ. I may be taking this too literally but I plan to put four NICs in a system and create two bond sets, each containing two NICs. Device âbond0â will contain âeth0â and âeth1â while âbond1â will contain âeth2â and âeth3â. Iâll subscribe each bond device to a separate multicast stream before shutting the NICs down in the controlled tests to ensure thereâs no âcross pollinationâ of multicast /IGMP across the two bond sets. John: Do you perceive this will cause a problem? The traffic is only replicated between devices in the same bond. There should be no "cross pollination". Apparently using "primary=ethX" reduces the effectiveness of this fix..."don't do that"... Thanks for the advice! We'd initially used this option (primary=eth0) but removed it early in the bonding testing when we noticed that as soon as eth0 came back everything immediately failed back to it. Once bonding had failed across to the 'standby' NIC we wanted the traffic to stay there; removing this option achieves that. (We have found other issues with bonding apparently fouling up but that's outside the focus of this call). -Andy Copied from Issue Tracker: Software Engineering have now completed their testing on the POC kernel. They were completely successful and we would now like to formally request a hotfix. Since this is RHEL 4, could the hotfix please be based on the latest released RHEL 4 Update 2 kernel, which we believe is 2.6.9-22.0.1? Since the Engineering test group have now gone onto other things I'd also be grateful if you could give us an estimate of when we could expect it, since we'll have to schedule acceptance testing. If you have any issues or question then please contact either Phil or me in the office. Cheers, -Andy We now have all our RHEL4 systems upgraded with the Hotfix kernel and have transferred some live services to them. From our side, we can now close this call but a big thank you to all concerned for making this happen, permitting us to provide a significant improvement in resilience to what is a very critical part of our trading infrastructure. -Andy An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0132.html |