Bug 861135
Summary: | Cluster Traffic does not send IGMP joins | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Rob Marti <robmartiwork> |
Component: | openais | Assignee: | Jan Friesse <jfriesse> |
Status: | CLOSED NOTABUG | QA Contact: | Cluster QE <mspqa-list> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 5.8 | CC: | cluster-maint, edamato, mppz3wzs7k, rmarti, sdake |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2013-09-06 18:04:02 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Rob Marti
2012-09-27 15:38:30 UTC
OpenAIS is correctly calling setsockopt (sockets->mcast_recv, IPPROTO_IP, IP_ADD_MEMBERSHIP, &mreq, sizeof (mreq)); Kernel then should take care to create IGMP_JOIN message. If it is not, there may be problem ether in glibc or kernel. How did you found out, that IGMP Join is not called? I started the cluster services and our networking people watched the packets at the switch, never seeing an IGMP Join. We also used tcpdump/wireshark to try and find it. Using a Cisco Catalyst switch the cluster goes quorate without any issues (according to Cisco because those switches treat any Mutlicast traffic as an implicit join). Swapping to the N7Ks leads to an unquorate cluster. I've moved on from that job (and now work for Red Hat) but I know this is still an issue. I'll try and get one of my ex-coworkers copied on the bug. IGMP join is handled by kernel. You can see the kernel info in /proc/net/igmp. Displaying this file would be helpful to determine if kernel saw the openais request to add membership. Regards -steve Here are the contents of /proc/net/igmp from one of the clusters that split-brains. Idx Device : Count Querier Group Users Timer Reporter 1 lo : 0 V3 010000E0 1 0:00000000 0 5 bond0 : 7 V3 2E16C0EF 1 0:00000000 0 FB0000E0 1 0:00000000 0 010000E0 1 0:00000000 0 Corey, I would expect if you kill openais you will see one of the querier entries disappear. I would recommend running this through Red Hat's GSS, since they have a better handle on how to properly configure your switches. My guess is that you have IGMP timeouts (may not be proper term) turned on in the switch, and the Cisco switch is dropping the IGMP data from its querier tables after the timeout. Regards -steve Steve, I can certainly put in a ticket, but I work for an educational institution with only self-support. Unless it's something on Satellite (Satellite's not clustered, no issue there) I believe we're on our own. Thank you. Corey Crawford One extra comment. Corosync uses ASM (Any Source Multicast). Maybe Cisco (or your Cisco configuration) likes SSM more. Sadly, we don't support SSM in OpenAIS/Corosync. You can test that by trying omping (should be in EPEL). If (with -M ssm option) you will get expected behavior, we will know source of your problem (sadly, not a solution). I don't know if it matters, but we're not using openais or corosync. I'm trying omping this morning. Corey Crawford Oh, sorry. We're using openais, but the cluster I'm testing on is currently not running. We're not using corosync, though. It doesn't even seem to be in the RHEL5 repos. (In reply to comment #9) > Oh, sorry. We're using openais, but the cluster I'm testing on is currently > not running. > We're not using corosync, though. It doesn't even seem to be in the RHEL5 > repos. Ya, sorry for confusing you. Actually, wherever I will tell corosync, I mean openais. Transport code is almost same in both openais and corosync. And cluster (as cman) just executes openais (aisexec) or corosync (depending on RHEL version). |