Bug 1289169
Summary: | corosync does not start | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Huy VU <huy.vu> | ||||||
Component: | corosync | Assignee: | Jan Friesse <jfriesse> | ||||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||||
Severity: | urgent | Docs Contact: | Steven J. Levine <slevine> | ||||||
Priority: | urgent | ||||||||
Version: | 7.1 | CC: | akarlsso, ccaulfie, cfeist, cluster-maint, huy.vu, jfriesse, jkortus, kkeithle, mdolezel, mnavrati, rcyriac, rsteiger, skoduri, tojeline | ||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | corosync-2.3.6-1.el7 | Doc Type: | Bug Fix | ||||||
Doc Text: |
Corosync starts correctly when configured to use IPv4 and DNS is set to return both IPv4 and IPv6 addresses
Previously, when a pcs-generated `corosync.conf` file used hostnames instead of IP addresses and Internet Protocol version 4 (IPv4) and the DNS server was set to return both IPV4 and IPV6 addresses, the `corosync` utility failed to start. With this fix, if Corosync is configured to use IPv4, IPv4 is really used. As a result, `corosync` starts as expected in the described circumstances.
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 1333396 1333397 (view as bug list) | Environment: | |||||||
Last Closed: | 2016-11-04 06:49:10 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1324332, 1333396, 1333397 | ||||||||
Attachments: |
|
Description
Huy VU
2015-12-07 15:09:43 UTC
Found a work-around here: http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs Look for Freddy's comment. Comment on attachment 1103265 [details]
See events around Dec 7 09:52:50
I've set the attachment to private as it contains some private communication instead of logs.
Huy Vu, were you able to solve your problem by adding the hostnames to /etc/hosts on both machines? If yes, it would indicate that either your local domain name resolution does not work properly or the host names used in cluster are not known by the DNS service. Created attachment 1103613 [details]
Provided logs
Hello Radek, Thanks for spotting the wrong logs. I have attached the correct logs. I was able to resolve the problem by adding both host names to my /etc/hosts. What was confusing to me was that the DNS server resolved the names of both nodes correctly; I was able to ping nodes by names from each of the nodes before adding the names to the hosts file. (In reply to Huy VU from comment #5) > Hello Radek, > Thanks for spotting the wrong logs. I have attached the correct logs. > > I was able to resolve the problem by adding both host names to my > /etc/hosts. What was confusing to me was that the DNS server resolved the > names of both nodes correctly; I was able to ping nodes by names from each > of the nodes before adding the names to the hosts file. The DNS server had records for node1 and node2 and was able to look up their IPs. What happens if you remove the /etc/hosts records and start corosync manually on all nodes, i.e.: # systemctl start corosync Removed entry of self from /etc/hosts. On Node 1: [root@huysnpmvm9 ~]# systemctl start corosync Job for corosync.service failed. See 'systemctl status corosync.service' and 'journalctl -xn' for details. You have new mail in /var/spool/mail/root [root@huysnpmvm9 ~]# systemctl status corosync.service corosync.service - Corosync Cluster Engine Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled) Active: failed (Result: exit-code) since Tue 2015-12-08 11:25:43 EST; 19s ago Process: 8050 ExecStart=/usr/share/corosync/corosync start (code=exited, status=1/FAILURE) Dec 08 11:24:42 huysnpmvm9 systemd[1]: Starting Corosync Cluster Engine... Dec 08 11:24:42 huysnpmvm9 corosync[8056]: [MAIN ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service. Dec 08 11:24:42 huysnpmvm9 corosync[8056]: [MAIN ] Corosync built-in features: dbus systemd xmlconf snmp pie relro bindnow Dec 08 11:24:42 huysnpmvm9 corosync[8057]: [TOTEM ] Initializing transport (UDP/IP Unicast). Dec 08 11:24:42 huysnpmvm9 corosync[8057]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none Dec 08 11:25:43 huysnpmvm9 corosync[8050]: Starting Corosync Cluster Engine (corosync): [FAILED] Dec 08 11:25:43 huysnpmvm9 systemd[1]: corosync.service: control process exited, code=exited status=1 Dec 08 11:25:43 huysnpmvm9 systemd[1]: Failed to start Corosync Cluster Engine. Dec 08 11:25:43 huysnpmvm9 systemd[1]: Unit corosync.service entered failed state. [root@huysnpmvm9 ~]# journalctl -xn -- Logs begin at Mon 2015-12-07 15:05:02 EST, end at Tue 2015-12-08 11:25:43 EST. -- Dec 08 11:24:43 huysnpmvm9 postfix/cleanup[8112]: DC32D3402FAA: message-id=<5667044b.XeUrThbdjPfDse1m%user@localhost> Dec 08 11:24:43 huysnpmvm9 postfix/qmgr[1972]: DC32D3402FAA: from=<user>, size=22215, nrcpt=1 (queue active) Dec 08 11:24:43 huysnpmvm9 postfix/local[8114]: DC32D3402FAA: to=<root>, orig_to=<root@localhost>, relay=local Dec 08 11:24:43 huysnpmvm9 postfix/qmgr[1972]: DC32D3402FAA: removed Dec 08 11:24:57 huysnpmvm9 avahi-daemon[598]: Invalid response packet from host 10.35.29.39. Dec 08 11:24:59 huysnpmvm9 avahi-daemon[598]: Invalid response packet from host 10.35.29.40. Dec 08 11:25:43 huysnpmvm9 corosync[8050]: Starting Corosync Cluster Engine (corosync): [FAILED] Dec 08 11:25:43 huysnpmvm9 systemd[1]: corosync.service: control process exited, code=exited status=1 Dec 08 11:25:43 huysnpmvm9 systemd[1]: Failed to start Corosync Cluster Engine. -- Subject: Unit corosync.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit corosync.service has failed. -- -- The result is failed. Dec 08 11:25:43 huysnpmvm9 systemd[1]: Unit corosync.service entered failed state. [root@huysnpmvm9 ~]# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 10.35.29.222 huysnpmvm10 huysnpmvm10.mitel.com ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 [root@huysnpmvm9 ~]# ping huysnpmvm9 On Node 2: [root@huysnpmvm10 ~]# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 10.35.29.221 huysnpmvm9 huysnpmvm9.mitel.com ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 [root@huysnpmvm10 ~]# ping huysnpmvm10 PING huysnpmvm10 (10.35.29.222) 56(84) bytes of data. 64 bytes from huysnpmvm10 (10.35.29.222): icmp_seq=1 ttl=64 time=0.000 ms 64 bytes from huysnpmvm10 (10.35.29.222): icmp_seq=2 ttl=64 time=0.060 ms ^C --- huysnpmvm10 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1000ms rtt min/avg/max/mdev = 0.000/0.030/0.060/0.030 ms [root@huysnpmvm10 ~]# ping huysnpmvm9 PING huysnpmvm9 (10.35.29.221) 56(84) bytes of data. 64 bytes from huysnpmvm9 (10.35.29.221): icmp_seq=1 ttl=64 time=1.07 ms 64 bytes from huysnpmvm9 (10.35.29.221): icmp_seq=2 ttl=64 time=0.476 ms 64 bytes from huysnpmvm9 (10.35.29.221): icmp_seq=3 ttl=64 time=0.298 ms ^C --- huysnpmvm9 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2000ms rtt min/avg/max/mdev = 0.298/0.615/1.073/0.332 ms Huy VU, can you please provide your corosync.conf (please change confidential information if you found some)? Also abrt was triggered so can you please provide abrt result (coredump, ...)? Huy VU, also interesting information may be running corosync with debug set to trace (just set debug: in logging section, so it looks like: logging { ... debug: trace ... } ) and execute corosync in foreground mode by running "corosync -f". Huy VU, can you please try to provide information I was asking for? Corosync is evidently crashing and that's not good. Sadly I'm not able to reproduce this bug. Created attachment 1144236 [details]
Proposed patch
totemconfig: Explicitly pass IP version
If resolver was set to prefer IPv6 (almost always) and interface section
was not defined (almost all config files created by pcs), IP version was
set to mcast_addr.family. Because mcast_addr.family was unset (reset to
zero), IPv6 address was returned causing failure in totemsrp.
Solution is to pass correct IP version stored in
totem_config->ip_version.
Patch also simplifies get_cluster_mcast_addr. It was using mix of
explicitly passed IP version and bindnet IP version.
Also return value of get_cluster_mcast_addr is now properly checked.
Reworking the doc text for inclusion in the 7.3 release notes. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2463.html |