Created attachment 1103265 [details] See events around Dec 7 09:52:50 Description of problem: I am following the document "Cluster from Stratch" from clusterlabs.org. Specifically this version: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/pdf/Clusters_from_Scratch/Pacemaker-1.1-Clusters_from_Scratch-en-US.pdf Everything went fine until Chapter 4 where I had to execute the command: pcs cluster start --all This is what transpired: [root@localhost ~]# pcs cluster start --all huysnpmvm10: Starting Cluster... Redirecting to /bin/systemctl start corosync.service Job for corosync.service failed. See 'systemctl status corosync.service' and 'journalctl -xn' for details. huysnpmvm9: Starting Cluster... Redirecting to /bin/systemctl start corosync.service Job for corosync.service failed. See 'systemctl status corosync.service' and 'journalctl -xn' for details. You have new mail in /var/spool/mail/root [root@localhost ~]# systemctl status corosync.service corosync.service - Corosync Cluster Engine Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled) Active: failed (Result: exit-code) since Mon 2015-12-07 09:53:51 EST; 15s ago Process: 14736 ExecStart=/usr/share/corosync/corosync start (code=exited, status=1/FAILURE) Dec 07 09:52:50 huysnpmvm9 systemd[1]: Starting Corosync Cluster Engine... Dec 07 09:52:50 huysnpmvm9 corosync[14742]: [MAIN ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service. Dec 07 09:52:50 huysnpmvm9 corosync[14742]: [MAIN ] Corosync built-in features: dbus systemd xmlconf snmp pie relro bindnow Dec 07 09:52:50 huysnpmvm9 corosync[14743]: [TOTEM ] Initializing transport (UDP/IP Unicast). Dec 07 09:52:50 huysnpmvm9 corosync[14743]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none Dec 07 09:53:51 huysnpmvm9 corosync[14736]: Starting Corosync Cluster Engine (corosync): [FAILED] Dec 07 09:53:51 huysnpmvm9 systemd[1]: corosync.service: control process exited, code=exited status=1 Dec 07 09:53:51 huysnpmvm9 systemd[1]: Failed to start Corosync Cluster Engine. Dec 07 09:53:51 huysnpmvm9 systemd[1]: Unit corosync.service entered failed state. [root@localhost ~]# journalctl -xn -- Logs begin at Fri 2015-12-04 14:32:34 EST, end at Mon 2015-12-07 09:53:51 EST. -- Dec 07 09:52:51 huysnpmvm9 abrt-server[14748]: Email was sent to: root@localhost Dec 07 09:52:51 huysnpmvm9 postfix/pickup[12456]: D97CF316428D: uid=0 from=<user@localhost> Dec 07 09:52:51 huysnpmvm9 postfix/cleanup[14799]: D97CF316428D: message-id=<56659d43.dCxiTCGvgeHtHfo4%user@localhost> Dec 07 09:52:51 huysnpmvm9 postfix/qmgr[2469]: D97CF316428D: from=<user>, size=22215, nrcpt=1 (queue active) Dec 07 09:52:51 huysnpmvm9 postfix/local[14801]: D97CF316428D: to=<root>, orig_to=<root@localhost>, relay=loca Dec 07 09:52:51 huysnpmvm9 postfix/qmgr[2469]: D97CF316428D: removed Dec 07 09:53:51 huysnpmvm9 corosync[14736]: Starting Corosync Cluster Engine (corosync): [FAILED] Dec 07 09:53:51 huysnpmvm9 systemd[1]: corosync.service: control process exited, code=exited status=1 Dec 07 09:53:51 huysnpmvm9 systemd[1]: Failed to start Corosync Cluster Engine. -- Subject: Unit corosync.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit corosync.service has failed. -- -- The result is failed. Dec 07 09:53:51 huysnpmvm9 systemd[1]: Unit corosync.service entered failed state. [root@localhost ~]# Version-Release number of selected component (if applicable): [root@localhost ~]# rpm -q pacemaker corosync pcs pacemaker-1.1.12-22.el7_1.4.x86_64 corosync-2.3.4-4.el7_1.3.x86_64 pcs-0.9.137-13.el7_1.4.x86_64 [root@localhost ~]# cat /etc/system-release CentOS Linux release 7.1.1503 (Core) [root@localhost ~]# cat /etc/redhat-release CentOS Linux release 7.1.1503 (Core) [root@localhost ~]# How reproducible: Steps to Reproduce: 1.Install Centos 7.1 on two servers. 2.Follow the instructions in http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/pdf/Clusters_from_Scratch/Pacemaker-1.1-Clusters_from_Scratch-en-US.pdf 3. In chapter 4, you'll be asked to do 'pcs cluster start --all'. This is when the problem happened to me. Actual results: See above Expected results: cluster should have started fine Additional info:
Found a work-around here: http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs Look for Freddy's comment.
Comment on attachment 1103265 [details] See events around Dec 7 09:52:50 I've set the attachment to private as it contains some private communication instead of logs.
Huy Vu, were you able to solve your problem by adding the hostnames to /etc/hosts on both machines? If yes, it would indicate that either your local domain name resolution does not work properly or the host names used in cluster are not known by the DNS service.
Created attachment 1103613 [details] Provided logs
Hello Radek, Thanks for spotting the wrong logs. I have attached the correct logs. I was able to resolve the problem by adding both host names to my /etc/hosts. What was confusing to me was that the DNS server resolved the names of both nodes correctly; I was able to ping nodes by names from each of the nodes before adding the names to the hosts file.
(In reply to Huy VU from comment #5) > Hello Radek, > Thanks for spotting the wrong logs. I have attached the correct logs. > > I was able to resolve the problem by adding both host names to my > /etc/hosts. What was confusing to me was that the DNS server resolved the > names of both nodes correctly; I was able to ping nodes by names from each > of the nodes before adding the names to the hosts file. The DNS server had records for node1 and node2 and was able to look up their IPs.
What happens if you remove the /etc/hosts records and start corosync manually on all nodes, i.e.: # systemctl start corosync
Removed entry of self from /etc/hosts. On Node 1: [root@huysnpmvm9 ~]# systemctl start corosync Job for corosync.service failed. See 'systemctl status corosync.service' and 'journalctl -xn' for details. You have new mail in /var/spool/mail/root [root@huysnpmvm9 ~]# systemctl status corosync.service corosync.service - Corosync Cluster Engine Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled) Active: failed (Result: exit-code) since Tue 2015-12-08 11:25:43 EST; 19s ago Process: 8050 ExecStart=/usr/share/corosync/corosync start (code=exited, status=1/FAILURE) Dec 08 11:24:42 huysnpmvm9 systemd[1]: Starting Corosync Cluster Engine... Dec 08 11:24:42 huysnpmvm9 corosync[8056]: [MAIN ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service. Dec 08 11:24:42 huysnpmvm9 corosync[8056]: [MAIN ] Corosync built-in features: dbus systemd xmlconf snmp pie relro bindnow Dec 08 11:24:42 huysnpmvm9 corosync[8057]: [TOTEM ] Initializing transport (UDP/IP Unicast). Dec 08 11:24:42 huysnpmvm9 corosync[8057]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none Dec 08 11:25:43 huysnpmvm9 corosync[8050]: Starting Corosync Cluster Engine (corosync): [FAILED] Dec 08 11:25:43 huysnpmvm9 systemd[1]: corosync.service: control process exited, code=exited status=1 Dec 08 11:25:43 huysnpmvm9 systemd[1]: Failed to start Corosync Cluster Engine. Dec 08 11:25:43 huysnpmvm9 systemd[1]: Unit corosync.service entered failed state. [root@huysnpmvm9 ~]# journalctl -xn -- Logs begin at Mon 2015-12-07 15:05:02 EST, end at Tue 2015-12-08 11:25:43 EST. -- Dec 08 11:24:43 huysnpmvm9 postfix/cleanup[8112]: DC32D3402FAA: message-id=<5667044b.XeUrThbdjPfDse1m%user@localhost> Dec 08 11:24:43 huysnpmvm9 postfix/qmgr[1972]: DC32D3402FAA: from=<user>, size=22215, nrcpt=1 (queue active) Dec 08 11:24:43 huysnpmvm9 postfix/local[8114]: DC32D3402FAA: to=<root>, orig_to=<root@localhost>, relay=local Dec 08 11:24:43 huysnpmvm9 postfix/qmgr[1972]: DC32D3402FAA: removed Dec 08 11:24:57 huysnpmvm9 avahi-daemon[598]: Invalid response packet from host 10.35.29.39. Dec 08 11:24:59 huysnpmvm9 avahi-daemon[598]: Invalid response packet from host 10.35.29.40. Dec 08 11:25:43 huysnpmvm9 corosync[8050]: Starting Corosync Cluster Engine (corosync): [FAILED] Dec 08 11:25:43 huysnpmvm9 systemd[1]: corosync.service: control process exited, code=exited status=1 Dec 08 11:25:43 huysnpmvm9 systemd[1]: Failed to start Corosync Cluster Engine. -- Subject: Unit corosync.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit corosync.service has failed. -- -- The result is failed. Dec 08 11:25:43 huysnpmvm9 systemd[1]: Unit corosync.service entered failed state. [root@huysnpmvm9 ~]# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 10.35.29.222 huysnpmvm10 huysnpmvm10.mitel.com ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 [root@huysnpmvm9 ~]# ping huysnpmvm9 On Node 2: [root@huysnpmvm10 ~]# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 10.35.29.221 huysnpmvm9 huysnpmvm9.mitel.com ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 [root@huysnpmvm10 ~]# ping huysnpmvm10 PING huysnpmvm10 (10.35.29.222) 56(84) bytes of data. 64 bytes from huysnpmvm10 (10.35.29.222): icmp_seq=1 ttl=64 time=0.000 ms 64 bytes from huysnpmvm10 (10.35.29.222): icmp_seq=2 ttl=64 time=0.060 ms ^C --- huysnpmvm10 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1000ms rtt min/avg/max/mdev = 0.000/0.030/0.060/0.030 ms [root@huysnpmvm10 ~]# ping huysnpmvm9 PING huysnpmvm9 (10.35.29.221) 56(84) bytes of data. 64 bytes from huysnpmvm9 (10.35.29.221): icmp_seq=1 ttl=64 time=1.07 ms 64 bytes from huysnpmvm9 (10.35.29.221): icmp_seq=2 ttl=64 time=0.476 ms 64 bytes from huysnpmvm9 (10.35.29.221): icmp_seq=3 ttl=64 time=0.298 ms ^C --- huysnpmvm9 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2000ms rtt min/avg/max/mdev = 0.298/0.615/1.073/0.332 ms
Huy VU, can you please provide your corosync.conf (please change confidential information if you found some)? Also abrt was triggered so can you please provide abrt result (coredump, ...)?
Huy VU, also interesting information may be running corosync with debug set to trace (just set debug: in logging section, so it looks like: logging { ... debug: trace ... } ) and execute corosync in foreground mode by running "corosync -f".
Huy VU, can you please try to provide information I was asking for? Corosync is evidently crashing and that's not good. Sadly I'm not able to reproduce this bug.
Created attachment 1144236 [details] Proposed patch totemconfig: Explicitly pass IP version If resolver was set to prefer IPv6 (almost always) and interface section was not defined (almost all config files created by pcs), IP version was set to mcast_addr.family. Because mcast_addr.family was unset (reset to zero), IPv6 address was returned causing failure in totemsrp. Solution is to pass correct IP version stored in totem_config->ip_version. Patch also simplifies get_cluster_mcast_addr. It was using mix of explicitly passed IP version and bindnet IP version. Also return value of get_cluster_mcast_addr is now properly checked.
Reworking the doc text for inclusion in the 7.3 release notes.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2463.html