Bug 1289169

Summary: corosync does not start
Product: Red Hat Enterprise Linux 7 Reporter: Huy VU <huy.vu>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact: Steven J. Levine <slevine>
Priority: urgent    
Version: 7.1CC: akarlsso, ccaulfie, cfeist, cluster-maint, huy.vu, jfriesse, jkortus, kkeithle, mdolezel, mnavrati, rcyriac, rsteiger, skoduri, tojeline
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: corosync-2.3.6-1.el7 Doc Type: Bug Fix
Doc Text:
Corosync starts correctly when configured to use IPv4 and DNS is set to return both IPv4 and IPv6 addresses Previously, when a pcs-generated `corosync.conf` file used hostnames instead of IP addresses and Internet Protocol version 4 (IPv4) and the DNS server was set to return both IPV4 and IPV6 addresses, the `corosync` utility failed to start. With this fix, if Corosync is configured to use IPv4, IPv4 is really used. As a result, `corosync` starts as expected in the described circumstances.
Story Points: ---
Clone Of:
: 1333396 1333397 (view as bug list) Environment:
Last Closed: 2016-11-04 06:49:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1324332, 1333396, 1333397    
Attachments:
Description Flags
Provided logs
none
Proposed patch none

Description Huy VU 2015-12-07 15:09:43 UTC
Created attachment 1103265 [details]
See events around Dec 7 09:52:50

Description of problem:
I am following the document "Cluster from Stratch" from clusterlabs.org. Specifically this version: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/pdf/Clusters_from_Scratch/Pacemaker-1.1-Clusters_from_Scratch-en-US.pdf

Everything went fine until Chapter 4 where I had to execute the command:

pcs cluster start --all

This is what transpired:

[root@localhost ~]# pcs cluster start --all
huysnpmvm10: Starting Cluster...
Redirecting to /bin/systemctl start  corosync.service
Job for corosync.service failed. See 'systemctl status corosync.service' and 'journalctl -xn' for details.
huysnpmvm9: Starting Cluster...
Redirecting to /bin/systemctl start  corosync.service
Job for corosync.service failed. See 'systemctl status corosync.service' and 'journalctl -xn' for details.
You have new mail in /var/spool/mail/root
[root@localhost ~]# systemctl status corosync.service
corosync.service - Corosync Cluster Engine
   Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled)
   Active: failed (Result: exit-code) since Mon 2015-12-07 09:53:51 EST; 15s ago
  Process: 14736 ExecStart=/usr/share/corosync/corosync start (code=exited, status=1/FAILURE)

Dec 07 09:52:50 huysnpmvm9 systemd[1]: Starting Corosync Cluster Engine...
Dec 07 09:52:50 huysnpmvm9 corosync[14742]: [MAIN  ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service.
Dec 07 09:52:50 huysnpmvm9 corosync[14742]: [MAIN  ] Corosync built-in features: dbus systemd xmlconf snmp pie relro bindnow
Dec 07 09:52:50 huysnpmvm9 corosync[14743]: [TOTEM ] Initializing transport (UDP/IP Unicast).
Dec 07 09:52:50 huysnpmvm9 corosync[14743]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none
Dec 07 09:53:51 huysnpmvm9 corosync[14736]: Starting Corosync Cluster Engine (corosync): [FAILED]
Dec 07 09:53:51 huysnpmvm9 systemd[1]: corosync.service: control process exited, code=exited status=1
Dec 07 09:53:51 huysnpmvm9 systemd[1]: Failed to start Corosync Cluster Engine.
Dec 07 09:53:51 huysnpmvm9 systemd[1]: Unit corosync.service entered failed state.
[root@localhost ~]# journalctl -xn
-- Logs begin at Fri 2015-12-04 14:32:34 EST, end at Mon 2015-12-07 09:53:51 EST. --
Dec 07 09:52:51 huysnpmvm9 abrt-server[14748]: Email was sent to: root@localhost
Dec 07 09:52:51 huysnpmvm9 postfix/pickup[12456]: D97CF316428D: uid=0 from=<user@localhost>
Dec 07 09:52:51 huysnpmvm9 postfix/cleanup[14799]: D97CF316428D: message-id=<56659d43.dCxiTCGvgeHtHfo4%user@localhost>
Dec 07 09:52:51 huysnpmvm9 postfix/qmgr[2469]: D97CF316428D: from=<user>, size=22215, nrcpt=1 (queue active)
Dec 07 09:52:51 huysnpmvm9 postfix/local[14801]: D97CF316428D: to=<root>, orig_to=<root@localhost>, relay=loca
Dec 07 09:52:51 huysnpmvm9 postfix/qmgr[2469]: D97CF316428D: removed
Dec 07 09:53:51 huysnpmvm9 corosync[14736]: Starting Corosync Cluster Engine (corosync): [FAILED]
Dec 07 09:53:51 huysnpmvm9 systemd[1]: corosync.service: control process exited, code=exited status=1
Dec 07 09:53:51 huysnpmvm9 systemd[1]: Failed to start Corosync Cluster Engine.
-- Subject: Unit corosync.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit corosync.service has failed.
--
-- The result is failed.
Dec 07 09:53:51 huysnpmvm9 systemd[1]: Unit corosync.service entered failed state.
[root@localhost ~]#


Version-Release number of selected component (if applicable):
[root@localhost ~]# rpm -q pacemaker corosync pcs
pacemaker-1.1.12-22.el7_1.4.x86_64
corosync-2.3.4-4.el7_1.3.x86_64
pcs-0.9.137-13.el7_1.4.x86_64



[root@localhost ~]# cat /etc/system-release
CentOS Linux release 7.1.1503 (Core)
[root@localhost ~]# cat /etc/redhat-release
CentOS Linux release 7.1.1503 (Core)
[root@localhost ~]#

How reproducible:

Steps to Reproduce:
1.Install Centos 7.1 on two servers.
2.Follow the instructions in 
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/pdf/Clusters_from_Scratch/Pacemaker-1.1-Clusters_from_Scratch-en-US.pdf
3. In chapter 4, you'll be asked to do 'pcs cluster start --all'. This is when the problem happened to me.

Actual results:
See above

Expected results:
cluster should have started fine


Additional info:

Comment 1 Huy VU 2015-12-07 20:49:37 UTC
Found a work-around here:

http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs

Look for Freddy's comment.

Comment 2 Radek Steiger 2015-12-08 11:45:58 UTC
Comment on attachment 1103265 [details]
See events around Dec 7 09:52:50

I've set the attachment to private as it contains some private communication instead of logs.

Comment 3 Radek Steiger 2015-12-08 13:01:24 UTC
Huy Vu,

were you able to solve your problem by adding the hostnames to /etc/hosts on both machines? If yes, it would indicate that either your local domain name resolution does not work properly or the host names used in cluster are not known by the DNS service.

Comment 4 Huy VU 2015-12-08 13:59:08 UTC
Created attachment 1103613 [details]
Provided logs

Comment 5 Huy VU 2015-12-08 14:03:58 UTC
Hello Radek,
Thanks for spotting the wrong logs. I have attached the correct logs.

I was able to resolve the problem by adding both host names to my /etc/hosts. What was confusing to me was that the DNS server resolved the names of both nodes correctly; I was able to ping nodes by names from each of the nodes before adding the names to the hosts file.

Comment 6 Huy VU 2015-12-08 16:01:51 UTC
(In reply to Huy VU from comment #5)
> Hello Radek,
> Thanks for spotting the wrong logs. I have attached the correct logs.
> 
> I was able to resolve the problem by adding both host names to my
> /etc/hosts. What was confusing to me was that the DNS server resolved the
> names of both nodes correctly; I was able to ping nodes by names from each
> of the nodes before adding the names to the hosts file.

The DNS server had records for node1 and node2 and was able to look up their IPs.

Comment 7 Radek Steiger 2015-12-08 16:12:12 UTC
What happens if you remove the /etc/hosts records and start corosync manually on all nodes, i.e.:

# systemctl start corosync

Comment 8 Huy VU 2015-12-08 16:29:06 UTC
Removed entry of self from /etc/hosts.

On Node 1:

[root@huysnpmvm9 ~]# systemctl start corosync
Job for corosync.service failed. See 'systemctl status corosync.service' and 'journalctl -xn' for details.
You have new mail in /var/spool/mail/root
[root@huysnpmvm9 ~]# systemctl status corosync.service
corosync.service - Corosync Cluster Engine
   Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled)
   Active: failed (Result: exit-code) since Tue 2015-12-08 11:25:43 EST; 19s ago
  Process: 8050 ExecStart=/usr/share/corosync/corosync start (code=exited, status=1/FAILURE)

Dec 08 11:24:42 huysnpmvm9 systemd[1]: Starting Corosync Cluster Engine...
Dec 08 11:24:42 huysnpmvm9 corosync[8056]: [MAIN  ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service.
Dec 08 11:24:42 huysnpmvm9 corosync[8056]: [MAIN  ] Corosync built-in features: dbus systemd xmlconf snmp pie relro bindnow
Dec 08 11:24:42 huysnpmvm9 corosync[8057]: [TOTEM ] Initializing transport (UDP/IP Unicast).
Dec 08 11:24:42 huysnpmvm9 corosync[8057]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none
Dec 08 11:25:43 huysnpmvm9 corosync[8050]: Starting Corosync Cluster Engine (corosync): [FAILED]
Dec 08 11:25:43 huysnpmvm9 systemd[1]: corosync.service: control process exited, code=exited status=1
Dec 08 11:25:43 huysnpmvm9 systemd[1]: Failed to start Corosync Cluster Engine.
Dec 08 11:25:43 huysnpmvm9 systemd[1]: Unit corosync.service entered failed state.
[root@huysnpmvm9 ~]# journalctl -xn
-- Logs begin at Mon 2015-12-07 15:05:02 EST, end at Tue 2015-12-08 11:25:43 EST. --
Dec 08 11:24:43 huysnpmvm9 postfix/cleanup[8112]: DC32D3402FAA: message-id=<5667044b.XeUrThbdjPfDse1m%user@localhost>
Dec 08 11:24:43 huysnpmvm9 postfix/qmgr[1972]: DC32D3402FAA: from=<user>, size=22215, nrcpt=1 (queue active)
Dec 08 11:24:43 huysnpmvm9 postfix/local[8114]: DC32D3402FAA: to=<root>, orig_to=<root@localhost>, relay=local
Dec 08 11:24:43 huysnpmvm9 postfix/qmgr[1972]: DC32D3402FAA: removed
Dec 08 11:24:57 huysnpmvm9 avahi-daemon[598]: Invalid response packet from host 10.35.29.39.
Dec 08 11:24:59 huysnpmvm9 avahi-daemon[598]: Invalid response packet from host 10.35.29.40.
Dec 08 11:25:43 huysnpmvm9 corosync[8050]: Starting Corosync Cluster Engine (corosync): [FAILED]
Dec 08 11:25:43 huysnpmvm9 systemd[1]: corosync.service: control process exited, code=exited status=1
Dec 08 11:25:43 huysnpmvm9 systemd[1]: Failed to start Corosync Cluster Engine.
-- Subject: Unit corosync.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit corosync.service has failed.
--
-- The result is failed.
Dec 08 11:25:43 huysnpmvm9 systemd[1]: Unit corosync.service entered failed state.
[root@huysnpmvm9 ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
10.35.29.222   huysnpmvm10 huysnpmvm10.mitel.com
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
[root@huysnpmvm9 ~]# ping huysnpmvm9





On Node 2:



[root@huysnpmvm10 ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
10.35.29.221   huysnpmvm9 huysnpmvm9.mitel.com
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
[root@huysnpmvm10 ~]# ping huysnpmvm10
PING huysnpmvm10 (10.35.29.222) 56(84) bytes of data.
64 bytes from huysnpmvm10 (10.35.29.222): icmp_seq=1 ttl=64 time=0.000 ms
64 bytes from huysnpmvm10 (10.35.29.222): icmp_seq=2 ttl=64 time=0.060 ms
^C
--- huysnpmvm10 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.000/0.030/0.060/0.030 ms
[root@huysnpmvm10 ~]# ping huysnpmvm9
PING huysnpmvm9 (10.35.29.221) 56(84) bytes of data.
64 bytes from huysnpmvm9 (10.35.29.221): icmp_seq=1 ttl=64 time=1.07 ms
64 bytes from huysnpmvm9 (10.35.29.221): icmp_seq=2 ttl=64 time=0.476 ms
64 bytes from huysnpmvm9 (10.35.29.221): icmp_seq=3 ttl=64 time=0.298 ms
^C
--- huysnpmvm9 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.298/0.615/1.073/0.332 ms

Comment 10 Jan Friesse 2015-12-08 17:00:35 UTC
Huy VU,
can you please provide your corosync.conf (please change confidential information if you found some)? Also abrt was triggered so can you please provide abrt result (coredump, ...)?

Comment 11 Jan Friesse 2015-12-08 17:03:06 UTC
Huy VU,
also interesting information may be running corosync with debug set to trace (just set debug: in logging section, so it looks like:
logging {
...
debug: trace
...
} ) and execute corosync in foreground mode by running "corosync -f".

Comment 12 Jan Friesse 2016-01-25 16:21:45 UTC
Huy VU,
can you please try to provide information I was asking for? Corosync is evidently crashing and that's not good. Sadly I'm not able to reproduce this bug.

Comment 21 Jan Friesse 2016-04-06 14:05:43 UTC
Created attachment 1144236 [details]
Proposed patch

totemconfig: Explicitly pass IP version

If resolver was set to prefer IPv6 (almost always) and interface section
was not defined (almost all config files created by pcs), IP version was
set to mcast_addr.family. Because mcast_addr.family was unset (reset to
zero), IPv6 address was returned causing failure in totemsrp.
Solution is to pass correct IP version stored in
totem_config->ip_version.

Patch also simplifies get_cluster_mcast_addr. It was using mix of
explicitly passed IP version and bindnet IP version.

Also return value of get_cluster_mcast_addr is now properly checked.

Comment 39 Steven J. Levine 2016-10-21 16:50:03 UTC
Reworking the doc text for inclusion in the 7.3 release notes.

Comment 41 errata-xmlrpc 2016-11-04 06:49:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2463.html