Bug 1297850

Summary: Corosync fails to start in an ipv6 deployment
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Jiri Stransky <jstransk>
Status: CLOSED ERRATA QA Contact: yeylon <yeylon>
Severity: high Docs Contact:
Priority: urgent    
Version: 7.0 (Kilo)CC: dmacpher, emacchi, gdubreui, jslagle, mburns, michele, rhel-osp-director-maint, sathlang, srevivo, yeylon
Target Milestone: y3Keywords: ZStream
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-0.8.6-99.el7ost Doc Type: Bug Fix
Doc Text:
Corosync failed to start in an IPv6-based Overcloud. This is due to a missing '--ipv6' option when the director tries to start Corosync. This fix adds this option to the Controller's Puppet manifest and also adds related parameters to the Heat template collection. Corosync now starts successfully in IPv6-based Overclouds.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-18 16:48:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Marius Cornea 2016-01-12 15:45:58 UTC
Description of problem:
Corosync fails to start in an ipv6 deployment with the following error:
[MAIN  ] parse error in config: No interfaces defined
[MAIN  ] Corosync Cluster Engine exiting with status 8 at main.c:1278.


Version-Release number of selected component (if applicable):
I'm doing the test following the instructions in:
https://etherpad.openstack.org/p/tripleo-ipv6-support
and enabling pacemaker by passing an additional $THT/environments/puppet-pacemaker.yaml environment file

How reproducible:
100%

Steps to Reproduce:
1. Deploy ipv6 enabled overcloud with pacemaker

Actual results:
Deployment eventually fails.

Expected results:
Deployment completes successfully.

Additional info:
This is the corosync.conf:
totem {
    version: 2
    secauth: off
    cluster_name: tripleo_cluster
    transport: udpu
}

nodelist {
    node {
        ring0_addr: overcloud-controller-0
        nodeid: 1
    }
}

quorum {
    provider: corosync_votequorum
}

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
}

overcloud-controller-0 resolves to an ipv6 address:
[root@overcloud-controller-0 ~]# ping6 -n -c1 overcloud-controller-0
PING overcloud-controller-0(fd00:fd00:fd00:2000:f816:3eff:fe45:bec3) 56 data bytes
64 bytes from fd00:fd00:fd00:2000:f816:3eff:fe45:bec3: icmp_seq=1 ttl=64 time=0.032 ms

[root@overcloud-controller-0 ~]# systemctl status corosync
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2016-01-12 14:46:01 UTC; 57min ago
  Process: 1004 ExecStop=/usr/share/corosync/corosync stop (code=exited, status=0/SUCCESS)
  Process: 1255 ExecStart=/usr/share/corosync/corosync start (code=exited, status=1/FAILURE)
 Main PID: 814 (code=exited, status=0/SUCCESS)

Jan 12 14:46:01 overcloud-controller-0 systemd[1]: Starting Corosync Cluster Engine...
Jan 12 14:46:01 overcloud-controller-0 corosync[1261]:  [MAIN  ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service.
Jan 12 14:46:01 overcloud-controller-0 corosync[1261]:  [MAIN  ] Corosync built-in features: dbus systemd xmlconf snmp pie relro bindnow
Jan 12 14:46:01 overcloud-controller-0 corosync[1261]:  [MAIN  ] parse error in config: No interfaces defined
Jan 12 14:46:01 overcloud-controller-0 corosync[1261]:  [MAIN  ] Corosync Cluster Engine exiting with status 8 at main.c:1278.
Jan 12 14:46:01 overcloud-controller-0 corosync[1255]: Starting Corosync Cluster Engine (corosync): [FAILED]
Jan 12 14:46:01 overcloud-controller-0 systemd[1]: corosync.service: control process exited, code=exited status=1
Jan 12 14:46:01 overcloud-controller-0 systemd[1]: Failed to start Corosync Cluster Engine.
Jan 12 14:46:01 overcloud-controller-0 systemd[1]: Unit corosync.service entered failed state.
Jan 12 14:46:01 overcloud-controller-0 systemd[1]: corosync.service failed.

Comment 1 Emilien Macchi 2016-01-12 17:13:42 UTC
I think TripleO Heat Templates is missing some useful options to enable Corosync on the overcloud:
https://github.com/redhat-openstack/puppet-pacemaker/blob/master/manifests/corosync.pp#L25-L27

Looking at THT now, it seems like cluster_setup_extras is empty now, which could be the reason why Corosync configure IPv4 by default.

Comment 2 Gilles Dubreuil 2016-01-13 23:53:18 UTC
Effectively when using IPv6 the cluster_setup_extras must bear the --ipv6 option.

For instance:

-------------
class {"pacemaker::corosync":
  cluster_name => "cluster_test",
  cluster_members => "one.pcs.tst two.pcs.tst three.pcs.tst",
  cluster_setup_extras => { '--ipv6' => '' },
}
--------------

With above, the cluster starts properly.
The option must be added to the deployment parameters.

Comment 4 Marius Cornea 2016-01-19 10:52:53 UTC
openstack-tripleo-heat-templates-0.8.6-106.el7ost.noarch

[root@overcloud-controller-0 ~]# systemctl status corosync
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2016-01-19 05:30:34 EST; 22min ago
 Main PID: 25781 (corosync)
   CGroup: /system.slice/corosync.service
           └─25781 corosync

Jan 19 05:30:34 overcloud-controller-0.localdomain corosync[25781]:  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jan 19 05:30:34 overcloud-controller-0.localdomain corosync[25781]:  [QB    ] server name: votequorum
Jan 19 05:30:34 overcloud-controller-0.localdomain corosync[25781]:  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jan 19 05:30:34 overcloud-controller-0.localdomain corosync[25781]:  [QB    ] server name: quorum
Jan 19 05:30:34 overcloud-controller-0.localdomain corosync[25781]:  [TOTEM ] adding new UDPU member {fd00:fd00:fd00:2000:f816:3eff:feeb:3100}
Jan 19 05:30:34 overcloud-controller-0.localdomain corosync[25781]:  [TOTEM ] A new membership (fd00:fd00:fd00:2000:f816:3eff:feeb:3100:4) was formed. Members joined: 1
Jan 19 05:30:34 overcloud-controller-0.localdomain corosync[25781]:  [QUORUM] Members[1]: 1
Jan 19 05:30:34 overcloud-controller-0.localdomain corosync[25781]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jan 19 05:30:34 overcloud-controller-0.localdomain corosync[25774]: Starting Corosync Cluster Engine (corosync): [  OK  ]
Jan 19 05:30:34 overcloud-controller-0.localdomain systemd[1]: Started Corosync Cluster Engine.

Comment 8 errata-xmlrpc 2016-02-18 16:48:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0264.html