Bug 1533204

Summary:	Upgrade from OSP11->OSP12 fails - ocf-exit-reason:Could not determine galera name from pacemaker node <galera-bundle-0>
Product:	Red Hat OpenStack	Reporter:	Chris Janiszewski <cjanisze>
Component:	puppet-tripleo	Assignee:	Damien Ciabrini <dciabrin>
Status:	CLOSED ERRATA	QA Contact:	Yurii Prokulevych <yprokule>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	12.0 (Pike)	CC:	aschultz, dciabrin, dsorrent, jjoyce, jschluet, mbayer, mbultel, mburns, mcornea, michele, slinaber, tvignaud
Target Milestone:	z3	Keywords:	Triaged, ZStream
Target Release:	12.0 (Pike)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	puppet-tripleo-7.4.12-5.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-08-20 12:58:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Chris Janiszewski 2018-01-10 17:44:09 UTC

Description of problem:
Major upgrade of overcloud from OSP11-> OSP12 fails:

y_status_code : Deployment exited with non-zero status code: 2

 Stack chrisj UPDATE_FAILED

chrisj.AllNodesDeploySteps.AllNodesPostUpgradeSteps.ControllerDeployment_Step2.0:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: 2d11a076-fd0f-4fb9-97d2-9129ea3a82a5
  status: CREATE_FAILED
  status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
            "\u001b[1;33mWarning: This method is deprecated, please use match expressions with Stdlib::Compat::Ipv6 instead. They are described at https://docs.puppet.com/puppet/latest/reference/lang_data_type.html#match-expressions. at [\"/etc/puppet/modules/tripleo/manifests/pacemaker/haproxy_with_vip.pp\", 62]:",
            "\u001b[1;33mWarning: Scope(Haproxy::Config[haproxy]): haproxy: The $merge_options parameter will default to true in the next major release. Please review the documentation regarding the implications.\u001b[0m"
        ],
        "failed_when_result": true
    }   
        to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/53896bf9-3c19-48e8-81cc-e67f4d0e5a49_playbook.retry

    PLAY RECAP *********************************************************************
    localhost                  : ok=7    changed=2    unreachable=0    failed=1

One of the controllers is reporting:

Jan 09 23:23:09 chrisj-controller-0 crmd[268583]:    error: Failed to receive meta-data for ocf:heartbeat:galera
Jan 09 23:23:09 chrisj-controller-0 crmd[268583]:    error: No metadata for ocf::heartbeat:galera
Jan 09 23:23:09 chrisj-controller-0 crmd[268583]:   notice: Result of start operation for galera on galera-bundle-0: 6 (not configured)
Jan 09 23:23:09 chrisj-controller-0 crmd[268583]:   notice: galera-bundle-0-galera_start_0:18969 [ ocf-exit-reason:Could not determine galera name from pacemaker node <galera-bundle-0>.\n ]
Jan 09 23:23:09 chrisj-controller-0 crmd[268583]:  warning: Action 78 (galera_start_0) on galera-bundle-0 failed (target: 0 vs. rc: 6): Error
Jan 09 23:23:09 chrisj-controller-0 crmd[268583]:   notice: Transition aborted by operation galera_start_0 'modify' on chrisj-controller-0: Event failed
Jan 09 23:23:09 chrisj-controller-0 crmd[268583]:  warning: Action 78 (galera_start_0) on galera-bundle-0 failed (target: 0 vs. rc: 6): Error
Jan 09 23:23:09 chrisj-controller-0 crmd[268583]:   notice: Transition 895 (Complete=9, Pending=0, Fired=0, Skipped=0, Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-940.bz2): Complete
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:  warning: Processing failed op start for galera:0 on galera-bundle-0: not configured (6)
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:    error: Preventing galera-bundle-master from re-starting anywhere: operation start failed 'not configured' (6)
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:  warning: Processing failed op start for galera:0 on galera-bundle-0: not configured (6)
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:    error: Preventing galera-bundle-master from re-starting anywhere: operation start failed 'not configured' (6)
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:  warning: Forcing galera-bundle-master away from galera-bundle-0 after 1000000 failures (max=1000000)
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:  warning: Forcing galera-bundle-master away from galera-bundle-0 after 1000000 failures (max=1000000)
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:  warning: Forcing galera-bundle-master away from galera-bundle-0 after 1000000 failures (max=1000000)
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:   notice: Recover galera:0        (Slave galera-bundle-0)
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:   notice: Start   galera:1        (galera-bundle-1)
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:   notice: Start   galera:2        (galera-bundle-2)
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:   notice: Calculated transition 896, saving inputs in /var/lib/pacemaker/pengine/pe-input-941.bz2
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:  warning: Processing failed op start for galera:0 on galera-bundle-0: not configured (6)
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:    error: Preventing galera-bundle-master from re-starting anywhere: operation start failed 'not configured' (6)
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:  warning: Processing failed op start for galera:0 on galera-bundle-0: not configured (6)
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:    error: Preventing galera-bundle-master from re-starting anywhere: operation start failed 'not configured' (6)
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:  warning: Forcing galera-bundle-master away from galera-bundle-0 after 1000000 failures (max=1000000)
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:  warning: Forcing galera-bundle-master away from galera-bundle-0 after 1000000 failures (max=1000000)
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:  warning: Forcing galera-bundle-master away from galera-bundle-0 after 1000000 failures (max=1000000)
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:   notice: Recover galera:0        (Slave galera-bundle-0)
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:   notice: Start   galera:1        (galera-bundle-1)
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:   notice: Start   galera:2        (galera-bundle-2)
Jan 09 23:23:09 chrisj-controller-0 pengine[268582]:   notice: Calculated transition 897, saving inputs in /var/lib/pacemaker/pengine/pe-input-942.bz2
Jan 09 23:23:09 chrisj-controller-0 crmd[268583]:   notice: Initiating stop operation galera_stop_0 locally on galera-bundle-0

Version-Release number of selected component (if applicable):
osp12

How reproducible:
Every time

Steps to Reproduce:
1. deploy osp11
2. upgrade undercloud to osp12
3. upgrade overcloud to osp12

Actual results:
Fails

Expected results:


Additional info:
This bug looks somewhat similar - https://bugs.launchpad.net/tripleo/+bug/1721497

sosreports:
http://chrisj.cloud/sosreport-chrisj-controller-0-galera-bundle-upgrade-issue-20180110010619.tar.xz
http://chrisj.cloud/sosreport-chrisj-undercloud-gallera-issue-upgrade-20180109200210.tar.xz

Comment 1 Darin Sorrentino 2018-01-11 18:03:06 UTC

I've hit this one as well.  I noticed that in OSP11, we used a different galera resource name than we do in OSP12.

In OSP11, for galera.cnf we use the following:

wsrep_cluster_address = gcomm://dshackfestosp12-controller-0,dshackfestosp12-controller-1,dshackfestosp12-controller-2

When looking in the mariadb containers, it looks like it is using the following:

wsrep_cluster_address = gcomm://dshackfestosp12-controller-0.internalapi.redhat.com,dshackfestosp12-controller-1.internalapi.redhat.com,dshackfestosp12-controller-2.internalapi.redhat.com

My thought is that this change has now is causing a mismatch between the resource name when trying to start the cluster.

If this is the actual problem, we could might be able to use 'cluster_host_map' to map the resource to the host-name.

Comment 2 Marius Cornea 2018-01-11 19:35:47 UTC

I've seen similar issue and it was caused by pacemaker and resource-agents pacakges not being up to date. Cheking the sosreport-chrisj-controller-0-galera-bundle-upgrade-issue-20180110010619 I can see the pacemaker-1.1.16-12.el7_4.2.x86_64 package installed while on a system which got successfully upgraded I have pacemaker-1.1.16-12.el7_4.5.x86_64.

Could you please check that the overcloud nodes have the rhel-ha-for-rhel-7-server-rpms repo enabled which provides pacemaker-1.1.16-12.el7_4.5?

Thanks,
Marius

Comment 3 Chris Janiszewski 2018-01-11 22:55:35 UTC

Marius,

It looks like all 3 failed environments were at pacemaker-1.1.16-12.el7_4.5.x86_64.

BTW. we could perform heat stack-delete and then fresh openstack deploy without /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-composable-steps-docker.yaml extension, and fresh deployment worked ok. This indicates it's upgrade related issue and not configuration issue.

Comment 4 Marius Cornea 2018-01-11 23:30:24 UTC

(In reply to Chris Janiszewski from comment #3)
> Marius,
> 
> It looks like all 3 failed environments were at
> pacemaker-1.1.16-12.el7_4.5.x86_64.
> 
> BTW. we could perform heat stack-delete and then fresh openstack deploy
> without
> /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-
> composable-steps-docker.yaml extension, and fresh deployment worked ok. This
> indicates it's upgrade related issue and not configuration issue.

In case of a fresh deployment the packages come pre-installed inside the overcloud image so it's different compared to the upgrade where the packages get updated during the upgrade step. 

Would it be possible to attach the controller nodes sosreports from an environment which got pacemaker-1.1.16-12.el7_4.5.x86_64 at upgrade failure time? Checking the attached sosreport I could see pacemaker was at version 1.1.16-12.el7_4.2:

~/sosreport-chrisj-controller-0-galera-bundle-upgrade-issue-20180110010619>>> grep ^pacemaker-1 installed-rpms 
pacemaker-1.1.16-12.el7_4.2.x86_64                          Mon Oct  2 19:41:00 2017

~/sosreport-chrisj-controller-0-galera-bundle-upgrade-issue-20180110010619>>> grep 'Starting Pacemaker' var/log/cluster/corosync.log 
Jan 09 16:30:52 [2727] chrisj-controller-0 pacemakerd:   notice: main:	Starting Pacemaker 1.1.16-12.el7_4.2 | build=94ff4df features: generated-manpages agent-manpages ncurses libqb-logging libqb-ipc systemd nagios  corosync-native atomic-attrd acls
Jan 09 20:38:33 [2565] chrisj-controller-0 pacemakerd:   notice: main:	Starting Pacemaker 1.1.16-12.el7_4.2 | build=94ff4df features: generated-manpages agent-manpages ncurses libqb-logging libqb-ipc systemd nagios  corosync-native atomic-attrd acls
Jan 09 21:54:19 [268576] chrisj-controller-0 pacemakerd:   notice: main:	Starting Pacemaker 1.1.16-12.el7_4.2 | build=94ff4df features: generated-manpages agent-manpages ncurses libqb-logging libqb-ipc systemd nagios  corosync-native atomic-attrd acls


Also checking the /etc/yum.repos.d/osp12.repo file in the attached sosreport I couldn't find the rhel-ha-for-rhel-7-server-rpms repository which would provide a pacemaker package update.

Comment 5 Darin Sorrentino 2018-01-12 16:53:11 UTC

I've hit the same exact issue as Chris. I am going to generate a sosreport from one of my controllers and attach it.  Just so we don't get any "red-herrings" here, after I hit the same issue on the environment, it subsequently lost connectivity to the external network but that's related to some underlying infrastructure issue and not related to the deployment.  The environment was able to hit the external gateway while I experienced the issue.


[root@dshackfestosp12-controller-2 ~]# grep -B9 'enabled = 1' /etc/yum.repos.d/redhat.repo 

[rhel-7-rc-rpms]
metadata_expire = 86400
sslclientcert = /etc/pki/entitlement/3969176143091601766.pem
baseurl = https://cdn.redhat.com/content/rc/rhel/server/7/x86_64/os
sslverify = 1
name = Red Hat Enterprise Linux 7 Server Release Candidate (RPMs)
sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem
gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release
enabled = 1
--
[rhel-7-server-htb-rpms]
metadata_expire = 86400
sslclientcert = /etc/pki/entitlement/3969176143091601766.pem
baseurl = https://cdn.redhat.com/content/htb/rhel/server/7/$basearch/os
ui_repoid_vars = basearch
sslverify = 1
name = Red Hat Enterprise Linux 7 Server HTB (RPMs)
sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem
gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-beta,file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release
enabled = 1
--
[rhel-7-server-tus-rpms]
metadata_expire = 86400
sslclientcert = /etc/pki/entitlement/3969176143091601766.pem
baseurl = https://cdn.redhat.com/content/tus/rhel/server/7/$releasever/$basearch/os
ui_repoid_vars = releasever basearch
sslverify = 1
name = Red Hat Enterprise Linux 7 Server - TUS (RPMs)
sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem
gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release
enabled = 1
--
[rhel-7-server-aus-rpms]
metadata_expire = 86400
sslclientcert = /etc/pki/entitlement/3969176143091601766.pem
baseurl = https://cdn.redhat.com/content/aus/rhel/server/7/$releasever/$basearch/os
ui_repoid_vars = releasever basearch
sslverify = 1
name = Red Hat Enterprise Linux 7 Server - AUS (RPMs)
sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem
gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release
enabled = 1
--
[rhel-7-server-nfv-rpms]
metadata_expire = 86400
sslclientcert = /etc/pki/entitlement/3969176143091601766.pem
baseurl = https://cdn.redhat.com/content/dist/rhel/server/7/$releasever/$basearch/nfv/os
ui_repoid_vars = releasever basearch
sslverify = 1
name = Red Hat Enterprise Linux for Real Time for NFV (RHEL 7 Server) (RPMs)
sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem
gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release
enabled = 1
--
[rhel-7-server-rt-rpms]
metadata_expire = 86400
sslclientcert = /etc/pki/entitlement/3969176143091601766.pem
baseurl = https://cdn.redhat.com/content/dist/rhel/server/7/$releasever/$basearch/rt/os
ui_repoid_vars = releasever basearch
sslverify = 1
name = Red Hat Enterprise Linux for Real Time (RHEL 7 Server) (RPMs)
sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem
gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release
enabled = 1
--
[rhel-7-server-rt-beta-rpms]
metadata_expire = 86400
sslclientcert = /etc/pki/entitlement/3969176143091601766.pem
baseurl = https://cdn.redhat.com/content/beta/rhel/server/7/$basearch/rt/os
ui_repoid_vars = basearch
sslverify = 1
name = Red Hat Enterprise Linux for Real Time Beta (RHEL 7 Server) (RPMs)
sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem
gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-beta,file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release
enabled = 1
--
[rhel-7-server-rpms]
metadata_expire = 86400
sslclientcert = /etc/pki/entitlement/3969176143091601766.pem
baseurl = https://cdn.redhat.com/content/dist/rhel/server/7/$releasever/$basearch/os
ui_repoid_vars = releasever basearch
sslverify = 1
name = Red Hat Enterprise Linux 7 Server (RPMs)
sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem
gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release
enabled = 1

[root@dshackfestosp12-controller-2 ~]# rpm -qa | grep ^pacemaker-
pacemaker-libs-1.1.16-12.el7_4.5.x86_64
pacemaker-cli-1.1.16-12.el7_4.5.x86_64
pacemaker-1.1.16-12.el7_4.5.x86_64
pacemaker-remote-1.1.16-12.el7_4.5.x86_64
pacemaker-cluster-libs-1.1.16-12.el7_4.5.x86_64
[root@dshackfestosp12-controller-2 ~]#

Comment 6 Darin Sorrentino 2018-01-12 16:59:22 UTC

sosreport upgraded here http://chrisj.cloud/sosreport-dshackfestosp12-controller-2.BZ1533204-20180112104244.tar.xz

Comment 7 Damien Ciabrini 2018-01-16 09:23:50 UTC

So starting OSP12, the wsrep_cluster_address uses the FQDN of the nodes the galera servers run on. As noted in comment #1, for that to work, we use the new cluster_host_map configuration option when we set up the galera resource in pacemaker. 

The start of the galera resource did not succeed because the cluster_host_map has been populated with host name which contain upper case, and that doesn't match the wsrep_cluster_address configuration:

From the sosreports:

            <nvpair id="galera-instance_attributes-cluster_host_map" name="cluster_host_map" value="DShackfestOSP12-controller-0:DShackfestOSP12-controller-0.internalapi.redhat.com;DShackfestOSP12-controller-1:DShackfestOSP12-controller-1.internalapi.redhat.com;DShackfestOSP12-controller-2:DShackfestOSP12-

            <nvpair id="galera-instance_attributes-wsrep_cluster_address" name="wsrep_cluster_address" value="gcomm://dshackfestosp12-controller-0.internalapi.redhat.com,dshackfestosp12-controller-1.internalapi.redhat.com,dshackfestosp12-controller-2.internalapi.redhat.com"/>
          </instance_attributes>

I'll look at the puppet code to see where we take the values from to generate those two lines. Meanwhile, using lowercase would unblock you.

Comment 11 Yurii Prokulevych 2018-08-03 13:17:57 UTC

Verified with puppet-tripleo-7.4.12-6.el7ost.noarch and oc's name QECLOUD

Comment 13 errata-xmlrpc 2018-08-20 12:58:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2331