Description of problem: Major upgrade of overcloud from OSP11-> OSP12 fails: y_status_code : Deployment exited with non-zero status code: 2 Stack chrisj UPDATE_FAILED chrisj.AllNodesDeploySteps.AllNodesPostUpgradeSteps.ControllerDeployment_Step2.0: resource_type: OS::Heat::StructuredDeployment physical_resource_id: 2d11a076-fd0f-4fb9-97d2-9129ea3a82a5 status: CREATE_FAILED status_reason: | Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2 deploy_stdout: | ... "\u001b[1;33mWarning: This method is deprecated, please use match expressions with Stdlib::Compat::Ipv6 instead. They are described at https://docs.puppet.com/puppet/latest/reference/lang_data_type.html#match-expressions. at [\"/etc/puppet/modules/tripleo/manifests/pacemaker/haproxy_with_vip.pp\", 62]:", "\u001b[1;33mWarning: Scope(Haproxy::Config[haproxy]): haproxy: The $merge_options parameter will default to true in the next major release. Please review the documentation regarding the implications.\u001b[0m" ], "failed_when_result": true } to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/53896bf9-3c19-48e8-81cc-e67f4d0e5a49_playbook.retry PLAY RECAP ********************************************************************* localhost : ok=7 changed=2 unreachable=0 failed=1 One of the controllers is reporting: Jan 09 23:23:09 chrisj-controller-0 crmd[268583]: error: Failed to receive meta-data for ocf:heartbeat:galera Jan 09 23:23:09 chrisj-controller-0 crmd[268583]: error: No metadata for ocf::heartbeat:galera Jan 09 23:23:09 chrisj-controller-0 crmd[268583]: notice: Result of start operation for galera on galera-bundle-0: 6 (not configured) Jan 09 23:23:09 chrisj-controller-0 crmd[268583]: notice: galera-bundle-0-galera_start_0:18969 [ ocf-exit-reason:Could not determine galera name from pacemaker node <galera-bundle-0>.\n ] Jan 09 23:23:09 chrisj-controller-0 crmd[268583]: warning: Action 78 (galera_start_0) on galera-bundle-0 failed (target: 0 vs. rc: 6): Error Jan 09 23:23:09 chrisj-controller-0 crmd[268583]: notice: Transition aborted by operation galera_start_0 'modify' on chrisj-controller-0: Event failed Jan 09 23:23:09 chrisj-controller-0 crmd[268583]: warning: Action 78 (galera_start_0) on galera-bundle-0 failed (target: 0 vs. rc: 6): Error Jan 09 23:23:09 chrisj-controller-0 crmd[268583]: notice: Transition 895 (Complete=9, Pending=0, Fired=0, Skipped=0, Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-940.bz2): Complete Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: warning: Processing failed op start for galera:0 on galera-bundle-0: not configured (6) Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: error: Preventing galera-bundle-master from re-starting anywhere: operation start failed 'not configured' (6) Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: warning: Processing failed op start for galera:0 on galera-bundle-0: not configured (6) Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: error: Preventing galera-bundle-master from re-starting anywhere: operation start failed 'not configured' (6) Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: warning: Forcing galera-bundle-master away from galera-bundle-0 after 1000000 failures (max=1000000) Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: warning: Forcing galera-bundle-master away from galera-bundle-0 after 1000000 failures (max=1000000) Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: warning: Forcing galera-bundle-master away from galera-bundle-0 after 1000000 failures (max=1000000) Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: notice: Recover galera:0 (Slave galera-bundle-0) Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: notice: Start galera:1 (galera-bundle-1) Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: notice: Start galera:2 (galera-bundle-2) Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: notice: Calculated transition 896, saving inputs in /var/lib/pacemaker/pengine/pe-input-941.bz2 Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: warning: Processing failed op start for galera:0 on galera-bundle-0: not configured (6) Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: error: Preventing galera-bundle-master from re-starting anywhere: operation start failed 'not configured' (6) Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: warning: Processing failed op start for galera:0 on galera-bundle-0: not configured (6) Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: error: Preventing galera-bundle-master from re-starting anywhere: operation start failed 'not configured' (6) Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: warning: Forcing galera-bundle-master away from galera-bundle-0 after 1000000 failures (max=1000000) Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: warning: Forcing galera-bundle-master away from galera-bundle-0 after 1000000 failures (max=1000000) Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: warning: Forcing galera-bundle-master away from galera-bundle-0 after 1000000 failures (max=1000000) Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: notice: Recover galera:0 (Slave galera-bundle-0) Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: notice: Start galera:1 (galera-bundle-1) Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: notice: Start galera:2 (galera-bundle-2) Jan 09 23:23:09 chrisj-controller-0 pengine[268582]: notice: Calculated transition 897, saving inputs in /var/lib/pacemaker/pengine/pe-input-942.bz2 Jan 09 23:23:09 chrisj-controller-0 crmd[268583]: notice: Initiating stop operation galera_stop_0 locally on galera-bundle-0 Version-Release number of selected component (if applicable): osp12 How reproducible: Every time Steps to Reproduce: 1. deploy osp11 2. upgrade undercloud to osp12 3. upgrade overcloud to osp12 Actual results: Fails Expected results: Additional info: This bug looks somewhat similar - https://bugs.launchpad.net/tripleo/+bug/1721497 sosreports: http://chrisj.cloud/sosreport-chrisj-controller-0-galera-bundle-upgrade-issue-20180110010619.tar.xz http://chrisj.cloud/sosreport-chrisj-undercloud-gallera-issue-upgrade-20180109200210.tar.xz
I've hit this one as well. I noticed that in OSP11, we used a different galera resource name than we do in OSP12. In OSP11, for galera.cnf we use the following: wsrep_cluster_address = gcomm://dshackfestosp12-controller-0,dshackfestosp12-controller-1,dshackfestosp12-controller-2 When looking in the mariadb containers, it looks like it is using the following: wsrep_cluster_address = gcomm://dshackfestosp12-controller-0.internalapi.redhat.com,dshackfestosp12-controller-1.internalapi.redhat.com,dshackfestosp12-controller-2.internalapi.redhat.com My thought is that this change has now is causing a mismatch between the resource name when trying to start the cluster. If this is the actual problem, we could might be able to use 'cluster_host_map' to map the resource to the host-name.
I've seen similar issue and it was caused by pacemaker and resource-agents pacakges not being up to date. Cheking the sosreport-chrisj-controller-0-galera-bundle-upgrade-issue-20180110010619 I can see the pacemaker-1.1.16-12.el7_4.2.x86_64 package installed while on a system which got successfully upgraded I have pacemaker-1.1.16-12.el7_4.5.x86_64. Could you please check that the overcloud nodes have the rhel-ha-for-rhel-7-server-rpms repo enabled which provides pacemaker-1.1.16-12.el7_4.5? Thanks, Marius
Marius, It looks like all 3 failed environments were at pacemaker-1.1.16-12.el7_4.5.x86_64. BTW. we could perform heat stack-delete and then fresh openstack deploy without /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-composable-steps-docker.yaml extension, and fresh deployment worked ok. This indicates it's upgrade related issue and not configuration issue.
(In reply to Chris Janiszewski from comment #3) > Marius, > > It looks like all 3 failed environments were at > pacemaker-1.1.16-12.el7_4.5.x86_64. > > BTW. we could perform heat stack-delete and then fresh openstack deploy > without > /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade- > composable-steps-docker.yaml extension, and fresh deployment worked ok. This > indicates it's upgrade related issue and not configuration issue. In case of a fresh deployment the packages come pre-installed inside the overcloud image so it's different compared to the upgrade where the packages get updated during the upgrade step. Would it be possible to attach the controller nodes sosreports from an environment which got pacemaker-1.1.16-12.el7_4.5.x86_64 at upgrade failure time? Checking the attached sosreport I could see pacemaker was at version 1.1.16-12.el7_4.2: ~/sosreport-chrisj-controller-0-galera-bundle-upgrade-issue-20180110010619>>> grep ^pacemaker-1 installed-rpms pacemaker-1.1.16-12.el7_4.2.x86_64 Mon Oct 2 19:41:00 2017 ~/sosreport-chrisj-controller-0-galera-bundle-upgrade-issue-20180110010619>>> grep 'Starting Pacemaker' var/log/cluster/corosync.log Jan 09 16:30:52 [2727] chrisj-controller-0 pacemakerd: notice: main: Starting Pacemaker 1.1.16-12.el7_4.2 | build=94ff4df features: generated-manpages agent-manpages ncurses libqb-logging libqb-ipc systemd nagios corosync-native atomic-attrd acls Jan 09 20:38:33 [2565] chrisj-controller-0 pacemakerd: notice: main: Starting Pacemaker 1.1.16-12.el7_4.2 | build=94ff4df features: generated-manpages agent-manpages ncurses libqb-logging libqb-ipc systemd nagios corosync-native atomic-attrd acls Jan 09 21:54:19 [268576] chrisj-controller-0 pacemakerd: notice: main: Starting Pacemaker 1.1.16-12.el7_4.2 | build=94ff4df features: generated-manpages agent-manpages ncurses libqb-logging libqb-ipc systemd nagios corosync-native atomic-attrd acls Also checking the /etc/yum.repos.d/osp12.repo file in the attached sosreport I couldn't find the rhel-ha-for-rhel-7-server-rpms repository which would provide a pacemaker package update.
I've hit the same exact issue as Chris. I am going to generate a sosreport from one of my controllers and attach it. Just so we don't get any "red-herrings" here, after I hit the same issue on the environment, it subsequently lost connectivity to the external network but that's related to some underlying infrastructure issue and not related to the deployment. The environment was able to hit the external gateway while I experienced the issue. [root@dshackfestosp12-controller-2 ~]# grep -B9 'enabled = 1' /etc/yum.repos.d/redhat.repo [rhel-7-rc-rpms] metadata_expire = 86400 sslclientcert = /etc/pki/entitlement/3969176143091601766.pem baseurl = https://cdn.redhat.com/content/rc/rhel/server/7/x86_64/os sslverify = 1 name = Red Hat Enterprise Linux 7 Server Release Candidate (RPMs) sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release enabled = 1 -- [rhel-7-server-htb-rpms] metadata_expire = 86400 sslclientcert = /etc/pki/entitlement/3969176143091601766.pem baseurl = https://cdn.redhat.com/content/htb/rhel/server/7/$basearch/os ui_repoid_vars = basearch sslverify = 1 name = Red Hat Enterprise Linux 7 Server HTB (RPMs) sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-beta,file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release enabled = 1 -- [rhel-7-server-tus-rpms] metadata_expire = 86400 sslclientcert = /etc/pki/entitlement/3969176143091601766.pem baseurl = https://cdn.redhat.com/content/tus/rhel/server/7/$releasever/$basearch/os ui_repoid_vars = releasever basearch sslverify = 1 name = Red Hat Enterprise Linux 7 Server - TUS (RPMs) sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release enabled = 1 -- [rhel-7-server-aus-rpms] metadata_expire = 86400 sslclientcert = /etc/pki/entitlement/3969176143091601766.pem baseurl = https://cdn.redhat.com/content/aus/rhel/server/7/$releasever/$basearch/os ui_repoid_vars = releasever basearch sslverify = 1 name = Red Hat Enterprise Linux 7 Server - AUS (RPMs) sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release enabled = 1 -- [rhel-7-server-nfv-rpms] metadata_expire = 86400 sslclientcert = /etc/pki/entitlement/3969176143091601766.pem baseurl = https://cdn.redhat.com/content/dist/rhel/server/7/$releasever/$basearch/nfv/os ui_repoid_vars = releasever basearch sslverify = 1 name = Red Hat Enterprise Linux for Real Time for NFV (RHEL 7 Server) (RPMs) sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release enabled = 1 -- [rhel-7-server-rt-rpms] metadata_expire = 86400 sslclientcert = /etc/pki/entitlement/3969176143091601766.pem baseurl = https://cdn.redhat.com/content/dist/rhel/server/7/$releasever/$basearch/rt/os ui_repoid_vars = releasever basearch sslverify = 1 name = Red Hat Enterprise Linux for Real Time (RHEL 7 Server) (RPMs) sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release enabled = 1 -- [rhel-7-server-rt-beta-rpms] metadata_expire = 86400 sslclientcert = /etc/pki/entitlement/3969176143091601766.pem baseurl = https://cdn.redhat.com/content/beta/rhel/server/7/$basearch/rt/os ui_repoid_vars = basearch sslverify = 1 name = Red Hat Enterprise Linux for Real Time Beta (RHEL 7 Server) (RPMs) sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-beta,file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release enabled = 1 -- [rhel-7-server-rpms] metadata_expire = 86400 sslclientcert = /etc/pki/entitlement/3969176143091601766.pem baseurl = https://cdn.redhat.com/content/dist/rhel/server/7/$releasever/$basearch/os ui_repoid_vars = releasever basearch sslverify = 1 name = Red Hat Enterprise Linux 7 Server (RPMs) sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release enabled = 1 [root@dshackfestosp12-controller-2 ~]# rpm -qa | grep ^pacemaker- pacemaker-libs-1.1.16-12.el7_4.5.x86_64 pacemaker-cli-1.1.16-12.el7_4.5.x86_64 pacemaker-1.1.16-12.el7_4.5.x86_64 pacemaker-remote-1.1.16-12.el7_4.5.x86_64 pacemaker-cluster-libs-1.1.16-12.el7_4.5.x86_64 [root@dshackfestosp12-controller-2 ~]#
sosreport upgraded here http://chrisj.cloud/sosreport-dshackfestosp12-controller-2.BZ1533204-20180112104244.tar.xz
So starting OSP12, the wsrep_cluster_address uses the FQDN of the nodes the galera servers run on. As noted in comment #1, for that to work, we use the new cluster_host_map configuration option when we set up the galera resource in pacemaker. The start of the galera resource did not succeed because the cluster_host_map has been populated with host name which contain upper case, and that doesn't match the wsrep_cluster_address configuration: From the sosreports: <nvpair id="galera-instance_attributes-cluster_host_map" name="cluster_host_map" value="DShackfestOSP12-controller-0:DShackfestOSP12-controller-0.internalapi.redhat.com;DShackfestOSP12-controller-1:DShackfestOSP12-controller-1.internalapi.redhat.com;DShackfestOSP12-controller-2:DShackfestOSP12- <nvpair id="galera-instance_attributes-wsrep_cluster_address" name="wsrep_cluster_address" value="gcomm://dshackfestosp12-controller-0.internalapi.redhat.com,dshackfestosp12-controller-1.internalapi.redhat.com,dshackfestosp12-controller-2.internalapi.redhat.com"/> </instance_attributes> I'll look at the puppet code to see where we take the values from to generate those two lines. Meanwhile, using lowercase would unblock you.
Verified with puppet-tripleo-7.4.12-6.el7ost.noarch and oc's name QECLOUD
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2331