Bug 1533204
Summary: | Upgrade from OSP11->OSP12 fails - ocf-exit-reason:Could not determine galera name from pacemaker node <galera-bundle-0> | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Chris Janiszewski <cjanisze> |
Component: | puppet-tripleo | Assignee: | Damien Ciabrini <dciabrin> |
Status: | CLOSED ERRATA | QA Contact: | Yurii Prokulevych <yprokule> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 12.0 (Pike) | CC: | aschultz, dciabrin, dsorrent, jjoyce, jschluet, mbayer, mbultel, mburns, mcornea, michele, slinaber, tvignaud |
Target Milestone: | z3 | Keywords: | Triaged, ZStream |
Target Release: | 12.0 (Pike) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | puppet-tripleo-7.4.12-5.el7ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-08-20 12:58:39 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Chris Janiszewski
2018-01-10 17:44:09 UTC
I've hit this one as well. I noticed that in OSP11, we used a different galera resource name than we do in OSP12. In OSP11, for galera.cnf we use the following: wsrep_cluster_address = gcomm://dshackfestosp12-controller-0,dshackfestosp12-controller-1,dshackfestosp12-controller-2 When looking in the mariadb containers, it looks like it is using the following: wsrep_cluster_address = gcomm://dshackfestosp12-controller-0.internalapi.redhat.com,dshackfestosp12-controller-1.internalapi.redhat.com,dshackfestosp12-controller-2.internalapi.redhat.com My thought is that this change has now is causing a mismatch between the resource name when trying to start the cluster. If this is the actual problem, we could might be able to use 'cluster_host_map' to map the resource to the host-name. I've seen similar issue and it was caused by pacemaker and resource-agents pacakges not being up to date. Cheking the sosreport-chrisj-controller-0-galera-bundle-upgrade-issue-20180110010619 I can see the pacemaker-1.1.16-12.el7_4.2.x86_64 package installed while on a system which got successfully upgraded I have pacemaker-1.1.16-12.el7_4.5.x86_64. Could you please check that the overcloud nodes have the rhel-ha-for-rhel-7-server-rpms repo enabled which provides pacemaker-1.1.16-12.el7_4.5? Thanks, Marius Marius, It looks like all 3 failed environments were at pacemaker-1.1.16-12.el7_4.5.x86_64. BTW. we could perform heat stack-delete and then fresh openstack deploy without /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-composable-steps-docker.yaml extension, and fresh deployment worked ok. This indicates it's upgrade related issue and not configuration issue. (In reply to Chris Janiszewski from comment #3) > Marius, > > It looks like all 3 failed environments were at > pacemaker-1.1.16-12.el7_4.5.x86_64. > > BTW. we could perform heat stack-delete and then fresh openstack deploy > without > /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade- > composable-steps-docker.yaml extension, and fresh deployment worked ok. This > indicates it's upgrade related issue and not configuration issue. In case of a fresh deployment the packages come pre-installed inside the overcloud image so it's different compared to the upgrade where the packages get updated during the upgrade step. Would it be possible to attach the controller nodes sosreports from an environment which got pacemaker-1.1.16-12.el7_4.5.x86_64 at upgrade failure time? Checking the attached sosreport I could see pacemaker was at version 1.1.16-12.el7_4.2: ~/sosreport-chrisj-controller-0-galera-bundle-upgrade-issue-20180110010619>>> grep ^pacemaker-1 installed-rpms pacemaker-1.1.16-12.el7_4.2.x86_64 Mon Oct 2 19:41:00 2017 ~/sosreport-chrisj-controller-0-galera-bundle-upgrade-issue-20180110010619>>> grep 'Starting Pacemaker' var/log/cluster/corosync.log Jan 09 16:30:52 [2727] chrisj-controller-0 pacemakerd: notice: main: Starting Pacemaker 1.1.16-12.el7_4.2 | build=94ff4df features: generated-manpages agent-manpages ncurses libqb-logging libqb-ipc systemd nagios corosync-native atomic-attrd acls Jan 09 20:38:33 [2565] chrisj-controller-0 pacemakerd: notice: main: Starting Pacemaker 1.1.16-12.el7_4.2 | build=94ff4df features: generated-manpages agent-manpages ncurses libqb-logging libqb-ipc systemd nagios corosync-native atomic-attrd acls Jan 09 21:54:19 [268576] chrisj-controller-0 pacemakerd: notice: main: Starting Pacemaker 1.1.16-12.el7_4.2 | build=94ff4df features: generated-manpages agent-manpages ncurses libqb-logging libqb-ipc systemd nagios corosync-native atomic-attrd acls Also checking the /etc/yum.repos.d/osp12.repo file in the attached sosreport I couldn't find the rhel-ha-for-rhel-7-server-rpms repository which would provide a pacemaker package update. I've hit the same exact issue as Chris. I am going to generate a sosreport from one of my controllers and attach it. Just so we don't get any "red-herrings" here, after I hit the same issue on the environment, it subsequently lost connectivity to the external network but that's related to some underlying infrastructure issue and not related to the deployment. The environment was able to hit the external gateway while I experienced the issue. [root@dshackfestosp12-controller-2 ~]# grep -B9 'enabled = 1' /etc/yum.repos.d/redhat.repo [rhel-7-rc-rpms] metadata_expire = 86400 sslclientcert = /etc/pki/entitlement/3969176143091601766.pem baseurl = https://cdn.redhat.com/content/rc/rhel/server/7/x86_64/os sslverify = 1 name = Red Hat Enterprise Linux 7 Server Release Candidate (RPMs) sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release enabled = 1 -- [rhel-7-server-htb-rpms] metadata_expire = 86400 sslclientcert = /etc/pki/entitlement/3969176143091601766.pem baseurl = https://cdn.redhat.com/content/htb/rhel/server/7/$basearch/os ui_repoid_vars = basearch sslverify = 1 name = Red Hat Enterprise Linux 7 Server HTB (RPMs) sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-beta,file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release enabled = 1 -- [rhel-7-server-tus-rpms] metadata_expire = 86400 sslclientcert = /etc/pki/entitlement/3969176143091601766.pem baseurl = https://cdn.redhat.com/content/tus/rhel/server/7/$releasever/$basearch/os ui_repoid_vars = releasever basearch sslverify = 1 name = Red Hat Enterprise Linux 7 Server - TUS (RPMs) sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release enabled = 1 -- [rhel-7-server-aus-rpms] metadata_expire = 86400 sslclientcert = /etc/pki/entitlement/3969176143091601766.pem baseurl = https://cdn.redhat.com/content/aus/rhel/server/7/$releasever/$basearch/os ui_repoid_vars = releasever basearch sslverify = 1 name = Red Hat Enterprise Linux 7 Server - AUS (RPMs) sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release enabled = 1 -- [rhel-7-server-nfv-rpms] metadata_expire = 86400 sslclientcert = /etc/pki/entitlement/3969176143091601766.pem baseurl = https://cdn.redhat.com/content/dist/rhel/server/7/$releasever/$basearch/nfv/os ui_repoid_vars = releasever basearch sslverify = 1 name = Red Hat Enterprise Linux for Real Time for NFV (RHEL 7 Server) (RPMs) sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release enabled = 1 -- [rhel-7-server-rt-rpms] metadata_expire = 86400 sslclientcert = /etc/pki/entitlement/3969176143091601766.pem baseurl = https://cdn.redhat.com/content/dist/rhel/server/7/$releasever/$basearch/rt/os ui_repoid_vars = releasever basearch sslverify = 1 name = Red Hat Enterprise Linux for Real Time (RHEL 7 Server) (RPMs) sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release enabled = 1 -- [rhel-7-server-rt-beta-rpms] metadata_expire = 86400 sslclientcert = /etc/pki/entitlement/3969176143091601766.pem baseurl = https://cdn.redhat.com/content/beta/rhel/server/7/$basearch/rt/os ui_repoid_vars = basearch sslverify = 1 name = Red Hat Enterprise Linux for Real Time Beta (RHEL 7 Server) (RPMs) sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-beta,file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release enabled = 1 -- [rhel-7-server-rpms] metadata_expire = 86400 sslclientcert = /etc/pki/entitlement/3969176143091601766.pem baseurl = https://cdn.redhat.com/content/dist/rhel/server/7/$releasever/$basearch/os ui_repoid_vars = releasever basearch sslverify = 1 name = Red Hat Enterprise Linux 7 Server (RPMs) sslclientkey = /etc/pki/entitlement/3969176143091601766-key.pem gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release enabled = 1 [root@dshackfestosp12-controller-2 ~]# rpm -qa | grep ^pacemaker- pacemaker-libs-1.1.16-12.el7_4.5.x86_64 pacemaker-cli-1.1.16-12.el7_4.5.x86_64 pacemaker-1.1.16-12.el7_4.5.x86_64 pacemaker-remote-1.1.16-12.el7_4.5.x86_64 pacemaker-cluster-libs-1.1.16-12.el7_4.5.x86_64 [root@dshackfestosp12-controller-2 ~]# sosreport upgraded here http://chrisj.cloud/sosreport-dshackfestosp12-controller-2.BZ1533204-20180112104244.tar.xz So starting OSP12, the wsrep_cluster_address uses the FQDN of the nodes the galera servers run on. As noted in comment #1, for that to work, we use the new cluster_host_map configuration option when we set up the galera resource in pacemaker. The start of the galera resource did not succeed because the cluster_host_map has been populated with host name which contain upper case, and that doesn't match the wsrep_cluster_address configuration: From the sosreports: <nvpair id="galera-instance_attributes-cluster_host_map" name="cluster_host_map" value="DShackfestOSP12-controller-0:DShackfestOSP12-controller-0.internalapi.redhat.com;DShackfestOSP12-controller-1:DShackfestOSP12-controller-1.internalapi.redhat.com;DShackfestOSP12-controller-2:DShackfestOSP12- <nvpair id="galera-instance_attributes-wsrep_cluster_address" name="wsrep_cluster_address" value="gcomm://dshackfestosp12-controller-0.internalapi.redhat.com,dshackfestosp12-controller-1.internalapi.redhat.com,dshackfestosp12-controller-2.internalapi.redhat.com"/> </instance_attributes> I'll look at the puppet code to see where we take the values from to generate those two lines. Meanwhile, using lowercase would unblock you. Verified with puppet-tripleo-7.4.12-6.el7ost.noarch and oc's name QECLOUD Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2331 |