1389166 – Galera Fails to upgrade along with mysql during overcloud upgrade of OSP 8 to 9.

Bug 1389166 - Galera Fails to upgrade along with mysql during overcloud upgrade of OSP 8 to 9.

Summary: Galera Fails to upgrade along with mysql during overcloud upgrade of OSP 8 to 9.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	8.0 (Liberty)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	9.0 (Mitaka)
Assignee:	Damien Ciabrini
QA Contact:	Arik Chernetsky
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-27 05:37 UTC by Navneet Krishnan
Modified:	2019-12-16 07:15 UTC (History)
CC List:	15 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-2.0.0-38.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-12-21 16:51:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1612642	None	None	None	2016-11-14 10:22:50 UTC
OpenStack gerrit	397102	None	MERGED	M/N upgrade fix galera restart.	2020-04-11 05:47:49 UTC
Red Hat Product Errata	RHBA-2016:2983	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 9 director Bug Fix Advisory	2016-12-21 21:35:43 UTC

Comment 5 Navneet Krishnan 2016-10-27 06:25:08 UTC

Description of problem:

During step 3.4.6 in Official Documentation of of RHOS8 to RHOS9 upgrade of
Overcloud, Heat engine service is not stopped with other services during the
restart procedure of cluster and finally stops when cluster is starting up.
Hence causing the upgrade to fail with impending time-out.
 
Version-Release number of selected component (if applicable)


openstack-heat-engine-6.0.0-11.el7ost.noarch
collect-config-0.1.37-6.el7ost.noarch
systemd-219-19.el7_2.13.x86_64

How reproducible:

Only reproducible in Customer's Environment while doing upgrade to OSP9.


Steps to Reproduce: 

openstack overcloud deploy --templates -e
~/templates/environments/network-isolation.yaml -e
~/templates/environments/network-environment.yaml -e
~/templates/environments/network-management.yaml -e ~/ceilometer.yaml -e
~/templates/environments/storage-environment.yaml -e
/usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-
pacemaker.yaml --compute-scale 2 --control-scale 3 --ceph-storage-scale 1
--compute-flavor compute --control-flavor control --ceph-storage-flavor
ceph-storage --libvirt-type kvm --ntp-server xx.xx.xx.xx --timeout 120


Actual results:

Upgrade fails with a prolonged 'UPDATE_IN_PROGRESS' and fails due to time-out
 
Expected results: Successful upgrade with major-upgrade-pacemaker.yaml
 template.

Comment 15 Thomas Hervé 2016-11-08 08:44:16 UTC

OK, so last error seems pretty clear:

Galera cluster node is not synced.
HTTP/1.1 503 Service Unavailable

There is an issue with Galera in the overcloud. Not related to Heat AFAICT.

Comment 16 Navneet Krishnan 2016-11-08 08:59:50 UTC

Yes, I have noticed this too now. It seems the earlier issue of the stack stalling with UPDATE_IN_PROGRESS , with no failed status reported any more.


Looks similar to this: https://bugzilla.redhat.com/show_bug.cgi?id=1240394

Any pointers?

Comment 18 Damien Ciabrini 2016-11-08 15:02:27 UTC

We lack sosreports from _all_ controllers to determine the state of the galera cluster at the time of the log reported in #c15.

I'm pretty sure it is not similar to https://bugzilla.redhat.com/show_bug.cgi?id=1240394 though, since it only mention old behaviours which have been fixed in recent version of resource-agents.

Navneet, I need sosrepots from all controller because one of the three will contain the journalctl logs from the pacemaker's DC. All other logs from sosreport are needed so that I can trace the progression of the galera bootstrap process across controller nodes.

Could you link them to the bz?

Comment 22 Damien Ciabrini 2016-11-14 10:32:52 UTC

The reason for this failure was due to the fact that the upgrade code
always assumed that the mariadb-* packages are being upgraded together
with mariadb-galera-server which is the owner of /var/lib/mysql). In
this case, at the time of the upgrade, only the mariadb packages were
upgraded which caused the absence of /var/lib/mysql on the
non-bootstrap controller nodes.

That is why galera failed to start (from crm_mon.txt):  
Failed Actions:
* galera_start_0 on overcloud-controller-2 'not installed' (5): call=220, status=complete, exitreason='Datadir /var/lib/mysql doesn't exist', last-rc-change='Mon Nov  7 16:46:27 2016', queued=0ms, exec=73ms                                                                       
* galera_start_0 on overcloud-controller-1 'not installed' (5): call=220, status=complete, exitreason='Datadir /var/lib/mysql doesn't exist', 
    last-rc-change='Mon Nov  7 16:46:26 2016', queued=0ms, exec=75ms   

We need to backport a fix in order to cater for this  situation.                            

Navneet, do you need help in bringing galera up again or can we assume
you created the /var/lib/mysql folders on the non-bootstrap nodes,
assigned the right permissions and restarted the resource?

Comment 23 Navneet Krishnan 2016-11-14 11:26:03 UTC

I can try this workaround as suggested:

1. Manually create the /var/lib/mysql folders on the non-bootstrap nodes: controller0 and controller1. 

2. Chown the folders for mysql user.

3. Bring the galera-master resource up on the cluster.

$sudo pcs resource restart galera-master //If the resource is stopped on non-bootstrap controllers

$sudo pcs resource enable galera-master  //If the resource is stopped on all controllers


4. pcs status //cleanup if required


5. Restart the step 3.4.6

Comment 24 Damien Ciabrini 2016-11-14 11:42:56 UTC

Also, please "restorecon" the created directory in step2 for SELinux 

Note that "pcs resource cleanup galera" might be better than "pcs resource restart" as it won't restart galera if it's already started on a node, thus preventing service outage.

Comment 26 Damien Ciabrini 2016-11-15 13:25:44 UTC

Fix backported to Mitaka upstream

Comment 30 errata-xmlrpc 2016-12-21 16:51:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2983.html

Note You need to log in before you can comment on or make changes to this bug.