Bug 1389166

Summary: Galera Fails to upgrade along with mysql during overcloud upgrade of OSP 8 to 9.
Product: Red Hat OpenStack Reporter: Navneet Krishnan <nkrishna>
Component: openstack-tripleo-heat-templatesAssignee: Damien Ciabrini <dciabrin>
Status: CLOSED ERRATA QA Contact: Arik Chernetsky <achernet>
Severity: high Docs Contact:
Priority: high    
Version: 8.0 (Liberty)CC: dciabrin, fdinitto, jslagle, mburns, michele, nkrishna, rhel-osp-director-maint, sbaker, shardy, slinaber, srevivo, therve, ushkalim, vaggarwa, zbitter
Target Milestone: ---Keywords: ZStream
Target Release: 9.0 (Mitaka)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-2.0.0-38.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-21 16:51:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 5 Navneet Krishnan 2016-10-27 06:25:08 UTC
Description of problem:

During step 3.4.6 in Official Documentation of of RHOS8 to RHOS9 upgrade of
Overcloud, Heat engine service is not stopped with other services during the
restart procedure of cluster and finally stops when cluster is starting up.
Hence causing the upgrade to fail with impending time-out.
 
Version-Release number of selected component (if applicable)


openstack-heat-engine-6.0.0-11.el7ost.noarch
collect-config-0.1.37-6.el7ost.noarch
systemd-219-19.el7_2.13.x86_64

How reproducible:

Only reproducible in Customer's Environment while doing upgrade to OSP9.


Steps to Reproduce: 

openstack overcloud deploy --templates -e
~/templates/environments/network-isolation.yaml -e
~/templates/environments/network-environment.yaml -e
~/templates/environments/network-management.yaml -e ~/ceilometer.yaml -e
~/templates/environments/storage-environment.yaml -e
/usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-
pacemaker.yaml --compute-scale 2 --control-scale 3 --ceph-storage-scale 1
--compute-flavor compute --control-flavor control --ceph-storage-flavor
ceph-storage --libvirt-type kvm --ntp-server xx.xx.xx.xx --timeout 120


Actual results:

Upgrade fails with a prolonged 'UPDATE_IN_PROGRESS' and fails due to time-out
 
Expected results: Successful upgrade with major-upgrade-pacemaker.yaml
 template.

Comment 15 Thomas Hervé 2016-11-08 08:44:16 UTC
OK, so last error seems pretty clear:

Galera cluster node is not synced.
HTTP/1.1 503 Service Unavailable

There is an issue with Galera in the overcloud. Not related to Heat AFAICT.

Comment 16 Navneet Krishnan 2016-11-08 08:59:50 UTC
Yes, I have noticed this too now. It seems the earlier issue of the stack stalling with UPDATE_IN_PROGRESS , with no failed status reported any more.


Looks similar to this: https://bugzilla.redhat.com/show_bug.cgi?id=1240394

Any pointers?

Comment 18 Damien Ciabrini 2016-11-08 15:02:27 UTC
We lack sosreports from _all_ controllers to determine the state of the galera cluster at the time of the log reported in #c15.

I'm pretty sure it is not similar to https://bugzilla.redhat.com/show_bug.cgi?id=1240394 though, since it only mention old behaviours which have been fixed in recent version of resource-agents.

Navneet, I need sosrepots from all controller because one of the three will contain the journalctl logs from the pacemaker's DC. All other logs from sosreport are needed so that I can trace the progression of the galera bootstrap process across controller nodes.

Could you link them to the bz?

Comment 22 Damien Ciabrini 2016-11-14 10:32:52 UTC
The reason for this failure was due to the fact that the upgrade code
always assumed that the mariadb-* packages are being upgraded together
with mariadb-galera-server which is the owner of /var/lib/mysql). In
this case, at the time of the upgrade, only the mariadb packages were
upgraded which caused the absence of /var/lib/mysql on the
non-bootstrap controller nodes.

That is why galera failed to start (from crm_mon.txt):  
Failed Actions:
* galera_start_0 on overcloud-controller-2 'not installed' (5): call=220, status=complete, exitreason='Datadir /var/lib/mysql doesn't exist', last-rc-change='Mon Nov  7 16:46:27 2016', queued=0ms, exec=73ms                                                                       
* galera_start_0 on overcloud-controller-1 'not installed' (5): call=220, status=complete, exitreason='Datadir /var/lib/mysql doesn't exist', 
    last-rc-change='Mon Nov  7 16:46:26 2016', queued=0ms, exec=75ms   

We need to backport a fix in order to cater for this  situation.                            

Navneet, do you need help in bringing galera up again or can we assume
you created the /var/lib/mysql folders on the non-bootstrap nodes,
assigned the right permissions and restarted the resource?

Comment 23 Navneet Krishnan 2016-11-14 11:26:03 UTC
I can try this workaround as suggested:

1. Manually create the /var/lib/mysql folders on the non-bootstrap nodes: controller0 and controller1. 

2. Chown the folders for mysql user.

3. Bring the galera-master resource up on the cluster.

$sudo pcs resource restart galera-master //If the resource is stopped on non-bootstrap controllers

$sudo pcs resource enable galera-master  //If the resource is stopped on all controllers


4. pcs status //cleanup if required


5. Restart the step 3.4.6

Comment 24 Damien Ciabrini 2016-11-14 11:42:56 UTC
Also, please "restorecon" the created directory in step2 for SELinux 

Note that "pcs resource cleanup galera" might be better than "pcs resource restart" as it won't restart galera if it's already started on a node, thus preventing service outage.

Comment 26 Damien Ciabrini 2016-11-15 13:25:44 UTC
Fix backported to Mitaka upstream

Comment 30 errata-xmlrpc 2016-12-21 16:51:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2983.html