Description of problem: During step 3.4.6 in Official Documentation of of RHOS8 to RHOS9 upgrade of Overcloud, Heat engine service is not stopped with other services during the restart procedure of cluster and finally stops when cluster is starting up. Hence causing the upgrade to fail with impending time-out. Version-Release number of selected component (if applicable) openstack-heat-engine-6.0.0-11.el7ost.noarch collect-config-0.1.37-6.el7ost.noarch systemd-219-19.el7_2.13.x86_64 How reproducible: Only reproducible in Customer's Environment while doing upgrade to OSP9. Steps to Reproduce: openstack overcloud deploy --templates -e ~/templates/environments/network-isolation.yaml -e ~/templates/environments/network-environment.yaml -e ~/templates/environments/network-management.yaml -e ~/ceilometer.yaml -e ~/templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade- pacemaker.yaml --compute-scale 2 --control-scale 3 --ceph-storage-scale 1 --compute-flavor compute --control-flavor control --ceph-storage-flavor ceph-storage --libvirt-type kvm --ntp-server xx.xx.xx.xx --timeout 120 Actual results: Upgrade fails with a prolonged 'UPDATE_IN_PROGRESS' and fails due to time-out Expected results: Successful upgrade with major-upgrade-pacemaker.yaml template.
OK, so last error seems pretty clear: Galera cluster node is not synced. HTTP/1.1 503 Service Unavailable There is an issue with Galera in the overcloud. Not related to Heat AFAICT.
Yes, I have noticed this too now. It seems the earlier issue of the stack stalling with UPDATE_IN_PROGRESS , with no failed status reported any more. Looks similar to this: https://bugzilla.redhat.com/show_bug.cgi?id=1240394 Any pointers?
We lack sosreports from _all_ controllers to determine the state of the galera cluster at the time of the log reported in #c15. I'm pretty sure it is not similar to https://bugzilla.redhat.com/show_bug.cgi?id=1240394 though, since it only mention old behaviours which have been fixed in recent version of resource-agents. Navneet, I need sosrepots from all controller because one of the three will contain the journalctl logs from the pacemaker's DC. All other logs from sosreport are needed so that I can trace the progression of the galera bootstrap process across controller nodes. Could you link them to the bz?
The reason for this failure was due to the fact that the upgrade code always assumed that the mariadb-* packages are being upgraded together with mariadb-galera-server which is the owner of /var/lib/mysql). In this case, at the time of the upgrade, only the mariadb packages were upgraded which caused the absence of /var/lib/mysql on the non-bootstrap controller nodes. That is why galera failed to start (from crm_mon.txt): Failed Actions: * galera_start_0 on overcloud-controller-2 'not installed' (5): call=220, status=complete, exitreason='Datadir /var/lib/mysql doesn't exist', last-rc-change='Mon Nov 7 16:46:27 2016', queued=0ms, exec=73ms * galera_start_0 on overcloud-controller-1 'not installed' (5): call=220, status=complete, exitreason='Datadir /var/lib/mysql doesn't exist', last-rc-change='Mon Nov 7 16:46:26 2016', queued=0ms, exec=75ms We need to backport a fix in order to cater for this situation. Navneet, do you need help in bringing galera up again or can we assume you created the /var/lib/mysql folders on the non-bootstrap nodes, assigned the right permissions and restarted the resource?
I can try this workaround as suggested: 1. Manually create the /var/lib/mysql folders on the non-bootstrap nodes: controller0 and controller1. 2. Chown the folders for mysql user. 3. Bring the galera-master resource up on the cluster. $sudo pcs resource restart galera-master //If the resource is stopped on non-bootstrap controllers $sudo pcs resource enable galera-master //If the resource is stopped on all controllers 4. pcs status //cleanup if required 5. Restart the step 3.4.6
Also, please "restorecon" the created directory in step2 for SELinux Note that "pcs resource cleanup galera" might be better than "pcs resource restart" as it won't restart galera if it's already started on a node, thus preventing service outage.
Fix backported to Mitaka upstream
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2983.html