Bug 1602071

Summary: [RFE] Provide a graceful recovery if procedure fails part-way through
Product: Red Hat OpenStack Reporter: Valli Annamalai <vannamal>
Component: RFEsAssignee: Jesse Pretorius <jpretori>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 13.0 (Queens)CC: ccamacho, hbrock, jfrancoa, jpretori, jslagle, lbezdick, markmc, mburns, morazi, spower
Target Milestone: ---Keywords: FutureFeature, Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-06-21 13:46:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Valli Annamalai 2018-07-17 18:38:51 UTC
Description of problem:

OSP10 was deployed with 3 controllers and 2 computes.
Undercloud was upgraded from OSP10 to 13
Fast Forward prepare was run including all the templates.
But I missed the ffwd-upgrade run command and executed the controller upgrade.

So during controller upgrade_steps, the task Install docker package failed:

 u'TASK [Install docker packages on upgrade if missing] ***************************',
 u'Tuesday 17 July 2018  11:47:43 -0400 (0:00:00.101)       0:20:22.448 ********** ',
 u'fatal: [192.168.24.7]: FAILED! => {"changed": false, "msg": "There are no enabled repos.\\n Run \\"yum repolist all\\" to see the repos you have.\\n To enable Red Hat Subscription Management repositories:\\n     subscription-manager repos --enable <repo>\\n To enable custom repositories:\\n     yum-config-manager --enable <repo>\\n", "rc": 1, "results": []}',
 u'fatal: [192.168.24.15]: FAILED! => {"changed": false, "msg": "There are no enabled repos.\\n Run \\"yum repolist all\\" to see the repos you have.\\n To enable Red Hat Subscription Management repositories:\\n     subscription-manager repos --enable <repo>\\n To enable custom repositories:\\n     yum-config-manager --enable <repo>\\n", "rc": 1, "results": []}',
 u'fatal: [192.168.24.12]: FAILED! => {"changed": false, "msg": "There are no enabled repos.\\n Run \\"yum repolist all\\" to see the repos you have.\\n To enable Red Hat Subscription Management repositories:\\n     subscription-manager repos --enable <repo>\\n To enable custom repositories:\\n     yum-config-manager --enable <repo>\\n", "rc": 1, "results": []}',
 u'',
 u'PLAY RECAP *********************************************************************',
 u'192.168.24.12              : ok=354  changed=226  unreachable=0    failed=1   ',
 u'192.168.24.15              : ok=354  changed=226  unreachable=0    failed=1   ',
 u'192.168.24.7               : ok=354  changed=226  unreachable=0    failed=1   ',


So when I ran the ffwd-upgrade run command, it failed with error:
An unexpected error prevented the server from fulfilling your request. (HTTP 500) (Request-ID: req-3f978f6a-a1df-4d5d-a636-26e7d1b26bad)

And in keystone log:
 [root@lorenzo stack]# tail /var/log/keystone/keystone.log
2018-07-17 13:17:26.316 48958 ERROR keystone.common.wsgi   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1152, in _request_authentication
2018-07-17 13:17:26.316 48958 ERROR keystone.common.wsgi     auth_packet = self._read_packet()
2018-07-17 13:17:26.316 48958 ERROR keystone.common.wsgi   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1014, in _read_packet
2018-07-17 13:17:26.316 48958 ERROR keystone.common.wsgi     packet.check_error()
2018-07-17 13:17:26.316 48958 ERROR keystone.common.wsgi   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 393, in check_error
2018-07-17 13:17:26.316 48958 ERROR keystone.common.wsgi     err.raise_mysql_exception(self._data)
2018-07-17 13:17:26.316 48958 ERROR keystone.common.wsgi   File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 107, in raise_mysql_exception
2018-07-17 13:17:26.316 48958 ERROR keystone.common.wsgi     raise errorclass(errno, errval)
2018-07-17 13:17:26.316 48958 ERROR keystone.common.wsgi DBNonExistentDatabase: (pymysql.err.InternalError) (1049, u"Unknown database 'keystone'") (Background on this error at: http://sqlalche.me/e/2j85)
2018-07-17 13:17:26.316 48958 ERROR keystone.common.wsgi 


Since the upgrade_steps playbook failed in the middle after disabling all the services, the openstack CLI commands failed.

So there should be a way to recover from this other than the hard way of starting OSP 10 from scratch. The playbook can be made to revert all changes made when it fails in the middle. Or there could be a validation step in the beginning of controller upgrade to check if the ffwd-upgrade run command completed successfully.


Version-Release number of selected component (if applicable):


How reproducible:
Can be reproduced when the run command is missed and the controllers upgrade is started


Steps to Reproduce:
1. Deploy OSP10
2. Upgrade undercloud from 10 to 13
3. openstack overcloud ffwd-upgrade prepare
4. openstack overcloud upgrade run --roles Controller
5. Step 4 will fail with the task: Install docker packages
6. openstack overcloud ffwd-upgrade run --yes
7. Step 6 will throw error with keystone

Actual results:
When upgrade steps in controller fail, its impossible to recover the cloud.

Expected results:
When upgrade steps fail, it should revert the changes so the cloud is not disturbed. Or a validation step should be added to make sure all previous command were completed successfully.

Additional info:

Comment 8 spower 2022-05-11 10:19:51 UTC
This RFE is not marked as an MVP for 17.0, so it is being moved for consideration to OSP 17.1. As stated in the OSP Program Call, QE and Docs only have the capacity to verify and document MVP features for OSP 17.0.

Comment 9 Lukas Bezdicka 2022-06-21 13:46:44 UTC
I think we pretty much adressed this in OSP13->OSP16 where if issue happens the usual procedure is to run proper step unless there is needed change to THT. In that case one edits templates, reruns prepare and continues with same step they were at.