Bug 1602071 - [RFE] Provide a graceful recovery if procedure fails part-way through
Summary: [RFE] Provide a graceful recovery if procedure fails part-way through
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: RFEs
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Jesse Pretorius
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-17 18:38 UTC by Valli Annamalai
Modified: 2023-08-07 08:46 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-06-21 13:46:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-2867 0 None None None 2022-01-05 05:34:28 UTC
Red Hat Issue Tracker UPG-22 0 None None None 2021-09-28 15:31:45 UTC

Description Valli Annamalai 2018-07-17 18:38:51 UTC
Description of problem:

OSP10 was deployed with 3 controllers and 2 computes.
Undercloud was upgraded from OSP10 to 13
Fast Forward prepare was run including all the templates.
But I missed the ffwd-upgrade run command and executed the controller upgrade.

So during controller upgrade_steps, the task Install docker package failed:

 u'TASK [Install docker packages on upgrade if missing] ***************************',
 u'Tuesday 17 July 2018  11:47:43 -0400 (0:00:00.101)       0:20:22.448 ********** ',
 u'fatal: [192.168.24.7]: FAILED! => {"changed": false, "msg": "There are no enabled repos.\\n Run \\"yum repolist all\\" to see the repos you have.\\n To enable Red Hat Subscription Management repositories:\\n     subscription-manager repos --enable <repo>\\n To enable custom repositories:\\n     yum-config-manager --enable <repo>\\n", "rc": 1, "results": []}',
 u'fatal: [192.168.24.15]: FAILED! => {"changed": false, "msg": "There are no enabled repos.\\n Run \\"yum repolist all\\" to see the repos you have.\\n To enable Red Hat Subscription Management repositories:\\n     subscription-manager repos --enable <repo>\\n To enable custom repositories:\\n     yum-config-manager --enable <repo>\\n", "rc": 1, "results": []}',
 u'fatal: [192.168.24.12]: FAILED! => {"changed": false, "msg": "There are no enabled repos.\\n Run \\"yum repolist all\\" to see the repos you have.\\n To enable Red Hat Subscription Management repositories:\\n     subscription-manager repos --enable <repo>\\n To enable custom repositories:\\n     yum-config-manager --enable <repo>\\n", "rc": 1, "results": []}',
 u'',
 u'PLAY RECAP *********************************************************************',
 u'192.168.24.12              : ok=354  changed=226  unreachable=0    failed=1   ',
 u'192.168.24.15              : ok=354  changed=226  unreachable=0    failed=1   ',
 u'192.168.24.7               : ok=354  changed=226  unreachable=0    failed=1   ',


So when I ran the ffwd-upgrade run command, it failed with error:
An unexpected error prevented the server from fulfilling your request. (HTTP 500) (Request-ID: req-3f978f6a-a1df-4d5d-a636-26e7d1b26bad)

And in keystone log:
 [root@lorenzo stack]# tail /var/log/keystone/keystone.log
2018-07-17 13:17:26.316 48958 ERROR keystone.common.wsgi   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1152, in _request_authentication
2018-07-17 13:17:26.316 48958 ERROR keystone.common.wsgi     auth_packet = self._read_packet()
2018-07-17 13:17:26.316 48958 ERROR keystone.common.wsgi   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1014, in _read_packet
2018-07-17 13:17:26.316 48958 ERROR keystone.common.wsgi     packet.check_error()
2018-07-17 13:17:26.316 48958 ERROR keystone.common.wsgi   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 393, in check_error
2018-07-17 13:17:26.316 48958 ERROR keystone.common.wsgi     err.raise_mysql_exception(self._data)
2018-07-17 13:17:26.316 48958 ERROR keystone.common.wsgi   File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 107, in raise_mysql_exception
2018-07-17 13:17:26.316 48958 ERROR keystone.common.wsgi     raise errorclass(errno, errval)
2018-07-17 13:17:26.316 48958 ERROR keystone.common.wsgi DBNonExistentDatabase: (pymysql.err.InternalError) (1049, u"Unknown database 'keystone'") (Background on this error at: http://sqlalche.me/e/2j85)
2018-07-17 13:17:26.316 48958 ERROR keystone.common.wsgi 


Since the upgrade_steps playbook failed in the middle after disabling all the services, the openstack CLI commands failed.

So there should be a way to recover from this other than the hard way of starting OSP 10 from scratch. The playbook can be made to revert all changes made when it fails in the middle. Or there could be a validation step in the beginning of controller upgrade to check if the ffwd-upgrade run command completed successfully.


Version-Release number of selected component (if applicable):


How reproducible:
Can be reproduced when the run command is missed and the controllers upgrade is started


Steps to Reproduce:
1. Deploy OSP10
2. Upgrade undercloud from 10 to 13
3. openstack overcloud ffwd-upgrade prepare
4. openstack overcloud upgrade run --roles Controller
5. Step 4 will fail with the task: Install docker packages
6. openstack overcloud ffwd-upgrade run --yes
7. Step 6 will throw error with keystone

Actual results:
When upgrade steps in controller fail, its impossible to recover the cloud.

Expected results:
When upgrade steps fail, it should revert the changes so the cloud is not disturbed. Or a validation step should be added to make sure all previous command were completed successfully.

Additional info:

Comment 8 spower 2022-05-11 10:19:51 UTC
This RFE is not marked as an MVP for 17.0, so it is being moved for consideration to OSP 17.1. As stated in the OSP Program Call, QE and Docs only have the capacity to verify and document MVP features for OSP 17.0.

Comment 9 Lukas Bezdicka 2022-06-21 13:46:44 UTC
I think we pretty much adressed this in OSP13->OSP16 where if issue happens the usual procedure is to run proper step unless there is needed change to THT. In that case one edits templates, reruns prepare and continues with same step they were at.


Note You need to log in before you can comment on or make changes to this bug.