Bug 1434437

Summary: [UPDATES] Reconnect to interrupted udpates
Product: Red Hat OpenStack Reporter: Yurii Prokulevych <yprokule>
Component: python-tripleoclientAssignee: Adriano Petrich <apetrich>
Status: CLOSED WONTFIX QA Contact: Arik Chernetsky <achernet>
Severity: high Docs Contact:
Priority: high    
Version: 11.0 (Ocata)CC: beth.white, brad, hbrock, jcoufal, jjoyce, jpichon, jschluet, jslagle, lbezdick, mburns, rbrady, rhel-osp-director-maint, sathlang, slinaber, tvignaud
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-07-29 15:37:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yurii Prokulevych 2017-03-21 14:01:05 UTC
Description of problem:
-----------------------
If minor update of RHOS-11 is cancelled there is no easy way to restart the update except restarting heat-engine.


Version-Release number of selected component (if applicable):
--------------------------------------------------------------
puppet-mistral-10.3.0-0.20170213130808.a6e8ebc.el7ost.noarch
openstack-mistral-api-4.0.0-3.el7ost.noarch
python-mistralclient-3.0.0-0.20170208192040.a7bf138.el7ost.noarch
openstack-mistral-engine-4.0.0-3.el7ost.noarch
openstack-mistral-executor-4.0.0-3.el7ost.noarch
python-openstack-mistral-4.0.0-3.el7ost.noarch
openstack-mistral-common-4.0.0-3.el7ost.noarch

python-heatclient-1.8.0-0.20170208192329.17dd306.el7ost.noarch
openstack-heat-engine-8.0.0-4.el7ost.noarch
puppet-heat-10.3.0-0.20170210225040.920f4f9.el7ost.noarch
openstack-heat-common-8.0.0-4.el7ost.noarch
openstack-tripleo-heat-templates-6.0.0-0.20170307170102.3134785.0rc2.el7ost.noarch
python-heat-agent-1.0.0-0.20170224185834.8e6dbb1.el7ost.noarch
openstack-heat-api-8.0.0-4.el7ost.noarch
openstack-heat-api-cfn-8.0.0-4.el7ost.noarch
heat-cfntools-1.3.0-2.el7ost.noarch

Steps to Reproduce:
1. Deploy RHOS-11 (2017-03-14.2)
2. Setup lates repos on uc and oc (2017-03-15.2)
3. Update undercloud
4. Start interactive overcloud update
    openstack overcloud update stack -i overcloud
5. After updating at least one node hit CTRL-C
6. Try to re-run update
    ERROR: Stack overcloud already has an action (UPDATE) in progress.

Actual results:
---------------
Update cannot be re-run

Expected results:
-----------------
There is a way to reconnect to an existing update


Additional info:
----------------
Virtual setup: 3controllers + 2computes + 3ceph

Comment 2 Brad P. Crochet 2017-03-21 19:17:52 UTC
One option is to abort the update:

openstack overcloud update abort overcloud

This does not block. You will need to monitor stack status until it returns to 'ROLLBACK_COMPLETE'.

I am still investigating continuing the update instead of aborting.

Comment 3 Brad P. Crochet 2017-03-22 13:06:02 UTC
Steps to continue an update:

1. openstack stack resource list -n 5 -f yaml --filter name=UpdateDeployment overcloud

This command will give you the resource that the hook is on. You will get output similar to this:

- physical_resource_id: 50da754b-2e09-45ff-8836-6a8b097337b6
  resource_name: UpdateDeployment
  resource_status: UPDATE_COMPLETE
  resource_type: OS::Heat::SoftwareDeployment
  stack_name: overcloud-Controller-cnu5246du7ej-0-qcu5iunuiqmm
  updated_time: '2017-03-22T12:53:27Z'
- physical_resource_id: 24718cee-63fb-4620-955a-20fca5678316
  resource_name: UpdateDeployment
  resource_status: UPDATE_COMPLETE
  resource_type: OS::Heat::SoftwareDeployment
  stack_name: overcloud-Compute-mwb5lla6twbn-0-tuysaeiyisoi
  updated_time: '2017-03-22T12:58:04Z'

2. openstack stack event list --resource UpdateDeployment overcloud-Controller-cnu5246du7ej-0-qcu5iunuiqmm
openstack stack event list --resource UpdateDeployment overcloud-Compute-mwb5lla6twbn-0-tuysaeiyisoi

The last few lines of the output from these commands (you will run this for every UpdateDeployment resource) will look similar to this:

2017-03-22 12:32:31Z [UpdateDeployment]: UPDATE_COMPLETE  UPDATE paused until Hook pre-update is cleared

If you see that, then you know the breakpoint has been reached, and needs to be cleared.

3. openstack stack hook clear overcloud-Controller-cnu5246du7ej-0-qcu5iunuiqmm UpdateDeployment

This, as you can guess, will clear the breakpoint and allow the update to continue. If you run the event list again, you will see:

2017-03-22 12:53:27Z [UpdateDeployment]: UPDATE_COMPLETE  Hook pre-update is cleared
2017-03-22 12:53:27Z [UpdateDeployment]: UPDATE_IN_PROGRESS  state changed
2017-03-22 12:54:21Z [UpdateDeployment]: SIGNAL_IN_PROGRESS  Signal: deployment 50da754b-2e09-45ff-8836-6a8b097337b6 succeeded
2017-03-22 12:54:22Z [UpdateDeployment]: UPDATE_COMPLETE  state changed

4. openstack stack list

Monitor the stack for completion.

Comment 4 Brad P. Crochet 2017-03-23 12:38:25 UTC
Based on comments by zaneb, abort should not be used. In fact, he advocates for complete removal of that command, and I agree.

Comment 5 Julie Pichon 2017-03-29 17:05:13 UTC
Assigning back after confirming with Brad, I hadn't noticed all your debugging work around this - thank you!

Comment 7 Red Hat Bugzilla Rules Engine 2017-08-16 12:15:00 UTC
This bugzilla has been removed from the release since it has not been Triaged, and needs to be reviewed for targeting another release.