Bug 1389115

Summary: cancelling "openstack overcloud update stack" during a breakpoint behaves badly
Product: Red Hat OpenStack Reporter: Matt Flusche <mflusche>
Component: openstack-tripleo-commonAssignee: Adriano Petrich <apetrich>
Status: CLOSED WONTFIX QA Contact: Alexander Chuzhoy <sasha>
Severity: urgent Docs Contact:
Priority: urgent    
Version: unspecifiedCC: aschultz, ccamacho, dbecker, emacchi, mbracho, mburns, morazi, rhel-osp-director-maint, sathlang, sbaker, shardy, slinaber, tvignaud
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: 9.0 (Mitaka)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-19 21:39:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Matt Flusche 2016-10-26 21:42:41 UTC
Description of problem:

The "no" option during an "openstack overcloud update stack" breakpoint seems to be broken.

  Breakpoint reached, continue? Regexp or Enter=proceed (will clear 12300056-ffff-dddd-1111-12345678ffff), no=cancel update, C-c=quit interactive mode:

When "no" is selected a stack roll-back occurs and this actually causes all overcloud nodes to run yum updates in parallel (assuming patches are available).  All controller nodes will do a pcs cluster stop at about the same time and can cause fencing if stonith is enabled.  Obviously this is not the desired behavior.

Version-Release number of selected component (if applicable):
Current OSP 9 bits

How reproducible:
100% so far (once for a customer, once in a lab for me)

Steps to Reproduce:
1. Deploy OSP 9 via Director
2. Ensure nodes are registered or have update repos configured. 
3. Run the patching procedure
  openstack overcloud update stack overcloud -i \
  --templates -e [env file] -e [more env files] \
  ....

4. At first breakpoint cancel the update via "no"
  on_breakpoint: [u'mflusche-osd001', u'mflusche-osd000', u'mflusche-osd002', u'mflusche-compute001', u'mflusche-compute000']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear fafe8cc9-e4d4-46d9-8dc1-57b62cf73b58), no=cancel update, C-c=quit interactive mode: no
canceling update, doing rollback
canceling update

5. login to overcloud nodes and observe the behavior.

  journalctl -u os-collect-config -f 
Oct 25 23:01:12 mflusche-control000.flusche.co os-collect-config[3848]: [2016-10-25 23:01:12,543] (heat-config) [DEBUG] Running /var/lib/heat-config/hooks/script < /var/lib/heat-config/deployed/feffcf44-753b-4eaf-9cd0-7b9abd0272ff.json
Oct 25 23:07:10 mflusche-control000 yum[17346]: Updated: 1:openssl-libs-1.0.1e-51.el7_2.7.x86_64
Oct 25 23:07:10 mflusche-control000 yum[17346]: Updated: systemd-libs-219-19.el7_2.13.x86_64
Oct 25 23:07:10 mflusche-control000 yum[17346]: Updated: 1:librados2-0.94.9-3.el7cp.x86_64
...

  tail -f /var/log/yum.log

  monitor on controllers: pcs status

Actual results:

the "no" options during a breakpoint seems to cause a parallel patch update on all overcloud nodes.

Expected results:

cancel update operation.

Additional info:

Comment 4 Sofer Athlan-Guyot 2018-09-04 12:57:19 UTC
Hi,

this is still happening see https://bugzilla.redhat.com/show_bug.cgi?id=1613063 for more information.

Comment 7 Zane Bitter 2018-10-22 16:35:05 UTC
I'm not sure why we ever allowed the user to cancel an update, because doing a rollback has never been safe in TripleO.

It wasn't until Queens (OSP13) that Heat offered a way for users to cancel a stack update without triggering a rollback: https://bugs.launchpad.net/heat/+bug/1709041

The code to cancel an update was removed from tripleo-common in Pike and backported to Ocata:

https://review.openstack.org/#/q/I752e061979d667c1fb2b115c1a7339002e1824d5

So OSP 10 and earlier are presumably still affected, which is what the testing discussed above appears to show.

(Ironically, it would be a useful thing to add back in now that we can cancel without triggering a rollback, as long as we did that.)

Comment 8 Alex Schultz 2018-11-19 21:39:12 UTC
Closing as a wontfix as we have provided a way to cancel in Queens and it is unlikely that we will be able to address this to any of the older versions prior to their EOL.