Description of problem:
The "no" option during an "openstack overcloud update stack" breakpoint seems to be broken.
Breakpoint reached, continue? Regexp or Enter=proceed (will clear 12300056-ffff-dddd-1111-12345678ffff), no=cancel update, C-c=quit interactive mode:
When "no" is selected a stack roll-back occurs and this actually causes all overcloud nodes to run yum updates in parallel (assuming patches are available). All controller nodes will do a pcs cluster stop at about the same time and can cause fencing if stonith is enabled. Obviously this is not the desired behavior.
Version-Release number of selected component (if applicable):
Current OSP 9 bits
How reproducible:
100% so far (once for a customer, once in a lab for me)
Steps to Reproduce:
1. Deploy OSP 9 via Director
2. Ensure nodes are registered or have update repos configured.
3. Run the patching procedure
openstack overcloud update stack overcloud -i \
--templates -e [env file] -e [more env files] \
....
4. At first breakpoint cancel the update via "no"
on_breakpoint: [u'mflusche-osd001', u'mflusche-osd000', u'mflusche-osd002', u'mflusche-compute001', u'mflusche-compute000']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear fafe8cc9-e4d4-46d9-8dc1-57b62cf73b58), no=cancel update, C-c=quit interactive mode: no
canceling update, doing rollback
canceling update
5. login to overcloud nodes and observe the behavior.
journalctl -u os-collect-config -f
Oct 25 23:01:12 mflusche-control000.flusche.co os-collect-config[3848]: [2016-10-25 23:01:12,543] (heat-config) [DEBUG] Running /var/lib/heat-config/hooks/script < /var/lib/heat-config/deployed/feffcf44-753b-4eaf-9cd0-7b9abd0272ff.json
Oct 25 23:07:10 mflusche-control000 yum[17346]: Updated: 1:openssl-libs-1.0.1e-51.el7_2.7.x86_64
Oct 25 23:07:10 mflusche-control000 yum[17346]: Updated: systemd-libs-219-19.el7_2.13.x86_64
Oct 25 23:07:10 mflusche-control000 yum[17346]: Updated: 1:librados2-0.94.9-3.el7cp.x86_64
...
tail -f /var/log/yum.log
monitor on controllers: pcs status
Actual results:
the "no" options during a breakpoint seems to cause a parallel patch update on all overcloud nodes.
Expected results:
cancel update operation.
Additional info:
Comment 4Sofer Athlan-Guyot
2018-09-04 12:57:19 UTC
I'm not sure why we ever allowed the user to cancel an update, because doing a rollback has never been safe in TripleO.
It wasn't until Queens (OSP13) that Heat offered a way for users to cancel a stack update without triggering a rollback: https://bugs.launchpad.net/heat/+bug/1709041
The code to cancel an update was removed from tripleo-common in Pike and backported to Ocata:
https://review.openstack.org/#/q/I752e061979d667c1fb2b115c1a7339002e1824d5
So OSP 10 and earlier are presumably still affected, which is what the testing discussed above appears to show.
(Ironically, it would be a useful thing to add back in now that we can cancel without triggering a rollback, as long as we did that.)
Closing as a wontfix as we have provided a way to cancel in Queens and it is unlikely that we will be able to address this to any of the older versions prior to their EOL.