Description of problem:
Step1 of the controller upgrade fails when upgrading OSP 8 to OSP 9 because os-collect-config is restarted during yum update, which means it doesn't finish the rest of the upgrade step 1 script, and never reports the success of step 1 to Heat. Heat waits and times out.
Looking at dist git, this looks like a packaging regression from fixed bug 1272254, it has most probably the same cause, and hopefully also the same fix will apply.
Version-Release number of selected component (if applicable):
Hmm so it seems the OSP 8 postun script has the same issue, and it's actually *that* postun script which gets executed during upgrade to OSP 9.
So this fix is still important because it will fix minor updates of OSP 9 and major upgrade from OSP 9 to OSP 10, but it doesn't fix the upgrade from OSP 8 to OSP 9.
We'll probably need to document and out-of-band (= out-of-heat) `sudo yum -y update os-collect-config` to be run on all nodes after the `overcloud deploy` that does the repository switch, but before the `overcloud deploy` that upgrades controllers.
(In reply to Jiri Stransky from comment #2)
> Hmm so it seems the OSP 8 postun script has the same issue, and it's
> actually *that* postun script which gets executed during upgrade to OSP 9.
> So this fix is still important because it will fix minor updates of OSP 9
> and major upgrade from OSP 9 to OSP 10, but it doesn't fix the upgrade from
> OSP 8 to OSP 9.
> We'll probably need to document and out-of-band (= out-of-heat) `sudo yum -y
> update os-collect-config` to be run on all nodes after the `overcloud
> deploy` that does the repository switch, but before the `overcloud deploy`
> that upgrades controllers.
We already document that we have to update to the latest osp 8 before upgrade to osp9, right? Should we just fix this in 8 and push it out quickly?
(In reply to Mike Burns from comment #3)
> We already document that we have to update to the latest osp 8 before
> upgrade to osp9, right? Should we just fix this in 8 and push it out
I'm not sure if such minor update to OSP 8 will work though. It will probably fail with the same problem -- the minor update script will be triggered by Heat via os-collect-config, it will run yum update, which will restart os-collect-config in an inconvenient time and the minor update will not finish and will not be reported to Heat, leaving the stack-update stuck. Not 100% sure here but i don't think it would work.
the real fix, as discussed with sbaker (some context below), is in the packaging, in particular we need to add SendSIGKILL=no to the systemd unit file. Moving back to ON_DEV since we definitley need the packaging to happen for 8..9 upgrades
---------------------context from sbaker:--------------------
"We had os-collect-config upgrades working in OSP 7, yes the postrun
restarts the service, but the systemd unit file also has this:
This means that when os-collect-config gets restarted, the current
running os-refresh-config will continue to completion. Its not ideal
because os-refresh-config output stops getting logged to the journal,
and you only find out what happened in the rest of the run if/when the
various heat deployment resources get signaled.
> > This reminds me, when upgrading to the package which contains the SendSIGKILL=no fix my testing showed that when the postrun restart happens it uses the *new* unit file behavior. This means that if the SendSIGKILL=no fix is released into OSP 8 and 9 no special pre-upgrade handling is needed.
> So you're telling me we fix this whole thing by simply shipping an os-collect-config RPM with the right stuff in the systemd unit file and everything just works?
> ... facepalm...
I remember now, thats how we fixed 7.3 upgrade.
RDO fix posted, I'm on PTO next week so someone can take ownership of that change if anything needs fixing.
8.0 GA - > 9 Upgrade
I followed the latest upgrade guide and finished on 02 AUG 16
openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --ceph-storage-scale 1
[root@overcloud-controller-1 ~]# rpm -qa | grep os-collect-config
[root@overcloud-controller-1 ~]# date
Fri Aug 5 13:29:02 UTC 2016
I did not see this issue reported during this upgrade.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.