Description of problem: Step1 of the controller upgrade fails when upgrading OSP 8 to OSP 9 because os-collect-config is restarted during yum update, which means it doesn't finish the rest of the upgrade step 1 script, and never reports the success of step 1 to Heat. Heat waits and times out. Looking at dist git, this looks like a packaging regression from fixed bug 1272254, it has most probably the same cause, and hopefully also the same fix will apply. Reference links: https://bugzilla.redhat.com/show_bug.cgi?id=1272254#c2 https://review.gerrithub.io/#/c/249945/1 Version-Release number of selected component (if applicable): original: os-collect-config-0.1.37-2.el7ost.noarch updated: os-collect-config-0.1.37-4.el7ost.noarch
Hmm so it seems the OSP 8 postun script has the same issue, and it's actually *that* postun script which gets executed during upgrade to OSP 9. So this fix is still important because it will fix minor updates of OSP 9 and major upgrade from OSP 9 to OSP 10, but it doesn't fix the upgrade from OSP 8 to OSP 9. We'll probably need to document and out-of-band (= out-of-heat) `sudo yum -y update os-collect-config` to be run on all nodes after the `overcloud deploy` that does the repository switch, but before the `overcloud deploy` that upgrades controllers.
(In reply to Jiri Stransky from comment #2) > Hmm so it seems the OSP 8 postun script has the same issue, and it's > actually *that* postun script which gets executed during upgrade to OSP 9. > > So this fix is still important because it will fix minor updates of OSP 9 > and major upgrade from OSP 9 to OSP 10, but it doesn't fix the upgrade from > OSP 8 to OSP 9. > > We'll probably need to document and out-of-band (= out-of-heat) `sudo yum -y > update os-collect-config` to be run on all nodes after the `overcloud > deploy` that does the repository switch, but before the `overcloud deploy` > that upgrades controllers. We already document that we have to update to the latest osp 8 before upgrade to osp9, right? Should we just fix this in 8 and push it out quickly?
(In reply to Mike Burns from comment #3) > We already document that we have to update to the latest osp 8 before > upgrade to osp9, right? Should we just fix this in 8 and push it out > quickly? I'm not sure if such minor update to OSP 8 will work though. It will probably fail with the same problem -- the minor update script will be triggered by Heat via os-collect-config, it will run yum update, which will restart os-collect-config in an inconvenient time and the minor update will not finish and will not be reported to Heat, leaving the stack-update stuck. Not 100% sure here but i don't think it would work.
the real fix, as discussed with sbaker (some context below), is in the packaging, in particular we need to add SendSIGKILL=no to the systemd unit file. Moving back to ON_DEV since we definitley need the packaging to happen for 8..9 upgrades ---------------------context from sbaker:-------------------- "We had os-collect-config upgrades working in OSP 7, yes the postrun restarts the service, but the systemd unit file also has this: KillMode=process SendSIGKILL=no This means that when os-collect-config gets restarted, the current running os-refresh-config will continue to completion. Its not ideal because os-refresh-config output stops getting logged to the journal, and you only find out what happened in the rest of the run if/when the various heat deployment resources get signaled. ... > > This reminds me, when upgrading to the package which contains the SendSIGKILL=no fix my testing showed that when the postrun restart happens it uses the *new* unit file behavior. This means that if the SendSIGKILL=no fix is released into OSP 8 and 9 no special pre-upgrade handling is needed. > > > So you're telling me we fix this whole thing by simply shipping an os-collect-config RPM with the right stuff in the systemd unit file and everything just works? > > ... facepalm... > I remember now, thats how we fixed 7.3 upgrade.
RDO fix posted, I'm on PTO next week so someone can take ownership of that change if anything needs fixing.
8.0 GA - > 9 Upgrade I followed the latest upgrade guide and finished on 02 AUG 16 (http://etherpad.corp.redhat.com/ospd9-upgrade) Initial deployment: openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --ceph-storage-scale 1 [root@overcloud-controller-1 ~]# rpm -qa | grep os-collect-config os-collect-config-0.1.37-6.el7ost.noarch [root@overcloud-controller-1 ~]# date Fri Aug 5 13:29:02 UTC 2016 I did not see this issue reported during this upgrade.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1599.html