Description of problem: When upgrading from 7.0 to 7.1 and running command: openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml This command runs forever. The problem is that during 7.0->7.1, yum update script on each node is aborted because os-collect-config service (parent process of heat-config) is updated and restarted during yum update: http://paste.openstack.org/show/476428/ IOW yum update script is doing unintentional suicide. And because heat-config scripts are marked as deployed before running them, then when os-collect-config is restarted and runs again, it considers this script as already deployed which causes it never sends signal back to heat: https://github.com/openstack/heat-templates/blob/master/hot/software-config/elements/heat-config/os-refresh-config/configure.d/55-heat-config#L113 So in the end CLI update command is running forever waiting for finishing update on the failed node. Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-0.8.6-71.el7ost.noarch on overcloud nodes: [heat-admin@overcloud-compute-0 ~]$ rpm -qa|grep os-collect-config os-collect-config-0.1.35-3.el7ost.noarch os-collect-config-0.1.35-2.el7ost.noarch Steps to Reproduce: 1. deploy RHOS-d 7.0 2. update UC to 7.1 3. run openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml Actual results: update runs until timeout Expected results: update finishes successfully
os-collect-config is designed to gracefully restart at the end of each run if any data changes, so the rpm spec does not need to specify a restart for the os-collect-config service. https://github.com/openstack/os-collect-config/blob/master/os_collect_config/collect.py#L287 I would suggest as an urgent fix to release an os-collect-config package which doesn't restart the service.
Unfortunately os-collect-config is still being restarted: Oct 16 04:34:51 overcloud-controller-0.localdomain os-collect-config[2003]: 2015-10-16 04:34:51.390 2003 WARNING os_collect_config.local [-] No local metadata found (['/var/lib/os-collect-config/local-data']) Oct 16 04:35:27 overcloud-controller-0.localdomain yum[28308]: Updated: os-collect-config-0.1.35-4.el7ost.noarch Oct 16 04:35:33 overcloud-controller-0.localdomain os-collect-config[29174]: 2015-10-16 04:35:33.061 29174 WARNING os-collect-config [-] Source [request] Unavailable.
the restart is actually triggered by the rpm being removed, not the new one being installed. the rpm script is %postun which tells the rpm what to do when it's being removed (or upgraded). There isn't anything we can do for that other than document that users need to manually update the rpm on each host *first* then run the stack update.
If we can't avoid a restart then we should be able to get systemd to not kill os-collect-conifig's child processes. According to man systemd.kill [1] setting [Service] SendSIGKILL=no would prevent os-refresh-config from being killed when os-collect-config is. This would allow the full os-refresh-config run to continue until its natural exit. The restarted os-collect-config may attempt to do another os-refresh-config while the old one is still running, but this is fine as os-refresh-config prevents concurrent runs with a lockfile [2] It would be nice if we could fix this in the systemd unit rather than requiring a manual upgrade of the package. [1] http://www.freedesktop.org/software/systemd/man/systemd.kill.html [2] https://github.com/openstack/os-refresh-config/blob/master/os_refresh_config/os_refresh_config.py#L93
I think we should make that change to the package, but also add code in the upgrade script to set SendSIGKILL=no in the service file if it is not already present and then do a systemctl daemon-reload so that when yum runs the %postun stanza it will not kill the existing os-collect-config. I think that will allow us to make the initial transition (from not having SendSIGKILL=no to having it) without a manual workaround. The thing to watch out for would be how yum treats modified files on an uninstall (I think it renames them with a suffix instead of removing them), and how that interacts with systemd (I think it probably works because the directory it actually starts things from just contains symlinks to the actual unit files). It should work but there may be subtleties.
I'll look into patching the unit file in the update script too.
The fixed package works for me when upgrading puddles 2015-07-30-1 -> 2015-10-21-1. One quirk is that journalctl -u os-collect-config stops logging the orphaned os-refresh-config so the results of the remaining update script can't be seen until heat is signalled with the full deploy_stdout. This is to be expected, its just something to keep in mind.
Upgrading works from 7.0 to 7.2 now. the original error is very binary, either it works or it isn't. since it is, its enough to mark this as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2015:2651