Bug 1272254

Summary: Overcloud update fails due to os-collect-config restart
Product: Red Hat OpenStack Reporter: Jan Provaznik <jprovazn>
Component: os-collect-configAssignee: Mike Burns <mburns>
Status: CLOSED ERRATA QA Contact: Alexander Chuzhoy <sasha>
Severity: high Docs Contact:
Priority: urgent    
Version: 7.0 (Kilo)CC: apevec, augol, cylopez, dmacpher, dsavinea, glambert, hrosnet, lhh, mburns, mchappel, rhel-osp-director-maint, sasha, sbaker, yeylon, zbitter
Target Milestone: y2Keywords: Triaged
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: os-collect-config-0.1.35-4.el7ost Doc Type: Bug Fix
Doc Text:
The "os-collect-config" service on the Overcloud restarted on an RPM update. This caused Overcloud updates to fail. This fix changes the behavior so that "os-collect-config" does not restart on an RPM update. The Overcloud updates now succeed after an update of "os-collect-config". Note that "os-collect-config" gracefully restarts itself when "os-refresh-config" runs, so the restart on update is not required.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-21 16:56:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1274859, 1275814    
Bug Blocks:    

Description Jan Provaznik 2015-10-15 21:35:52 UTC
Description of problem:
When upgrading from 7.0 to 7.1 and running command:
openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml

This command runs forever. The problem is that during 7.0->7.1, yum update script on each node is aborted because os-collect-config service (parent process of heat-config) is updated and restarted during yum update: http://paste.openstack.org/show/476428/

IOW yum update script is doing unintentional suicide.

And because heat-config scripts are marked as deployed before running them, then when os-collect-config is restarted and runs again, it considers this script as already deployed which causes it never sends signal back to heat:
https://github.com/openstack/heat-templates/blob/master/hot/software-config/elements/heat-config/os-refresh-config/configure.d/55-heat-config#L113

So in the end CLI update command is running forever waiting for finishing update on the failed node.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-0.8.6-71.el7ost.noarch

on overcloud nodes:
[heat-admin@overcloud-compute-0 ~]$ rpm -qa|grep os-collect-config
os-collect-config-0.1.35-3.el7ost.noarch
os-collect-config-0.1.35-2.el7ost.noarch


Steps to Reproduce:
1. deploy RHOS-d 7.0
2. update UC to 7.1
3. run openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml

Actual results:
update runs until timeout

Expected results:
update finishes successfully

Comment 2 Steve Baker 2015-10-15 21:45:08 UTC
os-collect-config is designed to gracefully restart at the end of each run if any data changes, so the rpm spec does not need to specify a restart for the os-collect-config service.

https://github.com/openstack/os-collect-config/blob/master/os_collect_config/collect.py#L287

I would suggest as an urgent fix to release an os-collect-config package which doesn't restart the service.

Comment 5 Jan Provaznik 2015-10-16 08:46:41 UTC
Unfortunately os-collect-config is still being restarted:
Oct 16 04:34:51 overcloud-controller-0.localdomain os-collect-config[2003]: 2015-10-16 04:34:51.390 2003 WARNING os_collect_config.local [-] No local metadata found (['/var/lib/os-collect-config/local-data'])
Oct 16 04:35:27 overcloud-controller-0.localdomain yum[28308]: Updated: os-collect-config-0.1.35-4.el7ost.noarch
Oct 16 04:35:33 overcloud-controller-0.localdomain os-collect-config[29174]: 2015-10-16 04:35:33.061 29174 WARNING os-collect-config [-] Source [request] Unavailable.

Comment 6 Mike Burns 2015-10-16 11:57:29 UTC
the restart is actually triggered by the rpm being removed, not the new one being installed.  

the rpm script is %postun which tells the rpm what to do when it's being removed (or upgraded).  There isn't anything we can do for that other than document that users need to manually update the rpm on each host *first* then run the stack update.

Comment 8 Steve Baker 2015-10-18 20:27:04 UTC
If we can't avoid a restart then we should be able to get systemd to not kill os-collect-conifig's child processes.

According to man systemd.kill [1] setting [Service] SendSIGKILL=no would prevent os-refresh-config from being killed when os-collect-config is. This would allow the full os-refresh-config run to continue until its natural exit.

The restarted os-collect-config may attempt to do another os-refresh-config while the old one is still running, but this is fine as os-refresh-config prevents concurrent runs with a lockfile [2]

It would be nice if we could fix this in the systemd unit rather than requiring a manual upgrade of the package.

[1] http://www.freedesktop.org/software/systemd/man/systemd.kill.html
[2] https://github.com/openstack/os-refresh-config/blob/master/os_refresh_config/os_refresh_config.py#L93

Comment 9 Zane Bitter 2015-10-19 20:58:29 UTC
I think we should make that change to the package, but also add code in the upgrade script to set SendSIGKILL=no in the service file if it is not already present and then do a systemctl daemon-reload so that when yum runs the %postun stanza it will not kill the existing os-collect-config. I think that will allow us to make the initial transition (from not having SendSIGKILL=no to having it) without a manual workaround.

The thing to watch out for would be how yum treats modified files on an uninstall (I think it renames them with a suffix instead of removing them), and how that interacts with systemd (I think it probably works because the directory it actually starts things from just contains symlinks to the actual unit files). It should work but there may be subtleties.

Comment 10 Steve Baker 2015-10-19 21:01:05 UTC
I'll look into patching the unit file in the update script too.

Comment 12 Steve Baker 2015-11-09 00:34:46 UTC
The fixed package works for me when upgrading puddles 2015-07-30-1 -> 2015-10-21-1.

One quirk is that journalctl -u os-collect-config stops logging the orphaned os-refresh-config so the results of the remaining update script can't be seen until heat is signalled with the full deploy_stdout. This is to be expected, its just something to keep in mind.

Comment 14 Amit Ugol 2015-12-15 11:17:33 UTC
Upgrading works from 7.0 to 7.2 now. the original error is very binary, either it works or it isn't. since it is, its enough to mark this as verified.

Comment 16 errata-xmlrpc 2015-12-21 16:56:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2015:2651