1517548 – swift_rsync container on controller nodes is in restarting state after stack update

Bug 1517548 - swift_rsync container on controller nodes is in restarting state after stack update

Summary: swift_rsync container on controller nodes is in restarting state after stack ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	12.0 (Pike)
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	z2
Target Release:	12.0 (Pike)
Assignee:	Christian Schwede (cschwede)
QA Contact:	Mike Abrams
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-11-26 17:01 UTC by Mike Abrams
Modified:	2018-06-04 14:30 UTC (History)
CC List:	5 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-7.0.9-8.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-06-04 14:30:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1734674	None	None	None	2017-11-27 11:59:20 UTC
OpenStack gerrit	523134	'None'	MERGED	Ensure rsyncd PID file is removed during overcloud updates	2020-12-06 07:54:47 UTC
OpenStack gerrit	526998	'None'	MERGED	Ensure rsyncd PID file is removed during overcloud updates	2020-12-06 07:54:47 UTC

Description Mike Abrams 2017-11-26 17:01:00 UTC

Description of problem:
during the converge step, the controller's swift_rsync container is restarted from a previously healthy state after the undercloud and overcloud upgrade, and gets stuck in a 'Restarting' state.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.3-12.el7ost.noarch

How reproducible:
always

Steps to Reproduce:
1. install rhos11
2. do upgrade steps up to the converge operation
3. swift_rsync container gets set to 'Restarting'

Actual results:
swift_rsync container gets set to 'Restarting'

Expected results:
swift_rsync container should restart in 'healthy' state

Additional info:
during the converge step, somewhere around here:

2017-11-26 12:53:33Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ComputeDeployment_Step4.0]: CREATE_COMPLETE  state changed
2017-11-26 12:53:33Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ComputeDeployment_Step4]: CREATE_COMPLETE  Stack CREATE completed successfully
2017-11-26 12:53:34Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ComputeDeployment_Step4]: CREATE_COMPLETE  state changed
2017-11-26 12:54:39Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ControllerDeployment_Step4.0]: SIGNAL_IN_PROGRESS  Signal: deployment 0601ca23-f0d7-4581-acb2-837a2bac8a89 succeeded
2017-11-26 12:54:39Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ControllerDeployment_Step4.0]: CREATE_COMPLETE  state changed
2017-11-26 12:54:39Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ControllerDeployment_Step4]: CREATE_COMPLETE  Stack CREATE completed successfully
2017-11-26 12:54:39Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ControllerDeployment_Step4]: CREATE_COMPLETE  state changed
2017-11-26 12:54:39Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.BlockStorageDeployment_Step5]: CREATE_IN_PROGRESS  state changed
2017-11-26 12:54:40Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ObjectStorageDeployment_Step5]: CREATE_IN_PROGRESS  state changed
2017-11-26 12:54:40Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.CephStorageDeployment_Step5]: CREATE_IN_PROGRESS  state changed
2017-11-26 12:54:41Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ComputeDeployment_Step5]: CREATE_IN_PROGRESS  state changed
2017-11-26 12:54:41Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ComputeDeployment_Step5]: CREATE_IN_PROGRESS  Stack CREATE started
2017-11-26 12:54:41Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ComputeDeployment_Step5.0]: CREATE_IN_PROGRESS  state changed
2017-11-26 12:54:41Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ControllerDeployment_Step5]: CREATE_IN_PROGRESS  state changed
2017-11-26 12:54:42Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ControllerDeployment_Step5]: CREATE_IN_PROGRESS  Stack CREATE started
2017-11-26 12:54:42Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ControllerDeployment_Step5.0]: CREATE_IN_PROGRESS  state changed
2017-11-26 12:54:43Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.BlockStorageDeployment_Step5]: CREATE_COMPLETE  state changed
2017-11-26 12:54:43Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ObjectStorageDeployment_Step5]: CREATE_COMPLETE  state changed
2017-11-26 12:54:43Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.CephStorageDeployment_Step5]: CREATE_COMPLETE  state changed


...the swift_rsync process goes from healthy to restarting.

Comment 1 Christian Schwede (cschwede) 2017-11-27 10:44:17 UTC

Adding Marius comment from https://bugzilla.redhat.com/show_bug.cgi?id=1493298#c14

> The converge step basically does a stack update the nova upgrade_levels. At this point the services have been upgraded and migrated into containers. If the swift_rsync container gets into Restarting state at that point I suspect the same issue would show up while doing a stack update of a fresh OSP12 deployment so it's probably not related to the patch which addresses BZ#1493298. Checking the logs on your machine we can see: 
> 
> [root@controller-0 heat-admin]# docker logs --tail 5 swift_rsync
> INFO:__main__:Deleting /etc/rsyncd.conf
> INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/rsyncd.conf to /etc/rsyncd.conf
> INFO:__main__:Writing out command to execute
> failed to create pid file /var/run/rsyncd.pid: File exists
> Running command: '/usr/bin/rsync --daemon --no-detach --config=/etc/rsyncd.conf'
> 
> After removing the existing rsync pid file the container is able to start:
> [root@controller-0 heat-admin]# mv /var/run/rsyncd.pid /var/run/rsyncd.pid.orig
> [root@controller-0 heat-admin]# docker restart swift_rsync
> swift_rsync
> [root@controller-0 heat-admin]# docker ps | grep swift_rsync
> a2e2c07ef6e8        rhos-qe-mirror-tlv.usersys.redhat.com:5000/rhosp12/openstack-swift-object-docker:20171122.1              "kolla_start"            18 minutes ago      Up About a minute (healthy)                       swift_rsync

I can confirm this; it also happens on master. After doing a stack update the rsync container is in a restarting loop, and removing /var/run/rsyncd.pid solves this.

Comment 2 Christian Schwede (cschwede) 2017-11-27 10:46:16 UTC

Note: I don't think this is a blocker (given that this can be worked-around easily), but we should fix this asap - ie in the first zstream.

Comment 3 Christian Schwede (cschwede) 2017-11-27 11:34:14 UTC

Looks like a stack update stops the rsyncd container in an unclean way - a normal "docker stop" && "docker start" works fine, whereas a "docker kill" leaves the rsyncd.pid file, preventing the container to start later on.

I don't think we need the rsyncd PID file in the containerized environment. As seen in this case, if docker restarts the container and the file still exists, it won't start. From the rsyncd.conf manpage:

"pid file This parameter tells the rsync daemon to write its process ID to that file.  If the file already exists, the rsync daemon will abort rather than overwrite the file."

It's set in puppet-swift: https://github.com/openstack/puppet-swift/blob/master/templates/rsyncd.conf.erb#L6

So there are multiple ways to handle this:
1. Ensure a stack update doesn't kill a container (?)
2. Remove a possibly existing .pid before starting rsyncd
3. Change the rsyncd.conf and remove the "pid file" setting (only in a containerized env - where should this be handlet?) 

Thoughts?

Comment 4 Alexander Chuzhoy 2017-12-04 19:01:30 UTC

Reproduced on stack configuration update, i.e. re-running the deployment command with some minor changes, like:
resource_registry:
  OS::TripleO::NodeExtraConfigPost: /home/stack/templates/nameserver.yaml



Environment:
openstack-swift-plugin-swift3-1.12.0-2.el7ost.noarch
openstack-swift-account-2.15.1-3.el7ost.noarch
python-swiftclient-3.4.0-1.el7ost.noarch
python-swift-2.15.1-3.el7ost.noarch
openstack-swift-container-2.15.1-3.el7ost.noarch
puppet-swift-11.3.0-1.el7ost.noarch
openstack-swift-object-2.15.1-3.el7ost.noarch
openstack-swift-proxy-2.15.1-3.el7ost.noarch
openstack-tripleo-heat-templates-7.0.3-18.el7ost.noarch
openstack-puppet-modules-11.0.0-1.el7ost.noarch
instack-undercloud-7.4.3-5.el7ost.noarch



Note: rebooting the entire setup (part of the automated job) resolves the situation.

Comment 5 Christian Schwede (cschwede) 2017-12-19 16:01:55 UTC

Merged upstream, moving to POST.

Note You need to log in before you can comment on or make changes to this bug.