Description of problem: during the converge step, the controller's swift_rsync container is restarted from a previously healthy state after the undercloud and overcloud upgrade, and gets stuck in a 'Restarting' state. Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-7.0.3-12.el7ost.noarch How reproducible: always Steps to Reproduce: 1. install rhos11 2. do upgrade steps up to the converge operation 3. swift_rsync container gets set to 'Restarting' Actual results: swift_rsync container gets set to 'Restarting' Expected results: swift_rsync container should restart in 'healthy' state Additional info: during the converge step, somewhere around here: 2017-11-26 12:53:33Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ComputeDeployment_Step4.0]: CREATE_COMPLETE state changed 2017-11-26 12:53:33Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ComputeDeployment_Step4]: CREATE_COMPLETE Stack CREATE completed successfully 2017-11-26 12:53:34Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ComputeDeployment_Step4]: CREATE_COMPLETE state changed 2017-11-26 12:54:39Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ControllerDeployment_Step4.0]: SIGNAL_IN_PROGRESS Signal: deployment 0601ca23-f0d7-4581-acb2-837a2bac8a89 succeeded 2017-11-26 12:54:39Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ControllerDeployment_Step4.0]: CREATE_COMPLETE state changed 2017-11-26 12:54:39Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ControllerDeployment_Step4]: CREATE_COMPLETE Stack CREATE completed successfully 2017-11-26 12:54:39Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ControllerDeployment_Step4]: CREATE_COMPLETE state changed 2017-11-26 12:54:39Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.BlockStorageDeployment_Step5]: CREATE_IN_PROGRESS state changed 2017-11-26 12:54:40Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ObjectStorageDeployment_Step5]: CREATE_IN_PROGRESS state changed 2017-11-26 12:54:40Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.CephStorageDeployment_Step5]: CREATE_IN_PROGRESS state changed 2017-11-26 12:54:41Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ComputeDeployment_Step5]: CREATE_IN_PROGRESS state changed 2017-11-26 12:54:41Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ComputeDeployment_Step5]: CREATE_IN_PROGRESS Stack CREATE started 2017-11-26 12:54:41Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ComputeDeployment_Step5.0]: CREATE_IN_PROGRESS state changed 2017-11-26 12:54:41Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ControllerDeployment_Step5]: CREATE_IN_PROGRESS state changed 2017-11-26 12:54:42Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ControllerDeployment_Step5]: CREATE_IN_PROGRESS Stack CREATE started 2017-11-26 12:54:42Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ControllerDeployment_Step5.0]: CREATE_IN_PROGRESS state changed 2017-11-26 12:54:43Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.BlockStorageDeployment_Step5]: CREATE_COMPLETE state changed 2017-11-26 12:54:43Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.ObjectStorageDeployment_Step5]: CREATE_COMPLETE state changed 2017-11-26 12:54:43Z [overcloud-AllNodesDeploySteps-tt3e7ghupl3a.CephStorageDeployment_Step5]: CREATE_COMPLETE state changed ...the swift_rsync process goes from healthy to restarting.
Adding Marius comment from https://bugzilla.redhat.com/show_bug.cgi?id=1493298#c14 > The converge step basically does a stack update the nova upgrade_levels. At this point the services have been upgraded and migrated into containers. If the swift_rsync container gets into Restarting state at that point I suspect the same issue would show up while doing a stack update of a fresh OSP12 deployment so it's probably not related to the patch which addresses BZ#1493298. Checking the logs on your machine we can see: > > [root@controller-0 heat-admin]# docker logs --tail 5 swift_rsync > INFO:__main__:Deleting /etc/rsyncd.conf > INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/rsyncd.conf to /etc/rsyncd.conf > INFO:__main__:Writing out command to execute > failed to create pid file /var/run/rsyncd.pid: File exists > Running command: '/usr/bin/rsync --daemon --no-detach --config=/etc/rsyncd.conf' > > After removing the existing rsync pid file the container is able to start: > [root@controller-0 heat-admin]# mv /var/run/rsyncd.pid /var/run/rsyncd.pid.orig > [root@controller-0 heat-admin]# docker restart swift_rsync > swift_rsync > [root@controller-0 heat-admin]# docker ps | grep swift_rsync > a2e2c07ef6e8 rhos-qe-mirror-tlv.usersys.redhat.com:5000/rhosp12/openstack-swift-object-docker:20171122.1 "kolla_start" 18 minutes ago Up About a minute (healthy) swift_rsync I can confirm this; it also happens on master. After doing a stack update the rsync container is in a restarting loop, and removing /var/run/rsyncd.pid solves this.
Note: I don't think this is a blocker (given that this can be worked-around easily), but we should fix this asap - ie in the first zstream.
Looks like a stack update stops the rsyncd container in an unclean way - a normal "docker stop" && "docker start" works fine, whereas a "docker kill" leaves the rsyncd.pid file, preventing the container to start later on. I don't think we need the rsyncd PID file in the containerized environment. As seen in this case, if docker restarts the container and the file still exists, it won't start. From the rsyncd.conf manpage: "pid file This parameter tells the rsync daemon to write its process ID to that file. If the file already exists, the rsync daemon will abort rather than overwrite the file." It's set in puppet-swift: https://github.com/openstack/puppet-swift/blob/master/templates/rsyncd.conf.erb#L6 So there are multiple ways to handle this: 1. Ensure a stack update doesn't kill a container (?) 2. Remove a possibly existing .pid before starting rsyncd 3. Change the rsyncd.conf and remove the "pid file" setting (only in a containerized env - where should this be handlet?) Thoughts?
Reproduced on stack configuration update, i.e. re-running the deployment command with some minor changes, like: resource_registry: OS::TripleO::NodeExtraConfigPost: /home/stack/templates/nameserver.yaml Environment: openstack-swift-plugin-swift3-1.12.0-2.el7ost.noarch openstack-swift-account-2.15.1-3.el7ost.noarch python-swiftclient-3.4.0-1.el7ost.noarch python-swift-2.15.1-3.el7ost.noarch openstack-swift-container-2.15.1-3.el7ost.noarch puppet-swift-11.3.0-1.el7ost.noarch openstack-swift-object-2.15.1-3.el7ost.noarch openstack-swift-proxy-2.15.1-3.el7ost.noarch openstack-tripleo-heat-templates-7.0.3-18.el7ost.noarch openstack-puppet-modules-11.0.0-1.el7ost.noarch instack-undercloud-7.4.3-5.el7ost.noarch Note: rebooting the entire setup (part of the automated job) resolves the situation.
Merged upstream, moving to POST.