If the process is stopped gracefully, the entry should be deleted in the database: https://git.openstack.org/cgit/openstack/heat/tree/heat/engine/service.py?h=stable%2Fqueens#n458 If the process is killed then there is a cleanup that runs at startup: https://git.openstack.org/cgit/openstack/heat/tree/heat/engine/service.py?h=stable%2Fqueens#n2328 but obviously it can only detect entries that have timed out (so they were killed >3 minutes ago). Obviously that's not going to include the container that was just restarted (although it should eventually clear up older entries dating from previous restarts >3 minutes ago). The solution would be to start a timer after startup to check for entries to clear out after 3 minutes instead of doing it straight away. A better question might be why the processes are getting killed instead of gracefully stopped.
> A better question might be why the processes are getting killed instead of gracefully stopped. When docker stop/restart is done, SIGTERM is send to the parent process. Kolla containers use dumb-init[1] as PID 1, Signals are forwarded to all child processes in root session by dumb-init. It seems forked children also receive SIGTERM and killed immediately[1] rather than parent waiting for them to finish. May be dumb-init should forward signals to the direct child only with --single-child option? [1] https://github.com/openstack/kolla/blob/master/docker/base/Dockerfile.j2#L403 [1] https://github.com/openstack/oslo.service/blob/master/oslo_service/service.py#L623
Yeah, that sounds easier than teaching oslo.service to handle SIGTERM in the children as well as the parent.
So the funny thing is after all the effort around upstream kolla fix and backports, I noticed that we don't use dumb-init in the openstack-base image downstream. So the fix is not relevant for downstream. We're seeing the issue on OSP13/14 as the default timeout[1] for docker stop/restart to kill the container is 10s, which is very little for heat engine to stop gracefully. A docker restart command with a proper timeout would fix the issue. Though it may vary on the amount of work heat engine has to so before stopping gracefully, >60s seems to work fine in my setup. (undercloud) [stack@undercloud-0 ~]$ sudo docker stop --help Usage: docker stop [OPTIONS] CONTAINER [CONTAINER...] Stop one or more running containers Options: --help Print usage -t, --time int Seconds to wait for stop before killing it (default 10)
Does the heat-engine container in tripleo-heat-templates have an appropriate stop_grace_period set? This maps to docker run --stop-timeout
> Does the heat-engine container in tripleo-heat-templates have an appropriate stop_grace_period set? This maps to docker run --stop-timeout Yeah, though that's something we can possibly leverage (I don't see it being used by any service atm), I don't know if we can set an optimum stop-timeout based on number of engine workers and the work left to be completed by those. I guess the user can always provide some appropriate value when stopping the containers.
Also, it seems stop_grace_period (paunch 3.1.0) is only available from OSP14 onwards and also requires docker API 1.25+.
I believe that an easy way to clean this is to "heat-manage service clean". Can we have this cron'd ?
> Can we have this cron'd? I don't think we're trying to address how to clean up the stale engines from the db, but to shutdown the engines gracefully rather than killing those. User can clean manually whenever they want. Those engines in the list don't create any issues.
I'm sorry, I was trying to work around the fact that stop_grace_period was only available in paunch 3.1.0.
In order to address this we have to package dumb-init for < OSP15 and get it added to the container build process. We'll need to investigate how much effort this will be, but it also might introduce some additional behavior changes.
*** Bug 1909579 has been marked as a duplicate of this bug. ***