Bug 1641667

Summary:	When restart heat_engine container, it leaves dead services in heat service-list
Product:	Red Hat OpenStack	Reporter:	David Vallee Delisle <dvd>
Component:	openstack-tripleo-heat-templates	Assignee:	Rabi Mishra <ramishra>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Sasha Smolyak <ssmolyak>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	13.0 (Queens)	CC:	aschultz, dbecker, emacchi, jbuchta, jslagle, marjones, mburns, mcornea, morazi, ohochman, ramishra, rlondhe, sbaker, shardy, zbitter
Target Milestone:	ga	Keywords:	Reopened, Triaged, ZStream
Target Release:	16.0 (Train on RHEL 8.1)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1390313	Environment:
Last Closed:	2020-05-29 13:20:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1390313, 1747442, 1747445
Bug Blocks:

Comment 2 Zane Bitter 2018-10-22 16:20:20 UTC

If the process is stopped gracefully, the entry should be deleted in the database:

https://git.openstack.org/cgit/openstack/heat/tree/heat/engine/service.py?h=stable%2Fqueens#n458

If the process is killed then there is a cleanup that runs at startup:

https://git.openstack.org/cgit/openstack/heat/tree/heat/engine/service.py?h=stable%2Fqueens#n2328

but obviously it can only detect entries that have timed out (so they were killed >3 minutes ago). Obviously that's not going to include the container that was just restarted (although it should eventually clear up older entries dating from previous restarts >3 minutes ago).

The solution would be to start a timer after startup to check for entries to clear out after 3 minutes instead of doing it straight away.

A better question might be why the processes are getting killed instead of gracefully stopped.

Comment 3 Rabi Mishra 2018-10-24 06:03:55 UTC

> A better question might be why the processes are getting killed instead of gracefully stopped.

When docker stop/restart is done, SIGTERM is send to the parent process. Kolla containers use dumb-init[1] as PID 1, Signals are forwarded to all child processes in root session by dumb-init. It seems forked children also receive SIGTERM and killed immediately[1] rather than parent waiting for them to finish.

May be dumb-init should forward signals to the direct child only with --single-child option?

[1] https://github.com/openstack/kolla/blob/master/docker/base/Dockerfile.j2#L403
[1] https://github.com/openstack/oslo.service/blob/master/oslo_service/service.py#L623

Comment 5 Zane Bitter 2018-10-24 13:37:50 UTC

Yeah, that sounds easier than teaching oslo.service to handle SIGTERM in the children as well as the parent.

Comment 8 Rabi Mishra 2019-01-23 06:53:40 UTC

So the funny thing is after all the effort around upstream kolla fix and backports, I noticed that we don't use dumb-init in the openstack-base image downstream. So the fix is not relevant for downstream.

We're seeing the issue on OSP13/14 as the default timeout[1] for docker stop/restart to kill the container is 10s, which is very little for heat engine to stop gracefully.

A docker restart command with a proper timeout would fix the issue.  Though it may vary on the amount of work heat engine has to so before stopping gracefully, >60s seems to work fine in my setup.

(undercloud) [stack@undercloud-0 ~]$ sudo docker stop --help

Usage:	docker stop [OPTIONS] CONTAINER [CONTAINER...]

Stop one or more running containers

Options:
      --help       Print usage
  -t, --time int   Seconds to wait for stop before killing it (default 10)

Comment 11 Steve Baker 2019-01-28 21:23:44 UTC

Does the heat-engine container in tripleo-heat-templates have an appropriate stop_grace_period set? This maps to docker run --stop-timeout

Comment 12 Rabi Mishra 2019-01-29 12:27:16 UTC

> Does the heat-engine container in tripleo-heat-templates have an appropriate stop_grace_period set? This maps to docker run --stop-timeout

Yeah, though that's something we can possibly leverage (I don't see it being used by any service atm), I don't know if we can set an optimum stop-timeout based on number of engine workers and the work left to be completed by those. I guess the user can always provide some appropriate value when stopping the containers.

Comment 13 Rabi Mishra 2019-01-29 13:58:30 UTC

Also, it seems stop_grace_period (paunch 3.1.0) is only available from OSP14 onwards and also requires docker API 1.25+.

Comment 14 David Vallee Delisle 2019-01-29 16:38:38 UTC

I believe that an easy way to clean this is to "heat-manage service clean". Can we have this cron'd ?

Comment 15 Rabi Mishra 2019-01-29 16:55:45 UTC

> Can we have this cron'd?

I don't think we're trying to address how to clean up the stale engines from the db, but to shutdown the engines gracefully rather than killing those. User can clean manually whenever they want. Those engines in the list don't create any issues.

Comment 16 David Vallee Delisle 2019-01-29 16:58:08 UTC

I'm sorry, I was trying to work around the fact that stop_grace_period was only available in paunch 3.1.0.

Comment 17 Alex Schultz 2019-06-21 20:52:31 UTC

In order to address this we have to package dumb-init for < OSP15 and get it added to the container build process. We'll need to investigate how much effort this will be, but it also might introduce some additional behavior changes.

Comment 26 Rabi Mishra 2020-12-21 06:25:42 UTC

*** Bug 1909579 has been marked as a duplicate of this bug. ***