1641667 – When restart heat_engine container, it leaves dead services in heat service-list

Bug 1641667 - When restart heat_engine container, it leaves dead services in heat service-list

Summary: When restart heat_engine container, it leaves dead services in heat service-list

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	ga
Target Release:	16.0 (Train on RHEL 8.1)
Assignee:	Rabi Mishra
QA Contact:	Sasha Smolyak
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1909579 (view as bug list)
Depends On:	1390313 1747442 1747445
Blocks:
TreeView+	depends on / blocked

Reported:	2018-10-22 13:27 UTC by David Vallee Delisle
Modified:	2024-03-25 15:22 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1390313
Environment:
Last Closed:	2020-05-29 13:20:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	633864	0	None	MERGED	Add stop_grace_period for heat_engine container	2020-12-28 14:27:57 UTC
Red Hat Issue Tracker	OSP-31710	0	None	None	None	2024-03-25 15:22:07 UTC

Comment 2 Zane Bitter 2018-10-22 16:20:20 UTC

If the process is stopped gracefully, the entry should be deleted in the database:

https://git.openstack.org/cgit/openstack/heat/tree/heat/engine/service.py?h=stable%2Fqueens#n458

If the process is killed then there is a cleanup that runs at startup:

https://git.openstack.org/cgit/openstack/heat/tree/heat/engine/service.py?h=stable%2Fqueens#n2328

but obviously it can only detect entries that have timed out (so they were killed >3 minutes ago). Obviously that's not going to include the container that was just restarted (although it should eventually clear up older entries dating from previous restarts >3 minutes ago).

The solution would be to start a timer after startup to check for entries to clear out after 3 minutes instead of doing it straight away.

A better question might be why the processes are getting killed instead of gracefully stopped.

Comment 3 Rabi Mishra 2018-10-24 06:03:55 UTC

> A better question might be why the processes are getting killed instead of gracefully stopped.

When docker stop/restart is done, SIGTERM is send to the parent process. Kolla containers use dumb-init[1] as PID 1, Signals are forwarded to all child processes in root session by dumb-init. It seems forked children also receive SIGTERM and killed immediately[1] rather than parent waiting for them to finish.

May be dumb-init should forward signals to the direct child only with --single-child option?

[1] https://github.com/openstack/kolla/blob/master/docker/base/Dockerfile.j2#L403
[1] https://github.com/openstack/oslo.service/blob/master/oslo_service/service.py#L623

Comment 5 Zane Bitter 2018-10-24 13:37:50 UTC

Yeah, that sounds easier than teaching oslo.service to handle SIGTERM in the children as well as the parent.

Comment 8 Rabi Mishra 2019-01-23 06:53:40 UTC

So the funny thing is after all the effort around upstream kolla fix and backports, I noticed that we don't use dumb-init in the openstack-base image downstream. So the fix is not relevant for downstream.

We're seeing the issue on OSP13/14 as the default timeout[1] for docker stop/restart to kill the container is 10s, which is very little for heat engine to stop gracefully.

A docker restart command with a proper timeout would fix the issue.  Though it may vary on the amount of work heat engine has to so before stopping gracefully, >60s seems to work fine in my setup.

(undercloud) [stack@undercloud-0 ~]$ sudo docker stop --help

Usage:	docker stop [OPTIONS] CONTAINER [CONTAINER...]

Stop one or more running containers

Options:
      --help       Print usage
  -t, --time int   Seconds to wait for stop before killing it (default 10)

Comment 11 Steve Baker 2019-01-28 21:23:44 UTC

Does the heat-engine container in tripleo-heat-templates have an appropriate stop_grace_period set? This maps to docker run --stop-timeout

Comment 12 Rabi Mishra 2019-01-29 12:27:16 UTC

> Does the heat-engine container in tripleo-heat-templates have an appropriate stop_grace_period set? This maps to docker run --stop-timeout

Yeah, though that's something we can possibly leverage (I don't see it being used by any service atm), I don't know if we can set an optimum stop-timeout based on number of engine workers and the work left to be completed by those. I guess the user can always provide some appropriate value when stopping the containers.

Comment 13 Rabi Mishra 2019-01-29 13:58:30 UTC

Also, it seems stop_grace_period (paunch 3.1.0) is only available from OSP14 onwards and also requires docker API 1.25+.

Comment 14 David Vallee Delisle 2019-01-29 16:38:38 UTC

I believe that an easy way to clean this is to "heat-manage service clean". Can we have this cron'd ?

Comment 15 Rabi Mishra 2019-01-29 16:55:45 UTC

> Can we have this cron'd?

I don't think we're trying to address how to clean up the stale engines from the db, but to shutdown the engines gracefully rather than killing those. User can clean manually whenever they want. Those engines in the list don't create any issues.

Comment 16 David Vallee Delisle 2019-01-29 16:58:08 UTC

I'm sorry, I was trying to work around the fact that stop_grace_period was only available in paunch 3.1.0.

Comment 17 Alex Schultz 2019-06-21 20:52:31 UTC

In order to address this we have to package dumb-init for < OSP15 and get it added to the container build process. We'll need to investigate how much effort this will be, but it also might introduce some additional behavior changes.

Comment 26 Rabi Mishra 2020-12-21 06:25:42 UTC

*** Bug 1909579 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.