Bug 1641667
Summary: | When restart heat_engine container, it leaves dead services in heat service-list | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | David Vallee Delisle <dvd> |
Component: | openstack-tripleo-heat-templates | Assignee: | Rabi Mishra <ramishra> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Sasha Smolyak <ssmolyak> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 13.0 (Queens) | CC: | aschultz, dbecker, emacchi, jbuchta, jslagle, marjones, mburns, mcornea, morazi, ohochman, ramishra, rlondhe, sbaker, shardy, zbitter |
Target Milestone: | ga | Keywords: | Reopened, Triaged, ZStream |
Target Release: | 16.0 (Train on RHEL 8.1) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | 1390313 | Environment: | |
Last Closed: | 2020-05-29 13:20:57 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1390313, 1747442, 1747445 | ||
Bug Blocks: |
Comment 2
Zane Bitter
2018-10-22 16:20:20 UTC
> A better question might be why the processes are getting killed instead of gracefully stopped. When docker stop/restart is done, SIGTERM is send to the parent process. Kolla containers use dumb-init[1] as PID 1, Signals are forwarded to all child processes in root session by dumb-init. It seems forked children also receive SIGTERM and killed immediately[1] rather than parent waiting for them to finish. May be dumb-init should forward signals to the direct child only with --single-child option? [1] https://github.com/openstack/kolla/blob/master/docker/base/Dockerfile.j2#L403 [1] https://github.com/openstack/oslo.service/blob/master/oslo_service/service.py#L623 Yeah, that sounds easier than teaching oslo.service to handle SIGTERM in the children as well as the parent. So the funny thing is after all the effort around upstream kolla fix and backports, I noticed that we don't use dumb-init in the openstack-base image downstream. So the fix is not relevant for downstream. We're seeing the issue on OSP13/14 as the default timeout[1] for docker stop/restart to kill the container is 10s, which is very little for heat engine to stop gracefully. A docker restart command with a proper timeout would fix the issue. Though it may vary on the amount of work heat engine has to so before stopping gracefully, >60s seems to work fine in my setup. (undercloud) [stack@undercloud-0 ~]$ sudo docker stop --help Usage: docker stop [OPTIONS] CONTAINER [CONTAINER...] Stop one or more running containers Options: --help Print usage -t, --time int Seconds to wait for stop before killing it (default 10) Does the heat-engine container in tripleo-heat-templates have an appropriate stop_grace_period set? This maps to docker run --stop-timeout > Does the heat-engine container in tripleo-heat-templates have an appropriate stop_grace_period set? This maps to docker run --stop-timeout
Yeah, though that's something we can possibly leverage (I don't see it being used by any service atm), I don't know if we can set an optimum stop-timeout based on number of engine workers and the work left to be completed by those. I guess the user can always provide some appropriate value when stopping the containers.
Also, it seems stop_grace_period (paunch 3.1.0) is only available from OSP14 onwards and also requires docker API 1.25+. I believe that an easy way to clean this is to "heat-manage service clean". Can we have this cron'd ? > Can we have this cron'd?
I don't think we're trying to address how to clean up the stale engines from the db, but to shutdown the engines gracefully rather than killing those. User can clean manually whenever they want. Those engines in the list don't create any issues.
I'm sorry, I was trying to work around the fact that stop_grace_period was only available in paunch 3.1.0. In order to address this we have to package dumb-init for < OSP15 and get it added to the container build process. We'll need to investigate how much effort this will be, but it also might introduce some additional behavior changes. *** Bug 1909579 has been marked as a duplicate of this bug. *** |