Cloning Bug #1330443 as we see the following behaviour in RHOSP-10: `heat-manage service list` for each controller gives the following (example) output: ## the controller names have been masked in order for this comment to be made public 1 ctrl0... engine down 2019-07-29 16:49:00 2 ctrl0... engine down 2019-07-29 16:49:00 3 ctrl0... engine down 2019-07-29 16:49:00 4 ctrl0... engine down 2019-07-29 16:49:00 5 ctrl0... engine down 2019-07-29 16:49:00 6 ctrl0... engine down 2019-07-29 16:49:00 7 ctrl0... engine down 2019-07-29 16:49:00 8 ctrl0... engine down 2019-07-29 16:49:00 9 ctrl0... engine up 2019-08-16 07:51:59 10 ctrl0... engine up 2019-08-16 07:51:59 11 ctrl0... engine up 2019-08-16 07:51:59 12 ctrl0... engine up 2019-08-16 07:51:59 13 ctrl0... engine up 2019-08-16 07:51:59 14 ctrl0... engine up 2019-08-16 07:51:59 15 ctrl0... engine up 2019-08-16 07:51:59 16 ctrl0... engine up 2019-08-16 07:52:00 17 ctrl1... engine down 2019-06-13 12:18:51 18 ctrl1... engine down 2019-06-13 12:18:51 19 ctrl1... engine down 2019-06-13 12:18:51 20 ctrl1... engine down 2019-06-13 12:18:51 21 ctrl1... engine down 2019-06-13 12:18:51 22 ctrl1... engine down 2019-06-13 12:18:51 23 ctrl1... engine down 2019-06-13 12:18:51 24 ctrl1... engine down 2019-06-13 12:18:52 25 ctrl1... engine up 2019-08-16 07:51:48 26 ctrl1... engine up 2019-08-16 07:51:48 27 ctrl1... engine up 2019-08-16 07:51:48 28 ctrl1... engine up 2019-08-16 07:51:48 29 ctrl1... engine up 2019-08-16 07:51:48 30 ctrl1... engine up 2019-08-16 07:51:48 31 ctrl1... engine up 2019-08-16 07:51:48 32 ctrl1... engine up 2019-08-16 07:51:48 33 ctrl2... engine down 2019-06-13 13:01:32 34 ctrl2... engine down 2019-06-13 13:01:32 35 ctrl2... engine down 2019-06-13 13:01:32 36 ctrl2... engine down 2019-06-13 13:01:32 37 ctrl2... engine down 2019-06-13 13:01:32 38 ctrl2... engine down 2019-06-13 13:01:32 39 ctrl2... engine down 2019-06-13 13:01:32 40 ctrl2... engine down 2019-06-13 13:01:32 41 ctrl2... engine up 2019-08-16 07:51:13 42 ctrl2... engine up 2019-08-16 07:51:13 43 ctrl2... engine up 2019-08-16 07:51:14 44 ctrl2... engine up 2019-08-16 07:51:14 45 ctrl2... engine up 2019-08-16 07:51:14 46 ctrl2... engine up 2019-08-16 07:51:14 47 ctrl2... engine up 2019-08-16 07:51:14 48 ctrl2... engine up 2019-08-16 07:51:14 Each controller has: 8 logical processors 8 Intel Core Processor (Broadwell, IBRS) (flags: aes,constant_tsc,lm,nx,pae) num_engine_workers is set to 'None': /etc/heat/heat.conf >> num_engine_workers = <None> Observed behaviour of the environment: 1) Upon executing `heat-manage service clean`, the dead heat-engine processes *are* getting properly cleaned, i.e. we see only: ctrl0... engine up 2019-08-16 07:51:59 ctrl0... engine up 2019-08-16 07:51:59 ctrl0... engine up 2019-08-16 07:51:59 ctrl0... engine up 2019-08-16 07:51:59 ctrl0... engine up 2019-08-16 07:51:59 ctrl0... engine up 2019-08-16 07:51:59 ctrl0... engine up 2019-08-16 07:51:59 ctrl0... engine up 2019-08-16 07:52:00 ctrl1... engine up 2019-08-16 07:51:48 ctrl1... engine up 2019-08-16 07:51:48 ctrl1... engine up 2019-08-16 07:51:48 ctrl1... engine up 2019-08-16 07:51:48 ctrl1... engine up 2019-08-16 07:51:48 ctrl1... engine up 2019-08-16 07:51:48 ctrl1... engine up 2019-08-16 07:51:48 ctrl1... engine up 2019-08-16 07:51:48 ctrl2... engine up 2019-08-16 07:51:13 ctrl2... engine up 2019-08-16 07:51:13 ctrl2... engine up 2019-08-16 07:51:14 ctrl2... engine up 2019-08-16 07:51:14 ctrl2... engine up 2019-08-16 07:51:14 ctrl2... engine up 2019-08-16 07:51:14 ctrl2... engine up 2019-08-16 07:51:14 ctrl2... engine up 2019-08-16 07:51:14 2) Observing the environment for a couple of days, we see *no* dead heat-engine processes re-appear on their own. 3) However, restarting the heat services often triggers the same behaviour, i.e. the old dead heat-engines are never getting cleaned up. Given that RHOSP-10 is EOL and is heading towards end of Maintenance Support as well (December 16, 2019), I would like to get a simple ACK/NACK if we can consider the behaviour described above as normal?
> Just to be on the same page: what do we call a 'graceful restart' here; > `systemctl restart < heat service >` vs `kill <ps>` (soft/clean restart vs > hard restart)? Graceful is SIGTERM. Non-graceful is SIGKILL. systemd will do SIGTERM and then wait 90s for the process to exit before doing SIGKILL. To ensure a graceful restart, you can do: systemctl --signal=SIGTERM kill heat-engine.service