Cloning Bug #1330443 as we see the following behaviour in RHOSP-10:
`heat-manage service list` for each controller gives the following (example) output:
## the controller names have been masked in order for this comment to be made public
1 ctrl0... engine down 2019-07-29 16:49:00
2 ctrl0... engine down 2019-07-29 16:49:00
3 ctrl0... engine down 2019-07-29 16:49:00
4 ctrl0... engine down 2019-07-29 16:49:00
5 ctrl0... engine down 2019-07-29 16:49:00
6 ctrl0... engine down 2019-07-29 16:49:00
7 ctrl0... engine down 2019-07-29 16:49:00
8 ctrl0... engine down 2019-07-29 16:49:00
9 ctrl0... engine up 2019-08-16 07:51:59
10 ctrl0... engine up 2019-08-16 07:51:59
11 ctrl0... engine up 2019-08-16 07:51:59
12 ctrl0... engine up 2019-08-16 07:51:59
13 ctrl0... engine up 2019-08-16 07:51:59
14 ctrl0... engine up 2019-08-16 07:51:59
15 ctrl0... engine up 2019-08-16 07:51:59
16 ctrl0... engine up 2019-08-16 07:52:00
17 ctrl1... engine down 2019-06-13 12:18:51
18 ctrl1... engine down 2019-06-13 12:18:51
19 ctrl1... engine down 2019-06-13 12:18:51
20 ctrl1... engine down 2019-06-13 12:18:51
21 ctrl1... engine down 2019-06-13 12:18:51
22 ctrl1... engine down 2019-06-13 12:18:51
23 ctrl1... engine down 2019-06-13 12:18:51
24 ctrl1... engine down 2019-06-13 12:18:52
25 ctrl1... engine up 2019-08-16 07:51:48
26 ctrl1... engine up 2019-08-16 07:51:48
27 ctrl1... engine up 2019-08-16 07:51:48
28 ctrl1... engine up 2019-08-16 07:51:48
29 ctrl1... engine up 2019-08-16 07:51:48
30 ctrl1... engine up 2019-08-16 07:51:48
31 ctrl1... engine up 2019-08-16 07:51:48
32 ctrl1... engine up 2019-08-16 07:51:48
33 ctrl2... engine down 2019-06-13 13:01:32
34 ctrl2... engine down 2019-06-13 13:01:32
35 ctrl2... engine down 2019-06-13 13:01:32
36 ctrl2... engine down 2019-06-13 13:01:32
37 ctrl2... engine down 2019-06-13 13:01:32
38 ctrl2... engine down 2019-06-13 13:01:32
39 ctrl2... engine down 2019-06-13 13:01:32
40 ctrl2... engine down 2019-06-13 13:01:32
41 ctrl2... engine up 2019-08-16 07:51:13
42 ctrl2... engine up 2019-08-16 07:51:13
43 ctrl2... engine up 2019-08-16 07:51:14
44 ctrl2... engine up 2019-08-16 07:51:14
45 ctrl2... engine up 2019-08-16 07:51:14
46 ctrl2... engine up 2019-08-16 07:51:14
47 ctrl2... engine up 2019-08-16 07:51:14
48 ctrl2... engine up 2019-08-16 07:51:14
Each controller has:
8 logical processors
8 Intel Core Processor (Broadwell, IBRS) (flags: aes,constant_tsc,lm,nx,pae)
num_engine_workers is set to 'None':
/etc/heat/heat.conf >> num_engine_workers = <None>
Observed behaviour of the environment:
1) Upon executing `heat-manage service clean`, the dead heat-engine processes *are* getting properly cleaned, i.e. we see only:
ctrl0... engine up 2019-08-16 07:51:59
ctrl0... engine up 2019-08-16 07:51:59
ctrl0... engine up 2019-08-16 07:51:59
ctrl0... engine up 2019-08-16 07:51:59
ctrl0... engine up 2019-08-16 07:51:59
ctrl0... engine up 2019-08-16 07:51:59
ctrl0... engine up 2019-08-16 07:51:59
ctrl0... engine up 2019-08-16 07:52:00
ctrl1... engine up 2019-08-16 07:51:48
ctrl1... engine up 2019-08-16 07:51:48
ctrl1... engine up 2019-08-16 07:51:48
ctrl1... engine up 2019-08-16 07:51:48
ctrl1... engine up 2019-08-16 07:51:48
ctrl1... engine up 2019-08-16 07:51:48
ctrl1... engine up 2019-08-16 07:51:48
ctrl1... engine up 2019-08-16 07:51:48
ctrl2... engine up 2019-08-16 07:51:13
ctrl2... engine up 2019-08-16 07:51:13
ctrl2... engine up 2019-08-16 07:51:14
ctrl2... engine up 2019-08-16 07:51:14
ctrl2... engine up 2019-08-16 07:51:14
ctrl2... engine up 2019-08-16 07:51:14
ctrl2... engine up 2019-08-16 07:51:14
ctrl2... engine up 2019-08-16 07:51:14
2) Observing the environment for a couple of days, we see *no* dead heat-engine processes re-appear on their own.
3) However, restarting the heat services often triggers the same behaviour, i.e. the old dead heat-engines are never getting cleaned up.
Given that RHOSP-10 is EOL and is heading towards end of Maintenance Support as well (December 16, 2019), I would like to get a simple ACK/NACK if we can consider the behaviour described above as normal?
> Just to be on the same page: what do we call a 'graceful restart' here;
> `systemctl restart < heat service >` vs `kill <ps>` (soft/clean restart vs
> hard restart)?
Graceful is SIGTERM. Non-graceful is SIGKILL.
systemd will do SIGTERM and then wait 90s for the process to exit before doing SIGKILL.
To ensure a graceful restart, you can do:
systemctl --signal=SIGTERM kill heat-engine.service