Bug 1747443 - heat service-list shows heat-engine processes down
Summary: heat service-list shows heat-engine processes down
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-heat
Version: 10.0 (Newton)
Hardware: All
OS: Linux
medium
medium
Target Milestone: async
: 10.0 (Newton)
Assignee: Rabi Mishra
QA Contact: Victor Voronkov
URL:
Whiteboard:
Depends On: 1330443
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-30 13:07 UTC by Irina Petrova
Modified: 2019-12-20 06:08 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1330443
Environment:
Last Closed: 2019-12-20 06:08:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Comment 2 Irina Petrova 2019-08-30 13:25:17 UTC
Cloning Bug #1330443 as we see the following behaviour in RHOSP-10:

`heat-manage service list` for each controller gives the following (example) output:
 ## the controller names have been masked in order for this comment to be made public

  1  ctrl0... engine     down       2019-07-29 16:49:00   
  2  ctrl0... engine     down       2019-07-29 16:49:00   
  3  ctrl0... engine     down       2019-07-29 16:49:00   
  4  ctrl0... engine     down       2019-07-29 16:49:00   
  5  ctrl0... engine     down       2019-07-29 16:49:00   
  6  ctrl0... engine     down       2019-07-29 16:49:00   
  7  ctrl0... engine     down       2019-07-29 16:49:00   
  8  ctrl0... engine     down       2019-07-29 16:49:00   
  9  ctrl0... engine     up         2019-08-16 07:51:59   
 10  ctrl0... engine     up         2019-08-16 07:51:59   
 11  ctrl0... engine     up         2019-08-16 07:51:59   
 12  ctrl0... engine     up         2019-08-16 07:51:59   
 13  ctrl0... engine     up         2019-08-16 07:51:59   
 14  ctrl0... engine     up         2019-08-16 07:51:59   
 15  ctrl0... engine     up         2019-08-16 07:51:59   
 16  ctrl0... engine     up         2019-08-16 07:52:00   

 17  ctrl1... engine     down       2019-06-13 12:18:51   
 18  ctrl1... engine     down       2019-06-13 12:18:51   
 19  ctrl1... engine     down       2019-06-13 12:18:51   
 20  ctrl1... engine     down       2019-06-13 12:18:51   
 21  ctrl1... engine     down       2019-06-13 12:18:51   
 22  ctrl1... engine     down       2019-06-13 12:18:51   
 23  ctrl1... engine     down       2019-06-13 12:18:51   
 24  ctrl1... engine     down       2019-06-13 12:18:52   
 25  ctrl1... engine     up         2019-08-16 07:51:48   
 26  ctrl1... engine     up         2019-08-16 07:51:48   
 27  ctrl1... engine     up         2019-08-16 07:51:48   
 28  ctrl1... engine     up         2019-08-16 07:51:48   
 29  ctrl1... engine     up         2019-08-16 07:51:48   
 30  ctrl1... engine     up         2019-08-16 07:51:48   
 31  ctrl1... engine     up         2019-08-16 07:51:48   
 32  ctrl1... engine     up         2019-08-16 07:51:48   

 33  ctrl2... engine     down       2019-06-13 13:01:32   
 34  ctrl2... engine     down       2019-06-13 13:01:32   
 35  ctrl2... engine     down       2019-06-13 13:01:32   
 36  ctrl2... engine     down       2019-06-13 13:01:32   
 37  ctrl2... engine     down       2019-06-13 13:01:32   
 38  ctrl2... engine     down       2019-06-13 13:01:32   
 39  ctrl2... engine     down       2019-06-13 13:01:32   
 40  ctrl2... engine     down       2019-06-13 13:01:32   
 41  ctrl2... engine     up         2019-08-16 07:51:13   
 42  ctrl2... engine     up         2019-08-16 07:51:13   
 43  ctrl2... engine     up         2019-08-16 07:51:14   
 44  ctrl2... engine     up         2019-08-16 07:51:14   
 45  ctrl2... engine     up         2019-08-16 07:51:14   
 46  ctrl2... engine     up         2019-08-16 07:51:14   
 47  ctrl2... engine     up         2019-08-16 07:51:14   
 48  ctrl2... engine     up         2019-08-16 07:51:14   


Each controller has:

  8 logical processors 
  8 Intel Core Processor (Broadwell, IBRS) (flags: aes,constant_tsc,lm,nx,pae) 


num_engine_workers is set to 'None':

/etc/heat/heat.conf >> num_engine_workers = <None>


Observed behaviour of the environment:

1) Upon executing `heat-manage service clean`, the dead heat-engine processes *are* getting properly cleaned, i.e. we see only:

 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:52:00

 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48

 ctrl2... engine     up         2019-08-16 07:51:13
 ctrl2... engine     up         2019-08-16 07:51:13
 ctrl2... engine     up         2019-08-16 07:51:14
 ctrl2... engine     up         2019-08-16 07:51:14
 ctrl2... engine     up         2019-08-16 07:51:14
 ctrl2... engine     up         2019-08-16 07:51:14
 ctrl2... engine     up         2019-08-16 07:51:14
 ctrl2... engine     up         2019-08-16 07:51:14

2) Observing the environment for a couple of days, we see *no* dead heat-engine processes re-appear on their own.

3) However, restarting the heat services often triggers the same behaviour, i.e. the old dead heat-engines are never getting cleaned up.


Given that RHOSP-10 is EOL and is heading towards end of Maintenance Support as well (December 16, 2019), I would like to get a simple ACK/NACK if we can consider the behaviour described above as normal?

Comment 8 Zane Bitter 2019-09-05 00:58:56 UTC
> Just to be on the same page: what do we call a 'graceful restart' here;
> `systemctl restart < heat service >` vs `kill <ps>` (soft/clean restart vs
> hard restart)?

Graceful is SIGTERM. Non-graceful is SIGKILL.

systemd will do SIGTERM and then wait 90s for the process to exit before doing SIGKILL.

To ensure a graceful restart, you can do:

  systemctl --signal=SIGTERM kill heat-engine.service


Note You need to log in before you can comment on or make changes to this bug.