1747443 – heat service-list shows heat-engine processes down

Bug 1747443 - heat service-list shows heat-engine processes down

Summary: heat service-list shows heat-engine processes down

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-heat
Sub Component:
Version:	10.0 (Newton)
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	async
Target Release:	10.0 (Newton)
Assignee:	Rabi Mishra
QA Contact:	Victor Voronkov
Docs Contact:
URL:
Whiteboard:
Depends On:	1330443
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-30 13:07 UTC by Irina Petrova
Modified:	2019-12-20 06:08 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1330443
Environment:
Last Closed:	2019-12-20 06:08:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Comment 2 Irina Petrova 2019-08-30 13:25:17 UTC

Cloning Bug #1330443 as we see the following behaviour in RHOSP-10:

`heat-manage service list` for each controller gives the following (example) output:
 ## the controller names have been masked in order for this comment to be made public

  1  ctrl0... engine     down       2019-07-29 16:49:00   
  2  ctrl0... engine     down       2019-07-29 16:49:00   
  3  ctrl0... engine     down       2019-07-29 16:49:00   
  4  ctrl0... engine     down       2019-07-29 16:49:00   
  5  ctrl0... engine     down       2019-07-29 16:49:00   
  6  ctrl0... engine     down       2019-07-29 16:49:00   
  7  ctrl0... engine     down       2019-07-29 16:49:00   
  8  ctrl0... engine     down       2019-07-29 16:49:00   
  9  ctrl0... engine     up         2019-08-16 07:51:59   
 10  ctrl0... engine     up         2019-08-16 07:51:59   
 11  ctrl0... engine     up         2019-08-16 07:51:59   
 12  ctrl0... engine     up         2019-08-16 07:51:59   
 13  ctrl0... engine     up         2019-08-16 07:51:59   
 14  ctrl0... engine     up         2019-08-16 07:51:59   
 15  ctrl0... engine     up         2019-08-16 07:51:59   
 16  ctrl0... engine     up         2019-08-16 07:52:00   

 17  ctrl1... engine     down       2019-06-13 12:18:51   
 18  ctrl1... engine     down       2019-06-13 12:18:51   
 19  ctrl1... engine     down       2019-06-13 12:18:51   
 20  ctrl1... engine     down       2019-06-13 12:18:51   
 21  ctrl1... engine     down       2019-06-13 12:18:51   
 22  ctrl1... engine     down       2019-06-13 12:18:51   
 23  ctrl1... engine     down       2019-06-13 12:18:51   
 24  ctrl1... engine     down       2019-06-13 12:18:52   
 25  ctrl1... engine     up         2019-08-16 07:51:48   
 26  ctrl1... engine     up         2019-08-16 07:51:48   
 27  ctrl1... engine     up         2019-08-16 07:51:48   
 28  ctrl1... engine     up         2019-08-16 07:51:48   
 29  ctrl1... engine     up         2019-08-16 07:51:48   
 30  ctrl1... engine     up         2019-08-16 07:51:48   
 31  ctrl1... engine     up         2019-08-16 07:51:48   
 32  ctrl1... engine     up         2019-08-16 07:51:48   

 33  ctrl2... engine     down       2019-06-13 13:01:32   
 34  ctrl2... engine     down       2019-06-13 13:01:32   
 35  ctrl2... engine     down       2019-06-13 13:01:32   
 36  ctrl2... engine     down       2019-06-13 13:01:32   
 37  ctrl2... engine     down       2019-06-13 13:01:32   
 38  ctrl2... engine     down       2019-06-13 13:01:32   
 39  ctrl2... engine     down       2019-06-13 13:01:32   
 40  ctrl2... engine     down       2019-06-13 13:01:32   
 41  ctrl2... engine     up         2019-08-16 07:51:13   
 42  ctrl2... engine     up         2019-08-16 07:51:13   
 43  ctrl2... engine     up         2019-08-16 07:51:14   
 44  ctrl2... engine     up         2019-08-16 07:51:14   
 45  ctrl2... engine     up         2019-08-16 07:51:14   
 46  ctrl2... engine     up         2019-08-16 07:51:14   
 47  ctrl2... engine     up         2019-08-16 07:51:14   
 48  ctrl2... engine     up         2019-08-16 07:51:14   


Each controller has:

  8 logical processors 
  8 Intel Core Processor (Broadwell, IBRS) (flags: aes,constant_tsc,lm,nx,pae) 


num_engine_workers is set to 'None':

/etc/heat/heat.conf >> num_engine_workers = <None>


Observed behaviour of the environment:

1) Upon executing `heat-manage service clean`, the dead heat-engine processes *are* getting properly cleaned, i.e. we see only:

 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:52:00

 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48

 ctrl2... engine     up         2019-08-16 07:51:13
 ctrl2... engine     up         2019-08-16 07:51:13
 ctrl2... engine     up         2019-08-16 07:51:14
 ctrl2... engine     up         2019-08-16 07:51:14
 ctrl2... engine     up         2019-08-16 07:51:14
 ctrl2... engine     up         2019-08-16 07:51:14
 ctrl2... engine     up         2019-08-16 07:51:14
 ctrl2... engine     up         2019-08-16 07:51:14

2) Observing the environment for a couple of days, we see *no* dead heat-engine processes re-appear on their own.

3) However, restarting the heat services often triggers the same behaviour, i.e. the old dead heat-engines are never getting cleaned up.


Given that RHOSP-10 is EOL and is heading towards end of Maintenance Support as well (December 16, 2019), I would like to get a simple ACK/NACK if we can consider the behaviour described above as normal?

Comment 8 Zane Bitter 2019-09-05 00:58:56 UTC

> Just to be on the same page: what do we call a 'graceful restart' here;
> `systemctl restart < heat service >` vs `kill <ps>` (soft/clean restart vs
> hard restart)?

Graceful is SIGTERM. Non-graceful is SIGKILL.

systemd will do SIGTERM and then wait 90s for the process to exit before doing SIGKILL.

To ensure a graceful restart, you can do:

  systemctl --signal=SIGTERM kill heat-engine.service

Note You need to log in before you can comment on or make changes to this bug.