Bug 1747443

Summary:	heat service-list shows heat-engine processes down
Product:	Red Hat OpenStack	Reporter:	Irina Petrova <ipetrova>
Component:	openstack-heat	Assignee:	Rabi Mishra <ramishra>
Status:	CLOSED NOTABUG	QA Contact:	Victor Voronkov <vvoronko>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	10.0 (Newton)	CC:	augol, jslagle, lruzicka, mburns, nchandek, ramishra, rhel-osp-director-maint, sbaker, shardy, srevivo, vaggarwa, zbitter
Target Milestone:	async	Keywords:	Triaged, ZStream
Target Release:	10.0 (Newton)
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1330443	Environment:
Last Closed:	2019-12-20 06:08:57 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1330443
Bug Blocks:

Comment 2 Irina Petrova 2019-08-30 13:25:17 UTC

Cloning Bug #1330443 as we see the following behaviour in RHOSP-10:

`heat-manage service list` for each controller gives the following (example) output:
 ## the controller names have been masked in order for this comment to be made public

  1  ctrl0... engine     down       2019-07-29 16:49:00   
  2  ctrl0... engine     down       2019-07-29 16:49:00   
  3  ctrl0... engine     down       2019-07-29 16:49:00   
  4  ctrl0... engine     down       2019-07-29 16:49:00   
  5  ctrl0... engine     down       2019-07-29 16:49:00   
  6  ctrl0... engine     down       2019-07-29 16:49:00   
  7  ctrl0... engine     down       2019-07-29 16:49:00   
  8  ctrl0... engine     down       2019-07-29 16:49:00   
  9  ctrl0... engine     up         2019-08-16 07:51:59   
 10  ctrl0... engine     up         2019-08-16 07:51:59   
 11  ctrl0... engine     up         2019-08-16 07:51:59   
 12  ctrl0... engine     up         2019-08-16 07:51:59   
 13  ctrl0... engine     up         2019-08-16 07:51:59   
 14  ctrl0... engine     up         2019-08-16 07:51:59   
 15  ctrl0... engine     up         2019-08-16 07:51:59   
 16  ctrl0... engine     up         2019-08-16 07:52:00   

 17  ctrl1... engine     down       2019-06-13 12:18:51   
 18  ctrl1... engine     down       2019-06-13 12:18:51   
 19  ctrl1... engine     down       2019-06-13 12:18:51   
 20  ctrl1... engine     down       2019-06-13 12:18:51   
 21  ctrl1... engine     down       2019-06-13 12:18:51   
 22  ctrl1... engine     down       2019-06-13 12:18:51   
 23  ctrl1... engine     down       2019-06-13 12:18:51   
 24  ctrl1... engine     down       2019-06-13 12:18:52   
 25  ctrl1... engine     up         2019-08-16 07:51:48   
 26  ctrl1... engine     up         2019-08-16 07:51:48   
 27  ctrl1... engine     up         2019-08-16 07:51:48   
 28  ctrl1... engine     up         2019-08-16 07:51:48   
 29  ctrl1... engine     up         2019-08-16 07:51:48   
 30  ctrl1... engine     up         2019-08-16 07:51:48   
 31  ctrl1... engine     up         2019-08-16 07:51:48   
 32  ctrl1... engine     up         2019-08-16 07:51:48   

 33  ctrl2... engine     down       2019-06-13 13:01:32   
 34  ctrl2... engine     down       2019-06-13 13:01:32   
 35  ctrl2... engine     down       2019-06-13 13:01:32   
 36  ctrl2... engine     down       2019-06-13 13:01:32   
 37  ctrl2... engine     down       2019-06-13 13:01:32   
 38  ctrl2... engine     down       2019-06-13 13:01:32   
 39  ctrl2... engine     down       2019-06-13 13:01:32   
 40  ctrl2... engine     down       2019-06-13 13:01:32   
 41  ctrl2... engine     up         2019-08-16 07:51:13   
 42  ctrl2... engine     up         2019-08-16 07:51:13   
 43  ctrl2... engine     up         2019-08-16 07:51:14   
 44  ctrl2... engine     up         2019-08-16 07:51:14   
 45  ctrl2... engine     up         2019-08-16 07:51:14   
 46  ctrl2... engine     up         2019-08-16 07:51:14   
 47  ctrl2... engine     up         2019-08-16 07:51:14   
 48  ctrl2... engine     up         2019-08-16 07:51:14   


Each controller has:

  8 logical processors 
  8 Intel Core Processor (Broadwell, IBRS) (flags: aes,constant_tsc,lm,nx,pae) 


num_engine_workers is set to 'None':

/etc/heat/heat.conf >> num_engine_workers = <None>


Observed behaviour of the environment:

1) Upon executing `heat-manage service clean`, the dead heat-engine processes *are* getting properly cleaned, i.e. we see only:

 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:52:00

 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48

 ctrl2... engine     up         2019-08-16 07:51:13
 ctrl2... engine     up         2019-08-16 07:51:13
 ctrl2... engine     up         2019-08-16 07:51:14
 ctrl2... engine     up         2019-08-16 07:51:14
 ctrl2... engine     up         2019-08-16 07:51:14
 ctrl2... engine     up         2019-08-16 07:51:14
 ctrl2... engine     up         2019-08-16 07:51:14
 ctrl2... engine     up         2019-08-16 07:51:14

2) Observing the environment for a couple of days, we see *no* dead heat-engine processes re-appear on their own.

3) However, restarting the heat services often triggers the same behaviour, i.e. the old dead heat-engines are never getting cleaned up.


Given that RHOSP-10 is EOL and is heading towards end of Maintenance Support as well (December 16, 2019), I would like to get a simple ACK/NACK if we can consider the behaviour described above as normal?

Comment 8 Zane Bitter 2019-09-05 00:58:56 UTC

> Just to be on the same page: what do we call a 'graceful restart' here;
> `systemctl restart < heat service >` vs `kill <ps>` (soft/clean restart vs
> hard restart)?

Graceful is SIGTERM. Non-graceful is SIGKILL.

systemd will do SIGTERM and then wait 90s for the process to exit before doing SIGKILL.

To ensure a graceful restart, you can do:

  systemctl --signal=SIGTERM kill heat-engine.service