Bug 1747443

Summary: heat service-list shows heat-engine processes down
Product: Red Hat OpenStack Reporter: Irina Petrova <ipetrova>
Component: openstack-heatAssignee: Rabi Mishra <ramishra>
Status: CLOSED NOTABUG QA Contact: Victor Voronkov <vvoronko>
Severity: medium Docs Contact:
Priority: medium    
Version: 10.0 (Newton)CC: augol, jslagle, lruzicka, mburns, nchandek, ramishra, rhel-osp-director-maint, sbaker, shardy, srevivo, vaggarwa, zbitter
Target Milestone: asyncKeywords: Triaged, ZStream
Target Release: 10.0 (Newton)   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1330443 Environment:
Last Closed: 2019-12-20 06:08:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1330443    
Bug Blocks:    

Comment 2 Irina Petrova 2019-08-30 13:25:17 UTC
Cloning Bug #1330443 as we see the following behaviour in RHOSP-10:

`heat-manage service list` for each controller gives the following (example) output:
 ## the controller names have been masked in order for this comment to be made public

  1  ctrl0... engine     down       2019-07-29 16:49:00   
  2  ctrl0... engine     down       2019-07-29 16:49:00   
  3  ctrl0... engine     down       2019-07-29 16:49:00   
  4  ctrl0... engine     down       2019-07-29 16:49:00   
  5  ctrl0... engine     down       2019-07-29 16:49:00   
  6  ctrl0... engine     down       2019-07-29 16:49:00   
  7  ctrl0... engine     down       2019-07-29 16:49:00   
  8  ctrl0... engine     down       2019-07-29 16:49:00   
  9  ctrl0... engine     up         2019-08-16 07:51:59   
 10  ctrl0... engine     up         2019-08-16 07:51:59   
 11  ctrl0... engine     up         2019-08-16 07:51:59   
 12  ctrl0... engine     up         2019-08-16 07:51:59   
 13  ctrl0... engine     up         2019-08-16 07:51:59   
 14  ctrl0... engine     up         2019-08-16 07:51:59   
 15  ctrl0... engine     up         2019-08-16 07:51:59   
 16  ctrl0... engine     up         2019-08-16 07:52:00   

 17  ctrl1... engine     down       2019-06-13 12:18:51   
 18  ctrl1... engine     down       2019-06-13 12:18:51   
 19  ctrl1... engine     down       2019-06-13 12:18:51   
 20  ctrl1... engine     down       2019-06-13 12:18:51   
 21  ctrl1... engine     down       2019-06-13 12:18:51   
 22  ctrl1... engine     down       2019-06-13 12:18:51   
 23  ctrl1... engine     down       2019-06-13 12:18:51   
 24  ctrl1... engine     down       2019-06-13 12:18:52   
 25  ctrl1... engine     up         2019-08-16 07:51:48   
 26  ctrl1... engine     up         2019-08-16 07:51:48   
 27  ctrl1... engine     up         2019-08-16 07:51:48   
 28  ctrl1... engine     up         2019-08-16 07:51:48   
 29  ctrl1... engine     up         2019-08-16 07:51:48   
 30  ctrl1... engine     up         2019-08-16 07:51:48   
 31  ctrl1... engine     up         2019-08-16 07:51:48   
 32  ctrl1... engine     up         2019-08-16 07:51:48   

 33  ctrl2... engine     down       2019-06-13 13:01:32   
 34  ctrl2... engine     down       2019-06-13 13:01:32   
 35  ctrl2... engine     down       2019-06-13 13:01:32   
 36  ctrl2... engine     down       2019-06-13 13:01:32   
 37  ctrl2... engine     down       2019-06-13 13:01:32   
 38  ctrl2... engine     down       2019-06-13 13:01:32   
 39  ctrl2... engine     down       2019-06-13 13:01:32   
 40  ctrl2... engine     down       2019-06-13 13:01:32   
 41  ctrl2... engine     up         2019-08-16 07:51:13   
 42  ctrl2... engine     up         2019-08-16 07:51:13   
 43  ctrl2... engine     up         2019-08-16 07:51:14   
 44  ctrl2... engine     up         2019-08-16 07:51:14   
 45  ctrl2... engine     up         2019-08-16 07:51:14   
 46  ctrl2... engine     up         2019-08-16 07:51:14   
 47  ctrl2... engine     up         2019-08-16 07:51:14   
 48  ctrl2... engine     up         2019-08-16 07:51:14   


Each controller has:

  8 logical processors 
  8 Intel Core Processor (Broadwell, IBRS) (flags: aes,constant_tsc,lm,nx,pae) 


num_engine_workers is set to 'None':

/etc/heat/heat.conf >> num_engine_workers = <None>


Observed behaviour of the environment:

1) Upon executing `heat-manage service clean`, the dead heat-engine processes *are* getting properly cleaned, i.e. we see only:

 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:51:59
 ctrl0... engine     up         2019-08-16 07:52:00

 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48
 ctrl1... engine     up         2019-08-16 07:51:48

 ctrl2... engine     up         2019-08-16 07:51:13
 ctrl2... engine     up         2019-08-16 07:51:13
 ctrl2... engine     up         2019-08-16 07:51:14
 ctrl2... engine     up         2019-08-16 07:51:14
 ctrl2... engine     up         2019-08-16 07:51:14
 ctrl2... engine     up         2019-08-16 07:51:14
 ctrl2... engine     up         2019-08-16 07:51:14
 ctrl2... engine     up         2019-08-16 07:51:14

2) Observing the environment for a couple of days, we see *no* dead heat-engine processes re-appear on their own.

3) However, restarting the heat services often triggers the same behaviour, i.e. the old dead heat-engines are never getting cleaned up.


Given that RHOSP-10 is EOL and is heading towards end of Maintenance Support as well (December 16, 2019), I would like to get a simple ACK/NACK if we can consider the behaviour described above as normal?

Comment 8 Zane Bitter 2019-09-05 00:58:56 UTC
> Just to be on the same page: what do we call a 'graceful restart' here;
> `systemctl restart < heat service >` vs `kill <ps>` (soft/clean restart vs
> hard restart)?

Graceful is SIGTERM. Non-graceful is SIGKILL.

systemd will do SIGTERM and then wait 90s for the process to exit before doing SIGKILL.

To ensure a graceful restart, you can do:

  systemctl --signal=SIGTERM kill heat-engine.service