Bug 1730994
Summary: | heat-engine services have their service status flapping in "openstack orchestration service list" | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Takashi Kajinami <tkajinam> |
Component: | openstack-heat | Assignee: | Rabi Mishra <ramishra> |
Status: | CLOSED ERRATA | QA Contact: | Victor Voronkov <vvoronko> |
Severity: | low | Docs Contact: | |
Priority: | medium | ||
Version: | 13.0 (Queens) | CC: | mburns, sbaker, shardy, vvoronko |
Target Milestone: | --- | Keywords: | Triaged, ZStream |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-heat-10.0.3-6.el7ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-09-03 16:53:18 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Takashi Kajinami
2019-07-18 06:15:46 UTC
After looking at current implementation about service status in Heat, I think this is a bug in heat, which makes heat expect REAL ZERO TAT for status reporting. In current code, we see that status reporting is executed repeatedly based on the interval defined as periodic_interval. heat/engine/service.py ~~~ def start(self): self.engine_id = service_utils.generate_engine_id() ... if self.manage_thread_grp is None: self.manage_thread_grp = threadgroup.ThreadGroup() self.manage_thread_grp.add_timer(cfg.CONF.periodic_interval, self.service_manage_report) ~~~ On the other hand, when we mark services as down, which has not reporeted its status for service.report_interval. heat/common/service_utils.py ~~~ def format_service(service): if service is None: return status = 'down' if service.updated_at is not None: if ((timeutils.utcnow() - service.updated_at).total_seconds() <= service.report_interval): status = 'up' else: if ((timeutils.utcnow() - service.created_at).total_seconds() <= service.report_interval): status = 'up' ~~~ And we use the same value as periodic_interval to set service.report_interval. heat/common/service_utils.py ~~~ def service_manage_report(self): cnxt = context.get_admin_context() if self.service_id is None: service_ref = service_objects.Service.create( cnxt, dict(host=self.host, hostname=self.hostname, binary=self.binary, engine_id=self.engine_id, topic=self.topic, report_interval=cfg.CONF.periodic_interval) ) self.service_id = service_ref['id'] LOG.debug('Service %s is started', self.service_id) ~~~ As we use the same value for report interval and status judge, it is possible that when it takes even a little to get status report reached from heat-engine to database, the status of the service is judged as DOWN. IMO, we should have some margin for the report time, when we judge status of engine services, so that we can avoid the flapping caused by possibly very small overhead. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2625 |