Bug 1624588

Summary: Undercloud nova configuration does not have - sync_power_state_interval=-1
Product: Red Hat OpenStack Reporter: Marian Krcmarik <mkrcmari>
Component: openstack-tripleo-heat-templatesAssignee: Michele Baldessari <michele>
Status: CLOSED ERRATA QA Contact: Archit Modi <amodi>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 14.0 (Rocky)CC: agurenko, aherr, bfournie, chjones, lyarwood, mburns, mcornea, michele, oblaut
Target Milestone: betaKeywords: AutomationBlocker, Regression, Triaged
Target Release: 14.0 (Rocky)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: instack-undercloud-9.3.1-0.20180831000259.e464799.el7ost openstack-tripleo-heat-templates-9.0.0-0.20180906145841.66804ff.0rc1.0rc1.el7ost puppet-nova-13.3.1-0.20180831195237.ce0efbe.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-11 11:51:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marian Krcmarik 2018-09-01 21:50:54 UTC
Description of problem:
Undercloud nova has configuration item sync_power_state_interval=-1 since RHOS8 based on the decisions which are described in bug #1245298. It seems like this setting is not being set on undercloud of RHOS14 and thus all behaviour described in the mentioned bug is back.

I am not sure what is the right component since I do not have knowledge how much instack-undercloud is being used for undercloud on RHOS14.

Version-Release number of selected component (if applicable):
instack-undercloud-9.2.1-0.20180803181448.be5fa97.el7ost.noarch

How reproducible:
Always

Steps to Reproduce:
1. Start some HA tests which consist of overcloud node resets.

Actual results:
Even though overcloud nodes recover from reset/failover, undercloud nova sometimes keeps shutting them down If It hits interval when It does sync of overcloud node status with DB state.

Expected results:
Just do not touch it

Additional info:

Comment 1 Marian Krcmarik 2018-09-07 21:13:57 UTC
Maybe setting the config parameter sync_power_state_interval to -1 is not enough
I can see a new (?) option in nova config called handle_virt_lifecycle_events which maybe needs to be set to false (true by default) to disable power state synchronization, If I set only sync_power_state_interval to -1 I still could observe the unwanted behaviour, this is description in nova.conf:
# * If ``handle_virt_lifecycle_events`` in workarounds_group is
#   false and this option is negative, then instances that get out
#   of sync between the hypervisor and the Nova database will have
#   to be synchronized manually.
#  (integer value)
#sync_power_state_interval=600

But I do not really get clearly what that option handle_virt_lifecycle_events is about but setting to false seems to help.

Comment 2 Bob Fournier 2018-09-10 22:32:48 UTC
This doesn't fall under HardwareProvisioning DFG, moving to Comoute.

Comment 3 Michele Baldessari 2018-09-12 14:23:45 UTC
So in terms of getting back the exact same config as pre-containerized undercloud, here is the recap of the reviews:
- new puppet-nova param
  master: https://review.openstack.org/599480 (merged)
  rocky: https://review.openstack.org/602039 (not-merged)

- instack-undercloud to move to new param (not really needed for containerized undercloud)
  master: https://review.openstack.org/#/c/599580/ (merged)
  rocky: https://review.openstack.org/602042 (not-merged)

- tht change needed for the undercloud:
  master: https://review.openstack.org/599423 (merged)
  rocky: https://review.openstack.org/602041 (not merged)

Comment#1 from Marian is a bit concerning though and it might very well be that the previous conf we had is now insufficient. Some feedback from the compute folks would be great to have, here.

Comment 4 Artom Lifshitz 2018-09-13 14:22:53 UTC
handle_virt_lifecycle events has been present since Liberty - setting it to false is a workaround to try and reduce the possibility of racing on the _sync_instance_power_state(), which is called by *both* the periodic task (unless sync_power_state_interval = -1) *and* the virt driver sending instance lifecycle events to the compute manager (unless handle_virt_lifecycle_events = false).

That being said, sending lifecycle events up from the virt driver to the compute manager is something that only libvirt and hyperv do, so handle_virt_lifecycle is irrelevant in the case of the overcloud, as the undercloud uses the ironic driver. Therefore, I believe finding a way of getting sync_power_state_interval back to -1 should be enough.

Comment 5 Marian Krcmarik 2018-09-13 14:37:08 UTC
(In reply to Artom Lifshitz from comment #4)
> handle_virt_lifecycle events has been present since Liberty - setting it to
> false is a workaround to try and reduce the possibility of racing on the
> _sync_instance_power_state(), which is called by *both* the periodic task
> (unless sync_power_state_interval = -1) *and* the virt driver sending
> instance lifecycle events to the compute manager (unless
> handle_virt_lifecycle_events = false).
> 
> That being said, sending lifecycle events up from the virt driver to the
> compute manager is something that only libvirt and hyperv do, so
> handle_virt_lifecycle is irrelevant in the case of the overcloud, as the
> undercloud uses the ironic driver. Therefore, I believe finding a way of
> getting sync_power_state_interval back to -1 should be enough.

Yes, probably It was a different hiccup I had observed, I cannot see the problems when only sync_power_state_interval = -1 is set now, so let's proceed with the patch as It is.
Thanks.

Comment 14 errata-xmlrpc 2019-01-11 11:51:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045