Description of problem: Undercloud nova has configuration item sync_power_state_interval=-1 since RHOS8 based on the decisions which are described in bug #1245298. It seems like this setting is not being set on undercloud of RHOS14 and thus all behaviour described in the mentioned bug is back. I am not sure what is the right component since I do not have knowledge how much instack-undercloud is being used for undercloud on RHOS14. Version-Release number of selected component (if applicable): instack-undercloud-9.2.1-0.20180803181448.be5fa97.el7ost.noarch How reproducible: Always Steps to Reproduce: 1. Start some HA tests which consist of overcloud node resets. Actual results: Even though overcloud nodes recover from reset/failover, undercloud nova sometimes keeps shutting them down If It hits interval when It does sync of overcloud node status with DB state. Expected results: Just do not touch it Additional info:
Maybe setting the config parameter sync_power_state_interval to -1 is not enough I can see a new (?) option in nova config called handle_virt_lifecycle_events which maybe needs to be set to false (true by default) to disable power state synchronization, If I set only sync_power_state_interval to -1 I still could observe the unwanted behaviour, this is description in nova.conf: # * If ``handle_virt_lifecycle_events`` in workarounds_group is # false and this option is negative, then instances that get out # of sync between the hypervisor and the Nova database will have # to be synchronized manually. # (integer value) #sync_power_state_interval=600 But I do not really get clearly what that option handle_virt_lifecycle_events is about but setting to false seems to help.
This doesn't fall under HardwareProvisioning DFG, moving to Comoute.
So in terms of getting back the exact same config as pre-containerized undercloud, here is the recap of the reviews: - new puppet-nova param master: https://review.openstack.org/599480 (merged) rocky: https://review.openstack.org/602039 (not-merged) - instack-undercloud to move to new param (not really needed for containerized undercloud) master: https://review.openstack.org/#/c/599580/ (merged) rocky: https://review.openstack.org/602042 (not-merged) - tht change needed for the undercloud: master: https://review.openstack.org/599423 (merged) rocky: https://review.openstack.org/602041 (not merged) Comment#1 from Marian is a bit concerning though and it might very well be that the previous conf we had is now insufficient. Some feedback from the compute folks would be great to have, here.
handle_virt_lifecycle events has been present since Liberty - setting it to false is a workaround to try and reduce the possibility of racing on the _sync_instance_power_state(), which is called by *both* the periodic task (unless sync_power_state_interval = -1) *and* the virt driver sending instance lifecycle events to the compute manager (unless handle_virt_lifecycle_events = false). That being said, sending lifecycle events up from the virt driver to the compute manager is something that only libvirt and hyperv do, so handle_virt_lifecycle is irrelevant in the case of the overcloud, as the undercloud uses the ironic driver. Therefore, I believe finding a way of getting sync_power_state_interval back to -1 should be enough.
(In reply to Artom Lifshitz from comment #4) > handle_virt_lifecycle events has been present since Liberty - setting it to > false is a workaround to try and reduce the possibility of racing on the > _sync_instance_power_state(), which is called by *both* the periodic task > (unless sync_power_state_interval = -1) *and* the virt driver sending > instance lifecycle events to the compute manager (unless > handle_virt_lifecycle_events = false). > > That being said, sending lifecycle events up from the virt driver to the > compute manager is something that only libvirt and hyperv do, so > handle_virt_lifecycle is irrelevant in the case of the overcloud, as the > undercloud uses the ironic driver. Therefore, I believe finding a way of > getting sync_power_state_interval back to -1 should be enough. Yes, probably It was a different hiccup I had observed, I cannot see the problems when only sync_power_state_interval = -1 is set now, so let's proceed with the patch as It is. Thanks.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045