Bug 1834901 - Re-running overcloud deploy ignores heat templates
Summary: Re-running overcloud deploy ignores heat templates
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: rc
: 16.1 (Train on RHEL 8.2)
Assignee: Emilien Macchi
QA Contact: Roman Safronov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-12 15:52 UTC by Jakub Libosvar
Modified: 2020-07-29 07:52 UTC (History)
10 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.3.2-0.20200616081527.396affd.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-29 07:52:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 737340 0 None MERGED Cleanup all container startup configs before generating the new ones 2020-07-30 13:47:02 UTC
Red Hat Product Errata RHBA-2020:3148 0 None None None 2020-07-29 07:52:57 UTC

Description Jakub Libosvar 2020-05-12 15:52:02 UTC
Description of problem:
This bug was discovered during migration from Neutron openvswitch ml2 mechanism driver to ovn. During the migration, heat resources related to ml2/ovs are set to None:
  OS::TripleO::Services::NeutronOvsAgent: OS::Heat::None
  OS::TripleO::Services::ComputeNeutronOvsAgent: OS::Heat::None
  OS::TripleO::Services::NeutronL3Agent: OS::Heat::None
  OS::TripleO::Services::ComputeNeutronL3Agent: OS::Heat::None
  OS::TripleO::Services::NeutronMetadataAgent: OS::Heat::None
  OS::TripleO::Services::ComputeNeutronMetadataAgent: OS::Heat::None
  OS::TripleO::Services::NeutronDhcpAgent: OS::Heat::None
  OS::TripleO::Services::ComputeNeutronCorePlugin: OS::Heat::None

Previsouly, tripleo even stopped and removed the ml2/ovs services but I was told it was done by a chance and not intentionally. The migration role now stops and removes the services manually, however after calling back overcloud deploy with OVN services, tripleo configures the ml2/ovs services back regardless of what is set in the heat templates.

This didn't happen in OSP16 GA version.

Version-Release number of selected component (if applicable):
openstack-tripleo-common-containers-11.3.3-0.20200403044649.56c0fd5.el8ost.noarch
openstack-tripleo-image-elements-10.6.2-0.20200314025720.8c91b46.el8ost.noarch
openstack-tripleo-validations-11.3.2-0.20200318124452.3fd14c9.el8ost.noarch
openstack-tripleo-heat-templates-11.3.2-0.20200405044624.ec9970c.el8ost.noarch
openstack-tripleo-puppet-elements-11.2.2-0.20200302235857.a6fef08.el8ost.noarch
openstack-tripleo-common-11.3.3-0.20200403044649.56c0fd5.el8ost.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy OSP16 with ml2/ovs networking backend
2. Run migration as described in docs - https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.0/html/networking_with_open_virtual_network/migrating-ml2ovs-to-ovn


Actual results:
ml2/ovs services running after the migration

Expected results:
ml2/ovs services are not re-deployed

Additional info:

Comment 1 Brent Eagles 2020-05-12 15:58:55 UTC
Looking at the neutron metadata agent for example, the json files for the service containers appear to be on the host under /var/lib/tripleo-config and the timestamps seem to match the other files indicating that they might be being regenerated. However, the hieradata for the metadata agent is missing. Perhaps some aspect of the deployment is dropping the service, but it hasn't been removed from some other data source created and used by the deployment framework?

Comment 2 Alex Schultz 2020-05-13 21:30:52 UTC
When you set a service to OS::Heat::None, the services are not removed from the host that were previously running. You need to have something to do that clean up.  Usually we recommend switch it from the real service, to a service that describes all the removal actions.  


Can you please provide a full set of templates that were used and the command that were run?  Currently there is not enough information to understand the order of actions or what is actually performed.

Comment 3 Jakub Libosvar 2020-05-14 06:48:11 UTC
(In reply to Alex Schultz from comment #2)
> When you set a service to OS::Heat::None, the services are not removed from
> the host that were previously running. You need to have something to do that
> clean up.  Usually we recommend switch it from the real service, to a
> service that describes all the removal actions.  

The services used to be removed in OSP 16 GA when set to None. But this is not the problem this BZ aims on. The real problem is that even when I remove the service manually and I set it to None in templates, then it still gets configured regardless of the template settings.

I will provide the full templates once I have the env back and I re-run the migration.

Comment 4 Alex Schultz 2020-06-18 14:25:22 UTC
Jakub where you able to reproduce this?

Comment 5 Alex Schultz 2020-06-18 16:31:40 UTC
I think the issue is that we shouldn't be setting them to None, but rather to a disabled service that is basically a noop service.  We've done this in the past for services that we've removed in order to ensure they get properly handled for things like FFU or just a basic upgrade.  Example

https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/queens/puppet/services/disabled/ceilometer-api-disabled.yaml

Comment 6 Alex Schultz 2020-06-19 22:04:31 UTC
Ok so this is a regression from at least OSP13 and likely OSP16. Previously when you set a service to OS::Heat::None it would stop managing the service but leave it in place.  This was likely caused by the fix for Bug 1726606 since we're likely removing the service definition which causes heat not to recognize that it should be removed from the stack.  We likely need to check if there is a stack and a service is defined in the stack, do not remove the OS::Heat::None service.

Comment 7 Alex Schultz 2020-06-19 22:07:17 UTC
The work around would be to create a dummy/empty service to use instead of defining OS::Heat::None when you are removing a services.  This issue shows up in the ml2->ovn migration because we're running something externally to the deployment to do the migration rather than properly handling it during a deploy/update/upgrade procedure via deployment steps/host prep tasks/external tasks or soemthing to that effect.

Comment 8 Rabi Mishra 2020-06-20 00:08:50 UTC
openstack-tripleo-common-11.3.3-0.20200403044649.56c0fd5.el8ost.noarch/openstack-tripleo-common-containers-11.3.3-0.20200403044649.56c0fd5.el8ost.noarch look pretty old to have the fix for Bug 1726606 and cause the issue.

Comment 10 Alex Schultz 2020-06-22 17:48:56 UTC
So I attempted to reproduce this with 16.1 by deploying an overcloud, then disabling OS::TripleO::Services::Chrony by setting it to OS::Heat::None.  It didn't reproduce as the chrony tasks were not present on the subsequent update.  I'll now try with the ml2 -> ovn process with 16.0 to see if it's specific to that.

Comment 11 Emilien Macchi 2020-06-22 21:36:15 UTC
I think https://review.opendev.org/#/c/737337/ will fix the issue, where it'll clean up all containers that aren't supposed to be on a host or re-run once FFU is finished.

Comment 14 Alex Schultz 2020-06-23 12:58:26 UTC
I checked the ovn migration and the old containers are started up but they aren't in the ansible playbook so it looks like Heat is doing the correct thing and it's likely the bug resolved via https://review.opendev.org/#/c/737340/

Comment 30 Roman Safronov 2020-06-30 10:25:49 UTC
Verified on RHOS-16.1-RHEL-8-20200625.n.0 with openstack-tripleo-heat-templates-11.3.2-0.20200616081529.396affd.el8ost.noarch

Verified that ml2ovs services are not running after migration to ml2ovn.

Comment 33 errata-xmlrpc 2020-07-29 07:52:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3148


Note You need to log in before you can comment on or make changes to this bug.