1834901 – Re-running overcloud deploy ignores heat templates

Bug 1834901 - Re-running overcloud deploy ignores heat templates

Summary: Re-running overcloud deploy ignores heat templates

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	rc
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	Emilien Macchi
QA Contact:	Roman Safronov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-12 15:52 UTC by Jakub Libosvar
Modified:	2020-07-29 07:52 UTC (History)
CC List:	10 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-11.3.2-0.20200616081527.396affd.el8ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-29 07:52:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	737340	0	None	MERGED	Cleanup all container startup configs before generating the new ones	2020-07-30 13:47:02 UTC
Red Hat Product Errata	RHBA-2020:3148	0	None	None	None	2020-07-29 07:52:57 UTC

Description Jakub Libosvar 2020-05-12 15:52:02 UTC

Description of problem:
This bug was discovered during migration from Neutron openvswitch ml2 mechanism driver to ovn. During the migration, heat resources related to ml2/ovs are set to None:
  OS::TripleO::Services::NeutronOvsAgent: OS::Heat::None
  OS::TripleO::Services::ComputeNeutronOvsAgent: OS::Heat::None
  OS::TripleO::Services::NeutronL3Agent: OS::Heat::None
  OS::TripleO::Services::ComputeNeutronL3Agent: OS::Heat::None
  OS::TripleO::Services::NeutronMetadataAgent: OS::Heat::None
  OS::TripleO::Services::ComputeNeutronMetadataAgent: OS::Heat::None
  OS::TripleO::Services::NeutronDhcpAgent: OS::Heat::None
  OS::TripleO::Services::ComputeNeutronCorePlugin: OS::Heat::None

Previsouly, tripleo even stopped and removed the ml2/ovs services but I was told it was done by a chance and not intentionally. The migration role now stops and removes the services manually, however after calling back overcloud deploy with OVN services, tripleo configures the ml2/ovs services back regardless of what is set in the heat templates.

This didn't happen in OSP16 GA version.

Version-Release number of selected component (if applicable):
openstack-tripleo-common-containers-11.3.3-0.20200403044649.56c0fd5.el8ost.noarch
openstack-tripleo-image-elements-10.6.2-0.20200314025720.8c91b46.el8ost.noarch
openstack-tripleo-validations-11.3.2-0.20200318124452.3fd14c9.el8ost.noarch
openstack-tripleo-heat-templates-11.3.2-0.20200405044624.ec9970c.el8ost.noarch
openstack-tripleo-puppet-elements-11.2.2-0.20200302235857.a6fef08.el8ost.noarch
openstack-tripleo-common-11.3.3-0.20200403044649.56c0fd5.el8ost.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy OSP16 with ml2/ovs networking backend
2. Run migration as described in docs - https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.0/html/networking_with_open_virtual_network/migrating-ml2ovs-to-ovn


Actual results:
ml2/ovs services running after the migration

Expected results:
ml2/ovs services are not re-deployed

Additional info:

Comment 1 Brent Eagles 2020-05-12 15:58:55 UTC

Looking at the neutron metadata agent for example, the json files for the service containers appear to be on the host under /var/lib/tripleo-config and the timestamps seem to match the other files indicating that they might be being regenerated. However, the hieradata for the metadata agent is missing. Perhaps some aspect of the deployment is dropping the service, but it hasn't been removed from some other data source created and used by the deployment framework?

Comment 2 Alex Schultz 2020-05-13 21:30:52 UTC

When you set a service to OS::Heat::None, the services are not removed from the host that were previously running. You need to have something to do that clean up.  Usually we recommend switch it from the real service, to a service that describes all the removal actions.  


Can you please provide a full set of templates that were used and the command that were run?  Currently there is not enough information to understand the order of actions or what is actually performed.

Comment 3 Jakub Libosvar 2020-05-14 06:48:11 UTC

(In reply to Alex Schultz from comment #2)
> When you set a service to OS::Heat::None, the services are not removed from
> the host that were previously running. You need to have something to do that
> clean up.  Usually we recommend switch it from the real service, to a
> service that describes all the removal actions.  

The services used to be removed in OSP 16 GA when set to None. But this is not the problem this BZ aims on. The real problem is that even when I remove the service manually and I set it to None in templates, then it still gets configured regardless of the template settings.

I will provide the full templates once I have the env back and I re-run the migration.

Comment 4 Alex Schultz 2020-06-18 14:25:22 UTC

Jakub where you able to reproduce this?

Comment 5 Alex Schultz 2020-06-18 16:31:40 UTC

I think the issue is that we shouldn't be setting them to None, but rather to a disabled service that is basically a noop service.  We've done this in the past for services that we've removed in order to ensure they get properly handled for things like FFU or just a basic upgrade.  Example

https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/queens/puppet/services/disabled/ceilometer-api-disabled.yaml

Comment 6 Alex Schultz 2020-06-19 22:04:31 UTC

Ok so this is a regression from at least OSP13 and likely OSP16. Previously when you set a service to OS::Heat::None it would stop managing the service but leave it in place.  This was likely caused by the fix for Bug 1726606 since we're likely removing the service definition which causes heat not to recognize that it should be removed from the stack.  We likely need to check if there is a stack and a service is defined in the stack, do not remove the OS::Heat::None service.

Comment 7 Alex Schultz 2020-06-19 22:07:17 UTC

The work around would be to create a dummy/empty service to use instead of defining OS::Heat::None when you are removing a services.  This issue shows up in the ml2->ovn migration because we're running something externally to the deployment to do the migration rather than properly handling it during a deploy/update/upgrade procedure via deployment steps/host prep tasks/external tasks or soemthing to that effect.

Comment 8 Rabi Mishra 2020-06-20 00:08:50 UTC

openstack-tripleo-common-11.3.3-0.20200403044649.56c0fd5.el8ost.noarch/openstack-tripleo-common-containers-11.3.3-0.20200403044649.56c0fd5.el8ost.noarch look pretty old to have the fix for Bug 1726606 and cause the issue.

Comment 10 Alex Schultz 2020-06-22 17:48:56 UTC

So I attempted to reproduce this with 16.1 by deploying an overcloud, then disabling OS::TripleO::Services::Chrony by setting it to OS::Heat::None.  It didn't reproduce as the chrony tasks were not present on the subsequent update.  I'll now try with the ml2 -> ovn process with 16.0 to see if it's specific to that.

Comment 11 Emilien Macchi 2020-06-22 21:36:15 UTC

I think https://review.opendev.org/#/c/737337/ will fix the issue, where it'll clean up all containers that aren't supposed to be on a host or re-run once FFU is finished.

Comment 14 Alex Schultz 2020-06-23 12:58:26 UTC

I checked the ovn migration and the old containers are started up but they aren't in the ansible playbook so it looks like Heat is doing the correct thing and it's likely the bug resolved via https://review.opendev.org/#/c/737340/

Comment 30 Roman Safronov 2020-06-30 10:25:49 UTC

Verified on RHOS-16.1-RHEL-8-20200625.n.0 with openstack-tripleo-heat-templates-11.3.2-0.20200616081529.396affd.el8ost.noarch

Verified that ml2ovs services are not running after migration to ml2ovn.

Comment 33 errata-xmlrpc 2020-07-29 07:52:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3148

Note You need to log in before you can comment on or make changes to this bug.