Description of problem: This bug was discovered during migration from Neutron openvswitch ml2 mechanism driver to ovn. During the migration, heat resources related to ml2/ovs are set to None: OS::TripleO::Services::NeutronOvsAgent: OS::Heat::None OS::TripleO::Services::ComputeNeutronOvsAgent: OS::Heat::None OS::TripleO::Services::NeutronL3Agent: OS::Heat::None OS::TripleO::Services::ComputeNeutronL3Agent: OS::Heat::None OS::TripleO::Services::NeutronMetadataAgent: OS::Heat::None OS::TripleO::Services::ComputeNeutronMetadataAgent: OS::Heat::None OS::TripleO::Services::NeutronDhcpAgent: OS::Heat::None OS::TripleO::Services::ComputeNeutronCorePlugin: OS::Heat::None Previsouly, tripleo even stopped and removed the ml2/ovs services but I was told it was done by a chance and not intentionally. The migration role now stops and removes the services manually, however after calling back overcloud deploy with OVN services, tripleo configures the ml2/ovs services back regardless of what is set in the heat templates. This didn't happen in OSP16 GA version. Version-Release number of selected component (if applicable): openstack-tripleo-common-containers-11.3.3-0.20200403044649.56c0fd5.el8ost.noarch openstack-tripleo-image-elements-10.6.2-0.20200314025720.8c91b46.el8ost.noarch openstack-tripleo-validations-11.3.2-0.20200318124452.3fd14c9.el8ost.noarch openstack-tripleo-heat-templates-11.3.2-0.20200405044624.ec9970c.el8ost.noarch openstack-tripleo-puppet-elements-11.2.2-0.20200302235857.a6fef08.el8ost.noarch openstack-tripleo-common-11.3.3-0.20200403044649.56c0fd5.el8ost.noarch How reproducible: Always Steps to Reproduce: 1. Deploy OSP16 with ml2/ovs networking backend 2. Run migration as described in docs - https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.0/html/networking_with_open_virtual_network/migrating-ml2ovs-to-ovn Actual results: ml2/ovs services running after the migration Expected results: ml2/ovs services are not re-deployed Additional info:
Looking at the neutron metadata agent for example, the json files for the service containers appear to be on the host under /var/lib/tripleo-config and the timestamps seem to match the other files indicating that they might be being regenerated. However, the hieradata for the metadata agent is missing. Perhaps some aspect of the deployment is dropping the service, but it hasn't been removed from some other data source created and used by the deployment framework?
When you set a service to OS::Heat::None, the services are not removed from the host that were previously running. You need to have something to do that clean up. Usually we recommend switch it from the real service, to a service that describes all the removal actions. Can you please provide a full set of templates that were used and the command that were run? Currently there is not enough information to understand the order of actions or what is actually performed.
(In reply to Alex Schultz from comment #2) > When you set a service to OS::Heat::None, the services are not removed from > the host that were previously running. You need to have something to do that > clean up. Usually we recommend switch it from the real service, to a > service that describes all the removal actions. The services used to be removed in OSP 16 GA when set to None. But this is not the problem this BZ aims on. The real problem is that even when I remove the service manually and I set it to None in templates, then it still gets configured regardless of the template settings. I will provide the full templates once I have the env back and I re-run the migration.
Jakub where you able to reproduce this?
I think the issue is that we shouldn't be setting them to None, but rather to a disabled service that is basically a noop service. We've done this in the past for services that we've removed in order to ensure they get properly handled for things like FFU or just a basic upgrade. Example https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/queens/puppet/services/disabled/ceilometer-api-disabled.yaml
Ok so this is a regression from at least OSP13 and likely OSP16. Previously when you set a service to OS::Heat::None it would stop managing the service but leave it in place. This was likely caused by the fix for Bug 1726606 since we're likely removing the service definition which causes heat not to recognize that it should be removed from the stack. We likely need to check if there is a stack and a service is defined in the stack, do not remove the OS::Heat::None service.
The work around would be to create a dummy/empty service to use instead of defining OS::Heat::None when you are removing a services. This issue shows up in the ml2->ovn migration because we're running something externally to the deployment to do the migration rather than properly handling it during a deploy/update/upgrade procedure via deployment steps/host prep tasks/external tasks or soemthing to that effect.
openstack-tripleo-common-11.3.3-0.20200403044649.56c0fd5.el8ost.noarch/openstack-tripleo-common-containers-11.3.3-0.20200403044649.56c0fd5.el8ost.noarch look pretty old to have the fix for Bug 1726606 and cause the issue.
So I attempted to reproduce this with 16.1 by deploying an overcloud, then disabling OS::TripleO::Services::Chrony by setting it to OS::Heat::None. It didn't reproduce as the chrony tasks were not present on the subsequent update. I'll now try with the ml2 -> ovn process with 16.0 to see if it's specific to that.
I think https://review.opendev.org/#/c/737337/ will fix the issue, where it'll clean up all containers that aren't supposed to be on a host or re-run once FFU is finished.
I checked the ovn migration and the old containers are started up but they aren't in the ansible playbook so it looks like Heat is doing the correct thing and it's likely the bug resolved via https://review.opendev.org/#/c/737340/
Verified on RHOS-16.1-RHEL-8-20200625.n.0 with openstack-tripleo-heat-templates-11.3.2-0.20200616081529.396affd.el8ost.noarch Verified that ml2ovs services are not running after migration to ml2ovn.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3148