Bug 1807826
Summary: | [OSP15->16] Neutron is down after Controllers upgrade. Pacemaker allocating different IPs for ovn-dbserver | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Jose Luis Franco <jfrancoa> |
Component: | openstack-tripleo-heat-templates | Assignee: | Jose Luis Franco <jfrancoa> |
Status: | CLOSED ERRATA | QA Contact: | nlevinki <nlevinki> |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | 16.0 (Train) | CC: | batkisso, ccamacho, dciabrin, jjoyce, jschluet, lmiccini, mburns, shrjoshi, slinaber, tvignaud |
Target Milestone: | zstream | Keywords: | Triaged |
Target Release: | 16.0 (Train on RHEL 8.1) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-tripleo-heat-templates-11.3.2-0.20200310160324.b3d9c16.el8ost | Doc Type: | No Doc Update |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-05-14 12:16:10 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jose Luis Franco
2020-02-27 10:19:14 UTC
There seems to be some change that landed in Train which creates a dedicated VIP for OVN DBS https://github.com/openstack/tripleo-heat-templates/commit/c2d481684063af5a23fa922f028b383ecf81a3f4 This change will proably imply adding some upgrade_tasks in the ovn-dbs pacemaker template service to deal with the change: https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/ovn/ovn-dbs-pacemaker-puppet.yaml#L353 So we had a look with Luca since yesterday and we think the problem is the following: . at the end of the controller upgrade, there is a deploy task that runs puppet code to reassess the state of the ovn-dbs-bundle resource (it's run in container ovn_dbs_init_bundle) . the puppet code correctly create the new VIP and all its associated location and ordering constraints. . the ovndb_servers pacemaker resource is reconfigured to listen to the new VIP (attribute "master_ip" is updated in the resource config) . All resource replicas that are marked as Slaves are stopped, and then restarted. However, the Master resource is only demoted, and re-promoted. . in the OVN resource agent, a demotion is not sufficient to stop the ovndb_servers process. So the new VIP is never picked up. It's not clear yet whether this is an expected pacemaker behaviour, but in any case, forcing a restart of the resource with "pcs resource restart" is enough to restart all ovn processes and make them pick up the new config. Verified on a local environment with tht package : (undercloud) [stack@undercloud-0 ~]$ rpm -qa | grep tripleo-heat-templates openstack-tripleo-heat-templates-11.3.2-0.20200324120625.c3a8eb4.el8ost.noarch 2020-04-06 12:24:46 | TASK [Restart ovn-dbs service (pacemaker)] ************************************* 2020-04-06 12:24:46 | Monday 06 April 2020 12:23:35 +0000 (0:00:02.278) 0:00:10.444 ********** 2020-04-06 12:24:46 | skipping: [controller-1] => {"changed": false, "skip_reason": "Conditional result was False"} 2020-04-06 12:24:46 | skipping: [controller-2] => {"changed": false, "skip_reason": "Conditional result was False"} 2020-04-06 12:24:46 | changed: [controller-0] => {"changed": true, "out": "ovn-dbs-bundle successfully restarted\n", "rc": 0} .... 2020-04-06 12:24:53 | TASK [include_tasks] *********************************************************** 2020-04-06 12:24:53 | Monday 06 April 2020 12:24:53 +0000 (0:00:00.485) 0:01:27.925 ********** 2020-04-06 12:24:53 | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"} 2020-04-06 12:24:53 | skipping: [controller-1] => {"changed": false, "skip_reason": "Conditional result was False"} 2020-04-06 12:24:53 | skipping: [controller-2] => {"changed": false, "skip_reason": "Conditional result was False"} 2020-04-06 12:24:53 | 2020-04-06 12:24:53 | PLAY RECAP ********************************************************************* 2020-04-06 12:24:53 | controller-0 : ok=13 changed=4 unreachable=0 failed=0 skipped=35 rescued=0 ignored=0 2020-04-06 12:24:53 | controller-1 : ok=12 changed=3 unreachable=0 failed=0 skipped=36 rescued=0 ignored=0 2020-04-06 12:24:53 | controller-2 : ok=12 changed=3 unreachable=0 failed=0 skipped=36 rescued=0 ignored=0 2020-04-06 12:24:53 | 2020-04-06 12:24:53 | Monday 06 April 2020 12:24:53 +0000 (0:00:00.358) 0:01:28.284 ********** 2020-04-06 12:24:53 | =============================================================================== 2020-04-06 12:24:54 | 2020-04-06 12:24:54 | Updated nodes - Controller 2020-04-06 12:24:54 | Success 2020-04-06 12:24:54 | 2020-04-06 12:24:54.545 661020 INFO tripleoclient.v1.overcloud_upgrade.MajorUpgradeRun [-] Completed Overcloud Upgrade Run for Controller with playbooks ['upgrade_steps_playbook.yaml', 'deploy_steps_playbook.yaml', 'post_upgrade_steps_playbook.yaml'] ^[[00m 2020-04-06 12:24:54 | 2020-04-06 12:24:54.546 661020 INFO osc_lib.shell [-] END return value: None^[[00m Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2114 |