Bug 1807826
| Summary: | [OSP15->16] Neutron is down after Controllers upgrade. Pacemaker allocating different IPs for ovn-dbserver | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Jose Luis Franco <jfrancoa> |
| Component: | openstack-tripleo-heat-templates | Assignee: | Jose Luis Franco <jfrancoa> |
| Status: | CLOSED ERRATA | QA Contact: | nlevinki <nlevinki> |
| Severity: | urgent | Docs Contact: | |
| Priority: | high | ||
| Version: | 16.0 (Train) | CC: | batkisso, ccamacho, dciabrin, jjoyce, jschluet, lmiccini, mburns, shrjoshi, slinaber, tvignaud |
| Target Milestone: | zstream | Keywords: | Triaged |
| Target Release: | 16.0 (Train on RHEL 8.1) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | openstack-tripleo-heat-templates-11.3.2-0.20200310160324.b3d9c16.el8ost | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-05-14 12:16:10 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Jose Luis Franco
2020-02-27 10:19:14 UTC
There seems to be some change that landed in Train which creates a dedicated VIP for OVN DBS https://github.com/openstack/tripleo-heat-templates/commit/c2d481684063af5a23fa922f028b383ecf81a3f4 This change will proably imply adding some upgrade_tasks in the ovn-dbs pacemaker template service to deal with the change: https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/ovn/ovn-dbs-pacemaker-puppet.yaml#L353 So we had a look with Luca since yesterday and we think the problem is the following: . at the end of the controller upgrade, there is a deploy task that runs puppet code to reassess the state of the ovn-dbs-bundle resource (it's run in container ovn_dbs_init_bundle) . the puppet code correctly create the new VIP and all its associated location and ordering constraints. . the ovndb_servers pacemaker resource is reconfigured to listen to the new VIP (attribute "master_ip" is updated in the resource config) . All resource replicas that are marked as Slaves are stopped, and then restarted. However, the Master resource is only demoted, and re-promoted. . in the OVN resource agent, a demotion is not sufficient to stop the ovndb_servers process. So the new VIP is never picked up. It's not clear yet whether this is an expected pacemaker behaviour, but in any case, forcing a restart of the resource with "pcs resource restart" is enough to restart all ovn processes and make them pick up the new config. Verified on a local environment with tht package :
(undercloud) [stack@undercloud-0 ~]$ rpm -qa | grep tripleo-heat-templates
openstack-tripleo-heat-templates-11.3.2-0.20200324120625.c3a8eb4.el8ost.noarch
2020-04-06 12:24:46 | TASK [Restart ovn-dbs service (pacemaker)] *************************************
2020-04-06 12:24:46 | Monday 06 April 2020 12:23:35 +0000 (0:00:02.278) 0:00:10.444 **********
2020-04-06 12:24:46 | skipping: [controller-1] => {"changed": false, "skip_reason": "Conditional result was False"}
2020-04-06 12:24:46 | skipping: [controller-2] => {"changed": false, "skip_reason": "Conditional result was False"}
2020-04-06 12:24:46 | changed: [controller-0] => {"changed": true, "out": "ovn-dbs-bundle successfully restarted\n", "rc": 0}
....
2020-04-06 12:24:53 | TASK [include_tasks] ***********************************************************
2020-04-06 12:24:53 | Monday 06 April 2020 12:24:53 +0000 (0:00:00.485) 0:01:27.925 **********
2020-04-06 12:24:53 | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"}
2020-04-06 12:24:53 | skipping: [controller-1] => {"changed": false, "skip_reason": "Conditional result was False"}
2020-04-06 12:24:53 | skipping: [controller-2] => {"changed": false, "skip_reason": "Conditional result was False"}
2020-04-06 12:24:53 |
2020-04-06 12:24:53 | PLAY RECAP *********************************************************************
2020-04-06 12:24:53 | controller-0 : ok=13 changed=4 unreachable=0 failed=0 skipped=35 rescued=0 ignored=0
2020-04-06 12:24:53 | controller-1 : ok=12 changed=3 unreachable=0 failed=0 skipped=36 rescued=0 ignored=0
2020-04-06 12:24:53 | controller-2 : ok=12 changed=3 unreachable=0 failed=0 skipped=36 rescued=0 ignored=0
2020-04-06 12:24:53 |
2020-04-06 12:24:53 | Monday 06 April 2020 12:24:53 +0000 (0:00:00.358) 0:01:28.284 **********
2020-04-06 12:24:53 | ===============================================================================
2020-04-06 12:24:54 |
2020-04-06 12:24:54 | Updated nodes - Controller
2020-04-06 12:24:54 | Success
2020-04-06 12:24:54 | 2020-04-06 12:24:54.545 661020 INFO tripleoclient.v1.overcloud_upgrade.MajorUpgradeRun [-] Completed Overcloud Upgrade Run for Controller with playbooks ['upgrade_steps_playbook.yaml', 'deploy_steps_playbook.yaml', 'post_upgrade_steps_playbook.yaml'] ^[[00m
2020-04-06 12:24:54 | 2020-04-06 12:24:54.546 661020 INFO osc_lib.shell [-] END return value: None^[[00m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2114 |