Bug 1760405
Summary: | OSP15 update has a cut in control plane and loose HA of ovndb-servers. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Sofer Athlan-Guyot <sathlang> | ||||||
Component: | openstack-tripleo-heat-templates | Assignee: | Sofer Athlan-Guyot <sathlang> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Sasha Smolyak <ssmolyak> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 15.0 (Stein) | CC: | mburns, sgolovat | ||||||
Target Milestone: | async | Keywords: | Triaged, ZStream | ||||||
Target Release: | 15.0 (Stein) | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | openstack-tripleo-heat-templates-10.6.2-0.20191017030436.5dff146.el8ost | Doc Type: | No Doc Update | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1765247 1765257 (view as bug list) | Environment: | |||||||
Last Closed: | 2019-12-02 10:11:16 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 1759974 | ||||||||
Bug Blocks: | |||||||||
Attachments: |
|
Description
Sofer Athlan-Guyot
2019-10-10 13:20:59 UTC
Created attachment 1625889 [details]
Ctlplane test logs.
Created attachment 1625891 [details]
Test script.
During update the ovndb server can have a schema change. The problem is that an updated slave ovndb wouldn't connect to a master which still has the old db schema. At some point (200000ms) pacemaker put the resource in error Time Out. Then it will wait for the operator to cleanup the resource. Meaning that the update can goes like this: - Original state: (Master, Slave, Failed): nothing updated - ctl0-M-old - ctl1-S-old - ctl2-S-old - First state: after update of ctl0 - ctl0-F-new - ctl1-M-old - ctl2-S-old - Second state: after update of ctl1 - ctl0-F-new - ctl1-F-new - ctl2-M-old - Third and final state: after update of ctl2 - ctl0-F-new - ctl1-F-new - ctl2-M-new During the third state we have a *cut* in the control plane as ctl2 is the master and there is no slave to fall back to. After it's updated it becomes the Master but we end up loosing HA as it's the only active node. The error persists after reboot. Only a =pcs resource cleanup= will bring the cluster online. The real solution will come from ovndb and the associated ocf agent, but in the meantime, we need a workaround as the fasttrack next shipping is around end of November. Now, for the cuts. First, We note that each time we have to migrate the master to another node we loose the control plane for around a minute until the new master settle on another node. In the worst case scenario (which is the most likely one[1] and is the one described above), when we start with Master, this implies that we have a one minute cut in the ctl plane in state in the first and second state. Then given the current we have a last cut that last around 5 minutes. The time it take from stopping the Master ovndb server on ctl2, updating its image and restarting it. The attachement show the result of the test. The test (test-ctlplane.sh) was to associate and dissassociate a floating ip to an existing instance in a loop during the whole update. The failures are show with "FAILURE", the Unknown one are should be investigated but are not the primary concern. We can see 3 FAILURE periods with the longest one lasting around 5min. [1] as master is on the bootstrap node, usually ctl-0, during update we start by default on ctl-0 Refine to the exact version needed. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:4030 |