Bug 1760405

Summary:

OSP15 update has a cut in control plane and loose HA of ovndb-servers.

Product:

Red Hat OpenStack

Reporter:

Sofer Athlan-Guyot <sathlang>

Component:

openstack-tripleo-heat-templates

Assignee:

Sofer Athlan-Guyot <sathlang>

Status:

CLOSED ERRATA

QA Contact:

Sasha Smolyak <ssmolyak>

Severity:

high

Docs Contact:

Priority:

urgent

Version:

15.0 (Stein)

CC:

mburns, sgolovat

Target Milestone:

async

Keywords:

Triaged, ZStream

Target Release:

15.0 (Stein)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

openstack-tripleo-heat-templates-10.6.2-0.20191017030436.5dff146.el8ost

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Clones:

1765247 1765257 (view as bug list)

Environment:

Last Closed:

2019-12-02 10:11:16 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1759974

Bug Blocks:

Attachments:

Description	Flags
Ctlplane test logs.	none
Test script.	none

Description Sofer Athlan-Guyot 2019-10-10 13:20:59 UTC

Description of problem:

Doing an update of OSP15 from GA to RHOS_TRUNK-15.0-RHEL-8-20190926.n.0.

Everything goes fine and everything is working but the ovn servers
are not in HA anymore, 2 of them are stopped:

Cluster name: tripleo_cluster                                                                                                                                                                               [507/556]
Stack: corosync
Current DC: controller-0 (version 2.0.1-4.el8_0.4-0eb7991564) - partition with quorum
Last updated: Wed Oct  9 12:05:36 2019
Last change: Wed Oct  9 10:06:25 2019 by root via crm_resource on controller-2

15 nodes configured
46 resources configured

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@con
troller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ]

Full list of resources:

     podman container set: galera-bundle [192.168.24.1:8787/rhosp15/openstack-mariadb:pcmklatest]
       galera-bundle-0      (ocf::heartbeat:galera):        Master controller-0
       galera-bundle-1      (ocf::heartbeat:galera):        Master controller-1
       galera-bundle-2      (ocf::heartbeat:galera):        Master controller-2
     podman container set: rabbitmq-bundle [192.168.24.1:8787/rhosp15/openstack-rabbitmq:pcmklatest]
       rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Started controller-0
       rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Started controller-1
       rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Started controller-2
     podman container set: redis-bundle [192.168.24.1:8787/rhosp15/openstack-redis:pcmklatest]
       redis-bundle-0       (ocf::heartbeat:redis): Master controller-0
       redis-bundle-1       (ocf::heartbeat:redis): Slave controller-1
       redis-bundle-2       (ocf::heartbeat:redis): Slave controller-2
     ip-192.168.24.15       (ocf::heartbeat:IPaddr2):       Started controller-0
     ip-10.0.0.110  (ocf::heartbeat:IPaddr2):       Started controller-1
     ip-172.17.1.72 (ocf::heartbeat:IPaddr2):       Started controller-0
     ip-172.17.1.108        (ocf::heartbeat:IPaddr2):       Started controller-2
     ip-172.17.3.110        (ocf::heartbeat:IPaddr2):       Started controller-0
     ip-172.17.4.102        (ocf::heartbeat:IPaddr2):       Started controller-1
     podman container set: haproxy-bundle [192.168.24.1:8787/rhosp15/openstack-haproxy:pcmklatest]
       haproxy-bundle-podman-0      (ocf::heartbeat:podman):        Started controller-0
       haproxy-bundle-podman-1      (ocf::heartbeat:podman):        Started controller-1
       haproxy-bundle-podman-2      (ocf::heartbeat:podman):        Started controller-2
     podman container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest]
       ovn-dbs-bundle-0     (ocf::ovn:ovndb-servers):       Stopped controller-0
       ovn-dbs-bundle-1     (ocf::ovn:ovndb-servers):       Stopped controller-1
       ovn-dbs-bundle-2     (ocf::ovn:ovndb-servers):       Master controller-2
     podman container: openstack-cinder-volume [192.168.24.1:8787/rhosp15/openstack-cinder-volume:pcmklatest]
       openstack-cinder-volume-podman-0     (ocf::heartbeat:podman):        Started controller-1
    
    Failed Resource Actions:
    * ovndb_servers_start_0 on ovn-dbs-bundle-0 'unknown error' (1): call=8, status=Timed Out, exitreason='',
        last-rc-change='Wed Oct  9 09:12:56 2019', queued=0ms, exec=200002ms
    * ovndb_servers_start_0 on ovn-dbs-bundle-1 'unknown error' (1): call=8, status=Timed Out, exitreason='',
        last-rc-change='Wed Oct  9 09:42:35 2019', queued=0ms, exec=200002ms


    Daemon Status:
      corosync: active/enabled
      pacemaker: active/enabled
      pcsd: active/enabled


For more information about what is happening and the consequence see that comment https://bugzilla.redhat.com/show_bug.cgi?id=1759974#c4

I'm creating that bz for visibility and maybe because we are going to make a workaround for the issue in 1759974

Comment 4 Sofer Athlan-Guyot 2019-10-15 09:43:35 UTC

Created attachment 1625889 [details]
Ctlplane test logs.

Comment 5 Sofer Athlan-Guyot 2019-10-15 09:54:20 UTC

Created attachment 1625891 [details]
Test script.

Comment 6 Sofer Athlan-Guyot 2019-10-15 09:56:21 UTC

During update the ovndb server can have a schema change.

The problem is that an updated slave ovndb wouldn't connect to a
master which still has the old db schema.

At some point (200000ms) pacemaker put the resource in error Time Out.
Then it will wait for the operator to cleanup the resource.

Meaning that the update can goes like this:

- Original state: (Master, Slave, Failed): nothing updated
  - ctl0-M-old
  - ctl1-S-old
  - ctl2-S-old
- First state: after update of ctl0
  - ctl0-F-new
  - ctl1-M-old
  - ctl2-S-old
- Second state: after update of ctl1
  - ctl0-F-new
  - ctl1-F-new
  - ctl2-M-old
- Third and final state: after update of ctl2
  - ctl0-F-new
  - ctl1-F-new
  - ctl2-M-new

During the third state we have a *cut* in the control plane as ctl2 is
the master and there is no slave to fall back to.

After it's updated it becomes the Master but we end up loosing HA as
it's the only active node.

The error persists after reboot. Only a =pcs resource cleanup= will
bring the cluster online.

The real solution will come from ovndb and the associated ocf agent,
but in the meantime, we need a workaround as the fasttrack next
shipping is around end of November.

Now, for the cuts.

First, We note that each time we have to migrate the master to another
node we loose the control plane for around a minute until the new
master settle on another node.  In the worst case scenario (which is
the most likely one[1] and is the one described above), when we start
with Master, this implies that we have a one minute cut in the ctl
plane in state in the first and second state.

Then given the current we have a last cut that last around 5 minutes.
The time it take from stopping the Master ovndb server on ctl2,
updating its image and restarting it.

The attachement show the result of the test.  The test
(test-ctlplane.sh) was to associate and dissassociate a floating ip to
an existing instance in a loop during the whole update.

The failures are show with "FAILURE", the Unknown one are should be
investigated but are not the primary concern.  We can see 3 FAILURE
periods with the longest one lasting around 5min.

[1] as master is on the bootstrap node, usually ctl-0, during update
    we start by default on ctl-0

Comment 7 Sofer Athlan-Guyot 2019-10-25 07:54:00 UTC

Refine to the exact version needed.

Comment 14 errata-xmlrpc 2019-12-02 10:11:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:4030