1760405 – OSP15 update has a cut in control plane and loose HA of ovndb-servers.

Bug 1760405 - OSP15 update has a cut in control plane and loose HA of ovndb-servers.

Summary: OSP15 update has a cut in control plane and loose HA of ovndb-servers.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	15.0 (Stein)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	async
Target Release:	15.0 (Stein)
Assignee:	Sofer Athlan-Guyot
QA Contact:	Sasha Smolyak
Docs Contact:
URL:
Whiteboard:
Depends On:	1759974
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-10 13:20 UTC by Sofer Athlan-Guyot
Modified:	2019-12-02 10:11 UTC (History)
CC List:	2 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-10.6.2-0.20191017030436.5dff146.el8ost
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1765247 1765257 (view as bug list)
Environment:
Last Closed:	2019-12-02 10:11:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Ctlplane test logs. (36.45 KB, text/plain) 2019-10-15 09:43 UTC, Sofer Athlan-Guyot	no flags	Details
Test script. (894 bytes, application/x-shellscript) 2019-10-15 09:54 UTC, Sofer Athlan-Guyot	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1847780	None	None	None	2019-10-15 09:56:53 UTC
OpenStack gerrit	688846	'None'	MERGED	Workaround ovn cluster failure during update when schema change.	2020-04-10 09:48:27 UTC
Red Hat Product Errata	RHBA-2019:4030	None	None	None	2019-12-02 10:11:54 UTC

Description Sofer Athlan-Guyot 2019-10-10 13:20:59 UTC

Description of problem:

Doing an update of OSP15 from GA to RHOS_TRUNK-15.0-RHEL-8-20190926.n.0.

Everything goes fine and everything is working but the ovn servers
are not in HA anymore, 2 of them are stopped:

Cluster name: tripleo_cluster                                                                                                                                                                               [507/556]
Stack: corosync
Current DC: controller-0 (version 2.0.1-4.el8_0.4-0eb7991564) - partition with quorum
Last updated: Wed Oct  9 12:05:36 2019
Last change: Wed Oct  9 10:06:25 2019 by root via crm_resource on controller-2

15 nodes configured
46 resources configured

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@con
troller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ]

Full list of resources:

     podman container set: galera-bundle [192.168.24.1:8787/rhosp15/openstack-mariadb:pcmklatest]
       galera-bundle-0      (ocf::heartbeat:galera):        Master controller-0
       galera-bundle-1      (ocf::heartbeat:galera):        Master controller-1
       galera-bundle-2      (ocf::heartbeat:galera):        Master controller-2
     podman container set: rabbitmq-bundle [192.168.24.1:8787/rhosp15/openstack-rabbitmq:pcmklatest]
       rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Started controller-0
       rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Started controller-1
       rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Started controller-2
     podman container set: redis-bundle [192.168.24.1:8787/rhosp15/openstack-redis:pcmklatest]
       redis-bundle-0       (ocf::heartbeat:redis): Master controller-0
       redis-bundle-1       (ocf::heartbeat:redis): Slave controller-1
       redis-bundle-2       (ocf::heartbeat:redis): Slave controller-2
     ip-192.168.24.15       (ocf::heartbeat:IPaddr2):       Started controller-0
     ip-10.0.0.110  (ocf::heartbeat:IPaddr2):       Started controller-1
     ip-172.17.1.72 (ocf::heartbeat:IPaddr2):       Started controller-0
     ip-172.17.1.108        (ocf::heartbeat:IPaddr2):       Started controller-2
     ip-172.17.3.110        (ocf::heartbeat:IPaddr2):       Started controller-0
     ip-172.17.4.102        (ocf::heartbeat:IPaddr2):       Started controller-1
     podman container set: haproxy-bundle [192.168.24.1:8787/rhosp15/openstack-haproxy:pcmklatest]
       haproxy-bundle-podman-0      (ocf::heartbeat:podman):        Started controller-0
       haproxy-bundle-podman-1      (ocf::heartbeat:podman):        Started controller-1
       haproxy-bundle-podman-2      (ocf::heartbeat:podman):        Started controller-2
     podman container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest]
       ovn-dbs-bundle-0     (ocf::ovn:ovndb-servers):       Stopped controller-0
       ovn-dbs-bundle-1     (ocf::ovn:ovndb-servers):       Stopped controller-1
       ovn-dbs-bundle-2     (ocf::ovn:ovndb-servers):       Master controller-2
     podman container: openstack-cinder-volume [192.168.24.1:8787/rhosp15/openstack-cinder-volume:pcmklatest]
       openstack-cinder-volume-podman-0     (ocf::heartbeat:podman):        Started controller-1
    
    Failed Resource Actions:
    * ovndb_servers_start_0 on ovn-dbs-bundle-0 'unknown error' (1): call=8, status=Timed Out, exitreason='',
        last-rc-change='Wed Oct  9 09:12:56 2019', queued=0ms, exec=200002ms
    * ovndb_servers_start_0 on ovn-dbs-bundle-1 'unknown error' (1): call=8, status=Timed Out, exitreason='',
        last-rc-change='Wed Oct  9 09:42:35 2019', queued=0ms, exec=200002ms


    Daemon Status:
      corosync: active/enabled
      pacemaker: active/enabled
      pcsd: active/enabled


For more information about what is happening and the consequence see that comment https://bugzilla.redhat.com/show_bug.cgi?id=1759974#c4

I'm creating that bz for visibility and maybe because we are going to make a workaround for the issue in 1759974

Comment 4 Sofer Athlan-Guyot 2019-10-15 09:43:35 UTC

Created attachment 1625889 [details]
Ctlplane test logs.

Comment 5 Sofer Athlan-Guyot 2019-10-15 09:54:20 UTC

Created attachment 1625891 [details]
Test script.

Comment 6 Sofer Athlan-Guyot 2019-10-15 09:56:21 UTC

During update the ovndb server can have a schema change.

The problem is that an updated slave ovndb wouldn't connect to a
master which still has the old db schema.

At some point (200000ms) pacemaker put the resource in error Time Out.
Then it will wait for the operator to cleanup the resource.

Meaning that the update can goes like this:

- Original state: (Master, Slave, Failed): nothing updated
  - ctl0-M-old
  - ctl1-S-old
  - ctl2-S-old
- First state: after update of ctl0
  - ctl0-F-new
  - ctl1-M-old
  - ctl2-S-old
- Second state: after update of ctl1
  - ctl0-F-new
  - ctl1-F-new
  - ctl2-M-old
- Third and final state: after update of ctl2
  - ctl0-F-new
  - ctl1-F-new
  - ctl2-M-new

During the third state we have a *cut* in the control plane as ctl2 is
the master and there is no slave to fall back to.

After it's updated it becomes the Master but we end up loosing HA as
it's the only active node.

The error persists after reboot. Only a =pcs resource cleanup= will
bring the cluster online.

The real solution will come from ovndb and the associated ocf agent,
but in the meantime, we need a workaround as the fasttrack next
shipping is around end of November.

Now, for the cuts.

First, We note that each time we have to migrate the master to another
node we loose the control plane for around a minute until the new
master settle on another node.  In the worst case scenario (which is
the most likely one[1] and is the one described above), when we start
with Master, this implies that we have a one minute cut in the ctl
plane in state in the first and second state.

Then given the current we have a last cut that last around 5 minutes.
The time it take from stopping the Master ovndb server on ctl2,
updating its image and restarting it.

The attachement show the result of the test.  The test
(test-ctlplane.sh) was to associate and dissassociate a floating ip to
an existing instance in a loop during the whole update.

The failures are show with "FAILURE", the Unknown one are should be
investigated but are not the primary concern.  We can see 3 FAILURE
periods with the longest one lasting around 5min.

[1] as master is on the bootstrap node, usually ctl-0, during update
    we start by default on ctl-0

Comment 7 Sofer Athlan-Guyot 2019-10-25 07:54:00 UTC

Refine to the exact version needed.

Comment 14 errata-xmlrpc 2019-12-02 10:11:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:4030

Note You need to log in before you can comment on or make changes to this bug.