Description of problem: Doing an update of OSP15 from GA to RHOS_TRUNK-15.0-RHEL-8-20190926.n.0. Everything goes fine and everything is working but the ovn servers are not in HA anymore, 2 of them are stopped: Cluster name: tripleo_cluster [507/556] Stack: corosync Current DC: controller-0 (version 2.0.1-4.el8_0.4-0eb7991564) - partition with quorum Last updated: Wed Oct 9 12:05:36 2019 Last change: Wed Oct 9 10:06:25 2019 by root via crm_resource on controller-2 15 nodes configured 46 resources configured Online: [ controller-0 controller-1 controller-2 ] GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@con troller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ] Full list of resources: podman container set: galera-bundle [192.168.24.1:8787/rhosp15/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master controller-0 galera-bundle-1 (ocf::heartbeat:galera): Master controller-1 galera-bundle-2 (ocf::heartbeat:galera): Master controller-2 podman container set: rabbitmq-bundle [192.168.24.1:8787/rhosp15/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-0 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-1 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-2 podman container set: redis-bundle [192.168.24.1:8787/rhosp15/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Master controller-0 redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1 redis-bundle-2 (ocf::heartbeat:redis): Slave controller-2 ip-192.168.24.15 (ocf::heartbeat:IPaddr2): Started controller-0 ip-10.0.0.110 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.1.72 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.1.108 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.3.110 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.4.102 (ocf::heartbeat:IPaddr2): Started controller-1 podman container set: haproxy-bundle [192.168.24.1:8787/rhosp15/openstack-haproxy:pcmklatest] haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started controller-0 haproxy-bundle-podman-1 (ocf::heartbeat:podman): Started controller-1 haproxy-bundle-podman-2 (ocf::heartbeat:podman): Started controller-2 podman container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest] ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped controller-0 ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Stopped controller-1 ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Master controller-2 podman container: openstack-cinder-volume [192.168.24.1:8787/rhosp15/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Started controller-1 Failed Resource Actions: * ovndb_servers_start_0 on ovn-dbs-bundle-0 'unknown error' (1): call=8, status=Timed Out, exitreason='', last-rc-change='Wed Oct 9 09:12:56 2019', queued=0ms, exec=200002ms * ovndb_servers_start_0 on ovn-dbs-bundle-1 'unknown error' (1): call=8, status=Timed Out, exitreason='', last-rc-change='Wed Oct 9 09:42:35 2019', queued=0ms, exec=200002ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled For more information about what is happening and the consequence see that comment https://bugzilla.redhat.com/show_bug.cgi?id=1759974#c4 I'm creating that bz for visibility and maybe because we are going to make a workaround for the issue in 1759974
Created attachment 1625889 [details] Ctlplane test logs.
Created attachment 1625891 [details] Test script.
During update the ovndb server can have a schema change. The problem is that an updated slave ovndb wouldn't connect to a master which still has the old db schema. At some point (200000ms) pacemaker put the resource in error Time Out. Then it will wait for the operator to cleanup the resource. Meaning that the update can goes like this: - Original state: (Master, Slave, Failed): nothing updated - ctl0-M-old - ctl1-S-old - ctl2-S-old - First state: after update of ctl0 - ctl0-F-new - ctl1-M-old - ctl2-S-old - Second state: after update of ctl1 - ctl0-F-new - ctl1-F-new - ctl2-M-old - Third and final state: after update of ctl2 - ctl0-F-new - ctl1-F-new - ctl2-M-new During the third state we have a *cut* in the control plane as ctl2 is the master and there is no slave to fall back to. After it's updated it becomes the Master but we end up loosing HA as it's the only active node. The error persists after reboot. Only a =pcs resource cleanup= will bring the cluster online. The real solution will come from ovndb and the associated ocf agent, but in the meantime, we need a workaround as the fasttrack next shipping is around end of November. Now, for the cuts. First, We note that each time we have to migrate the master to another node we loose the control plane for around a minute until the new master settle on another node. In the worst case scenario (which is the most likely one[1] and is the one described above), when we start with Master, this implies that we have a one minute cut in the ctl plane in state in the first and second state. Then given the current we have a last cut that last around 5 minutes. The time it take from stopping the Master ovndb server on ctl2, updating its image and restarting it. The attachement show the result of the test. The test (test-ctlplane.sh) was to associate and dissassociate a floating ip to an existing instance in a loop during the whole update. The failures are show with "FAILURE", the Unknown one are should be investigated but are not the primary concern. We can see 3 FAILURE periods with the longest one lasting around 5min. [1] as master is on the bootstrap node, usually ctl-0, during update we start by default on ctl-0
Refine to the exact version needed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:4030