1828287 – ovn can lose its database when vip is migrated off a down node and moved on an active node and ovn db tries to self replicate to itself.

Bug 1828287 - ovn can lose its database when vip is migrated off a down node and moved on an active node and ovn db tries to self replicate to itself.

Summary: ovn can lose its database when vip is migrated off a down node and moved on a...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openvswitch
Sub Component:
Version:	13.0 (Queens)
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Jakub Libosvar
QA Contact:	Eran Kuris
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-27 14:04 UTC by David Hill
Modified:	2023-09-07 23:02 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-06-11 08:34:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-28420	0	None	None	None	2023-09-07 23:02:07 UTC
Red Hat Knowledge Base (Solution)	5018411	0	None	None	None	2020-04-27 14:33:27 UTC

Description David Hill 2020-04-27 14:04:23 UTC

Description of problem:
ovn can lose its database when vip is migrated off a down node and moved on an active node and ovn db tries to self replicate to itself.

Reproduce steps

Find the controller node where OVN is master


 Docker container set: ovn-dbs-bundle [satellite.localdomain:5000/osp13-containers-ovn-northd:13.0-hotfix-bz1766410-v2-20200110.1578623869]
   ovn-dbs-bundle-0    (ocf::ovn:ovndb-servers):    Master overcloud-controller-2
   ovn-dbs-bundle-1    (ocf::ovn:ovndb-servers):    Slave overcloud-controller-1
   ovn-dbs-bundle-2    (ocf::ovn:ovndb-servers):    Slave overcloud-controller-3

Stop pcs on the controller where OVN is master


[root@overcloud-controller-2 heat-admin]# pcs cluster stop
Another node is promoted to master


 Docker container set: ovn-dbs-bundle [satellite.localdomain:5000/osp13-containers-ovn-northd:13.0-hotfix-bz1766410-v2-20200110.1578623869]
   ovn-dbs-bundle-0    (ocf::ovn:ovndb-servers):    Stopped
   ovn-dbs-bundle-1    (ocf::ovn:ovndb-servers):    Master overcloud-controller-1
   ovn-dbs-bundle-2    (ocf::ovn:ovndb-servers):    Slave overcloud-controller-3
Note, that while usually ovn-dbs-bundle-0 is the master container name, now, it has changed to ovn-dbs-bundle-1



Restart cluster on the previously stopped node


[root@overcloud-controller-2 heat-admin]# pcs cluster stop
Now the master node does not change, ec-test-clt02 comes back as slave.


 Docker container set: ovn-dbs-bundle [satellite.localdomain:5000/osp13-containers-ovn-northd:13.0-hotfix-bz1766410-v2-20200110.1578623869]
   ovn-dbs-bundle-0    (ocf::ovn:ovndb-servers):    Slave overcloud-controller-2
   ovn-dbs-bundle-1    (ocf::ovn:ovndb-servers):    Master overcloud-controller-1
   ovn-dbs-bundle-2    (ocf::ovn:ovndb-servers):    Slave overcloud-controller-3

Update the image of the PCS resource 
ovn-dbs-bundle


Note, that this is not a real update, as the image is the same, but has a different tag


[root@overcloud-controller-1 heat-admin]# docker images |grep ovn-n
satellite.localdomain:5000/osp13-containers-ovn-northd                                                 13.0-hotfix-bz1766410-v2-20200110.1578623869   e283e80ff8aa        3 months ago        869 MB
capsule.localdomain:5000/production-osp_containers-osp13-containers-ovn-northd                pcmklatest                                     e283e80ff8aa        3 months ago        869 MB
img=capsule.localdomain:5000/production-osp_containers-osp13-containers-ovn-northd:pcmklatest
pcs resource bundle update ovn-dbs-bundle container image=$img
 Docker container set: ovn-dbs-bundle [capsule.localdomain:5000/production-osp_containers-osp13-containers-ovn-northd:pcmklatest]
   ovn-dbs-bundle-0    (ocf::ovn:ovndb-servers):    Master overcloud-controller-2
   ovn-dbs-bundle-1    (ocf::ovn:ovndb-servers):    Slave overcloud-controller-1
   ovn-dbs-bundle-2    (ocf::ovn:ovndb-servers):    Slave overcloud-controller-3

After the service has restarted, the OVN DBs are lost


[root@overcloud-controller-1 heat-admin]# docker ps |grep ovn-db
e56d5178a76b        capsule.localdomain:5000/production-osp_containers-osp13-containers-ovn-northd:pcmklatest              "dumb-init --singl..."   About an hour ago   Up About an hour                            ovn-dbs-bundle-docker-1
# docker exec -it ovn-dbs-bundle-docker-1 ovn-nbctl show |wc -l
0
# docker exec -it ovn-dbs-bundle-docker-1 ovn-nbctl list Logical_Switch_Port  |grep -c _uuid
0
# docker exec -it ovn-dbs-bundle-docker-1 ovn-nbctl list acl  |grep -c _uuid
0
# docker exec -it ovn-dbs-bundle-docker-1 ovn-nbctl list port_group  |grep -c _uuid
0
# docker exec -it ovn-dbs-bundle-docker-1 ovn-sbctl list Port_binding  |grep -c _uuid
0
# docker exec -it ovn-dbs-bundle-docker-1 ovn-sbctl list Datapath_binding  |grep -c _uuid
0
# docker exec -it ovn-dbs-bundle-docker-1 ovn-sbctl list MAC_binding  |grep -c _uuid
0
# docker exec -it ovn-dbs-bundle-docker-1 ovn-sbctl list Address_Set  |grep -c _uuid
0
A releavant log line below


Apr 23 08:43:45 overcloud-controller-3 crmd[42840]:  notice: Initiating stop operation ip-10.10.10.101_stop_0 on overcloud-controller-2
Apr 23 08:44:00 overcloud-controller-3 pengine[42839]:  notice:  * Start      ip-10.10.10.101                     (                      overcloud-controller-1 )
2020-04-23T08:44:01.019Z|00035|ovsdb_error|ERR|unexpected ovsdb error: Server ID check failed: Self replicating is not allowed
Seems like  Internal API VIP (10.10.10.101) temporary moved to ctl01, while ovn-dbs-bundle there was still slave (or restarted as slave) causing to replicate on himself.... which eventually resulted in DB loss.



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Jakub Libosvar 2020-06-11 08:32:33 UTC

I see the customer uses openvswitch-2.9.0-97.el7fdp.x86_64

There was a patch in OVS 2.11 that alleviates the race condition of losing the database:
https://github.com/openvswitch/ovs/commit/ecf44dd3b26904edf480ada1c72a22fadb6b1825

The customer should update to 2.11.

Note You need to log in before you can comment on or make changes to this bug.