Bug 1828287 - ovn can lose its database when vip is migrated off a down node and moved on an active node and ovn db tries to self replicate to itself.
Summary: ovn can lose its database when vip is migrated off a down node and moved on a...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openvswitch
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Jakub Libosvar
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-27 14:04 UTC by David Hill
Modified: 2023-09-07 23:02 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-06-11 08:34:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-28420 0 None None None 2023-09-07 23:02:07 UTC
Red Hat Knowledge Base (Solution) 5018411 0 None None None 2020-04-27 14:33:27 UTC

Description David Hill 2020-04-27 14:04:23 UTC
Description of problem:
ovn can lose its database when vip is migrated off a down node and moved on an active node and ovn db tries to self replicate to itself.

Reproduce steps

Find the controller node where OVN is master


 Docker container set: ovn-dbs-bundle [satellite.localdomain:5000/osp13-containers-ovn-northd:13.0-hotfix-bz1766410-v2-20200110.1578623869]
   ovn-dbs-bundle-0    (ocf::ovn:ovndb-servers):    Master overcloud-controller-2
   ovn-dbs-bundle-1    (ocf::ovn:ovndb-servers):    Slave overcloud-controller-1
   ovn-dbs-bundle-2    (ocf::ovn:ovndb-servers):    Slave overcloud-controller-3

Stop pcs on the controller where OVN is master


[root@overcloud-controller-2 heat-admin]# pcs cluster stop
Another node is promoted to master


 Docker container set: ovn-dbs-bundle [satellite.localdomain:5000/osp13-containers-ovn-northd:13.0-hotfix-bz1766410-v2-20200110.1578623869]
   ovn-dbs-bundle-0    (ocf::ovn:ovndb-servers):    Stopped
   ovn-dbs-bundle-1    (ocf::ovn:ovndb-servers):    Master overcloud-controller-1
   ovn-dbs-bundle-2    (ocf::ovn:ovndb-servers):    Slave overcloud-controller-3
Note, that while usually ovn-dbs-bundle-0 is the master container name, now, it has changed to ovn-dbs-bundle-1



Restart cluster on the previously stopped node


[root@overcloud-controller-2 heat-admin]# pcs cluster stop
Now the master node does not change, ec-test-clt02 comes back as slave.


 Docker container set: ovn-dbs-bundle [satellite.localdomain:5000/osp13-containers-ovn-northd:13.0-hotfix-bz1766410-v2-20200110.1578623869]
   ovn-dbs-bundle-0    (ocf::ovn:ovndb-servers):    Slave overcloud-controller-2
   ovn-dbs-bundle-1    (ocf::ovn:ovndb-servers):    Master overcloud-controller-1
   ovn-dbs-bundle-2    (ocf::ovn:ovndb-servers):    Slave overcloud-controller-3

Update the image of the PCS resource 
ovn-dbs-bundle


Note, that this is not a real update, as the image is the same, but has a different tag


[root@overcloud-controller-1 heat-admin]# docker images |grep ovn-n
satellite.localdomain:5000/osp13-containers-ovn-northd                                                 13.0-hotfix-bz1766410-v2-20200110.1578623869   e283e80ff8aa        3 months ago        869 MB
capsule.localdomain:5000/production-osp_containers-osp13-containers-ovn-northd                pcmklatest                                     e283e80ff8aa        3 months ago        869 MB
img=capsule.localdomain:5000/production-osp_containers-osp13-containers-ovn-northd:pcmklatest
pcs resource bundle update ovn-dbs-bundle container image=$img
 Docker container set: ovn-dbs-bundle [capsule.localdomain:5000/production-osp_containers-osp13-containers-ovn-northd:pcmklatest]
   ovn-dbs-bundle-0    (ocf::ovn:ovndb-servers):    Master overcloud-controller-2
   ovn-dbs-bundle-1    (ocf::ovn:ovndb-servers):    Slave overcloud-controller-1
   ovn-dbs-bundle-2    (ocf::ovn:ovndb-servers):    Slave overcloud-controller-3

After the service has restarted, the OVN DBs are lost


[root@overcloud-controller-1 heat-admin]# docker ps |grep ovn-db
e56d5178a76b        capsule.localdomain:5000/production-osp_containers-osp13-containers-ovn-northd:pcmklatest              "dumb-init --singl..."   About an hour ago   Up About an hour                            ovn-dbs-bundle-docker-1
# docker exec -it ovn-dbs-bundle-docker-1 ovn-nbctl show |wc -l
0
# docker exec -it ovn-dbs-bundle-docker-1 ovn-nbctl list Logical_Switch_Port  |grep -c _uuid
0
# docker exec -it ovn-dbs-bundle-docker-1 ovn-nbctl list acl  |grep -c _uuid
0
# docker exec -it ovn-dbs-bundle-docker-1 ovn-nbctl list port_group  |grep -c _uuid
0
# docker exec -it ovn-dbs-bundle-docker-1 ovn-sbctl list Port_binding  |grep -c _uuid
0
# docker exec -it ovn-dbs-bundle-docker-1 ovn-sbctl list Datapath_binding  |grep -c _uuid
0
# docker exec -it ovn-dbs-bundle-docker-1 ovn-sbctl list MAC_binding  |grep -c _uuid
0
# docker exec -it ovn-dbs-bundle-docker-1 ovn-sbctl list Address_Set  |grep -c _uuid
0
A releavant log line below


Apr 23 08:43:45 overcloud-controller-3 crmd[42840]:  notice: Initiating stop operation ip-10.10.10.101_stop_0 on overcloud-controller-2
Apr 23 08:44:00 overcloud-controller-3 pengine[42839]:  notice:  * Start      ip-10.10.10.101                     (                      overcloud-controller-1 )
2020-04-23T08:44:01.019Z|00035|ovsdb_error|ERR|unexpected ovsdb error: Server ID check failed: Self replicating is not allowed
Seems like  Internal API VIP (10.10.10.101) temporary moved to ctl01, while ovn-dbs-bundle there was still slave (or restarted as slave) causing to replicate on himself.... which eventually resulted in DB loss.



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Jakub Libosvar 2020-06-11 08:32:33 UTC
I see the customer uses openvswitch-2.9.0-97.el7fdp.x86_64

There was a patch in OVS 2.11 that alleviates the race condition of losing the database:
https://github.com/openvswitch/ovs/commit/ecf44dd3b26904edf480ada1c72a22fadb6b1825

The customer should update to 2.11.


Note You need to log in before you can comment on or make changes to this bug.