Description of problem: ovn can lose its database when vip is migrated off a down node and moved on an active node and ovn db tries to self replicate to itself. Reproduce steps Find the controller node where OVN is master Docker container set: ovn-dbs-bundle [satellite.localdomain:5000/osp13-containers-ovn-northd:13.0-hotfix-bz1766410-v2-20200110.1578623869] ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master overcloud-controller-2 ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave overcloud-controller-1 ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave overcloud-controller-3 Stop pcs on the controller where OVN is master [root@overcloud-controller-2 heat-admin]# pcs cluster stop Another node is promoted to master Docker container set: ovn-dbs-bundle [satellite.localdomain:5000/osp13-containers-ovn-northd:13.0-hotfix-bz1766410-v2-20200110.1578623869] ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Master overcloud-controller-1 ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave overcloud-controller-3 Note, that while usually ovn-dbs-bundle-0 is the master container name, now, it has changed to ovn-dbs-bundle-1 Restart cluster on the previously stopped node [root@overcloud-controller-2 heat-admin]# pcs cluster stop Now the master node does not change, ec-test-clt02 comes back as slave. Docker container set: ovn-dbs-bundle [satellite.localdomain:5000/osp13-containers-ovn-northd:13.0-hotfix-bz1766410-v2-20200110.1578623869] ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Slave overcloud-controller-2 ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Master overcloud-controller-1 ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave overcloud-controller-3 Update the image of the PCS resource ovn-dbs-bundle Note, that this is not a real update, as the image is the same, but has a different tag [root@overcloud-controller-1 heat-admin]# docker images |grep ovn-n satellite.localdomain:5000/osp13-containers-ovn-northd 13.0-hotfix-bz1766410-v2-20200110.1578623869 e283e80ff8aa 3 months ago 869 MB capsule.localdomain:5000/production-osp_containers-osp13-containers-ovn-northd pcmklatest e283e80ff8aa 3 months ago 869 MB img=capsule.localdomain:5000/production-osp_containers-osp13-containers-ovn-northd:pcmklatest pcs resource bundle update ovn-dbs-bundle container image=$img Docker container set: ovn-dbs-bundle [capsule.localdomain:5000/production-osp_containers-osp13-containers-ovn-northd:pcmklatest] ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master overcloud-controller-2 ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave overcloud-controller-1 ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave overcloud-controller-3 After the service has restarted, the OVN DBs are lost [root@overcloud-controller-1 heat-admin]# docker ps |grep ovn-db e56d5178a76b capsule.localdomain:5000/production-osp_containers-osp13-containers-ovn-northd:pcmklatest "dumb-init --singl..." About an hour ago Up About an hour ovn-dbs-bundle-docker-1 # docker exec -it ovn-dbs-bundle-docker-1 ovn-nbctl show |wc -l 0 # docker exec -it ovn-dbs-bundle-docker-1 ovn-nbctl list Logical_Switch_Port |grep -c _uuid 0 # docker exec -it ovn-dbs-bundle-docker-1 ovn-nbctl list acl |grep -c _uuid 0 # docker exec -it ovn-dbs-bundle-docker-1 ovn-nbctl list port_group |grep -c _uuid 0 # docker exec -it ovn-dbs-bundle-docker-1 ovn-sbctl list Port_binding |grep -c _uuid 0 # docker exec -it ovn-dbs-bundle-docker-1 ovn-sbctl list Datapath_binding |grep -c _uuid 0 # docker exec -it ovn-dbs-bundle-docker-1 ovn-sbctl list MAC_binding |grep -c _uuid 0 # docker exec -it ovn-dbs-bundle-docker-1 ovn-sbctl list Address_Set |grep -c _uuid 0 A releavant log line below Apr 23 08:43:45 overcloud-controller-3 crmd[42840]: notice: Initiating stop operation ip-10.10.10.101_stop_0 on overcloud-controller-2 Apr 23 08:44:00 overcloud-controller-3 pengine[42839]: notice: * Start ip-10.10.10.101 ( overcloud-controller-1 ) 2020-04-23T08:44:01.019Z|00035|ovsdb_error|ERR|unexpected ovsdb error: Server ID check failed: Self replicating is not allowed Seems like Internal API VIP (10.10.10.101) temporary moved to ctl01, while ovn-dbs-bundle there was still slave (or restarted as slave) causing to replicate on himself.... which eventually resulted in DB loss. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I see the customer uses openvswitch-2.9.0-97.el7fdp.x86_64 There was a patch in OVS 2.11 that alleviates the race condition of losing the database: https://github.com/openvswitch/ovs/commit/ecf44dd3b26904edf480ada1c72a22fadb6b1825 The customer should update to 2.11.