Description of problem: When controller-0 is reset, and when controller-1 is chosen for master, IPAddr2 resource is moved to controller-1 (as there is a colocation constraint set). ovsdb-servers (in controller-1) which were earlier running as standby, report the status as active as soon the IPAddr2 is configured. The ovsdb_server_promote function returns at L412 and we don't record the active master (by running $CRM_MASTER -N $host_name -v ${master_score} And when notify is called with "post-promote" op, since we didn't record the master node, the line L150 (https://github.com/openvswitch/ovs/blob/master/ovn/utilities/ovndb-servers.ocf#L150) evaluates to false and we demote back. This results in a loop with pacemaker promoting controller-1 and controller-2 alternatively. Version-Release number of selected component (if applicable): openvswitch-ovn-common-2.9.0-20.el7fdp.x86_64 How reproducible: Often Steps to Reproduce: 1. Forcefully reset Master node of ovndb-servers resource on Openstack deployment with OVN deployed. Actual results: No other node (which are in standby mode) is successfully promoted to Master. It keeps failing and promoting/demoting in a loop. Expected results: One of the slave nodes should be promoted successfully to Master. Additional info: As result of the bug, many neutron related functionality does not work, which includes spawning new instances and many others
Not sure if we want to make this as a blocker. Since the issue is seen only when the node is brought down ungracefully.
Submitted the patch to fix the issue - https://patchwork.ozlabs.org/patch/915289/
Does this only happen when we forcibly reset the master node? If a failure would occur in ovsdb-server (or the container is stopped), the failover takes place smoothly? then we'd hit https://bugzilla.redhat.com/show_bug.cgi?id=1578312?
The patch is merged in ovs master, branch 2.9 (and is part of 2.9.2 tag) - https://github.com/openvswitch/ovs/commit/c16e265713bef1c701bfce7608d68ab11695e286
I am assigning the BZ to openvswitch componen
(In reply to Daniel Alvarez Sanchez from comment #5) > Does this only happen when we forcibly reset the master node? If a failure That's correct, at least in my tests, but I believe I saw a mention from Numad that he reproduces with clean shutdown of ovsdb-server master through pacemaker. > would occur in ovsdb-server (or the container is stopped), the failover > takes place smoothly? then we'd hit I do not think so, I believe If the container is stopped outside of pacemaker (i.e. using docker command) It may behave the same - means with the buggy behaviour. > https://bugzilla.redhat.com/show_bug.cgi?id=1578312? I managed to get to the point where the above bug could be hit because the bug of this bz report was not 100% reproducible.
Backported in the internal build: openvswitch-2.9.0-40.el7fdn Brew: http://brewweb.engineering.redhat.com/brew/taskinfo?taskID=16484769 Please do not crosstag FDN builds.
We need this along the next ztsream release of OSP13, will we pick this from upstream, or is it already backported downstream? Can we set the "Fixed in version" and status to MODIFIED?
Hi, it was fixed in openvswitch-2.9.0-40.el7fdn and FDP 18.06 is aligned to openvswitch-2.9.0-47.el7fdp, so it'll include the fix
We'll want to bump OVS (+OVN) for z2.
I reproduce the issue so I have to re-open it: pcsd: active/enabled [root@controller-1 ~]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: controller-2 (version 1.1.19-8.el7_6.1-c3c624ea3d) - partition with quorum Last updated: Tue Dec 11 12:55:36 2018 Last change: Mon Dec 10 13:56:29 2018 by root via cibadmin on controller-0 15 nodes configured 46 resources configured Online: [ controller-0 controller-1 controller-2 ] GuestOnline: [ galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ] Full list of resources: Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp13/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-1 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-2 Docker container set: galera-bundle [192.168.24.1:8787/rhosp13/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Stopped galera-bundle-1 (ocf::heartbeat:galera): Master controller-1 galera-bundle-2 (ocf::heartbeat:galera): Master controller-2 Docker container set: redis-bundle [192.168.24.1:8787/rhosp13/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Stopped redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1 redis-bundle-2 (ocf::heartbeat:redis): Slave controller-2 ip-192.168.24.10 (ocf::heartbeat:IPaddr2): Started controller-2 ip-10.0.0.114 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.1.29 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.1.14 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.3.24 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.4.13 (ocf::heartbeat:IPaddr2): Started controller-2 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp13/openstack-haproxy:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Stopped haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started controller-1 haproxy-bundle-docker-2 (ocf::heartbeat:docker): Started controller-2 Docker container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp13/openstack-ovn-northd:pcmklatest] ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-1 ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-2 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp13/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started controller-2 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled forced reset the master node with virt-manager OpenStack/13.0-RHEL-7/2018-12-07.1 [root@controller-1 ~]# rpm -qa | grep -i openvs openvswitch-ovn-central-2.9.0-81.el7fdp.x86_64 openvswitch-2.9.0-81.el7fdp.x86_64 openvswitch-ovn-host-2.9.0-81.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-5.el7fdp.noarch openvswitch-ovn-common-2.9.0-81.el7fdp.x86_64 openstack-neutron-openvswitch-12.0.5-2.el7ost.noarch python-openvswitch-2.9.0-81.el7fdp.x86_64
(In reply to Eran Kuris from comment #25) > I reproduce the issue so I have to re-open it: That reminds me It could be another bug - https://bugzilla.redhat.com/show_bug.cgi?id=1579025, If not, try to update pacemaker package on controller nodes to at least version pacemaker-1.1.19-8.el7_6.2
(In reply to Marian Krcmarik from comment #26) > (In reply to Eran Kuris from comment #25) > > I reproduce the issue so I have to re-open it: > > That reminds me It could be another bug - > https://bugzilla.redhat.com/show_bug.cgi?id=1579025, If not, try to update > pacemaker package on controller nodes to at least version > pacemaker-1.1.19-8.el7_6.2 which bug is it remind you? you attached this bug ID.
(In reply to Eran Kuris from comment #27) > (In reply to Marian Krcmarik from comment #26) > > (In reply to Eran Kuris from comment #25) > > > I reproduce the issue so I have to re-open it: > > > > That reminds me It could be another bug - > > https://bugzilla.redhat.com/show_bug.cgi?id=1579025, If not, try to update > > pacemaker package on controller nodes to at least version > > pacemaker-1.1.19-8.el7_6.2 > > which bug is it remind you? you attached this bug ID. Sorry I meant - https://bugzilla.redhat.com/show_bug.cgi?id=1654602
you might want to check to see if you hit any selinux issues when you reproduced this issue.
According to our records, this should be resolved by openvswitch-2.9.0-83.el7fdp.1. This build is available now.
Verified on puddle 13.0-RHEL-7/2019-03-01.1 with openvswitch-2.9.0-97.el7fdp.x86_64 Setup: environment with 3 controller and 2 compute nodes, fencing enabled. Verified that after resetting ovndb-server master node another node promoted to be a master. Tested with different reset types: using "echo o >/proc/sysrq-trigger", force stop/reset by virt-manager, pcs cluster stop controller-X, sudo docker stop ovn-dbs-bundle-docker-0, shutdown -h now. Checked that it is possible to create/delete a vm after each reset and vm has access to network. [heat-admin@controller-0 ~]$ rpm -qa | grep openvswitch openvswitch-2.9.0-97.el7fdp.x86_64 openvswitch-ovn-central-2.9.0-97.el7fdp.x86_64 openvswitch-ovn-common-2.9.0-97.el7fdp.x86_64 openstack-neutron-openvswitch-12.0.5-4.el7ost.noarch python-openvswitch-2.9.0-97.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-9.el7fdp.noarch openvswitch-ovn-host-2.9.0-97.el7fdp.x86_64 [heat-admin@controller-0 ~]$ rpm -qa | grep pacemaker pacemaker-cluster-libs-1.1.19-8.el7_6.4.x86_64 pacemaker-libs-1.1.19-8.el7_6.4.x86_64 pacemaker-remote-1.1.19-8.el7_6.4.x86_64 pacemaker-cli-1.1.19-8.el7_6.4.x86_64 ansible-pacemaker-1.0.4-0.20180220234310.0e4d7c0.el7ost.noarch pacemaker-1.1.19-8.el7_6.4.x86_64 puppet-pacemaker-0.7.2-0.20180423212257.el7ost.noarch