Bug 1579025
Summary: | [OVN][RA] Forceful reset of OVN master node results in promotion/demotion loop of ovndb-servers on other nodes | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marian Krcmarik <mkrcmari> |
Component: | openvswitch | Assignee: | Assaf Muller <amuller> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Roman Safronov <rsafrono> |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | 13.0 (Queens) | CC: | amuller, apevec, chrisw, dalvarez, jschluet, lhh, majopela, mariel, mkrcmari, nusiddiq, ragiman, rhos-maint, rkhan, shdunne, srevivo, takito, tredaelli |
Target Milestone: | z4 | Keywords: | TestOnly, Triaged, ZStream |
Target Release: | 13.0 (Queens) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openvswitch-2.9.0-70.el7fdp.1 | Doc Type: | Known Issue |
Doc Text: |
OVN pacemaker Resource Agent (RA) script sometimes does not handle the promotion action properly when pacemaker tries to promote a slave node. This is seen when the ovsdb-servers report the status as master to the RA script when the master ip is moved to the node. The issue is fixed upstream.
When the issue occurs, the neutron server will not be able to connect the OVN North and South DB servers and all Create/Update/Delete APIs to the neutron server will fail.
Restarting the ovn-dbs-bundle resource will resolve the issue. Run the below command in one of the controller node:
"pcs resource restart ovn-dbs-bundle"
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2019-03-15 10:33:55 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1658631 | ||
Bug Blocks: |
Description
Marian Krcmarik
2018-05-16 19:42:35 UTC
Not sure if we want to make this as a blocker. Since the issue is seen only when the node is brought down ungracefully. Submitted the patch to fix the issue - https://patchwork.ozlabs.org/patch/915289/ Does this only happen when we forcibly reset the master node? If a failure would occur in ovsdb-server (or the container is stopped), the failover takes place smoothly? then we'd hit https://bugzilla.redhat.com/show_bug.cgi?id=1578312? The patch is merged in ovs master, branch 2.9 (and is part of 2.9.2 tag) - https://github.com/openvswitch/ovs/commit/c16e265713bef1c701bfce7608d68ab11695e286 I am assigning the BZ to openvswitch componen (In reply to Daniel Alvarez Sanchez from comment #5) > Does this only happen when we forcibly reset the master node? If a failure That's correct, at least in my tests, but I believe I saw a mention from Numad that he reproduces with clean shutdown of ovsdb-server master through pacemaker. > would occur in ovsdb-server (or the container is stopped), the failover > takes place smoothly? then we'd hit I do not think so, I believe If the container is stopped outside of pacemaker (i.e. using docker command) It may behave the same - means with the buggy behaviour. > https://bugzilla.redhat.com/show_bug.cgi?id=1578312? I managed to get to the point where the above bug could be hit because the bug of this bz report was not 100% reproducible. Backported in the internal build: openvswitch-2.9.0-40.el7fdn Brew: http://brewweb.engineering.redhat.com/brew/taskinfo?taskID=16484769 Please do not crosstag FDN builds. We need this along the next ztsream release of OSP13, will we pick this from upstream, or is it already backported downstream? Can we set the "Fixed in version" and status to MODIFIED? Hi, it was fixed in openvswitch-2.9.0-40.el7fdn and FDP 18.06 is aligned to openvswitch-2.9.0-47.el7fdp, so it'll include the fix We'll want to bump OVS (+OVN) for z2. I reproduce the issue so I have to re-open it: pcsd: active/enabled [root@controller-1 ~]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: controller-2 (version 1.1.19-8.el7_6.1-c3c624ea3d) - partition with quorum Last updated: Tue Dec 11 12:55:36 2018 Last change: Mon Dec 10 13:56:29 2018 by root via cibadmin on controller-0 15 nodes configured 46 resources configured Online: [ controller-0 controller-1 controller-2 ] GuestOnline: [ galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ] Full list of resources: Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp13/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-1 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-2 Docker container set: galera-bundle [192.168.24.1:8787/rhosp13/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Stopped galera-bundle-1 (ocf::heartbeat:galera): Master controller-1 galera-bundle-2 (ocf::heartbeat:galera): Master controller-2 Docker container set: redis-bundle [192.168.24.1:8787/rhosp13/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Stopped redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1 redis-bundle-2 (ocf::heartbeat:redis): Slave controller-2 ip-192.168.24.10 (ocf::heartbeat:IPaddr2): Started controller-2 ip-10.0.0.114 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.1.29 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.1.14 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.3.24 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.4.13 (ocf::heartbeat:IPaddr2): Started controller-2 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp13/openstack-haproxy:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Stopped haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started controller-1 haproxy-bundle-docker-2 (ocf::heartbeat:docker): Started controller-2 Docker container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp13/openstack-ovn-northd:pcmklatest] ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-1 ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-2 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp13/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started controller-2 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled forced reset the master node with virt-manager OpenStack/13.0-RHEL-7/2018-12-07.1 [root@controller-1 ~]# rpm -qa | grep -i openvs openvswitch-ovn-central-2.9.0-81.el7fdp.x86_64 openvswitch-2.9.0-81.el7fdp.x86_64 openvswitch-ovn-host-2.9.0-81.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-5.el7fdp.noarch openvswitch-ovn-common-2.9.0-81.el7fdp.x86_64 openstack-neutron-openvswitch-12.0.5-2.el7ost.noarch python-openvswitch-2.9.0-81.el7fdp.x86_64 (In reply to Eran Kuris from comment #25) > I reproduce the issue so I have to re-open it: That reminds me It could be another bug - https://bugzilla.redhat.com/show_bug.cgi?id=1579025, If not, try to update pacemaker package on controller nodes to at least version pacemaker-1.1.19-8.el7_6.2 (In reply to Marian Krcmarik from comment #26) > (In reply to Eran Kuris from comment #25) > > I reproduce the issue so I have to re-open it: > > That reminds me It could be another bug - > https://bugzilla.redhat.com/show_bug.cgi?id=1579025, If not, try to update > pacemaker package on controller nodes to at least version > pacemaker-1.1.19-8.el7_6.2 which bug is it remind you? you attached this bug ID. (In reply to Eran Kuris from comment #27) > (In reply to Marian Krcmarik from comment #26) > > (In reply to Eran Kuris from comment #25) > > > I reproduce the issue so I have to re-open it: > > > > That reminds me It could be another bug - > > https://bugzilla.redhat.com/show_bug.cgi?id=1579025, If not, try to update > > pacemaker package on controller nodes to at least version > > pacemaker-1.1.19-8.el7_6.2 > > which bug is it remind you? you attached this bug ID. Sorry I meant - https://bugzilla.redhat.com/show_bug.cgi?id=1654602 you might want to check to see if you hit any selinux issues when you reproduced this issue. According to our records, this should be resolved by openvswitch-2.9.0-83.el7fdp.1. This build is available now. Verified on puddle 13.0-RHEL-7/2019-03-01.1 with openvswitch-2.9.0-97.el7fdp.x86_64 Setup: environment with 3 controller and 2 compute nodes, fencing enabled. Verified that after resetting ovndb-server master node another node promoted to be a master. Tested with different reset types: using "echo o >/proc/sysrq-trigger", force stop/reset by virt-manager, pcs cluster stop controller-X, sudo docker stop ovn-dbs-bundle-docker-0, shutdown -h now. Checked that it is possible to create/delete a vm after each reset and vm has access to network. [heat-admin@controller-0 ~]$ rpm -qa | grep openvswitch openvswitch-2.9.0-97.el7fdp.x86_64 openvswitch-ovn-central-2.9.0-97.el7fdp.x86_64 openvswitch-ovn-common-2.9.0-97.el7fdp.x86_64 openstack-neutron-openvswitch-12.0.5-4.el7ost.noarch python-openvswitch-2.9.0-97.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-9.el7fdp.noarch openvswitch-ovn-host-2.9.0-97.el7fdp.x86_64 [heat-admin@controller-0 ~]$ rpm -qa | grep pacemaker pacemaker-cluster-libs-1.1.19-8.el7_6.4.x86_64 pacemaker-libs-1.1.19-8.el7_6.4.x86_64 pacemaker-remote-1.1.19-8.el7_6.4.x86_64 pacemaker-cli-1.1.19-8.el7_6.4.x86_64 ansible-pacemaker-1.0.4-0.20180220234310.0e4d7c0.el7ost.noarch pacemaker-1.1.19-8.el7_6.4.x86_64 puppet-pacemaker-0.7.2-0.20180423212257.el7ost.noarch |