Created attachment 1491282 [details] logs Description of problem: When I am restarting the ovn-dbs-bundle-docker (ovn-northd) docker on the master node [controller-0] Pacemaker reprogramming / populate another controller and it becomes a master [for example controller-1]. The issue, in that case, controller-0 should return to "slave" status. The actual result is that ovn-dbs-bundle-0 is in Stopped status. Version-Release number of selected component (if applicable): OpenStack/14.0-RHEL-7/2018-10-02.2 puppet-ovn-13.3.1-0.20180907024738.b9a1e0b.el7ost.noarch rhosp-openvswitch-ovn-common-2.10-0.1.el7ost.noarch openvswitch2.10-ovn-central-2.10.0-0.20180810git58a7ce6.el7fdp.x86_64 rhosp-openvswitch-ovn-central-2.10-0.1.el7ost.noarch openvswitch2.10-ovn-common-2.10.0-0.20180810git58a7ce6.el7fdp.x86_64 (undercloud) [stack@undercloud-0 ~]$ rpm -qa | grep tripl python-tripleoclient-heat-installer-10.5.1-0.20180906012842.el7ost.noarch ansible-tripleo-ipsec-9.0.1-0.20180827143021.d2b9234.el7ost.noarch ansible-role-tripleo-modify-image-1.0.1-0.20180915144057.cb535e9.el7ost.noarch openstack-tripleo-heat-templates-9.0.0-0.20180919080941.0rc1.0rc1.el7ost.noarch openstack-tripleo-puppet-elements-9.0.0-0.20180906013709.daf9069.el7ost.noarch openstack-tripleo-validations-9.3.1-0.20180831205306.el7ost.noarch openstack-tripleo-common-containers-9.3.1-0.20180923215325.d22cb3e.el7ost.noarch openstack-tripleo-image-elements-9.0.0-0.20180831210308.2dc678a.el7ost.noarch openstack-tripleo-common-9.3.1-0.20180923215325.d22cb3e.el7ost.noarch python-tripleoclient-10.5.1-0.20180906012842.el7ost.noarch python2-tripleo-common-9.3.1-0.20180923215325.d22cb3e.el7ost.noarch puppet-tripleo-9.3.1-0.20180831202649.8ec6c86.el7ost.noarch [root@controller-0 ~]# docker ps | grep ovn 4f716dacbce1 192.168.24.1:8787/rhosp14/openstack-ovn-northd:pcmklatest "/bin/bash /usr/lo..." 27 minutes ago Up 27 minutes ovn-dbs-bundle-docker-0 d2d80f7c2730 192.168.24.1:8787/rhosp14/openstack-ovn-controller:2018-10-01.1 "kolla_start" 3 days ago Up 3 days ovn_controller 2f6351ac2b48 192.168.24.1:8787/rhosp14/openstack-neutron-server-ovn:2018-10-01.1 "kolla_start" 3 days ago Up 3 days (healthy) neutron_api 6ffc3f666301 192.168.24.1:8787/rhosp14/openstack-nova-novncproxy:2018-10-01.1 How reproducible: 100% Steps to Reproduce: 1 pcs status 2 docker ps | grep ovn 3 docker restart ovn-dbs-bundle-docker-0 4 pcs status 5 clear 6 pcs status 7 vi /var/log/containers/openvswitch/ovn-controller.log 8 vi /var/log/containers/openvswitch/ovn-northd.log.1 9 vi /var/log/containers/openvswitch/ovsdb-server-nb.log 10 vi /var/log/containers/openvswitch/ovsdb-server-sb.log 11 history 12 vi /var/log/containers/neutron/server.log 13 pcs status Actual results: ovn-dbs-bundle-0 is in Stopped status. Expected results: ovn-dbs-bundle-0 is in Slave status. Additional info: Logs & sos-report attached
With "docker restart", since it doesn't stop the ovsdb-server's gracefully, the ovsdb-server pid files remain. And when the the service is started, ovn-ctl returns the status as "not-running" since it calls "pidfile_is_running" for the old file name. It requires a fix either in ovn-ctl to delete stale pid files before starting the services or delete the pidfiles in the actual binary before creating new one when --pidfile option is specified. I will propose a fix upstream.
Submitted the patch to fix this - https://patchwork.ozlabs.org/patch/981066/
Looks like this bug is also present in ovs 2.9, but the fix will look different. Is that true? I see in start_ovsdb__() local pid ... eval pid=\$DB_${DB}_PID ... if pidfile_is_running $pid; then .... fi In that case, would a possible fix be: - test -e "$pidfile" && pid=`cat "$pidfile"` && pid_exists "$pid" + test -e "$pidfile" && ispid=`cat "$pidfile"` && pid_exists "$ispid" ?
(In reply to Aaron Conole from comment #5) > Looks like this bug is also present in ovs 2.9, but the fix will look > different. Is that true? I see in start_ovsdb__() > > local pid > ... > eval pid=\$DB_${DB}_PID > ... > if pidfile_is_running $pid; then > .... > fi > > In that case, would a possible fix be: > > - test -e "$pidfile" && pid=`cat "$pidfile"` && pid_exists "$pid" > + test -e "$pidfile" && ispid=`cat "$pidfile"` && pid_exists "$ispid" > > ? I think the same fix will do that. The issue is because the local function 'pidfile_is_running' is overriding the 'pid' variable. In the u/s fix, I changed the name of the variable "pid" to "db_pid_file" in the start_ovsdb__(). The fix is already commited in u/s 2.9 branch - https://github.com/openvswitch/ovs/commit/4c7a432154a7b379cd97d26a51caaa155f35b449 I will backport it in 2.9 d/s.
the issue was fixed on OpenStack/14.0-RHEL-7/2018-11-22.2/ [root@controller-0 ~]# rpm -qa | grep openvsw rhosp-openvswitch-2.10-0.1.el7ost.noarch openvswitch2.10-2.10.0-28.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-5.el7fdp.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045