1636714 – [OVN] after restarting ovn-dbs-bundle-docker (ovn-northd) on master node the service stuck in Stopped status [regression]

Bug 1636714 - [OVN] after restarting ovn-dbs-bundle-docker (ovn-northd) on master node the service stuck in Stopped status [regression]

Summary: [OVN] after restarting ovn-dbs-bundle-docker (ovn-northd) on master node the...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openvswitch
Sub Component:
Version:	14.0 (Rocky)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	beta
Target Release:	14.0 (Rocky)
Assignee:	Numan Siddique
QA Contact:	Eran Kuris
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-10-07 07:39 UTC by Eran Kuris
Modified:	2019-09-09 14:02 UTC (History)
CC List:	14 users (show)
Fixed In Version:	openvswitch2.10-2.10.0-28.el7fdp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-11 11:53:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs (16.44 KB, text/plain) 2018-10-07 07:39 UTC, Eran Kuris	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2019:0045	0	None	None	None	2019-01-11 11:53:48 UTC

Description Eran Kuris 2018-10-07 07:39:07 UTC

Created attachment 1491282 [details]
logs

Description of problem:
When I am restarting the ovn-dbs-bundle-docker (ovn-northd) docker on the master node [controller-0]
Pacemaker reprogramming / populate another controller and it becomes a master [for example controller-1].

The issue, in that case, controller-0 should return to "slave" status. 
The actual result is that ovn-dbs-bundle-0 is in Stopped status.

Version-Release number of selected component (if applicable):
OpenStack/14.0-RHEL-7/2018-10-02.2
puppet-ovn-13.3.1-0.20180907024738.b9a1e0b.el7ost.noarch
rhosp-openvswitch-ovn-common-2.10-0.1.el7ost.noarch
openvswitch2.10-ovn-central-2.10.0-0.20180810git58a7ce6.el7fdp.x86_64
rhosp-openvswitch-ovn-central-2.10-0.1.el7ost.noarch
openvswitch2.10-ovn-common-2.10.0-0.20180810git58a7ce6.el7fdp.x86_64
(undercloud) [stack@undercloud-0 ~]$ rpm -qa | grep tripl
python-tripleoclient-heat-installer-10.5.1-0.20180906012842.el7ost.noarch
ansible-tripleo-ipsec-9.0.1-0.20180827143021.d2b9234.el7ost.noarch
ansible-role-tripleo-modify-image-1.0.1-0.20180915144057.cb535e9.el7ost.noarch
openstack-tripleo-heat-templates-9.0.0-0.20180919080941.0rc1.0rc1.el7ost.noarch
openstack-tripleo-puppet-elements-9.0.0-0.20180906013709.daf9069.el7ost.noarch
openstack-tripleo-validations-9.3.1-0.20180831205306.el7ost.noarch
openstack-tripleo-common-containers-9.3.1-0.20180923215325.d22cb3e.el7ost.noarch
openstack-tripleo-image-elements-9.0.0-0.20180831210308.2dc678a.el7ost.noarch
openstack-tripleo-common-9.3.1-0.20180923215325.d22cb3e.el7ost.noarch
python-tripleoclient-10.5.1-0.20180906012842.el7ost.noarch
python2-tripleo-common-9.3.1-0.20180923215325.d22cb3e.el7ost.noarch
puppet-tripleo-9.3.1-0.20180831202649.8ec6c86.el7ost.noarch



[root@controller-0 ~]# docker ps | grep ovn 
4f716dacbce1        192.168.24.1:8787/rhosp14/openstack-ovn-northd:pcmklatest                  "/bin/bash /usr/lo..."   27 minutes ago      Up 27 minutes                               ovn-dbs-bundle-docker-0
d2d80f7c2730        192.168.24.1:8787/rhosp14/openstack-ovn-controller:2018-10-01.1            "kolla_start"            3 days ago          Up 3 days                                   ovn_controller
2f6351ac2b48        192.168.24.1:8787/rhosp14/openstack-neutron-server-ovn:2018-10-01.1        "kolla_start"            3 days ago          Up 3 days (healthy)                         neutron_api
6ffc3f666301        192.168.24.1:8787/rhosp14/openstack-nova-novncproxy:2018-10-01.1 

How reproducible:
100%

Steps to Reproduce:
    1  pcs status 
    2  docker ps | grep ovn 
    3  docker restart ovn-dbs-bundle-docker-0
    4  pcs status 
    5  clear 
    6  pcs status 
    7  vi /var/log/containers/openvswitch/ovn-controller.log
    8  vi /var/log/containers/openvswitch/ovn-northd.log.1 
    9  vi /var/log/containers/openvswitch/ovsdb-server-nb.log
   10  vi /var/log/containers/openvswitch/ovsdb-server-sb.log
   11  history 
   12  vi /var/log/containers/neutron/server.log
   13  pcs status 
Actual results:

ovn-dbs-bundle-0 is in Stopped status.
Expected results:

ovn-dbs-bundle-0 is in Slave status.
Additional info:
Logs & sos-report attached

Comment 2 Numan Siddique 2018-10-08 13:08:22 UTC

With "docker restart", since it doesn't stop the ovsdb-server's gracefully, the ovsdb-server pid files remain. And when the the service is started, ovn-ctl returns the status as "not-running" since it calls "pidfile_is_running" for the old file name.

It requires a fix either in ovn-ctl to delete stale pid files before starting the services or delete the pidfiles in the actual binary before creating new one when --pidfile option is specified.

I will propose a fix upstream.

Comment 3 Numan Siddique 2018-10-09 07:19:01 UTC

Submitted the patch to fix this - https://patchwork.ozlabs.org/patch/981066/

Comment 5 Aaron Conole 2018-10-15 20:01:45 UTC

Looks like this bug is also present in ovs 2.9, but the fix will look different.  Is that true?  I see in start_ovsdb__()

   local pid
   ...
   eval pid=\$DB_${DB}_PID
   ...
   if pidfile_is_running $pid; then
    ....
   fi

In that case, would a possible fix be:

   -    test -e "$pidfile" && pid=`cat "$pidfile"` && pid_exists "$pid"
   +    test -e "$pidfile" && ispid=`cat "$pidfile"` && pid_exists "$ispid"

?

Comment 6 Numan Siddique 2018-10-16 05:18:50 UTC

(In reply to Aaron Conole from comment #5)
> Looks like this bug is also present in ovs 2.9, but the fix will look
> different.  Is that true?  I see in start_ovsdb__()
> 
>    local pid
>    ...
>    eval pid=\$DB_${DB}_PID
>    ...
>    if pidfile_is_running $pid; then
>     ....
>    fi
> 
> In that case, would a possible fix be:
> 
>    -    test -e "$pidfile" && pid=`cat "$pidfile"` && pid_exists "$pid"
>    +    test -e "$pidfile" && ispid=`cat "$pidfile"` && pid_exists "$ispid"
> 
> ?

I think the same fix will do that. The issue is because the local function 'pidfile_is_running' is overriding the 'pid' variable.

In the u/s fix, I changed the name of the variable "pid" to "db_pid_file" in the start_ovsdb__().  The fix is already commited in u/s 2.9 branch - https://github.com/openvswitch/ovs/commit/4c7a432154a7b379cd97d26a51caaa155f35b449

I will backport it in 2.9 d/s.

Comment 22 Eran Kuris 2018-11-26 14:57:34 UTC

the issue was fixed on OpenStack/14.0-RHEL-7/2018-11-22.2/

[root@controller-0 ~]# rpm -qa | grep openvsw
rhosp-openvswitch-2.10-0.1.el7ost.noarch
openvswitch2.10-2.10.0-28.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-5.el7fdp.noarch

Comment 25 errata-xmlrpc 2019-01-11 11:53:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045

Note You need to log in before you can comment on or make changes to this bug.