Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1741057

Summary: /usr/share/openvswitch/scripts/ovn-ctl is fragile in the presence of empty .pid files
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Michele Baldessari <michele>
Component: ovn2.11Assignee: Numan Siddique <nusiddiq>
Status: CLOSED ERRATA QA Contact: haidong li <haili>
Severity: high Docs Contact:
Priority: unspecified    
Version: RHEL 7.7CC: ctrautma, fleitner, haili, jishi, kfida, nusiddiq
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-01 07:21:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michele Baldessari 2019-08-14 07:42:20 UTC
Description of problem:
Lmiccini and I have seen the following issue on OSP15 when using ovn2.11-2.11.0-19.el8fdp.x86_64. After a number of destructive tests on our control plane, one controller would constantly fail to start the ovn-dbs resource in pacemaker.

So we would see the following:
 podman container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest]
   ovn-dbs-bundle-0     (ocf::ovn:ovndb-servers):       Stopped controller-0                     
   ovn-dbs-bundle-1     (ocf::ovn:ovndb-servers):       Master controller-2                      
   ovn-dbs-bundle-2     (ocf::ovn:ovndb-servers):       Slave controller-1                       

Inside the failing container we could observe the resource agent looping to call ovn-ctl:
root       46671  3.6  0.0  12688  3612 ?        S    07:08   0:04  \_ /bin/bash /usr/lib/ocf/resource.d/ovn/ovndb-servers start                            
root       72599  0.0  0.0  12820  3892 ?        S    07:10   0:00      \_ /bin/sh /usr/share/openvswitch/scripts/ovn-ctl status_ovnsb                      
root       72609  0.0  0.0  12820  2836 ?        S    07:10   0:00          \_ /bin/sh /usr/share/openvswitch/scripts/ovn-ctl status_ovnsb                  
root       72610  0.0  0.0  39604  1300 ?        R    07:10   0:00              \_ ovs-appctl -t /var/run/openvswitch/ovnsb_db.ctl ovsdb-server/sync-status 
root       72611  0.0  0.0  20092   668 ?        R    07:10   0:00              \_ awk {if(NR==1) print $2}

Attempting to restart the resource would constantly fail. The reason for the problem was that the container had empty but existing pid file (likely due to some destructive testing):
ls -l /var/run/openvswitch/*pid
-rw-r--r--. 1 root root       0 Aug  8 23:01 ovnnb_db.pid       
-rw-r--r--. 1 root root       0 Aug  8 23:01 ovnsb_db.pid       


The problem seems to be the code in ovs-lib and ovn-ctl which is not robust when the .pid file is empty (because pid_exists() will always succeed when $pid is empty and so the resource agent will think something is running when in reality nothing is really running):
pid_exists () {                                                             
    # This is better than "kill -0" because it doesn't require permission to
    # send a signal (so daemon_status in particular works as non-root).     
    test -d /proc/"$1"                                                      
}                                                                           
                                                                            
pidfile_is_running () {                                                     
    pidfile=$1                                                              
    test -e "$pidfile" && pid=`cat "$pidfile"` && pid_exists "$pid"         
} >/dev/null 2>&1                                                           

With the following change (or when we removed the .pid files by hand) we were able to get the resource starting:
--- /usr/share/openvswitch/scripts/ovn-ctl.orig 2019-08-14 07:39:07.048088357 +0000
+++ /usr/share/openvswitch/scripts/ovn-ctl      2019-08-14 07:39:32.122117803 +0000
@@ -35,7 +35,7 @@
 
 pidfile_is_running () {
     pidfile=$1
-    test -e "$pidfile" && pid=`cat "$pidfile"` && pid_exists "$pid"
+    test -e "$pidfile" && [ -s "$pidfile" ] && pid=`cat "$pidfile"` && pid_exists "$pid"
 } >/dev/null 2>&1
 
 stop_nb_ovsdb() {

I guess '[ -s "$pidfile" ]' is a bashism and we need something more generic

Comment 1 Michele Baldessari 2019-08-14 08:40:27 UTC
https://patchwork.ozlabs.org/patch/1146846/

Comment 3 Michele Baldessari 2019-08-14 15:52:18 UTC
http://patchwork.ozlabs.org/patch/1147111/ is the updated change after initial feedback

Comment 12 errata-xmlrpc 2019-10-01 07:21:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2943