The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 1741057 - /usr/share/openvswitch/scripts/ovn-ctl is fragile in the presence of empty .pid files
Summary: /usr/share/openvswitch/scripts/ovn-ctl is fragile in the presence of empty .p...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn2.11
Version: RHEL 7.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Numan Siddique
QA Contact: haidong li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-14 07:42 UTC by Michele Baldessari
Modified: 2020-01-14 20:44 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-01 07:21:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2943 0 None None None 2019-10-01 07:21:53 UTC

Description Michele Baldessari 2019-08-14 07:42:20 UTC
Description of problem:
Lmiccini and I have seen the following issue on OSP15 when using ovn2.11-2.11.0-19.el8fdp.x86_64. After a number of destructive tests on our control plane, one controller would constantly fail to start the ovn-dbs resource in pacemaker.

So we would see the following:
 podman container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest]
   ovn-dbs-bundle-0     (ocf::ovn:ovndb-servers):       Stopped controller-0                     
   ovn-dbs-bundle-1     (ocf::ovn:ovndb-servers):       Master controller-2                      
   ovn-dbs-bundle-2     (ocf::ovn:ovndb-servers):       Slave controller-1                       

Inside the failing container we could observe the resource agent looping to call ovn-ctl:
root       46671  3.6  0.0  12688  3612 ?        S    07:08   0:04  \_ /bin/bash /usr/lib/ocf/resource.d/ovn/ovndb-servers start                            
root       72599  0.0  0.0  12820  3892 ?        S    07:10   0:00      \_ /bin/sh /usr/share/openvswitch/scripts/ovn-ctl status_ovnsb                      
root       72609  0.0  0.0  12820  2836 ?        S    07:10   0:00          \_ /bin/sh /usr/share/openvswitch/scripts/ovn-ctl status_ovnsb                  
root       72610  0.0  0.0  39604  1300 ?        R    07:10   0:00              \_ ovs-appctl -t /var/run/openvswitch/ovnsb_db.ctl ovsdb-server/sync-status 
root       72611  0.0  0.0  20092   668 ?        R    07:10   0:00              \_ awk {if(NR==1) print $2}

Attempting to restart the resource would constantly fail. The reason for the problem was that the container had empty but existing pid file (likely due to some destructive testing):
ls -l /var/run/openvswitch/*pid
-rw-r--r--. 1 root root       0 Aug  8 23:01 ovnnb_db.pid       
-rw-r--r--. 1 root root       0 Aug  8 23:01 ovnsb_db.pid       


The problem seems to be the code in ovs-lib and ovn-ctl which is not robust when the .pid file is empty (because pid_exists() will always succeed when $pid is empty and so the resource agent will think something is running when in reality nothing is really running):
pid_exists () {                                                             
    # This is better than "kill -0" because it doesn't require permission to
    # send a signal (so daemon_status in particular works as non-root).     
    test -d /proc/"$1"                                                      
}                                                                           
                                                                            
pidfile_is_running () {                                                     
    pidfile=$1                                                              
    test -e "$pidfile" && pid=`cat "$pidfile"` && pid_exists "$pid"         
} >/dev/null 2>&1                                                           

With the following change (or when we removed the .pid files by hand) we were able to get the resource starting:
--- /usr/share/openvswitch/scripts/ovn-ctl.orig 2019-08-14 07:39:07.048088357 +0000
+++ /usr/share/openvswitch/scripts/ovn-ctl      2019-08-14 07:39:32.122117803 +0000
@@ -35,7 +35,7 @@
 
 pidfile_is_running () {
     pidfile=$1
-    test -e "$pidfile" && pid=`cat "$pidfile"` && pid_exists "$pid"
+    test -e "$pidfile" && [ -s "$pidfile" ] && pid=`cat "$pidfile"` && pid_exists "$pid"
 } >/dev/null 2>&1
 
 stop_nb_ovsdb() {

I guess '[ -s "$pidfile" ]' is a bashism and we need something more generic

Comment 1 Michele Baldessari 2019-08-14 08:40:27 UTC
https://patchwork.ozlabs.org/patch/1146846/

Comment 3 Michele Baldessari 2019-08-14 15:52:18 UTC
http://patchwork.ozlabs.org/patch/1147111/ is the updated change after initial feedback

Comment 12 errata-xmlrpc 2019-10-01 07:21:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2943


Note You need to log in before you can comment on or make changes to this bug.