Description of problem: Lmiccini and I have seen the following issue on OSP15 when using ovn2.11-2.11.0-19.el8fdp.x86_64. After a number of destructive tests on our control plane, one controller would constantly fail to start the ovn-dbs resource in pacemaker. So we would see the following: podman container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest] ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped controller-0 ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Master controller-2 ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-1 Inside the failing container we could observe the resource agent looping to call ovn-ctl: root 46671 3.6 0.0 12688 3612 ? S 07:08 0:04 \_ /bin/bash /usr/lib/ocf/resource.d/ovn/ovndb-servers start root 72599 0.0 0.0 12820 3892 ? S 07:10 0:00 \_ /bin/sh /usr/share/openvswitch/scripts/ovn-ctl status_ovnsb root 72609 0.0 0.0 12820 2836 ? S 07:10 0:00 \_ /bin/sh /usr/share/openvswitch/scripts/ovn-ctl status_ovnsb root 72610 0.0 0.0 39604 1300 ? R 07:10 0:00 \_ ovs-appctl -t /var/run/openvswitch/ovnsb_db.ctl ovsdb-server/sync-status root 72611 0.0 0.0 20092 668 ? R 07:10 0:00 \_ awk {if(NR==1) print $2} Attempting to restart the resource would constantly fail. The reason for the problem was that the container had empty but existing pid file (likely due to some destructive testing): ls -l /var/run/openvswitch/*pid -rw-r--r--. 1 root root 0 Aug 8 23:01 ovnnb_db.pid -rw-r--r--. 1 root root 0 Aug 8 23:01 ovnsb_db.pid The problem seems to be the code in ovs-lib and ovn-ctl which is not robust when the .pid file is empty (because pid_exists() will always succeed when $pid is empty and so the resource agent will think something is running when in reality nothing is really running): pid_exists () { # This is better than "kill -0" because it doesn't require permission to # send a signal (so daemon_status in particular works as non-root). test -d /proc/"$1" } pidfile_is_running () { pidfile=$1 test -e "$pidfile" && pid=`cat "$pidfile"` && pid_exists "$pid" } >/dev/null 2>&1 With the following change (or when we removed the .pid files by hand) we were able to get the resource starting: --- /usr/share/openvswitch/scripts/ovn-ctl.orig 2019-08-14 07:39:07.048088357 +0000 +++ /usr/share/openvswitch/scripts/ovn-ctl 2019-08-14 07:39:32.122117803 +0000 @@ -35,7 +35,7 @@ pidfile_is_running () { pidfile=$1 - test -e "$pidfile" && pid=`cat "$pidfile"` && pid_exists "$pid" + test -e "$pidfile" && [ -s "$pidfile" ] && pid=`cat "$pidfile"` && pid_exists "$pid" } >/dev/null 2>&1 stop_nb_ovsdb() { I guess '[ -s "$pidfile" ]' is a bashism and we need something more generic
https://patchwork.ozlabs.org/patch/1146846/
http://patchwork.ozlabs.org/patch/1147111/ is the updated change after initial feedback
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2943