Bug 1850778

Summary:

azure-lb: Resource fails intermittently due to nc output redirection to pidfile [RHEL 8]

Product:

Red Hat Enterprise Linux 8

Reporter:

Reid Wahl <nwahl>

Component:

resource-agents

Assignee:

Oyvind Albrigtsen <oalbrigt>

Status:

CLOSED ERRATA

QA Contact:

Brandon Perkins <bperkins>

Severity:

high

Docs Contact:

Priority:

high

Version:

8.2

CC:

agk, bperkins, cfeist, cluster-maint, fdinitto, toneata

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

resource-agents-4.1.1-68.el8

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1850779 1875792 1875793 1891081 (view as bug list)

Environment:

Last Closed:

2020-11-04 02:19:46 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1850779, 1875792, 1875793

Attachments:

Description	Flags
Example binary pid file	none

Description Reid Wahl 2020-06-24 20:59:49 UTC

Created attachment 1698709 [details]
Example binary pid file

Description of problem:

Resources using the azure-lb resource agent fail intermittently when binary data is received on the listened-to port.

The resource start operation includes the following code:
~~~
    cmd="$OCF_RESKEY_nc -l -k $OCF_RESKEY_port"
    if ! lb_monitor; then
        ocf_log debug "Starting $process: $cmd"
        # Execute the command as created above
        eval "$cmd & echo \$!" > $pidfile
~~~


The eval line runs $cmd in the background and echoes any output to $pidfile. The problem is that the output continues being echoed to $pidfile indefinitely after the start operation completes. `nc -l` prints any output it receives. For example:
~~~
[root@fastvm-rhel-7-6-21 ocf]# nc -l 5000
... prints nothing until input from node 2 ...
asdf

[root@fastvm-rhel-7-6-22 ~]# nc fastvm-rhel-7-6-21 5000
asdf
~~~


This doesn't create an immediate issue, because getpid() just greps for an all-numeric line in $pidfile and uses that as the PID during monitor operations.
~~~
getpid() {
        grep -o '[0-9]*' $1
}

lb_monitor() {
    if test -f "$pidfile"; then
        if pid=`getpid $pidfile` && [ "$pid" ] && kill -s 0 $pid; then
~~~


However, this becomes a major problem when binary data is received on $OCF_RESKEY_port and written to $pidfile. The grep command is run against the binary file, and the result is passed to `kill -s 0`. For example:
~~~
Jun 24 09:13:03 node2 lrmd[35728]:  notice: nc_PF2_02_monitor_10000:14214:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 97: kill: Binary: arguments must be process or job IDs ]
Jun 24 09:13:03 node2 lrmd[35728]:  notice: nc_PF2_02_monitor_10000:14214:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 97: kill: file: arguments must be process or job IDs ]
Jun 24 09:13:03 node2 lrmd[35728]:  notice: nc_PF2_02_monitor_10000:14214:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 97: kill: /var/run/nc_PF2_02.pid: arguments must be process or job IDs ]
Jun 24 09:13:03 node2 lrmd[35728]:  notice: nc_PF2_02_monitor_10000:14214:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 97: kill: matches: arguments must be process or job IDs ]
~~~


It's unclear to me why this eval trick was used at all, and why `$cmd` and `echo $!` were chained together for redirection to $pidfile. It seems to me from some basic testing that "Spawn the background command and then on a separate line echo the resulting pid to $pidfile and exit the script" would work fine.
~~~
[root@fastvm-rhel-7-6-21 ocf]# cat /tmp/test2.sh
#!/bin/bash
cmd="/bin/nc -l -k 5000"
pidfile=/tmp/nc.pid
$cmd &
echo $! > $pidfile
exit 0

[root@fastvm-rhel-7-6-21 ocf]# bash /tmp/test2.sh

[root@fastvm-rhel-7-6-21 ocf]# cat /tmp/nc.pid
32602

[root@fastvm-rhel-7-6-21 ocf]# ps -ef | grep /bin/nc
root     32602     1  0 13:28 pts/0    00:00:00 /bin/nc -l -k 5000
~~~


This issue causes a major impact to SAP-on-Azure customers who encounter it.

I am attaching an example of a pid file that contains binary data and causes the azure-lb monitor operation to fail.

-----

Version-Release number of selected component (if applicable):

resource-agents-4.1.1-44.el8_2.3

-----

How reproducible:

Always

-----

Steps to Reproduce:

1. Place node 2 in standby for simplicity.

2. Create an azure-lb resource and check that it's running properly.

    [root@fastvm-rhel-8-0-23 ~]# pcs resource status | grep lb
      * lb	(ocf::heartbeat:azure-lb):	Started fastvm-rhel-8-0-23

    [root@fastvm-rhel-8-0-23 ~]# cat /var/run/lb.pid
    1902

    [root@fastvm-rhel-8-0-23 ~]# ps -ef | grep /bin/nc
    root        1902       1  0 13:45 ?        00:00:00 /usr/bin/nc -l -k 61000


3. Create a FIFO on node 2.

    [root@fastvm-rhel-8-0-24 ~]# tmpd=`mktemp -d`
    [root@fastvm-rhel-8-0-24 ~]# tmpf="$tmpd"/fifo
    [root@fastvm-rhel-8-0-24 ~]# mkfifo "$tmpf"
    [root@fastvm-rhel-8-0-24 ~]# echo $tmpf
    /tmp/tmp.aqaKgvWeYF/fifo


4. In another node 2 session, open an nc connection from node 2 to node 1.

    [root@fastvm-rhel-8-0-24 ~]# nc fastvm-rhel-8-0-23 61000 < /tmp/tmp.aqaKgvWeYF/fifo


5. Going back to the original node 2 session, create a file descriptor pointing to the FIFO.

    [root@fastvm-rhel-8-0-24 ~]# exec 3> "$tmpf"

6. Use the FIFO to send some text data to node 1's nc listener. Note that the output is echoed to the pidfile on node 1, and that the resource stays running.

    [root@fastvm-rhel-8-0-24 ~]# echo abcd >&3

    [root@fastvm-rhel-8-0-23 ~]# cat /var/run/lb.pid
    1902
    abcd
    [root@fastvm-rhel-8-0-23 ~]# pcs resource status | grep lb
      * lb	(ocf::heartbeat:azure-lb):	Started fastvm-rhel-8-0-23

7. Use the FIFO to send some binary data to node 1's nc listener.

    [root@fastvm-rhel-8-0-24 ~]# cat /usr/bin/grep >&3

    [root@fastvm-rhel-8-0-23 ~]# grep 1902 /var/run/lb.pid 
    Binary file /var/run/lb.pid matches

    [root@fastvm-rhel-8-0-23 ~]# pcs resource status | grep lb
      * lb	(ocf::heartbeat:azure-lb):	Stopped

-----

Actual results:

Receiving binary data over the azure-lb listener port causes the grep on the pidfile to print "Binary file /var/run/$OCF_RESOURCE_INSTANCE.pid matches", which in turn causes the monitor operation to fail (due to invalid arguments to the `kill -s 0` command).

-----

Expected results:

Receiving binary data over the azure-lb listener port has no effect on the status of the resource, provided that the nc process keeps running.

Comment 1 Reid Wahl 2020-06-25 22:50:28 UTC

I forgot to mention a pretty important point. When this happens, the start fails afterward. That's because lb_start()'s call to lb_monitor() returns failure (for the same reason the monitor failed in the first place), and so lb_start() spawns a new `nc` process. But the old `nc` process is still running and using the configured port.

~~~
Jun 23 11:03:48 ADZW2RHNAF400 crmd[35614]:  notice: Result of stop operation for nc_PF2_02 on ADZW2RHNAF400: 0 (ok)
Jun 23 11:03:48 ADZW2RHNAF400 lrmd[35611]:  notice: nc_PF2_02_start_0:777:stderr [ Ncat: bind to :::62502: Address already in use. QUITTING. ]
Jun 23 11:03:48 ADZW2RHNAF400 lrmd[35611]:  notice: nc_PF2_02_start_0:777:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 91: kill: (818) - No such process ]
Jun 23 11:03:48 ADZW2RHNAF400 crmd[35614]:  notice: Result of start operation for nc_PF2_02 on ADZW2RHNAF400: 1 (unknown error)
~~~

If the issue has already happened on one node and then it happens on another node before the failure is manually cleaned up, the resource won't recovery automatically (due to start failures on each node).

Comment 2 Reid Wahl 2020-07-07 09:34:15 UTC

https://github.com/ClusterLabs/resource-agents/pull/1528

Comment 12 errata-xmlrpc 2020-11-04 02:19:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: resource-agents security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4605