1850779 – azure-lb: Resource fails intermittently due to nc output redirection to pidfile [rhel-7.9.z]

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1850779 - azure-lb: Resource fails intermittently due to nc output redirection to pidfile [rhel-7.9.z]

Summary: azure-lb: Resource fails intermittently due to nc output redirection to pidfi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	7.8
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Oyvind Albrigtsen
QA Contact:	Brandon Perkins
Docs Contact:
URL:
Whiteboard:
Depends On:	1850778
Blocks:	1876964 1876965 1876966
TreeView+	depends on / blocked

Reported:	2020-06-24 21:06 UTC by Reid Wahl
Modified:	2024-06-13 22:44 UTC (History)
CC List:	10 users (show)
Fixed In Version:	resource-agents-4.1.1-61.el7_9.2
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1850778
Clones:	1876964 1876965 1876966 (view as bug list)
Environment:
Last Closed:	2020-11-10 12:56:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Example binary pid file (13.24 KB, text/plain) 2020-06-24 21:18 UTC, Reid Wahl	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ClusterLabs resource-agents pull 1528	0	None	closed	azure-lb: Don't redirect nc listener output to pidfile	2021-02-04 00:57:11 UTC
Red Hat Knowledge Base (Solution)	5185011	0	None	None	None	2020-06-25 23:07:35 UTC

Description Reid Wahl 2020-06-24 21:06:05 UTC

+++ This bug was initially created as a clone of Bug #1850778 +++

Description of problem:

Resources using the azure-lb resource agent fail intermittently when binary data is received on the listened-to port.

The resource start operation includes the following code:
~~~
    cmd="$OCF_RESKEY_nc -l -k $OCF_RESKEY_port"
    if ! lb_monitor; then
        ocf_log debug "Starting $process: $cmd"
        # Execute the command as created above
        eval "$cmd & echo \$!" > $pidfile
~~~


The eval line runs $cmd in the background and echoes any output to $pidfile. The problem is that the output continues being echoed to $pidfile indefinitely after the start operation completes. `nc -l` prints any output it receives. For example:
~~~
[root@fastvm-rhel-7-6-21 ocf]# nc -l 5000
... prints nothing until input from node 2 ...
asdf

[root@fastvm-rhel-7-6-22 ~]# nc fastvm-rhel-7-6-21 5000
asdf
~~~


This doesn't create an immediate issue, because getpid() just greps for an all-numeric line in $pidfile and uses that as the PID during monitor operations.
~~~
getpid() {
        grep -o '[0-9]*' $1
}

lb_monitor() {
    if test -f "$pidfile"; then
        if pid=`getpid $pidfile` && [ "$pid" ] && kill -s 0 $pid; then
~~~


However, this becomes a major problem when binary data is received on $OCF_RESKEY_port and written to $pidfile. The grep command is run against the binary file, and the result is passed to `kill -s 0`. For example:
~~~
Jun 24 09:13:03 node2 lrmd[35728]:  notice: nc_PF2_02_monitor_10000:14214:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 97: kill: Binary: arguments must be process or job IDs ]
Jun 24 09:13:03 node2 lrmd[35728]:  notice: nc_PF2_02_monitor_10000:14214:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 97: kill: file: arguments must be process or job IDs ]
Jun 24 09:13:03 node2 lrmd[35728]:  notice: nc_PF2_02_monitor_10000:14214:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 97: kill: /var/run/nc_PF2_02.pid: arguments must be process or job IDs ]
Jun 24 09:13:03 node2 lrmd[35728]:  notice: nc_PF2_02_monitor_10000:14214:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 97: kill: matches: arguments must be process or job IDs ]
~~~


It's unclear to me why this eval trick was used at all, and why `$cmd` and `echo $!` were chained together for redirection to $pidfile. It seems to me from some basic testing that "Spawn the background command and then on a separate line echo the resulting pid to $pidfile and exit the script" would work fine.
~~~
[root@fastvm-rhel-7-6-21 ocf]# cat /tmp/test2.sh
#!/bin/bash
cmd="/bin/nc -l -k 5000"
pidfile=/tmp/nc.pid
$cmd &
echo $! > $pidfile
exit 0

[root@fastvm-rhel-7-6-21 ocf]# bash /tmp/test2.sh

[root@fastvm-rhel-7-6-21 ocf]# cat /tmp/nc.pid
32602

[root@fastvm-rhel-7-6-21 ocf]# ps -ef | grep /bin/nc
root     32602     1  0 13:28 pts/0    00:00:00 /bin/nc -l -k 5000
~~~


This issue causes a major impact to SAP-on-Azure customers who encounter it.

I am attaching an example of a pid file that contains binary data and causes the azure-lb monitor operation to fail.

-----

Version-Release number of selected component (if applicable):

resource-agents-4.1.1-46.el7_8.2

-----

How reproducible:

Always

-----

Steps to Reproduce:

1. Place node 2 in standby for simplicity.

2. Create an azure-lb resource and check that it's running properly.

    [root@fastvm-rhel-8-0-23 ~]# pcs resource status | grep lb
      * lb	(ocf::heartbeat:azure-lb):	Started fastvm-rhel-8-0-23

    [root@fastvm-rhel-8-0-23 ~]# cat /var/run/lb.pid
    1902

    [root@fastvm-rhel-8-0-23 ~]# ps -ef | grep /bin/nc
    root        1902       1  0 13:45 ?        00:00:00 /usr/bin/nc -l -k 61000


3. Create a FIFO on node 2.

    [root@fastvm-rhel-8-0-24 ~]# tmpd=`mktemp -d`
    [root@fastvm-rhel-8-0-24 ~]# tmpf="$tmpd"/fifo
    [root@fastvm-rhel-8-0-24 ~]# mkfifo "$tmpf"
    [root@fastvm-rhel-8-0-24 ~]# echo $tmpf
    /tmp/tmp.aqaKgvWeYF/fifo


4. In another node 2 session, open an nc connection from node 2 to node 1.

    [root@fastvm-rhel-8-0-24 ~]# nc fastvm-rhel-8-0-23 61000 < /tmp/tmp.aqaKgvWeYF/fifo


5. Going back to the original node 2 session, create a file descriptor pointing to the FIFO.

    [root@fastvm-rhel-8-0-24 ~]# exec 3> "$tmpf"

6. Use the FIFO to send some text data to node 1's nc listener. Note that the output is echoed to the pidfile on node 1, and that the resource stays running.

    [root@fastvm-rhel-8-0-24 ~]# echo abcd >&3

    [root@fastvm-rhel-8-0-23 ~]# cat /var/run/lb.pid
    1902
    abcd
    [root@fastvm-rhel-8-0-23 ~]# pcs resource status | grep lb
      * lb	(ocf::heartbeat:azure-lb):	Started fastvm-rhel-8-0-23

7. Use the FIFO to send some binary data to node 1's nc listener.

    [root@fastvm-rhel-8-0-24 ~]# cat /usr/bin/grep >&3

    [root@fastvm-rhel-8-0-23 ~]# grep 1902 /var/run/lb.pid 
    Binary file /var/run/lb.pid matches

    [root@fastvm-rhel-8-0-23 ~]# pcs resource status | grep lb
      * lb	(ocf::heartbeat:azure-lb):	Stopped

-----

Actual results:

Receiving binary data over the azure-lb listener port causes the grep on the pidfile to print "Binary file /var/run/$OCF_RESOURCE_INSTANCE.pid matches", which in turn causes the monitor operation to fail (due to invalid arguments to the `kill -s 0` command).

-----

Expected results:

Receiving binary data over the azure-lb listener port has no effect on the status of the resource, provided that the nc process keeps running.

Comment 2 Reid Wahl 2020-06-24 21:18:50 UTC

Created attachment 1698711 [details]
Example binary pid file

Comment 3 Reid Wahl 2020-06-25 22:50:40 UTC

I forgot to mention a pretty important point. When this happens, the start fails afterward. That's because lb_start()'s call to lb_monitor() returns failure (for the same reason the monitor failed in the first place), and so lb_start() spawns a new `nc` process. But the old `nc` process is still running and using the configured port.

~~~
Jun 23 11:03:48 ADZW2RHNAF400 crmd[35614]:  notice: Result of stop operation for nc_PF2_02 on ADZW2RHNAF400: 0 (ok)
Jun 23 11:03:48 ADZW2RHNAF400 lrmd[35611]:  notice: nc_PF2_02_start_0:777:stderr [ Ncat: bind to :::62502: Address already in use. QUITTING. ]
Jun 23 11:03:48 ADZW2RHNAF400 lrmd[35611]:  notice: nc_PF2_02_start_0:777:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 91: kill: (818) - No such process ]
Jun 23 11:03:48 ADZW2RHNAF400 crmd[35614]:  notice: Result of start operation for nc_PF2_02 on ADZW2RHNAF400: 1 (unknown error)
~~~

If the issue has already happened on one node and then it happens on another node before the failure is manually cleaned up, the resource won't recovery automatically (due to start failures on each node).

Comment 4 Reid Wahl 2020-07-07 09:35:05 UTC

https://github.com/ClusterLabs/resource-agents/pull/1528

Comment 20 errata-xmlrpc 2020-11-10 12:56:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: resource-agents security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5004

Note You need to log in before you can comment on or make changes to this bug.