Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
+++ This bug was initially created as a clone of Bug #1850778 +++
Description of problem:
Resources using the azure-lb resource agent fail intermittently when binary data is received on the listened-to port.
The resource start operation includes the following code:
~~~
cmd="$OCF_RESKEY_nc -l -k $OCF_RESKEY_port"
if ! lb_monitor; then
ocf_log debug "Starting $process: $cmd"
# Execute the command as created above
eval "$cmd & echo \$!" > $pidfile
~~~
The eval line runs $cmd in the background and echoes any output to $pidfile. The problem is that the output continues being echoed to $pidfile indefinitely after the start operation completes. `nc -l` prints any output it receives. For example:
~~~
[root@fastvm-rhel-7-6-21 ocf]# nc -l 5000
... prints nothing until input from node 2 ...
asdf
[root@fastvm-rhel-7-6-22 ~]# nc fastvm-rhel-7-6-21 5000
asdf
~~~
This doesn't create an immediate issue, because getpid() just greps for an all-numeric line in $pidfile and uses that as the PID during monitor operations.
~~~
getpid() {
grep -o '[0-9]*' $1
}
lb_monitor() {
if test -f "$pidfile"; then
if pid=`getpid $pidfile` && [ "$pid" ] && kill -s 0 $pid; then
~~~
However, this becomes a major problem when binary data is received on $OCF_RESKEY_port and written to $pidfile. The grep command is run against the binary file, and the result is passed to `kill -s 0`. For example:
~~~
Jun 24 09:13:03 node2 lrmd[35728]: notice: nc_PF2_02_monitor_10000:14214:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 97: kill: Binary: arguments must be process or job IDs ]
Jun 24 09:13:03 node2 lrmd[35728]: notice: nc_PF2_02_monitor_10000:14214:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 97: kill: file: arguments must be process or job IDs ]
Jun 24 09:13:03 node2 lrmd[35728]: notice: nc_PF2_02_monitor_10000:14214:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 97: kill: /var/run/nc_PF2_02.pid: arguments must be process or job IDs ]
Jun 24 09:13:03 node2 lrmd[35728]: notice: nc_PF2_02_monitor_10000:14214:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 97: kill: matches: arguments must be process or job IDs ]
~~~
It's unclear to me why this eval trick was used at all, and why `$cmd` and `echo $!` were chained together for redirection to $pidfile. It seems to me from some basic testing that "Spawn the background command and then on a separate line echo the resulting pid to $pidfile and exit the script" would work fine.
~~~
[root@fastvm-rhel-7-6-21 ocf]# cat /tmp/test2.sh
#!/bin/bash
cmd="/bin/nc -l -k 5000"
pidfile=/tmp/nc.pid
$cmd &
echo $! > $pidfile
exit 0
[root@fastvm-rhel-7-6-21 ocf]# bash /tmp/test2.sh
[root@fastvm-rhel-7-6-21 ocf]# cat /tmp/nc.pid
32602
[root@fastvm-rhel-7-6-21 ocf]# ps -ef | grep /bin/nc
root 32602 1 0 13:28 pts/0 00:00:00 /bin/nc -l -k 5000
~~~
This issue causes a major impact to SAP-on-Azure customers who encounter it.
I am attaching an example of a pid file that contains binary data and causes the azure-lb monitor operation to fail.
-----
Version-Release number of selected component (if applicable):
resource-agents-4.1.1-46.el7_8.2
-----
How reproducible:
Always
-----
Steps to Reproduce:
1. Place node 2 in standby for simplicity.
2. Create an azure-lb resource and check that it's running properly.
[root@fastvm-rhel-8-0-23 ~]# pcs resource status | grep lb
* lb (ocf::heartbeat:azure-lb): Started fastvm-rhel-8-0-23
[root@fastvm-rhel-8-0-23 ~]# cat /var/run/lb.pid
1902
[root@fastvm-rhel-8-0-23 ~]# ps -ef | grep /bin/nc
root 1902 1 0 13:45 ? 00:00:00 /usr/bin/nc -l -k 61000
3. Create a FIFO on node 2.
[root@fastvm-rhel-8-0-24 ~]# tmpd=`mktemp -d`
[root@fastvm-rhel-8-0-24 ~]# tmpf="$tmpd"/fifo
[root@fastvm-rhel-8-0-24 ~]# mkfifo "$tmpf"
[root@fastvm-rhel-8-0-24 ~]# echo $tmpf
/tmp/tmp.aqaKgvWeYF/fifo
4. In another node 2 session, open an nc connection from node 2 to node 1.
[root@fastvm-rhel-8-0-24 ~]# nc fastvm-rhel-8-0-23 61000 < /tmp/tmp.aqaKgvWeYF/fifo
5. Going back to the original node 2 session, create a file descriptor pointing to the FIFO.
[root@fastvm-rhel-8-0-24 ~]# exec 3> "$tmpf"
6. Use the FIFO to send some text data to node 1's nc listener. Note that the output is echoed to the pidfile on node 1, and that the resource stays running.
[root@fastvm-rhel-8-0-24 ~]# echo abcd >&3
[root@fastvm-rhel-8-0-23 ~]# cat /var/run/lb.pid
1902
abcd
[root@fastvm-rhel-8-0-23 ~]# pcs resource status | grep lb
* lb (ocf::heartbeat:azure-lb): Started fastvm-rhel-8-0-23
7. Use the FIFO to send some binary data to node 1's nc listener.
[root@fastvm-rhel-8-0-24 ~]# cat /usr/bin/grep >&3
[root@fastvm-rhel-8-0-23 ~]# grep 1902 /var/run/lb.pid
Binary file /var/run/lb.pid matches
[root@fastvm-rhel-8-0-23 ~]# pcs resource status | grep lb
* lb (ocf::heartbeat:azure-lb): Stopped
-----
Actual results:
Receiving binary data over the azure-lb listener port causes the grep on the pidfile to print "Binary file /var/run/$OCF_RESOURCE_INSTANCE.pid matches", which in turn causes the monitor operation to fail (due to invalid arguments to the `kill -s 0` command).
-----
Expected results:
Receiving binary data over the azure-lb listener port has no effect on the status of the resource, provided that the nc process keeps running.
I forgot to mention a pretty important point. When this happens, the start fails afterward. That's because lb_start()'s call to lb_monitor() returns failure (for the same reason the monitor failed in the first place), and so lb_start() spawns a new `nc` process. But the old `nc` process is still running and using the configured port.
~~~
Jun 23 11:03:48 ADZW2RHNAF400 crmd[35614]: notice: Result of stop operation for nc_PF2_02 on ADZW2RHNAF400: 0 (ok)
Jun 23 11:03:48 ADZW2RHNAF400 lrmd[35611]: notice: nc_PF2_02_start_0:777:stderr [ Ncat: bind to :::62502: Address already in use. QUITTING. ]
Jun 23 11:03:48 ADZW2RHNAF400 lrmd[35611]: notice: nc_PF2_02_start_0:777:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 91: kill: (818) - No such process ]
Jun 23 11:03:48 ADZW2RHNAF400 crmd[35614]: notice: Result of start operation for nc_PF2_02 on ADZW2RHNAF400: 1 (unknown error)
~~~
If the issue has already happened on one node and then it happens on another node before the failure is manually cleaned up, the resource won't recovery automatically (due to start failures on each node).
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Low: resource-agents security and bug fix update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2020:5004