Bug 1850778
Summary: | azure-lb: Resource fails intermittently due to nc output redirection to pidfile [RHEL 8] | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Reid Wahl <nwahl> | ||||
Component: | resource-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> | ||||
Status: | CLOSED ERRATA | QA Contact: | Brandon Perkins <bperkins> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 8.2 | CC: | agk, bperkins, cfeist, cluster-maint, fdinitto, toneata | ||||
Target Milestone: | rc | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | resource-agents-4.1.1-68.el8 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1850779 1875792 1875793 1891081 (view as bug list) | Environment: | |||||
Last Closed: | 2020-11-04 02:19:46 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1850779, 1875792, 1875793 | ||||||
Attachments: |
|
I forgot to mention a pretty important point. When this happens, the start fails afterward. That's because lb_start()'s call to lb_monitor() returns failure (for the same reason the monitor failed in the first place), and so lb_start() spawns a new `nc` process. But the old `nc` process is still running and using the configured port. ~~~ Jun 23 11:03:48 ADZW2RHNAF400 crmd[35614]: notice: Result of stop operation for nc_PF2_02 on ADZW2RHNAF400: 0 (ok) Jun 23 11:03:48 ADZW2RHNAF400 lrmd[35611]: notice: nc_PF2_02_start_0:777:stderr [ Ncat: bind to :::62502: Address already in use. QUITTING. ] Jun 23 11:03:48 ADZW2RHNAF400 lrmd[35611]: notice: nc_PF2_02_start_0:777:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 91: kill: (818) - No such process ] Jun 23 11:03:48 ADZW2RHNAF400 crmd[35614]: notice: Result of start operation for nc_PF2_02 on ADZW2RHNAF400: 1 (unknown error) ~~~ If the issue has already happened on one node and then it happens on another node before the failure is manually cleaned up, the resource won't recovery automatically (due to start failures on each node). Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Low: resource-agents security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:4605 |
Created attachment 1698709 [details] Example binary pid file Description of problem: Resources using the azure-lb resource agent fail intermittently when binary data is received on the listened-to port. The resource start operation includes the following code: ~~~ cmd="$OCF_RESKEY_nc -l -k $OCF_RESKEY_port" if ! lb_monitor; then ocf_log debug "Starting $process: $cmd" # Execute the command as created above eval "$cmd & echo \$!" > $pidfile ~~~ The eval line runs $cmd in the background and echoes any output to $pidfile. The problem is that the output continues being echoed to $pidfile indefinitely after the start operation completes. `nc -l` prints any output it receives. For example: ~~~ [root@fastvm-rhel-7-6-21 ocf]# nc -l 5000 ... prints nothing until input from node 2 ... asdf [root@fastvm-rhel-7-6-22 ~]# nc fastvm-rhel-7-6-21 5000 asdf ~~~ This doesn't create an immediate issue, because getpid() just greps for an all-numeric line in $pidfile and uses that as the PID during monitor operations. ~~~ getpid() { grep -o '[0-9]*' $1 } lb_monitor() { if test -f "$pidfile"; then if pid=`getpid $pidfile` && [ "$pid" ] && kill -s 0 $pid; then ~~~ However, this becomes a major problem when binary data is received on $OCF_RESKEY_port and written to $pidfile. The grep command is run against the binary file, and the result is passed to `kill -s 0`. For example: ~~~ Jun 24 09:13:03 node2 lrmd[35728]: notice: nc_PF2_02_monitor_10000:14214:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 97: kill: Binary: arguments must be process or job IDs ] Jun 24 09:13:03 node2 lrmd[35728]: notice: nc_PF2_02_monitor_10000:14214:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 97: kill: file: arguments must be process or job IDs ] Jun 24 09:13:03 node2 lrmd[35728]: notice: nc_PF2_02_monitor_10000:14214:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 97: kill: /var/run/nc_PF2_02.pid: arguments must be process or job IDs ] Jun 24 09:13:03 node2 lrmd[35728]: notice: nc_PF2_02_monitor_10000:14214:stderr [ /usr/lib/ocf/resource.d/heartbeat/azure-lb: line 97: kill: matches: arguments must be process or job IDs ] ~~~ It's unclear to me why this eval trick was used at all, and why `$cmd` and `echo $!` were chained together for redirection to $pidfile. It seems to me from some basic testing that "Spawn the background command and then on a separate line echo the resulting pid to $pidfile and exit the script" would work fine. ~~~ [root@fastvm-rhel-7-6-21 ocf]# cat /tmp/test2.sh #!/bin/bash cmd="/bin/nc -l -k 5000" pidfile=/tmp/nc.pid $cmd & echo $! > $pidfile exit 0 [root@fastvm-rhel-7-6-21 ocf]# bash /tmp/test2.sh [root@fastvm-rhel-7-6-21 ocf]# cat /tmp/nc.pid 32602 [root@fastvm-rhel-7-6-21 ocf]# ps -ef | grep /bin/nc root 32602 1 0 13:28 pts/0 00:00:00 /bin/nc -l -k 5000 ~~~ This issue causes a major impact to SAP-on-Azure customers who encounter it. I am attaching an example of a pid file that contains binary data and causes the azure-lb monitor operation to fail. ----- Version-Release number of selected component (if applicable): resource-agents-4.1.1-44.el8_2.3 ----- How reproducible: Always ----- Steps to Reproduce: 1. Place node 2 in standby for simplicity. 2. Create an azure-lb resource and check that it's running properly. [root@fastvm-rhel-8-0-23 ~]# pcs resource status | grep lb * lb (ocf::heartbeat:azure-lb): Started fastvm-rhel-8-0-23 [root@fastvm-rhel-8-0-23 ~]# cat /var/run/lb.pid 1902 [root@fastvm-rhel-8-0-23 ~]# ps -ef | grep /bin/nc root 1902 1 0 13:45 ? 00:00:00 /usr/bin/nc -l -k 61000 3. Create a FIFO on node 2. [root@fastvm-rhel-8-0-24 ~]# tmpd=`mktemp -d` [root@fastvm-rhel-8-0-24 ~]# tmpf="$tmpd"/fifo [root@fastvm-rhel-8-0-24 ~]# mkfifo "$tmpf" [root@fastvm-rhel-8-0-24 ~]# echo $tmpf /tmp/tmp.aqaKgvWeYF/fifo 4. In another node 2 session, open an nc connection from node 2 to node 1. [root@fastvm-rhel-8-0-24 ~]# nc fastvm-rhel-8-0-23 61000 < /tmp/tmp.aqaKgvWeYF/fifo 5. Going back to the original node 2 session, create a file descriptor pointing to the FIFO. [root@fastvm-rhel-8-0-24 ~]# exec 3> "$tmpf" 6. Use the FIFO to send some text data to node 1's nc listener. Note that the output is echoed to the pidfile on node 1, and that the resource stays running. [root@fastvm-rhel-8-0-24 ~]# echo abcd >&3 [root@fastvm-rhel-8-0-23 ~]# cat /var/run/lb.pid 1902 abcd [root@fastvm-rhel-8-0-23 ~]# pcs resource status | grep lb * lb (ocf::heartbeat:azure-lb): Started fastvm-rhel-8-0-23 7. Use the FIFO to send some binary data to node 1's nc listener. [root@fastvm-rhel-8-0-24 ~]# cat /usr/bin/grep >&3 [root@fastvm-rhel-8-0-23 ~]# grep 1902 /var/run/lb.pid Binary file /var/run/lb.pid matches [root@fastvm-rhel-8-0-23 ~]# pcs resource status | grep lb * lb (ocf::heartbeat:azure-lb): Stopped ----- Actual results: Receiving binary data over the azure-lb listener port causes the grep on the pidfile to print "Binary file /var/run/$OCF_RESOURCE_INSTANCE.pid matches", which in turn causes the monitor operation to fail (due to invalid arguments to the `kill -s 0` command). ----- Expected results: Receiving binary data over the azure-lb listener port has no effect on the status of the resource, provided that the nc process keeps running.