Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1937151

Summary: azure-lb: nc listener dies when attempting to write to stdout [RHEL 7.9.z]
Product: Red Hat Enterprise Linux 7 Reporter: Reid Wahl <nwahl>
Component: resource-agentsAssignee: Oyvind Albrigtsen <oalbrigt>
Status: CLOSED ERRATA QA Contact: Brandon Perkins <bperkins>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.9CC: agk, bperkins, cfeist, cluster-maint, cluster-qe, fdinitto, lmiksik, oalbrigt, radeltch
Target Milestone: rcKeywords: Regression, Triaged, ZStream
Target Release: 7.9Flags: pm-rhel: mirror+
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: resource-agents-4.1.1-61.el7_9.9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1937142
: 1937426 1937427 1937428 (view as bug list) Environment:
Last Closed: 2021-06-08 22:29:53 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1937426, 1937427, 1937428    

Description Reid Wahl 2021-03-10 01:42:39 UTC
+++ This bug was initially created as a clone of Bug #1937142 +++

Description of problem:

When the nc process created by an azure-lb resource attempts to write to stdout, it dies with a SIGPIPE error.

This can happen when random/garbage input is sent to the nc listener. For example:
~~~
    [root@fastvm-rhel-8-0-23 ~]# pcs resource debug-start my_nc 
    Operation start for my_nc (ocf:heartbeat:azure-lb) returned: 'ok' (0)

    [root@fastvm-rhel-8-0-23 ~]# date && ps -ef | grep 62000
    Tue Mar  9 16:19:13 PST 2021
    root        2838       1  0 16:19 pts/0    00:00:00 /usr/bin/nc -l -k 62000
    root        2845    1420  0 16:19 pts/0    00:00:00 grep --color=auto 62000
     
    [root@fastvm-rhel-8-0-24 ~]# date && echo test | nc node1 62000
    Tue Mar  9 16:19:27 PST 2021
     
    [root@fastvm-rhel-8-0-23 ~]# date && ps -ef | grep 62000
    Tue Mar  9 16:19:30 PST 2021
    root        2849    1420  0 16:19 pts/0    00:00:00 grep --color=auto 62000
~~~

If you want to see the SIGPIPE, you can make the following change to lb_start() temporarily, and then view the strace output after the process dies.
~~~
    #cmd="$OCF_RESKEY_nc -l -k $OCF_RESKEY_port"
    cmd="strace -Tttvfs 1024 -o /tmp/nc.$(date +%Y%m%d-%H%M%S).out nc -l -k $OCF_RESKEY_port"
    if ! lb_monitor; then
        ocf_log debug "Starting $process: $cmd"
        # Execute the command as created above
        $cmd &
        #echo $! > $pidfile
        pid=$(ps -ef | grep "nc -l -k $OCF_RESKEY_port" | grep -v grep | grep -v strace | awk '{print $2}')
        echo "$pid" > $pidfile
~~~

You'll find a failure like the following:
~~~
77954 15:13:33.003037 recvfrom(6, "GET / HTTP/1.0\r\n\r\n", 8192, 0, NULL, NULL) = 18 <0.000014>
77954 15:13:33.003179 write(1, "GET / HTTP/1.0\r\n\r\n", 18) = -1 EPIPE (Broken pipe) <0.000015>
77954 15:13:33.003245 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=77954, si_uid=0} ---
77954 15:13:33.003306 dup(2)            = 7 <0.000012>
77954 15:13:33.003360 fcntl(7, F_GETFL) = 0x1 (flags O_WRONLY) <0.000012>
77954 15:13:33.003414 close(7)          = 0 <0.000012>
77954 15:13:33.003507 write(2, "write: Broken pipe\n", 19) = -1 EPIPE (Broken pipe) <0.000012>
77954 15:13:33.003563 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=77954, si_uid=0} ---
77954 15:13:33.003741 exit_group(2)     = ?
77954 15:13:33.004444 +++ exited with 2 +++
~~~

Apparently the Azure health probes don't cause anything to be written to stdout at the nc listener end; the listener simply processes the request and sends a response back to the client. So the nc process doesn't die when Azure sends probes. But it can die in situations like a port scan.

This is a sneaky regression introduced by commit d22700fc.
  - azure-lb: Don't redirect nc listener output to pidfile (https://github.com/ClusterLabs/resource-agents/commit/d22700fc)

Prior to that fix, all of nc's stdout was redirected to the pid file. This could cause the resource to fail and then be unable to restart if binary data was appended to the pid file (documented in BZ1850778). However, the fix for that issue made it so that the nc stdout is no longer redirected at all. By not redirecting stdout, we now get a SIGPIPE if an nc listener created by the resource does try to write to stdout.

The current bug is less catastrophic than the previous one, as it causes the resource to fail but doesn't prevent it from restarting. Still, this may cause a significant service disruption in SAP environments.

There is no known workaround besides manually editing the resource agent.

-----

Version-Release number of selected component (if applicable):

resource-agents-4.1.1-68.el8

-----

How reproducible:

Always

-----

Steps to Reproduce:
1. Create an azure-lb resource with port=62000 (for example).

    node1# pcs resource create --disabled lb azure-lb port=62000

2. Connect to the listener and send it some text data.

    node2# echo test | nc node1 62000

-----

Actual results:

The nc listener process on node1 dies. The resource monitor operation fails.

-----

Expected results:

The resource continues running without issue.

-----

Additional info:

We're going to want to zStream this pretty widely (8.1.z/8.2.z/8.3.z, and ideally back to 7.4.z). This is critical for SAP deployments on Azure.

I'll leave it up to engineering whether to pursue an 8.4 blocker or to pursue 8.4.z instead. The actual fix appears to be trivial.

Comment 19 errata-xmlrpc 2021-06-08 22:29:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (resource-agents bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2311