1937151 – azure-lb: nc listener dies when attempting to write to stdout [RHEL 7.9.z]

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1937151 - azure-lb: nc listener dies when attempting to write to stdout [RHEL 7.9.z]

Summary: azure-lb: nc listener dies when attempting to write to stdout [RHEL 7.9.z]

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	7.9
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	7.9
Assignee:	Oyvind Albrigtsen
QA Contact:	Brandon Perkins
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1937426 1937427 1937428
TreeView+	depends on / blocked

Reported:	2021-03-10 01:42 UTC by Reid Wahl
Modified:	2021-06-08 22:30 UTC (History)
CC List:	9 users (show)
Fixed In Version:	resource-agents-4.1.1-61.el7_9.9
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1937142
Clones:	1937426 1937427 1937428 (view as bug list)
Environment:
Last Closed:	2021-06-08 22:29:53 UTC
Target Upstream Version:
Embargoed:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ClusterLabs resource-agents pull 1620	0	None	open	azure-lb: Redirect stdout and stderr to /dev/null	2021-03-10 02:36:03 UTC
Red Hat Knowledge Base (Solution)	5871151	0	None	None	None	2021-03-10 02:17:30 UTC

Description Reid Wahl 2021-03-10 01:42:39 UTC

+++ This bug was initially created as a clone of Bug #1937142 +++

Description of problem:

When the nc process created by an azure-lb resource attempts to write to stdout, it dies with a SIGPIPE error.

This can happen when random/garbage input is sent to the nc listener. For example:
~~~
    [root@fastvm-rhel-8-0-23 ~]# pcs resource debug-start my_nc 
    Operation start for my_nc (ocf:heartbeat:azure-lb) returned: 'ok' (0)

    [root@fastvm-rhel-8-0-23 ~]# date && ps -ef | grep 62000
    Tue Mar  9 16:19:13 PST 2021
    root        2838       1  0 16:19 pts/0    00:00:00 /usr/bin/nc -l -k 62000
    root        2845    1420  0 16:19 pts/0    00:00:00 grep --color=auto 62000
     
    [root@fastvm-rhel-8-0-24 ~]# date && echo test | nc node1 62000
    Tue Mar  9 16:19:27 PST 2021
     
    [root@fastvm-rhel-8-0-23 ~]# date && ps -ef | grep 62000
    Tue Mar  9 16:19:30 PST 2021
    root        2849    1420  0 16:19 pts/0    00:00:00 grep --color=auto 62000
~~~

If you want to see the SIGPIPE, you can make the following change to lb_start() temporarily, and then view the strace output after the process dies.
~~~
    #cmd="$OCF_RESKEY_nc -l -k $OCF_RESKEY_port"
    cmd="strace -Tttvfs 1024 -o /tmp/nc.$(date +%Y%m%d-%H%M%S).out nc -l -k $OCF_RESKEY_port"
    if ! lb_monitor; then
        ocf_log debug "Starting $process: $cmd"
        # Execute the command as created above
        $cmd &
        #echo $! > $pidfile
        pid=$(ps -ef | grep "nc -l -k $OCF_RESKEY_port" | grep -v grep | grep -v strace | awk '{print $2}')
        echo "$pid" > $pidfile
~~~

You'll find a failure like the following:
~~~
77954 15:13:33.003037 recvfrom(6, "GET / HTTP/1.0\r\n\r\n", 8192, 0, NULL, NULL) = 18 <0.000014>
77954 15:13:33.003179 write(1, "GET / HTTP/1.0\r\n\r\n", 18) = -1 EPIPE (Broken pipe) <0.000015>
77954 15:13:33.003245 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=77954, si_uid=0} ---
77954 15:13:33.003306 dup(2)            = 7 <0.000012>
77954 15:13:33.003360 fcntl(7, F_GETFL) = 0x1 (flags O_WRONLY) <0.000012>
77954 15:13:33.003414 close(7)          = 0 <0.000012>
77954 15:13:33.003507 write(2, "write: Broken pipe\n", 19) = -1 EPIPE (Broken pipe) <0.000012>
77954 15:13:33.003563 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=77954, si_uid=0} ---
77954 15:13:33.003741 exit_group(2)     = ?
77954 15:13:33.004444 +++ exited with 2 +++
~~~

Apparently the Azure health probes don't cause anything to be written to stdout at the nc listener end; the listener simply processes the request and sends a response back to the client. So the nc process doesn't die when Azure sends probes. But it can die in situations like a port scan.

This is a sneaky regression introduced by commit d22700fc.
  - azure-lb: Don't redirect nc listener output to pidfile (https://github.com/ClusterLabs/resource-agents/commit/d22700fc)

Prior to that fix, all of nc's stdout was redirected to the pid file. This could cause the resource to fail and then be unable to restart if binary data was appended to the pid file (documented in BZ1850778). However, the fix for that issue made it so that the nc stdout is no longer redirected at all. By not redirecting stdout, we now get a SIGPIPE if an nc listener created by the resource does try to write to stdout.

The current bug is less catastrophic than the previous one, as it causes the resource to fail but doesn't prevent it from restarting. Still, this may cause a significant service disruption in SAP environments.

There is no known workaround besides manually editing the resource agent.

-----

Version-Release number of selected component (if applicable):

resource-agents-4.1.1-68.el8

-----

How reproducible:

Always

-----

Steps to Reproduce:
1. Create an azure-lb resource with port=62000 (for example).

    node1# pcs resource create --disabled lb azure-lb port=62000

2. Connect to the listener and send it some text data.

    node2# echo test | nc node1 62000

-----

Actual results:

The nc listener process on node1 dies. The resource monitor operation fails.

-----

Expected results:

The resource continues running without issue.

-----

Additional info:

We're going to want to zStream this pretty widely (8.1.z/8.2.z/8.3.z, and ideally back to 7.4.z). This is critical for SAP deployments on Azure.

I'll leave it up to engineering whether to pursue an 8.4 blocker or to pursue 8.4.z instead. The actual fix appears to be trivial.

Comment 19 errata-xmlrpc 2021-06-08 22:29:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (resource-agents bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2311

Note You need to log in before you can comment on or make changes to this bug.