Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
> Description of problem:
Hi there, I was referred here by https://access.redhat.com/support/cases/#/case/01954815 .
We are using rsh/rshd as a transport on a ~1 Pb storage cluster, and were running into a bug similar to the one of netkit-rsh0.16 (1999):
=== ChangeLog ===
17-Dec-1999:
Fix bug in rshd (hangs forever with zombie offspring).
Also fix problems with rlogind.
To be posted as patch 1, making netkit-rsh-0.16.1.
=== === ===
in our case, it used to /sometimes/ randomly hang on long data transfers:
=== pstree output ===
systemd,1 --system --deserialize 25
( ... )
..sh,45950 -c zfs send -R 'pool1/export/dataset@snap-2017-05-22' | rsh 'remote' "zfs receive -F 'pool1/mirror/dataset'"
..rsh,45952 remote zfs receive -F 'pool1/mirror/dataset'
..(rsh,45953)
=== === ===
-- where (rsh,45953) denotes a zombie process ;
at this stage the remote rshd process was normally finished, and local strace was showing an indefinite wait:
=== === ===
root@local:/home/user# strace -p 45952
strace: Process 45952 attached
select(6, [5], NULL, NULL, NULL^Cstrace: Process 45952 detached
<detached ...>
root@local:/home/user#
=== === ===
From examining rsh/rshd code compared to strace output of a successful (non-hanging) transmission it was clear that the above select() is waiting on remote stderr, whereas rshd is not issuing a close() on remote stderr socket, closing it with a shutdown(SHUT_RWDR) instead.
Patching rshd code with a close() after shutdown() fixed the problem, and now I am looking for a package maintainer to pass the burden of maintaining the "patch".
Please note that the patch is at the very least harmless, since (a) shutdown(SHUT_RWDR) does not leave much to do with the socket, and (b) it perfectly allows for a subsequent close() -- which is in fact recommended to do under normal circumstances.
Proposed patch for "stock" netkit-rsh-0.17:
=== === ===
diff -r 36e472b66db7 rshd/rshd.c
--- a/rshd/rshd.c Thu May 25 21:46:23 2017 +1000
+++ b/rshd/rshd.c Thu May 25 22:14:17 2017 +1000
@@ -46,6 +46,13 @@
"$Id: rshd.c,v 1.25 2000/07/23 04:16:24 dholland Exp $";
#include "../version.h"
+
+// this must be the first line, 'cause GCC includes stuff (via /usr/lib/gcc/...../include)
+// which turns that off
+#include <linux/limits.h> // probably there's a better, more portable way to include this
+// NB: using hard-coded ARGS_MAX is not recommended anywway;
+// better use sth like sysconf(_SC_ARG_MAX) instead
+
/*
* remote shell server:
* [port]\0
@@ -203,6 +210,7 @@
if (cc <= 0) {
shutdown(sock, 2);
FD_CLR(pype, &readfrom);
+ close(sock); // <-- added this trying to get rid of the client hanging on reading stderr
guys--;
}
else write(sock, buf, cc);
=== === ===
Please note that RedHat version of rsh-0.17-76.el7_1.1 already has a proper ARGS_MAX / sysconf patch, but to make a proper RedHat-compatible patch I would probably need to get in touch with the package maintainer, in order to learn the recommended order of applying RedHat patches ( see https://access.redhat.com/support/cases/#/case/01954815 ).
> Version-Release number of selected component (if applicable):
rsh-0.17-76.el7_1.1 (rshd.c)
> How reproducible:
Irregular. Can provide a core file and some additional information on request.
> Steps to Reproduce:
> 1.
Install rsh package, use it for long data transfers on a 1 Pb storage cluster. The error will manifest itself sooner or later.
> 2.
> 3.> Actual results:
data transfer completes successfully, but rsh process hangs indefinitely
> Expected results:
rsh process exits after the data transfer
> Additional info:
See above
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2018:3057