1853102 – in.telnetd needs to tolerate temporary EIO errors. [rhel-7.9.z]

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1853102 - in.telnetd needs to tolerate temporary EIO errors. [rhel-7.9.z]

Summary: in.telnetd needs to tolerate temporary EIO errors. [rhel-7.9.z]

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	telnet
Sub Component:
Version:	7.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Michal Ruprich
QA Contact:	Patrik Moško
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1780662 1881335
TreeView+	depends on / blocked

Reported:	2020-07-02 00:38 UTC by Tetsuo Handa
Modified:	2023-12-15 18:22 UTC (History)
CC List:	6 users (show)
Fixed In Version:	telnet-0.17-66.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1881335 (view as bug list)
Environment:
Last Closed:	2020-11-10 13:04:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Patch to mitigate temporary EIO error (1.20 KB, patch) 2020-07-02 00:43 UTC, Tetsuo Handa	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:5019	0	None	None	None	2020-11-10 13:04:10 UTC

Description Tetsuo Handa 2020-07-02 00:38:43 UTC

Description of problem:

As described in Bug 1299351, /bin/login process temporarily closes all file descriptors when calling vhangup().
If /usr/sbin/in.telnetd reads from pty master while pty slave is temporarily closed, in.telnetd process gets EIO error.
But as a side effect of Bug 145636, in.telnetd process immediately closes connection upon EIO error.



Version-Release number of selected component (if applicable):

telnet-server-0.17-65.el7_8.x86_64
Any environment which uses /bin/login which closes all file descriptors before vhangup().



How reproducible:

This race condition is timing dependent, but I think it is not difficult to reproduce.



Steps to Reproduce:

(1) Install xinetd, telnet-server, telnet and strace packages.
(2) Create /etc/xinetd.d/telnet with the following content, in order to widen this race
    window by making in.telnetd process and login process run slower.

----------
service telnet
{
        socket_type             = stream
        protocol                = tcp
        wait                    = no
        user                    = root
        server                  = /usr/bin/strace
        server_args             = -ttf -o /tmp/strace.log /usr/sbin/in.telnetd
        disable                 = no
        flags                   = IPv4
}
----------

(3) Restart xinetd service in order to reload /etc/xinetd.d/telnet file.
(4) Connect to telnet server using the following command line. Note that
    echo '' is there for sending a garbage data into this race window.

      (echo ''; sleep 3) | telnet 127.0.0.1



Actual results:

in.telnetd process closes connection before reaching login: prompt.

----------
$ (echo ''; sleep 3) | telnet 127.0.0.1
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.

Kernel 3.10.0-1127.13.1.el7.x86_64 on an x86_64

Connection closed by foreign host.
----------



Expected results:

in.telnetd process closes connection after reaching login: prompt.

----------
$ (echo ''; sleep 3) | telnet 127.0.0.1
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.

Kernel 3.10.0-1127.13.1.el7.x86_64 on an x86_64

localhost login: Connection closed by foreign host.
----------



Additional info:

If /usr/bin/strace fails to widen this race window, you can instead try /bin/login built with the following patch applied.

----------
--- a/login-utils/login.c
+++ b/login-utils/login.c
@@ -397,20 +397,21 @@ static void init_tty(struct login_contex
 	/* Kill processes left on this tty */
 	tcsetattr(0, TCSANOW, &ttt);
 
 	/*
 	 * Let's close file decriptors before vhangup
 	 * https://lkml.org/lkml/2012/6/5/145
 	 */
 	close(STDIN_FILENO);
 	close(STDOUT_FILENO);
 	close(STDERR_FILENO);
+	sleep(5);
 
 	signal(SIGHUP, SIG_IGN);	/* so vhangup() wont kill us */
 	vhangup();
 	signal(SIGHUP, SIG_DFL);
 
 	/* open stdin,stdout,stderr to the tty */
 	open_tty(cxt->tty_path);
 
 	/* restore tty modes */
 	tcsetattr(0, TCSAFLUSH, &tt);
----------

Comment 2 Tetsuo Handa 2020-07-02 00:43:45 UTC

Created attachment 1699584 [details]
Patch to mitigate temporary EIO error

A different version of telnetd is mitigating this problem by tolerating temporary EIO errors for 10 ms
( https://git.busybox.net/busybox/commit/networking/telnetd.c?id=39b18196f89a6f595d47c2a9c3a62c50d413c054 ).

Since some unexpected delays between close() and open() can happen (due to e.g. context switching,
direct memory reclaim from page fault, antivirus software's on-access scanning), we should consider
retrying for longer period than busybox's version.

Since Bug 145636 did not describe steps to reproduce, we don't know how to trigger permanent EIO error
despite child process is still alive. Since in.telnetd process will automatically terminate due to
signal(SIGCHLD, cleanup), I consider that it is unlikely that we hit permanent EIO error despite child
process is still alive. Therefore, I consider that the risk of retrying for longer period is quite small.
An example mitigation patch for RHEL's version is attached.

Comment 5 Michal Ruprich 2020-08-06 10:42:18 UTC

(In reply to Tetsuo Handa from comment #0)
> Actual results:
> 
> in.telnetd process closes connection before reaching login: prompt.
> 
> ----------
> $ (echo ''; sleep 3) | telnet 127.0.0.1
> Trying 127.0.0.1...
> Connected to 127.0.0.1.
> Escape character is '^]'.
> 
> Kernel 3.10.0-1127.13.1.el7.x86_64 on an x86_64
> 
> Connection closed by foreign host.
> ----------
> 
> 
> 
> Expected results:
> 
> in.telnetd process closes connection after reaching login: prompt.
> 
> ----------
> $ (echo ''; sleep 3) | telnet 127.0.0.1
> Trying 127.0.0.1...
> Connected to 127.0.0.1.
> Escape character is '^]'.
> 
> Kernel 3.10.0-1127.13.1.el7.x86_64 on an x86_64
> 
> localhost login: Connection closed by foreign host.
> ----------
Hi Tetsuo, 

the expected result is visible when you use systemctl instead of xinetd to start telnetd.

# systemctl start telnet.socket
# ss -tlnup
.....
LISTEN   0   128   [::]:23   [::]:*   users:(("systemd",pid=1,fd=44))
.....
# (echo ''; sleep 3) | telnet 127.0.0.1
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.

Kernel 3.10.0-1158.el7.x86_64 on an x86_64

ci-vm-10-0-138-196 login: Connection closed by foreign host.

The login is reached with the reproducer under systemctl. Can you use this instead of xinetd? Would that solve the problem you are having?

Thanks and regards,
Michal Ruprich

Comment 6 Michal Ruprich 2020-08-06 10:44:37 UTC

Hi Masaharu,

maybe the suggestion from comment #5 might help your customer as well?

Thanks and regards,
Michal Ruprich

Comment 7 Tetsuo Handa 2020-08-06 11:21:48 UTC

Use of systemctl does not help. If you create /etc/systemd/system/telnet@.service from /usr/lib/systemd/system/telnet@.service with

  -ExecStart=-/usr/sbin/in.telnetd
  +ExecStart=-/usr/bin/strace -ttf -o /tmp/strace.log /usr/sbin/in.telnetd

modification, the same result will be observed.

I'm using strace in order to drive up the frequency of this failure for explanation/testing purpose. 
The customer is not running under strace.

The customer says that the frequency of this failure is a few percent, and avoiding this failure on the server side is important because it is impossible to implement retry logic on the client side.
(I created a public Bugzilla entry on behalf of the customer. I expect that the customer already created a RH support case for details.)

Comment 10 Michal Ruprich 2020-08-11 11:11:05 UTC

Hi Tetsuo,

I think that the patch seems reasonable. Just one thing, why did you use poll(NULL, 0, 10)? Why not use a simple sleep(0.01)? I am just wondering what might be better at this point but I probably don't see a difference between those two.

Thanks and regards,
Michal

Comment 11 Tetsuo Handa 2020-08-11 11:18:41 UTC

Because unlike sleep(1), sleep(3) accepts "seconds".

  unsigned int sleep(unsigned int seconds);

Comment 20 errata-xmlrpc 2020-11-10 13:04:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (telnet bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5019

Comment 21 Red Hat Bugzilla 2023-09-18 00:21:35 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.