Bug 1853102

Summary:

in.telnetd needs to tolerate temporary EIO errors. [rhel-7.9.z]

Product:

Red Hat Enterprise Linux 7

Reporter:

Tetsuo Handa <penguin-kernel>

Component:

telnet

Assignee:

Michal Ruprich <mruprich>

Status:

CLOSED ERRATA

QA Contact:

Patrik Moško <pmosko>

Severity:

medium

Docs Contact:

Priority:

urgent

Version:

7.8

CC:

ctpm-oss-app-prm, jreznik, mkawada, omejzlik, penguin-kernel, pmosko

Target Milestone:

Keywords:

Patch, Reproducer, TestCaseProvided, Triaged, ZStream

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

telnet-0.17-66.el7

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1881335 (view as bug list)

Environment:

Last Closed:

2020-11-10 13:04:04 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1780662, 1881335

Attachments:

Description	Flags
Patch to mitigate temporary EIO error	none

Description Tetsuo Handa 2020-07-02 00:38:43 UTC

Description of problem:

As described in Bug 1299351, /bin/login process temporarily closes all file descriptors when calling vhangup().
If /usr/sbin/in.telnetd reads from pty master while pty slave is temporarily closed, in.telnetd process gets EIO error.
But as a side effect of Bug 145636, in.telnetd process immediately closes connection upon EIO error.



Version-Release number of selected component (if applicable):

telnet-server-0.17-65.el7_8.x86_64
Any environment which uses /bin/login which closes all file descriptors before vhangup().



How reproducible:

This race condition is timing dependent, but I think it is not difficult to reproduce.



Steps to Reproduce:

(1) Install xinetd, telnet-server, telnet and strace packages.
(2) Create /etc/xinetd.d/telnet with the following content, in order to widen this race
    window by making in.telnetd process and login process run slower.

----------
service telnet
{
        socket_type             = stream
        protocol                = tcp
        wait                    = no
        user                    = root
        server                  = /usr/bin/strace
        server_args             = -ttf -o /tmp/strace.log /usr/sbin/in.telnetd
        disable                 = no
        flags                   = IPv4
}
----------

(3) Restart xinetd service in order to reload /etc/xinetd.d/telnet file.
(4) Connect to telnet server using the following command line. Note that
    echo '' is there for sending a garbage data into this race window.

      (echo ''; sleep 3) | telnet 127.0.0.1



Actual results:

in.telnetd process closes connection before reaching login: prompt.

----------
$ (echo ''; sleep 3) | telnet 127.0.0.1
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.

Kernel 3.10.0-1127.13.1.el7.x86_64 on an x86_64

Connection closed by foreign host.
----------



Expected results:

in.telnetd process closes connection after reaching login: prompt.

----------
$ (echo ''; sleep 3) | telnet 127.0.0.1
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.

Kernel 3.10.0-1127.13.1.el7.x86_64 on an x86_64

localhost login: Connection closed by foreign host.
----------



Additional info:

If /usr/bin/strace fails to widen this race window, you can instead try /bin/login built with the following patch applied.

----------
--- a/login-utils/login.c
+++ b/login-utils/login.c
@@ -397,20 +397,21 @@ static void init_tty(struct login_contex
 	/* Kill processes left on this tty */
 	tcsetattr(0, TCSANOW, &ttt);
 
 	/*
 	 * Let's close file decriptors before vhangup
 	 * https://lkml.org/lkml/2012/6/5/145
 	 */
 	close(STDIN_FILENO);
 	close(STDOUT_FILENO);
 	close(STDERR_FILENO);
+	sleep(5);
 
 	signal(SIGHUP, SIG_IGN);	/* so vhangup() wont kill us */
 	vhangup();
 	signal(SIGHUP, SIG_DFL);
 
 	/* open stdin,stdout,stderr to the tty */
 	open_tty(cxt->tty_path);
 
 	/* restore tty modes */
 	tcsetattr(0, TCSAFLUSH, &tt);
----------

Comment 2 Tetsuo Handa 2020-07-02 00:43:45 UTC

Created attachment 1699584 [details]
Patch to mitigate temporary EIO error

A different version of telnetd is mitigating this problem by tolerating temporary EIO errors for 10 ms
( https://git.busybox.net/busybox/commit/networking/telnetd.c?id=39b18196f89a6f595d47c2a9c3a62c50d413c054 ).

Since some unexpected delays between close() and open() can happen (due to e.g. context switching,
direct memory reclaim from page fault, antivirus software's on-access scanning), we should consider
retrying for longer period than busybox's version.

Since Bug 145636 did not describe steps to reproduce, we don't know how to trigger permanent EIO error
despite child process is still alive. Since in.telnetd process will automatically terminate due to
signal(SIGCHLD, cleanup), I consider that it is unlikely that we hit permanent EIO error despite child
process is still alive. Therefore, I consider that the risk of retrying for longer period is quite small.
An example mitigation patch for RHEL's version is attached.

Comment 5 Michal Ruprich 2020-08-06 10:42:18 UTC

(In reply to Tetsuo Handa from comment #0)
> Actual results:
> 
> in.telnetd process closes connection before reaching login: prompt.
> 
> ----------
> $ (echo ''; sleep 3) | telnet 127.0.0.1
> Trying 127.0.0.1...
> Connected to 127.0.0.1.
> Escape character is '^]'.
> 
> Kernel 3.10.0-1127.13.1.el7.x86_64 on an x86_64
> 
> Connection closed by foreign host.
> ----------
> 
> 
> 
> Expected results:
> 
> in.telnetd process closes connection after reaching login: prompt.
> 
> ----------
> $ (echo ''; sleep 3) | telnet 127.0.0.1
> Trying 127.0.0.1...
> Connected to 127.0.0.1.
> Escape character is '^]'.
> 
> Kernel 3.10.0-1127.13.1.el7.x86_64 on an x86_64
> 
> localhost login: Connection closed by foreign host.
> ----------
Hi Tetsuo, 

the expected result is visible when you use systemctl instead of xinetd to start telnetd.

# systemctl start telnet.socket
# ss -tlnup
.....
LISTEN   0   128   [::]:23   [::]:*   users:(("systemd",pid=1,fd=44))
.....
# (echo ''; sleep 3) | telnet 127.0.0.1
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.

Kernel 3.10.0-1158.el7.x86_64 on an x86_64

ci-vm-10-0-138-196 login: Connection closed by foreign host.

The login is reached with the reproducer under systemctl. Can you use this instead of xinetd? Would that solve the problem you are having?

Thanks and regards,
Michal Ruprich

Comment 6 Michal Ruprich 2020-08-06 10:44:37 UTC

Hi Masaharu,

maybe the suggestion from comment #5 might help your customer as well?

Thanks and regards,
Michal Ruprich

Comment 7 Tetsuo Handa 2020-08-06 11:21:48 UTC

Use of systemctl does not help. If you create /etc/systemd/system/telnet@.service from /usr/lib/systemd/system/telnet@.service with

  -ExecStart=-/usr/sbin/in.telnetd
  +ExecStart=-/usr/bin/strace -ttf -o /tmp/strace.log /usr/sbin/in.telnetd

modification, the same result will be observed.

I'm using strace in order to drive up the frequency of this failure for explanation/testing purpose. 
The customer is not running under strace.

The customer says that the frequency of this failure is a few percent, and avoiding this failure on the server side is important because it is impossible to implement retry logic on the client side.
(I created a public Bugzilla entry on behalf of the customer. I expect that the customer already created a RH support case for details.)

Comment 10 Michal Ruprich 2020-08-11 11:11:05 UTC

Hi Tetsuo,

I think that the patch seems reasonable. Just one thing, why did you use poll(NULL, 0, 10)? Why not use a simple sleep(0.01)? I am just wondering what might be better at this point but I probably don't see a difference between those two.

Thanks and regards,
Michal

Comment 11 Tetsuo Handa 2020-08-11 11:18:41 UTC

Because unlike sleep(1), sleep(3) accepts "seconds".

  unsigned int sleep(unsigned int seconds);

Comment 20 errata-xmlrpc 2020-11-10 13:04:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (telnet bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5019

Comment 21 Red Hat Bugzilla 2023-09-18 00:21:35 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days