Bug 1853102
Summary: | in.telnetd needs to tolerate temporary EIO errors. [rhel-7.9.z] | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Tetsuo Handa <penguin-kernel> | ||||
Component: | telnet | Assignee: | Michal Ruprich <mruprich> | ||||
Status: | CLOSED ERRATA | QA Contact: | Patrik Moško <pmosko> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 7.8 | CC: | ctpm-oss-app-prm, jreznik, mkawada, omejzlik, penguin-kernel, pmosko | ||||
Target Milestone: | rc | Keywords: | Patch, Reproducer, TestCaseProvided, Triaged, ZStream | ||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | telnet-0.17-66.el7 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1881335 (view as bug list) | Environment: | |||||
Last Closed: | 2020-11-10 13:04:04 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1780662, 1881335 | ||||||
Attachments: |
|
Description
Tetsuo Handa
2020-07-02 00:38:43 UTC
Created attachment 1699584 [details] Patch to mitigate temporary EIO error A different version of telnetd is mitigating this problem by tolerating temporary EIO errors for 10 ms ( https://git.busybox.net/busybox/commit/networking/telnetd.c?id=39b18196f89a6f595d47c2a9c3a62c50d413c054 ). Since some unexpected delays between close() and open() can happen (due to e.g. context switching, direct memory reclaim from page fault, antivirus software's on-access scanning), we should consider retrying for longer period than busybox's version. Since Bug 145636 did not describe steps to reproduce, we don't know how to trigger permanent EIO error despite child process is still alive. Since in.telnetd process will automatically terminate due to signal(SIGCHLD, cleanup), I consider that it is unlikely that we hit permanent EIO error despite child process is still alive. Therefore, I consider that the risk of retrying for longer period is quite small. An example mitigation patch for RHEL's version is attached. (In reply to Tetsuo Handa from comment #0) > Actual results: > > in.telnetd process closes connection before reaching login: prompt. > > ---------- > $ (echo ''; sleep 3) | telnet 127.0.0.1 > Trying 127.0.0.1... > Connected to 127.0.0.1. > Escape character is '^]'. > > Kernel 3.10.0-1127.13.1.el7.x86_64 on an x86_64 > > Connection closed by foreign host. > ---------- > > > > Expected results: > > in.telnetd process closes connection after reaching login: prompt. > > ---------- > $ (echo ''; sleep 3) | telnet 127.0.0.1 > Trying 127.0.0.1... > Connected to 127.0.0.1. > Escape character is '^]'. > > Kernel 3.10.0-1127.13.1.el7.x86_64 on an x86_64 > > localhost login: Connection closed by foreign host. > ---------- Hi Tetsuo, the expected result is visible when you use systemctl instead of xinetd to start telnetd. # systemctl start telnet.socket # ss -tlnup ..... LISTEN 0 128 [::]:23 [::]:* users:(("systemd",pid=1,fd=44)) ..... # (echo ''; sleep 3) | telnet 127.0.0.1 Trying 127.0.0.1... Connected to 127.0.0.1. Escape character is '^]'. Kernel 3.10.0-1158.el7.x86_64 on an x86_64 ci-vm-10-0-138-196 login: Connection closed by foreign host. The login is reached with the reproducer under systemctl. Can you use this instead of xinetd? Would that solve the problem you are having? Thanks and regards, Michal Ruprich Hi Masaharu, maybe the suggestion from comment #5 might help your customer as well? Thanks and regards, Michal Ruprich Use of systemctl does not help. If you create /etc/systemd/system/telnet@.service from /usr/lib/systemd/system/telnet@.service with -ExecStart=-/usr/sbin/in.telnetd +ExecStart=-/usr/bin/strace -ttf -o /tmp/strace.log /usr/sbin/in.telnetd modification, the same result will be observed. I'm using strace in order to drive up the frequency of this failure for explanation/testing purpose. The customer is not running under strace. The customer says that the frequency of this failure is a few percent, and avoiding this failure on the server side is important because it is impossible to implement retry logic on the client side. (I created a public Bugzilla entry on behalf of the customer. I expect that the customer already created a RH support case for details.) Hi Tetsuo, I think that the patch seems reasonable. Just one thing, why did you use poll(NULL, 0, 10)? Why not use a simple sleep(0.01)? I am just wondering what might be better at this point but I probably don't see a difference between those two. Thanks and regards, Michal Because unlike sleep(1), sleep(3) accepts "seconds". unsigned int sleep(unsigned int seconds); Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (telnet bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5019 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |