Bug 1335670
Summary: | rlogin fails to connect rsh-server after updating util-linux-ng from version 2.17.2-12.18.el6 to 2.17.2-12.24.el6 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Tiago M. De Rizzo <tmilsond> | ||||||||||
Component: | rsh | Assignee: | Michal Ruprich <mruprich> | ||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | qe-baseos-daemons | ||||||||||
Severity: | urgent | Docs Contact: | |||||||||||
Priority: | urgent | ||||||||||||
Version: | 6.8 | CC: | apmukher, asakure, baitken, cww, dkaylor, eric.brunet, fkrska, gerardba, james-p, jcastran, kabbott, kzak, lampe, lmiksik, luhliari, martin.moore, moscow1789, mruprich, psklenar, rick.beldin, riehecky, sean, snavale, thozza, tmilsond, todoleza, toneata, zpytela | ||||||||||
Target Milestone: | rc | Keywords: | Patch, Regression, Reopened, ZStream | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | |||||||||||||
: | 1450821 1571314 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2018-06-21 08:41:05 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 1269194, 1450821, 1571314 | ||||||||||||
Attachments: |
|
Description
Tiago M. De Rizzo
2016-05-12 19:57:05 UTC
It's an ugly hack, but replacing /bin/login with the previous binary allows rsh logins to function again. This has the potential for major disruption if a place has software like GPFS that is configured to rely on rsh for its functionality. If this update would have rolled to PROD, our compute cluster would have basically gone offline. Kernel version? For us, the system we first noticed it on is 2.6.32-431.23.3.el6.x86_64. I've postponed updating any other systems to 6.8. Out of curiosity, what does the kernel have to do with RSH functionality? You need kernel >= 2.6.32-622.el6, see https://bugzilla.redhat.com/show_bug.cgi?id=1308660 It's all about tty initialization voodoo, kernel & login(1) dance together in very dark place to make it usable for users... *ugh* That means we'll have to schedule reboots as part of the bulk update process. Not impossible, but when dealing with a compute cluster, people can get a little cranky about all the nodes needing to be rebooted. Thanks so much for the response. :) I've just come across this problem with new installs of 6.8 (and upgrades to 6.8) on machines using NIS for authentication We don't have this issue with 6.7 i.e. a new install of 6.8 with kernel 2.6.32-642.1.1.el6 and util-linux-ng-2.17.2-12.24 has the problem I can 'fix' the issue by using /bin/login from 6.7 (util-linux-ng-2.17.2-12.18) We don't see this issue when using sssd (AD back end) for authentication I can also fix the problem by rebuilding util-linux-ng-2.17.2-12.24 without the 'util-linux-ng-2.17-login-vhangup.patch' Yes, there is dependence on kernel (unfortunately not required by uti-linux-ng rpm -- reported as bug #1349192), but it's really strange that authentication method may affect this issue. Sean, did you try it with reboot? Does it solve the problem? If yes, we can close this as duplicate to bug #1349192. (In reply to Karel Zak from comment #13) > Yes, there is dependence on kernel (unfortunately not required by > uti-linux-ng rpm -- reported as bug #1349192), but it's really strange that > authentication method may affect this issue. I don't have access to bug #1349192 - what is the dependency on kernel version? I'm using kernel 2.6.32-642.1.1.el6 (the latest) For me, a reboot doesn't help ... The authentication method _may be_ a red herring - rlogin doesn't always fail - so it just might have been a coincidence that it worked on a box using sssd at that time ... I've since 'downgraded' /bin/login on all my installs, so can't confirm this at the moment > Sean, did you try it with reboot? Does it solve the problem?
Ahhh, yes. Apologies. I intended to reply back, but it never actually happened. ;)
Updating and rebooting to use the latest running kernel, in my case that's 2.6.32-642.1.1.el6.x86_64, fixed the issue I was seeing with RSH failing.
What is the kernel bug that 'corrects' this issue? Seems like this warrants util-linux-ng having a dep on kernel... What is the dep reported in #1349192? *** Bug 1349641 has been marked as a duplicate of this bug. *** (In reply to rick.beldin from comment #17) > What is the kernel bug that 'corrects' this issue? Seems like this warrants > util-linux-ng having a dep on kernel... Bug #1308660, we have backported the way how tty is initialized, now close() is required before vhangup() and to make it fully usable it's also necessary to disable kernel TTY packet mode reset (kernel upstream commit b81273a132177edd806476b953f6afeb17b786d5, Jan 2013). I've been doing a bit more testing ... When using sssd on the remote host for auth, rlogin works most of the time - fails may be 5% of the time When using NIS on the remote host for auth, rlogin fails all the time strace'ing the login process on the remote host shows that /bin/login gets killed by a SIGHUP when in a pam function e.g. 9094 stat("/etc/pam.d", <unfinished ...> 9094 <... stat resumed> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 9094 open("/etc/pam.d/remote", O_RDONLY <unfinished ...> 9094 <... open resumed> ) = 3 9094 fstat(3, <unfinished ...> 9094 <... fstat resumed> {st_mode=S_IFREG|0644, st_size=613, ...}) = 0 9094 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...> 9094 <... mmap resumed> ) = 0x7f2afd78a000 9094 read(3, <unfinished ...> 9094 <... read resumed> "#%PAM-1.0\nauth required pam_securetty.so\nauth include password-auth\naccount required pam_nologin.so\n"..., 4096) = 613 9094 open("/lib64/security/pam_securetty.so", O_RDONLY <unfinished ...> 9094 <... open resumed> ) = 4 9094 read(4, <unfinished ...> 9094 <... read resumed> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360\10\0\0\0\0\0\0@\0\0\0\0\0\0\0p!\0\0\0\0\0\0\0\0\0\0@\0008\0\7\0@\0\32\0\31\0\1\0\0\0\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\204\21\0\0\0\0\0\0\204\21\0\0\0\0\0\0\0\0 \0\0\0\0\0\1\0\0\0\6\0\0\0"..., 832) = 832 9094 fstat(4, <unfinished ...> 9094 <... fstat resumed> {st_mode=S_IFREG|0755, st_size=10224, ...}) = 0 9094 mmap(NULL, 2105464, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 4, 0 <unfinished ...> 9094 <... mmap resumed> ) = 0x7f2afcf26000 9094 mprotect(0x7f2afcf28000, 2093056, PROT_NONE <unfinished ...> 9094 <... mprotect resumed> ) = 0 9094 --- SIGHUP {si_signo=SIGHUP, si_code=SI_KERNEL} --- 9094 +++ killed by SIGHUP +++ I've also found I can 'fix' the problem by commenting out just the line in login.c: close(STDERR_FILENO); just before it does the vhangup() i.e. if stderr is left open, rlogin works ... (In reply to James Pearson from comment #20) > I've been doing a bit more testing ... > > When using sssd on the remote host for auth, rlogin works most of the time - > fails may be 5% of the time > > When using NIS on the remote host for auth, rlogin fails all the time I overlooked a possible significant difference when doing these tests ... the box running sssd was a bare-metal workstation - the box running NIS was a virtual host (using VMware) - swapping between NIS and sssd for auth on these two platforms, now leads me to believe the problem has nothing to do with the auth method, but the underlying platform ... so I now think: When the remote host is bare-metal host, rlogin works most of the time When the remote host is a virtual (VMware) host, rlogin fails all the time Could 'something' at the hypervisor level been affecting what is going on here? Note: rlogin to a remote bare-metal host still does fail occasionally ... I think VM vs bare-metal is significant only in that it indicates that there's some timing issue involved. Slowing down the login increases the likelihood of hitting the problem. In this case, the additional overhead in the hypervisor is affecting the timing. I can duplicate the problem between two bare-metal systems in an HPE lab with the following configuration: Client: ProLiant DL160 G6 running RHEL 6.6. Server: ProLiant DL980 G7 originally running RHEL 6.6, updated to 6.8. The accounts are using local authentication only (no NIS, sssd, etc.) Before updating the server to 6.8, rlogin from the client worked 100% of the time. After the update, rlogin fails about 40%-50% of the time. However, if I strace xinetd on the server (to see what its children rlogind and login are doing), the failure rate increases to 80%-90%. Apparently the additional overhead of tracing is also affecting the timing. I have a customer who reports the same behavior between two VMs. With normal testing, his failure rate is about 70-80%. With strace of xinetd, his failure rate increases to 100%. Note that we use the same code in upstream (=all distros) and Fedora. It's strange that nothing has been reported yet. Martin, please copy & past from: # uname -a; rpm -q xinetd util-linux-ng rsh-server I'll try it tomorrow. Thanks. Interesting ... adding a usleep(10) just after the vhangup() in login.c, dramatically increases the rlogin failure rate on my bare-metal box ... However, not closing stderr before the vhangup() - and rlogin works every time - even with a sleep of 10 (or more) seconds ... Karel, here's the requested output: # uname -a; rpm -q xinetd util-linux-ng rsh-server Linux dl980g7h08u36.alf1.global.tslabs.hpecorp.net 2.6.32-642.el6.x86_64 #1 SMP Wed Apr 13 00:51:26 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux xinetd-2.3.14-40.el6.x86_64 util-linux-ng-2.17.2-12.24.el6.x86_64 rsh-server-0.17-64.el6.x86_64 Created attachment 1172908 [details]
Sosreport of system
Hello, We also can not, for the moment, downgrade the package due to: https://rhn.redhat.com/errata/RHBA-2016-0911.html Which says the fix is included in the util-linux-ng-2.17.2-12.24 package. Thanks, John Castranio Red Hat From my point of view it seems like a kernel issue, if I copy rlogind and login from RHEL6.8 to Fedora then all works as expected. The problem is that select() for master side (=rlogind) of the terminal reports the terminal file descriptor as readable, but read() returns EIO in time when we close()+vhangup() slave side. I seems more recent kernels do not have this behaviour (I have doubts we can backport all the new tty stuff). **BUT** It seems that telnetd with the same kernel and login(1) is able to detect this problem and ignore this initial EIO errors (see 10 years old telned bugfix for bug #145636). It's not elegant, but if it's good enough for rlogind too. Created attachment 1173540 [details]
bugfix patch
That patch appears to fix the problem for me - thanks Although the patch works, there are still problems when using util-linux-ng-2.17.2-12.24.el6 on a pre-6.8 kernel ... i.e. after updating to 6.8, but before rebooting, rlogin will appear to 'hang' - nothing is echoed or displayed in the terminal - but it will still accept input and run commands ... Once rebooted, all is OK - which is a pain in our case, as a reboot could be a significant time after an update ... Created attachment 1174030 [details]
Hacky patch to allow login to work with older kernels ...
For what its worth, here is a hacky patch to /bin/login that allows rlogin to work with pre-6.8 kernels ...
Well, I don't think we want such login(1) patch ;-) What we need is to add proper Requires: to the spec file to make it obvious that reboot is required. (already requested by bug #1349192) I didn't really expect the login patch to be 'accepted' - but it's going to help us when we move from 6.7 to 6.8 in the next week or so :-) The attached 'bugfix patch' does not solve the problem. Putting the machine to rlogin to under heavy load and then connecting often enough still gets you the occasional immediate "connection closed". Hello, just want to let you know that introducing dependency in a spec file has some disadvantage for those, who are running virtual containers (OpenVZ, LXC etc). The kernel (and kernel-firmware) package is unneeded, unwanted and masked to save some space when there are lot's of such installs. p.s. While rebuilding the package to remove the dependancy, found this bug report. Didn't know somebody is still using rlogin in XXI century... You are right: This dependency doesn't fix the bug, nor does it do any good otherwise. It just documents that there is a dependency. (RH messed up the backports, see above.) Those who automatically say rlogin is from the middle ages dont need it, but still have a new problem. Those who use rlogin in appropriate circumstances for good still suffer from this bug, because RH won't fix it. Regarding NEEDINFO: Use two machines A and B with 6.8, all updates applied, latest kernel booted. Configure passwordless rlogin (~/.rlogin) from A to B, put B under heavy load and repeat A> rlogin B B> exit It takes at most a few dozen tries before you see the dreaded "connection closed". It's obviously a race condition. Hence the "put B under heavy load" in the reproducer to increase the likelihood of hitting the bug. Hello, I will look into this soon but right now this is not a priority unfortunately. As soon as I have some more info, I will let you know. Regards, Michal Is this possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=658217 ? Hi, guided by https://www.centos.org/forums/viewtopic.php?t=58248&start=10 I found a workaround that works under load: added 'nice = 5' in /etc/xinetd.d/rlogin and restarted xinetd. It'd be nice to see a proper bugfix. thanks, Gerard I have verified that the bug still exists in RHEL 6.9, util-linux-ng-2.17.2-12.28.el6.x86_64. The same behavior as before: rlogin will work for a few times and then start getting "connection closed". Michal, My customer would be more than happy to test. Karl I'll also be happy to test any potential fixes. Additionally, I've tried the workaround suggested in 50 and it didn't appear to have any effect. Martin Created attachment 1270759 [details]
Testing package #1
Providing testing package with a possible fix.
Hi Karl and Martin, any update on the testing of the package? Thanks, Michal Hi Michal, It looks promising. I installed the test package on the 6.9 server I've been using, and now I can't get rlogin connections to fail. Without the test package they were failing about 50% of the time. Martin Created attachment 1274104 [details]
Patch used in testing package
The test package appears to work fine for me as well - also now not able to get rlogin to fail |