Bug 1335670

Summary:

rlogin fails to connect rsh-server after updating util-linux-ng from version 2.17.2-12.18.el6 to 2.17.2-12.24.el6

Product:

Red Hat Enterprise Linux 6

Reporter:

Tiago M. De Rizzo <tmilsond>

Component:

rsh

Assignee:

Michal Ruprich <mruprich>

Status:

CLOSED CURRENTRELEASE

QA Contact:

qe-baseos-daemons

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

6.8

CC:

apmukher, asakure, baitken, cww, dkaylor, eric.brunet, fkrska, gerardba, james-p, jcastran, kabbott, kzak, lampe, lmiksik, luhliari, martin.moore, moscow1789, mruprich, psklenar, rick.beldin, riehecky, sean, snavale, thozza, tmilsond, todoleza, toneata, zpytela

Target Milestone:

Keywords:

Patch, Regression, Reopened, ZStream

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1450821 1571314 (view as bug list)

Environment:

Last Closed:

2018-06-21 08:41:05 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1269194, 1450821, 1571314

Attachments:

Description	Flags
bugfix patch	none
Hacky patch to allow login to work with older kernels ...	none
Testing package #1	none
Patch used in testing package	none

Description Tiago M. De Rizzo 2016-05-12 19:57:05 UTC

Description of problem:

rlogin fails to connect to rsh-server after updating util-linux-ng, and dependencies, from version 2.17.2-12.18.el6 to 2.17.2-12.24.el6

Version-Release number of selected component (if applicable):

Affected package:
----------------
util-linux-ng-2.17.2-12.24.el6.x86_64
libuuid-2.17.2-12.24.el6.x86_64
libblkid-2.17.2-12.24.el6.x86_64

How reproducible:

1. Install RHEL 6.7

2. Install and configure rsh and rsh-server as described in the articles:

- https://access.redhat.com/solutions/7321
- https://access.redhat.com/solutions/1601

3. Test the connection between client and server.

4. Update the package:

yum update util-linux-ng-2.17.2-12.24.el6.x86_64

5. Test the connection again and it will close it immediately.

Actual results:

Before update (working connection):
-----------------------------------

[root@host1 ~]# rlogin host2
Last login: Thu May 12 16:00:40 from host1
[root@host2 ~]#

After update (closed connection):
---------------------------------
[root@host1 ~]# rlogin host2
rlogin: connection closed.

Expected results:

The rlogin connection will be closed right after the connection is established.

You will also notice the connection duration equals 0 sec. 
---
May 12 16:33:09 host2 xinetd[1872]: START: login pid=2203 from=::ffff:192.168.124.58
May 12 16:33:09 host2 xinetd[1872]: EXIT: login status=0 pid=2203 duration=0(sec)
---

Additional info:

First noticed on a full system update from RHEL 6.7 to RHEL 6.8.

Comment 6 Sean Johnson 2016-06-03 14:21:42 UTC

It's an ugly hack, but replacing /bin/login with the previous binary allows rsh logins to function again. 

This has the potential for major disruption if a place has software like GPFS that is configured to rely on rsh for its functionality. If this update would have rolled to PROD, our compute cluster would have basically gone offline.

Comment 8 Karel Zak 2016-06-06 13:10:58 UTC

Kernel version?

Comment 9 Sean Johnson 2016-06-07 15:17:42 UTC

For us, the system we first noticed it on is 2.6.32-431.23.3.el6.x86_64. I've postponed  updating any other systems to 6.8. 

Out of curiosity, what does the kernel have to do with RSH functionality?

Comment 10 Karel Zak 2016-06-07 18:41:03 UTC

You need kernel >= 2.6.32-622.el6, see https://bugzilla.redhat.com/show_bug.cgi?id=1308660 

It's all about tty initialization voodoo, kernel & login(1) dance together in very dark place to make it usable for users...

Comment 11 Sean Johnson 2016-06-07 22:11:42 UTC

*ugh*

That means we'll have to schedule reboots as part of the bulk update process. Not impossible, but when dealing with a compute cluster, people can get a little cranky about all the nodes needing to be rebooted. 

Thanks so much for the response. :)

Comment 12 James Pearson 2016-06-22 13:51:34 UTC

I've just come across this problem with new installs of 6.8 (and upgrades to 6.8) on machines using NIS for authentication

We don't have this issue with 6.7

i.e. a new install of 6.8 with kernel 2.6.32-642.1.1.el6 and util-linux-ng-2.17.2-12.24 has the problem

I can 'fix' the issue by using /bin/login from 6.7 (util-linux-ng-2.17.2-12.18)

We don't see this issue when using sssd (AD back end) for authentication

I can also fix the problem by rebuilding util-linux-ng-2.17.2-12.24 without the 'util-linux-ng-2.17-login-vhangup.patch'

Comment 13 Karel Zak 2016-06-23 11:56:34 UTC

Yes, there is dependence on kernel (unfortunately not required by uti-linux-ng rpm -- reported as bug #1349192), but it's really strange that authentication method may affect this issue.

Comment 14 Karel Zak 2016-06-23 11:58:00 UTC

Sean, did you try it with reboot? Does it solve the problem? If yes, we can close this as duplicate to bug #1349192.

Comment 15 James Pearson 2016-06-23 13:22:07 UTC

(In reply to Karel Zak from comment #13)
> Yes, there is dependence on kernel (unfortunately not required by
> uti-linux-ng rpm -- reported as bug #1349192), but it's really strange that
> authentication method may affect this issue.

I don't have access to bug #1349192 - what is the dependency on kernel version?

I'm using kernel 2.6.32-642.1.1.el6 (the latest)

For me, a reboot doesn't help ...

The authentication method _may be_ a red herring - rlogin doesn't always fail - so it just might have been a coincidence that it worked on a box using sssd at that time ... I've since 'downgraded' /bin/login on all my installs, so can't confirm this at the moment

Comment 16 Sean Johnson 2016-06-23 14:29:25 UTC

> Sean, did you try it with reboot? Does it solve the problem?

Ahhh, yes. Apologies. I intended to reply back, but it never actually happened. ;)

Updating and rebooting to use the latest running kernel, in my case that's 2.6.32-642.1.1.el6.x86_64, fixed the issue I was seeing with RSH failing.

Comment 17 rick.beldin@hpe.com 2016-06-23 20:21:45 UTC

What is the kernel bug that 'corrects' this issue?  Seems like this warrants util-linux-ng having a dep on kernel... 

What is the dep reported in #1349192?

Comment 18 Karel Zak 2016-06-23 20:34:29 UTC

*** Bug 1349641 has been marked as a duplicate of this bug. ***

Comment 19 Karel Zak 2016-06-23 20:44:04 UTC

(In reply to rick.beldin from comment #17)
> What is the kernel bug that 'corrects' this issue?  Seems like this warrants
> util-linux-ng having a dep on kernel... 

Bug #1308660, we have backported the way how tty is initialized, now close() is required before vhangup() and to make it fully usable it's also necessary to disable kernel TTY packet mode reset (kernel upstream commit b81273a132177edd806476b953f6afeb17b786d5, Jan 2013).

Comment 20 James Pearson 2016-06-24 15:07:58 UTC

I've been doing a bit more testing ...

When using sssd on the remote host for auth, rlogin works most of the time - fails may be 5% of the time

When using NIS on the remote host for auth, rlogin fails all the time

strace'ing the login process on the remote host shows that /bin/login gets killed by a SIGHUP when in a pam function e.g.

9094  stat("/etc/pam.d",  <unfinished ...>
9094  <... stat resumed> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
9094  open("/etc/pam.d/remote", O_RDONLY <unfinished ...>
9094  <... open resumed> )              = 3
9094  fstat(3,  <unfinished ...>
9094  <... fstat resumed> {st_mode=S_IFREG|0644, st_size=613, ...}) = 0
9094  mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
9094  <... mmap resumed> )              = 0x7f2afd78a000
9094  read(3,  <unfinished ...>
9094  <... read resumed> "#%PAM-1.0\nauth       required     pam_securetty.so\nauth       include      password-auth\naccount    required     pam_nologin.so\n"..., 4096) = 613
9094  open("/lib64/security/pam_securetty.so", O_RDONLY <unfinished ...>
9094  <... open resumed> )              = 4
9094  read(4,  <unfinished ...>
9094  <... read resumed> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360\10\0\0\0\0\0\0@\0\0\0\0\0\0\0p!\0\0\0\0\0\0\0\0\0\0@\0008\0\7\0@\0\32\0\31\0\1\0\0\0\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\204\21\0\0\0\0\0\0\204\21\0\0\0\0\0\0\0\0 \0\0\0\0\0\1\0\0\0\6\0\0\0"..., 832) = 832
9094  fstat(4,  <unfinished ...>
9094  <... fstat resumed> {st_mode=S_IFREG|0755, st_size=10224, ...}) = 0
9094  mmap(NULL, 2105464, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 4, 0 <unfinished ...>
9094  <... mmap resumed> )              = 0x7f2afcf26000
9094  mprotect(0x7f2afcf28000, 2093056, PROT_NONE <unfinished ...>
9094  <... mprotect resumed> )          = 0
9094  --- SIGHUP {si_signo=SIGHUP, si_code=SI_KERNEL} ---
9094  +++ killed by SIGHUP +++

I've also found I can 'fix' the problem by commenting out just the line in login.c:

       close(STDERR_FILENO);

just before it does the vhangup()

i.e. if stderr is left open, rlogin works ...

Comment 21 James Pearson 2016-06-27 10:54:32 UTC

(In reply to James Pearson from comment #20)
> I've been doing a bit more testing ...
> 
> When using sssd on the remote host for auth, rlogin works most of the time -
> fails may be 5% of the time
> 
> When using NIS on the remote host for auth, rlogin fails all the time

I overlooked a possible significant difference when doing these tests ... the box running sssd was a bare-metal workstation - the box running NIS was a virtual host (using VMware) - swapping between NIS and sssd for auth on these two platforms, now leads me to believe the problem has nothing to do with the auth method, but the underlying platform ... so I now think:

When the remote host is bare-metal host, rlogin works most of the time 

When the remote host is a virtual (VMware) host, rlogin fails all the time

Could 'something' at the hypervisor level been affecting what is going on here?

Note: rlogin to a remote bare-metal host still does fail occasionally ...

Comment 22 Martin Moore 2016-06-27 12:33:03 UTC

I think VM vs bare-metal is significant only in that it indicates that there's some timing issue involved.  Slowing down the login increases the likelihood of hitting the problem.  In this case, the additional overhead in the hypervisor is affecting the timing.

I can duplicate the problem between two bare-metal systems in an HPE lab with the following configuration:

Client: ProLiant DL160 G6 running RHEL 6.6.
Server: ProLiant DL980 G7 originally running RHEL 6.6, updated to 6.8.
The accounts are using local authentication only (no NIS, sssd, etc.)

Before updating the server to 6.8, rlogin from the client worked 100% of the time.  After the update, rlogin fails about 40%-50% of the time.

However, if I strace xinetd on the server (to see what its children rlogind and login are doing), the failure rate increases to 80%-90%.  Apparently the additional overhead of tracing is also affecting the timing.

I have a customer who reports the same behavior between two VMs.  With normal testing, his failure rate is about 70-80%.  With strace of xinetd, his failure rate increases to 100%.

Comment 23 Karel Zak 2016-06-27 13:01:03 UTC

Note that we use the same code in upstream (=all distros) and Fedora. It's strange that nothing has been reported yet.

Martin, please copy & past from:

 # uname -a; rpm -q xinetd util-linux-ng rsh-server

I'll try it tomorrow.  Thanks.

Comment 24 James Pearson 2016-06-27 13:08:22 UTC

Interesting ... adding a usleep(10) just after the vhangup() in login.c, dramatically increases the rlogin failure rate on my bare-metal box ...

However, not closing stderr before the vhangup() - and rlogin works every time - even with a sleep of 10 (or more) seconds ...

Comment 25 Martin Moore 2016-06-27 13:10:10 UTC

Karel, here's the requested output:

# uname -a; rpm -q xinetd util-linux-ng rsh-server
Linux dl980g7h08u36.alf1.global.tslabs.hpecorp.net 2.6.32-642.el6.x86_64 #1 SMP Wed Apr 13 00:51:26 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
xinetd-2.3.14-40.el6.x86_64
util-linux-ng-2.17.2-12.24.el6.x86_64
rsh-server-0.17-64.el6.x86_64

Comment 27 jcastran 2016-06-27 14:46:29 UTC

Created attachment 1172908 [details]
Sosreport of system

Comment 28 jcastran 2016-06-27 16:04:38 UTC

Hello,

We also can not, for the moment, downgrade the package due to:

   https://rhn.redhat.com/errata/RHBA-2016-0911.html

Which says the fix is included in the util-linux-ng-2.17.2-12.24 package.

Thanks,
John Castranio
Red Hat

Comment 30 Karel Zak 2016-06-28 19:08:00 UTC

From my point of view it seems like a kernel issue, if I copy rlogind and login from RHEL6.8 to Fedora then all works as expected.

The problem is that select() for master side (=rlogind) of the terminal reports the terminal file descriptor as readable, but read() returns EIO in time when we close()+vhangup() slave side. 

I seems more recent kernels do not have this behaviour (I have doubts we can backport all the new tty stuff).

**BUT**

It seems that telnetd with the same kernel and login(1) is able to detect this problem and ignore this initial EIO errors (see 10 years old telned bugfix for bug #145636). It's not elegant, but if it's good enough for rlogind too.

Comment 31 Karel Zak 2016-06-28 19:10:08 UTC

Created attachment 1173540 [details]
bugfix patch

Comment 32 James Pearson 2016-06-29 09:27:31 UTC

That patch appears to fix the problem for me - thanks

Comment 33 James Pearson 2016-06-29 11:17:59 UTC

Although the patch works, there are still problems when using util-linux-ng-2.17.2-12.24.el6 on a pre-6.8 kernel ...

i.e. after updating to 6.8, but before rebooting, rlogin will appear to 'hang' - nothing is echoed or displayed in the terminal - but it will still accept input and run commands ...

Once rebooted, all is OK - which is a pain in our case, as a reboot could be a significant time after an update ...

Comment 34 James Pearson 2016-06-29 16:04:07 UTC

Created attachment 1174030 [details]
Hacky patch to allow login to work with older kernels ...

For what its worth, here is a hacky patch to /bin/login that allows rlogin to work with pre-6.8 kernels ...

Comment 35 Karel Zak 2016-06-30 08:59:17 UTC

Well, I don't think we want such login(1) patch ;-) What we need is to add proper Requires: to the spec file to make it obvious that reboot is required. (already requested by bug #1349192)

Comment 36 James Pearson 2016-06-30 21:27:28 UTC

I didn't really expect the login patch to be 'accepted' - but it's going to help us when we move from 6.7 to 6.8 in the next week or so :-)

Comment 38 Michael Lampe 2016-07-21 15:48:40 UTC

The attached 'bugfix patch' does not solve the problem. Putting the machine to rlogin to under heavy load and then connecting often enough still gets you the occasional immediate "connection closed".

Comment 40 Konstantin 2016-12-07 18:13:52 UTC

Hello, just want to let you know that introducing dependency in a spec file has some disadvantage for those, who are running virtual containers (OpenVZ, LXC etc). The kernel (and kernel-firmware) package is unneeded, unwanted and masked to save some space when there are lot's of such installs. 

p.s.

While rebuilding the package to remove the dependancy, found this bug report.
Didn't know somebody is still using rlogin in XXI century...

Comment 41 Michael Lampe 2016-12-07 20:57:52 UTC

You are right: This dependency doesn't fix the bug, nor does it do any good otherwise.

It just documents that there is a dependency. (RH messed up the backports, see above.)

Those who automatically say rlogin is from the middle ages dont need it, but still have a new problem.

Those who use rlogin in appropriate circumstances for good still suffer from this bug, because RH won't fix it.

Comment 43 Michael Lampe 2016-12-09 16:57:26 UTC

Regarding NEEDINFO: Use two machines A and B with 6.8, all updates applied, latest kernel booted. Configure passwordless rlogin (~/.rlogin) from A to B, put B under heavy load and repeat

A> rlogin B
B> exit

It takes at most a few dozen tries before you see the dreaded "connection closed". It's obviously a race condition. Hence the "put B under heavy load" in the reproducer to increase the likelihood of hitting the bug.

Comment 47 Michal Ruprich 2017-02-14 12:00:50 UTC

Hello,

I will look into this soon but right now this is not a priority unfortunately. As soon as I have some more info, I will let you know.

Regards,

Michal

Comment 49 Pat Riehecky 2017-03-13 15:28:20 UTC

Is this possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=658217  ?

Comment 50 Gerard Bernabeu Altayo 2017-03-14 23:49:09 UTC

Hi,

guided by https://www.centos.org/forums/viewtopic.php?t=58248&start=10 I found a workaround that works under load:  added 'nice = 5' in  /etc/xinetd.d/rlogin  and restarted xinetd. 

It'd be nice to see a proper bugfix.

thanks,
 Gerard

Comment 53 Martin Moore 2017-03-30 18:29:38 UTC

I have verified that the bug still exists in RHEL 6.9, util-linux-ng-2.17.2-12.28.el6.x86_64.  The same behavior as before: rlogin will work for a few times and then start getting "connection closed".

Comment 58 Karl Abbott 2017-04-10 13:01:18 UTC

Michal,

My customer would be more than happy to test.

Karl

Comment 59 Martin Moore 2017-04-10 16:57:52 UTC

I'll also be happy to test any potential fixes.

Additionally, I've tried the workaround suggested in 50 and it didn't appear to have any effect.

Martin

Comment 60 Michal Ruprich 2017-04-11 11:56:48 UTC

Created attachment 1270759 [details]
Testing package #1

Providing testing package with a possible fix.

Comment 61 Michal Ruprich 2017-04-19 07:31:34 UTC

Hi Karl and Martin,

any update on the testing of the package?

Thanks,
Michal

Comment 64 Martin Moore 2017-04-20 14:58:55 UTC

Hi Michal,

It looks promising.  I installed the test package on the 6.9 server I've been using, and now I can't get rlogin connections to fail.  Without the test package they were failing about 50% of the time.

Martin

Comment 65 Michal Ruprich 2017-04-26 07:28:42 UTC

Created attachment 1274104 [details]
Patch used in testing package

Comment 66 James Pearson 2017-04-26 21:35:44 UTC

The test package appears to work fine for me as well - also now not able to get rlogin to fail