Bug 1809618 - [RHVH 4.4] Install fail - Error reading SSH protocol banner
Summary: [RHVH 4.4] Install fail - Error reading SSH protocol banner
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: redhat-virtualization-host
Version: 4.4.0
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ovirt-4.4.1
: ---
Assignee: Nobody
QA Contact: Michael Burman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-03 14:31 UTC by Roni
Modified: 2020-03-30 21:11 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-19 17:56:51 UTC
oVirt Team: External
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Roni 2020-03-03 14:31:24 UTC
Description of problem:
Reprovision v4.4 RHVH fails with error
"Error reading SSH protocol banner[Errno 104] Connection reset by peer"

Version-Release number of selected component (if applicable):
rhvh-4.4.0.14-0.2020022

How reproducible:
Install the rhvh v4.4 image on a host

Steps to Reproduce:
1. From Jenkins go to 'infra_reprovision_job'
2. Enter the host FQDN
3. choose rhvh-4.4 and run the job

Actual results:
Reprovision fails with error:
"Error reading SSH protocol banner[Errno 104] Connection reset by peer"

Expected results:
RHVH should be successfully installed on host

Additional info:
1. It takes 25-sec delay to login to host
2. The problem does not reproduce with rhel8.2

Comment 16 Sandro Bonazzola 2020-03-04 08:13:58 UTC
Targeting to 4.3.9 since we have similar report upstream on 4.3.8.

Comment 23 Sandro Bonazzola 2020-03-12 12:35:47 UTC
Moving to 4.4.0 since QE couldn't reproduce on 4.3.9.

Comment 24 Sandro Bonazzola 2020-03-19 07:22:38 UTC
Closing as not a bug for RHV-H. It's a known issue when using python-paramiko and spawning too many connections, see more details here: https://github.com/paramiko/paramiko/issues/1181
Jenkins job connecting to the newly deployed RHV-H instance need to be fixed instead.

Comment 28 Michal Skrivanek 2020-03-19 09:09:17 UTC
Investigated in a reproduced baremetal environment provided by Roni:


you could have noticed this message once it logs you in:
""System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8).””

and a stuck rdma module

   1206 ?        D      0:00  |   \_ /sbin/modprobe -q -- rdma-client-rdma_cm

and a bunch of 
[ 1353.113715] INFO: task modprobe:1206 blocked for more than 120 seconds.
[ 1353.129705]       Not tainted 4.18.0-189.el8.x86_64 #1
[ 1353.140503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1353.169899] modprobe        D    0  1206    491 0x80004080
[ 1353.169902] Call Trace:
[ 1353.169913]  ? __schedule+0x24f/0x650
…
in system logs.

And also:
[root@lynx10 ~]# systemctl status
● lynx10.domain.com
    State: starting

your system is not fully booted up and refusing logins. I took the liberty and killed the stuck rdma module and it continued to boot normally, once it finished I can log in without any delay

Is that the issue you’re observing in that jenkins env?

Comment 29 Michal Skrivanek 2020-03-19 09:18:14 UTC
for the record - comment #38 is with rhvh-4.4.0.14-0.20200318.0+1

I don't see anything in oVirt/RHV to be done about it.
I suggest to follow up with RHEL QE about rdma module/kernel. In the meantime....not sure if the workaround to wait is a good one since the system is not fully booted up and services are not started

Maybe try to reproduce with bare RHEL 

keeping it open for clarification

Comment 30 Roni 2020-03-19 09:32:35 UTC
The problem does not reproduce with REHL, it reproduces with RHVH only
We try it on some different hosts and they all got the same banner timeout problem

Comment 31 Michal Skrivanek 2020-03-19 09:37:49 UTC
for the record - comment #38 is with rhvh-4.4.0.14-0.20200318.0+1

I don't see anything in oVirt/RHV to be done about it.
I suggest to follow up with RHEL QE about rdma module/kernel. In the meantime....not sure if the workaround to wait is a good one since the system is not fully booted up and services are not started

Maybe try to reproduce with bare RHEL on matching nightly version

keeping it open for clarification

Comment 32 Yuval Turgeman 2020-03-19 09:53:21 UTC
This really has nothing to do with ssh - it looks like dbus is denying access when it is asked to create a new session, looking at journalctl, we get this:

when trying to login from ssh:
pam_systemd(sshd:session): Failed to create session: Connection timed out

When just trying to su -:
pam_systemd(su:session): Failed to create session: Connection timed out


Drilling a little further down, we get this:


[root@lynx09 dbus-1]# dbus-monitor --system |grep -i err &
[1] 3197
[root@lynx09 dbus-1]# time su -c "exit"
error time=1584611418.018323 sender=org.freedesktop.DBus -> destination=:1.6 error_name=org.freedesktop.DBus.Error.AccessDenied reply_serial=32
   string "Rejected send message, 2 matched rules; type="signal", sender=":1.6" (uid=998 pid=976 comm="/usr/lib/polkit-1/polkitd --no-debug " label="system_u:system_r:policykit_t:s0") interface="org.freedesktop.PolicyKit1.Authority" member="Changed" error name="(unset)" requested_reply="0" destination="org.freedesktop.DBus" (uid=0 pid=1103 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")"
error time=1584611418.018600 sender=org.freedesktop.DBus -> destination=:1.6 error_name=org.freedesktop.DBus.Error.AccessDenied reply_serial=33
   string "Rejected send message, 2 matched rules; type="signal", sender=":1.6" (uid=998 pid=976 comm="/usr/lib/polkit-1/polkitd --no-debug " label="system_u:system_r:policykit_t:s0") interface="org.freedesktop.PolicyKit1.Authority" member="Changed" error name="(unset)" requested_reply="0" destination="org.freedesktop.DBus" (uid=0 pid=1103 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")"

real	0m25.042s
user	0m0.004s
sys	0m0.012s
[root@lynx09 dbus-1]#

Comment 33 Michal Skrivanek 2020-03-19 17:56:51 UTC
which makes some sense when you're not booted yet, as regular login is not supposed to be enabled until it's finished.


Manually it also didn't reproduce elsewhere, most notably it works ok for Node QE, so likely it's environmental too.
Also, as noted by Yuval in email thread, seems to be resolved in latest kernel.

closing again, I don't believe there's anything to track

Comment 34 Roni 2020-03-29 16:22:27 UTC
Summary: SSH was not started correctly due to a stuck RDMA driver 
(SSH error: "System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8).")
The stuck occurs due to deadlock in the Infiniband kernel module that was triggered by the RDMA code in Gluster
The workaround: enable rdma.service during kickstart

Comment 35 Michal Schmidt 2020-03-30 21:11:02 UTC
I filed bug 1818986 for the RDMA module kernel deadlock. I am nominating it for RHEL 8.2 Zstream. Feel free to add a comment there to justify the Zstream request. Thanks!


Note You need to log in before you can comment on or make changes to this bug.