Description of problem: Reprovision v4.4 RHVH fails with error "Error reading SSH protocol banner[Errno 104] Connection reset by peer" Version-Release number of selected component (if applicable): rhvh-4.4.0.14-0.2020022 How reproducible: Install the rhvh v4.4 image on a host Steps to Reproduce: 1. From Jenkins go to 'infra_reprovision_job' 2. Enter the host FQDN 3. choose rhvh-4.4 and run the job Actual results: Reprovision fails with error: "Error reading SSH protocol banner[Errno 104] Connection reset by peer" Expected results: RHVH should be successfully installed on host Additional info: 1. It takes 25-sec delay to login to host 2. The problem does not reproduce with rhel8.2
Targeting to 4.3.9 since we have similar report upstream on 4.3.8.
Moving to 4.4.0 since QE couldn't reproduce on 4.3.9.
Closing as not a bug for RHV-H. It's a known issue when using python-paramiko and spawning too many connections, see more details here: https://github.com/paramiko/paramiko/issues/1181 Jenkins job connecting to the newly deployed RHV-H instance need to be fixed instead.
Investigated in a reproduced baremetal environment provided by Roni: you could have noticed this message once it logs you in: ""System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8).”” and a stuck rdma module 1206 ? D 0:00 | \_ /sbin/modprobe -q -- rdma-client-rdma_cm and a bunch of [ 1353.113715] INFO: task modprobe:1206 blocked for more than 120 seconds. [ 1353.129705] Not tainted 4.18.0-189.el8.x86_64 #1 [ 1353.140503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1353.169899] modprobe D 0 1206 491 0x80004080 [ 1353.169902] Call Trace: [ 1353.169913] ? __schedule+0x24f/0x650 … in system logs. And also: [root@lynx10 ~]# systemctl status ● lynx10.domain.com State: starting your system is not fully booted up and refusing logins. I took the liberty and killed the stuck rdma module and it continued to boot normally, once it finished I can log in without any delay Is that the issue you’re observing in that jenkins env?
for the record - comment #38 is with rhvh-4.4.0.14-0.20200318.0+1 I don't see anything in oVirt/RHV to be done about it. I suggest to follow up with RHEL QE about rdma module/kernel. In the meantime....not sure if the workaround to wait is a good one since the system is not fully booted up and services are not started Maybe try to reproduce with bare RHEL keeping it open for clarification
The problem does not reproduce with REHL, it reproduces with RHVH only We try it on some different hosts and they all got the same banner timeout problem
for the record - comment #38 is with rhvh-4.4.0.14-0.20200318.0+1 I don't see anything in oVirt/RHV to be done about it. I suggest to follow up with RHEL QE about rdma module/kernel. In the meantime....not sure if the workaround to wait is a good one since the system is not fully booted up and services are not started Maybe try to reproduce with bare RHEL on matching nightly version keeping it open for clarification
This really has nothing to do with ssh - it looks like dbus is denying access when it is asked to create a new session, looking at journalctl, we get this: when trying to login from ssh: pam_systemd(sshd:session): Failed to create session: Connection timed out When just trying to su -: pam_systemd(su:session): Failed to create session: Connection timed out Drilling a little further down, we get this: [root@lynx09 dbus-1]# dbus-monitor --system |grep -i err & [1] 3197 [root@lynx09 dbus-1]# time su -c "exit" error time=1584611418.018323 sender=org.freedesktop.DBus -> destination=:1.6 error_name=org.freedesktop.DBus.Error.AccessDenied reply_serial=32 string "Rejected send message, 2 matched rules; type="signal", sender=":1.6" (uid=998 pid=976 comm="/usr/lib/polkit-1/polkitd --no-debug " label="system_u:system_r:policykit_t:s0") interface="org.freedesktop.PolicyKit1.Authority" member="Changed" error name="(unset)" requested_reply="0" destination="org.freedesktop.DBus" (uid=0 pid=1103 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")" error time=1584611418.018600 sender=org.freedesktop.DBus -> destination=:1.6 error_name=org.freedesktop.DBus.Error.AccessDenied reply_serial=33 string "Rejected send message, 2 matched rules; type="signal", sender=":1.6" (uid=998 pid=976 comm="/usr/lib/polkit-1/polkitd --no-debug " label="system_u:system_r:policykit_t:s0") interface="org.freedesktop.PolicyKit1.Authority" member="Changed" error name="(unset)" requested_reply="0" destination="org.freedesktop.DBus" (uid=0 pid=1103 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")" real 0m25.042s user 0m0.004s sys 0m0.012s [root@lynx09 dbus-1]#
which makes some sense when you're not booted yet, as regular login is not supposed to be enabled until it's finished. Manually it also didn't reproduce elsewhere, most notably it works ok for Node QE, so likely it's environmental too. Also, as noted by Yuval in email thread, seems to be resolved in latest kernel. closing again, I don't believe there's anything to track
Summary: SSH was not started correctly due to a stuck RDMA driver (SSH error: "System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8).") The stuck occurs due to deadlock in the Infiniband kernel module that was triggered by the RDMA code in Gluster The workaround: enable rdma.service during kickstart
I filed bug 1818986 for the RDMA module kernel deadlock. I am nominating it for RHEL 8.2 Zstream. Feel free to add a comment there to justify the Zstream request. Thanks!