Description of problem: Adding fc28 host to fc28 engine fails with: Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/usr/lib64/python3.6/logging/__init__.py", line 976, in flush self.stream.flush() BrokenPipeError: [Errno 32] Broken pipe During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib64/python3.6/logging/__init__.py", line 1943, in shutdown h.flush() File "/usr/lib64/python3.6/logging/__init__.py", line 976, in flush self.stream.flush() File "/tmp/ovirt-iPUTtKf1PF/pythonlib/otopi/main.py", line 53, in _signal raise RuntimeError("SIG%s" % signum) RuntimeError: SIG13 when we add a host engine starts host deploy process and sends a tar with ssh to the host when I tested that error I tried running strace on the host and trace the connection and found that the tar command never finished until after about 10-20 minutes the installation fails on unexpected connection termination, I am not sure why that happens but one idea is that on Fedora we use apache-sshd 0.14 and on Centos, we bundle it ourselves with version 0.12. I tried replacing the package in Fedora with the centos 0.12 version but the error remains the same Steps to Reproduce: 1.Installed engine on fc28, with python2 (python2-otopi) and default settings. 2.remove line: HostKey /etc/ssh/ssh_host_ecdsa_key, from /etc/ssh/sshd_config from engine and host (both fc28). workaround for bug: https://bugzilla.redhat.com/show_bug.cgi?id=1591801 3.fix broken links in host-deploy pythonlib, in: /usr/share/ovirt-host-deploy/interface-3/pythonlib/ links are broken because there is no python3-otopi/ovirt_host_mgmt/ovirt_host_deploy installed, dnf install python3-otopi, python-ovirt-host-deploy on engine side 4.log into engine and click Compute -> Hosts -> New and add a fc28 host Actual results: Installing task runs for about 10-20 minutes then fails. Expected results: Host add and host deploy installation finished. Additional info: adding engine log, host deploy log and strace file in strace you can search for 10:13:39 to get to the line in which the read is stoped/resumed
Created attachment 1476864 [details] engine/host/strace log files
a few description fixes: 1. the bug could be either on the engine side or on the host side, so relevant engine log: 2018-08-19 10:13:39,422+03 DEBUG [org.ovirt.engine.core.utils.timer.FixedDelayJobListener] (DefaultQuartzScheduler6) [] Rescheduling DEFAULT.org.ovirt.engine.core.bll.gluster.GlusterSyncJob.refreshLightWeightDat a#-9223372036854775801 as there is no unfired trigger. 2018-08-19 10:13:39,559+03 ERROR [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (VdsDeploy) [193f8c23] Error during deploy dialog 2018-08-19 10:13:39,559+03 DEBUG [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (VdsDeploy) [193f8c23] Exception: java.io.IOException: Unexpected connection termination at org.ovirt.otopi.dialog.MachineDialogParser.nextEvent(MachineDialogParser.java:390) [otopi.jar:] at org.ovirt.otopi.dialog.MachineDialogParser.nextEvent(MachineDialogParser.java:407) [otopi.jar:] at org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase.threadMain(VdsDeployBase.java:302) [bll.jar:] at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_162] 2018-08-19 10:13:39,561+03 ERROR [org.ovirt.engine.core.uutils.ssh.SSHDialog] (EE-ManagedThreadFactory-engine-Thread-8) [193f8c23] SSH error running command root.17.42:'umask 0077; MYTMP="$(TMPDIR="${OVIRT_TMPDIR}" mktemp -d -t ovirt-XXXXXXXXXX)"; trap "chmod -R u+rwX \"${MYTMP}\" > /dev/null 2>&1; rm -fr \"${MYTMP}\" > /dev/null 2>&1" 0; tar --warning=no-timestamp -C "${MYTMP}" -x && "${MYTMP}"/ovirt-host-deploy DIALOG/dialect=str:machine DIALOG/customization=bool:True': TimeLimitExceededException: SSH session timeout host 'root.17.42' 2018-08-19 10:13:39,561+03 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHDialog] (EE-ManagedThreadFactory-engine-Thread-8) [193f8c23] Exception: javax.naming.TimeLimitExceededException: SSH session timeout host 'root.17.42' 2018-08-19 10:13:39,569+03 ERROR [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (EE-ManagedThreadFactory-engine-Thread-8) [193f8c23] Timeout during host 10.35.17.42 install: SSH session timeout host 'root.17.42' 2018-08-19 10:13:39,569+03 DEBUG [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (EE-ManagedThreadFactory-engine-Thread-8) [193f8c23] Exception: javax.naming.TimeLimitExceededException: SSH session timeout host 'root.17.42' 2018-08-19 10:13:39,591+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-8) [193f8c23] EVENT_ID: VDS_INSTALL_IN_PROGRESS_ERROR(511), An error has occurred during installation of Host temp: Processing stopped due to timeout. 2018-08-19 10:13:39,591+03 ERROR [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand] (EE-ManagedThreadFactory-engine-Thread-8) [193f8c23] Host installation failed for host 'be1302ef-fc7e-4851-8baf-1dfa2e16ab5d', 'temp': SSH session timeout host 'root.17.42' 2018-08-19 10:13:39,591+03 DEBUG [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand] (EE-ManagedThreadFactory-engine-Thread-8) [193f8c23] Exception: javax.naming.TimeLimitExceededException: SSH session timeout host 'root.17.42' 2018-08-19 10:13:39,609+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-8) [193f8c23] EVENT_ID: VDS_INSTALL_FAILED(505), Host temp installation failed. SSH session timeout host 'root.17.42'. 2. in "Steps to Reproduce" on number 3 we need to install python3-ovirt-host-deploy
fixed in: https://gerrit.ovirt.org/#/c/94106/
the fix on: https://gerrit.ovirt.org/#/c/94106/ was reverted, therefore we need a different patch, discussion: https://lists.ovirt.org/archives/list/infra@ovirt.org/thread/YFXG2TVBXC4ZNTAYYBIUOFXNO33IGIYU/#QMRM2INTCRDPT7GPF24EEPNJAZRP4CUQ