Bug 1618984 - Host deploy from fc28 engine on fc28 host fails, ssh connection terminated
Summary: Host deploy from fc28 engine on fc28 host fails, ssh connection terminated
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: General
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ovirt-4.3.0
: ---
Assignee: Yuval Turgeman
QA Contact: Lukas Svaty
URL:
Whiteboard:
Depends On:
Blocks: oVirt_on_Fedora
TreeView+ depends on / blocked
 
Reported: 2018-08-19 08:38 UTC by Gal Zaidman
Modified: 2019-01-23 10:54 UTC (History)
2 users (show)

Fixed In Version: ovirt-engine-4.3.0_rc
Clone Of:
Environment:
Last Closed: 2019-01-23 10:54:38 UTC
oVirt Team: Infra
Embargoed:
rule-engine: ovirt-4.3+


Attachments (Terms of Use)
engine/host/strace log files (636.58 KB, application/x-gzip)
2018-08-19 08:40 UTC, Gal Zaidman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 95154 0 master MERGED tar: use GNU's record size when creating tar files 2020-11-02 13:24:46 UTC

Description Gal Zaidman 2018-08-19 08:38:56 UTC
Description of problem:

Adding fc28 host to fc28 engine fails with:

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib64/python3.6/logging/__init__.py", line 976, in flush
    self.stream.flush()
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/logging/__init__.py", line 1943, in shutdown
    h.flush()
  File "/usr/lib64/python3.6/logging/__init__.py", line 976, in flush
    self.stream.flush()
  File "/tmp/ovirt-iPUTtKf1PF/pythonlib/otopi/main.py", line 53, in _signal
    raise RuntimeError("SIG%s" % signum)
RuntimeError: SIG13

when we add a host engine starts host deploy process and sends a tar with ssh to the host when I tested that error I tried running strace on the host and trace the connection and found that the tar command never finished until after about 10-20 minutes the installation fails on unexpected connection termination, I am not sure why that happens but one idea is that on Fedora we use apache-sshd 0.14
and on Centos, we bundle it ourselves with version 0.12.
I tried replacing the package in Fedora with the centos 0.12 version but the error remains the same

Steps to Reproduce:

1.Installed engine on fc28, with python2 (python2-otopi) and default settings.

2.remove line: HostKey /etc/ssh/ssh_host_ecdsa_key, from /etc/ssh/sshd_config
from engine and host (both fc28).
workaround for bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1591801

3.fix broken links in host-deploy pythonlib, in:
/usr/share/ovirt-host-deploy/interface-3/pythonlib/
links are broken because there is no python3-otopi/ovirt_host_mgmt/ovirt_host_deploy installed, dnf install python3-otopi, python-ovirt-host-deploy on engine side

4.log into engine and click Compute -> Hosts -> New and add a fc28 host

Actual results:
Installing task runs for about 10-20 minutes then fails.

Expected results:
Host add and host deploy installation finished.

Additional info:
adding engine log, host deploy log and strace file
in strace you can search for 10:13:39 to get to the line in which the read is stoped/resumed

Comment 1 Gal Zaidman 2018-08-19 08:40:06 UTC
Created attachment 1476864 [details]
engine/host/strace log files

Comment 2 Gal Zaidman 2018-08-19 08:49:39 UTC
a few description fixes:
1. the bug could be either on the engine side or on the host side, so relevant engine log:
2018-08-19 10:13:39,422+03 DEBUG [org.ovirt.engine.core.utils.timer.FixedDelayJobListener] (DefaultQuartzScheduler6) [] Rescheduling DEFAULT.org.ovirt.engine.core.bll.gluster.GlusterSyncJob.refreshLightWeightDat
a#-9223372036854775801 as there is no unfired trigger.

2018-08-19 10:13:39,559+03 ERROR [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (VdsDeploy) [193f8c23] Error during deploy dialog
2018-08-19 10:13:39,559+03 DEBUG [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (VdsDeploy) [193f8c23] Exception: java.io.IOException: Unexpected connection termination
        at org.ovirt.otopi.dialog.MachineDialogParser.nextEvent(MachineDialogParser.java:390) [otopi.jar:]
        at org.ovirt.otopi.dialog.MachineDialogParser.nextEvent(MachineDialogParser.java:407) [otopi.jar:]
        at org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase.threadMain(VdsDeployBase.java:302) [bll.jar:]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_162]

2018-08-19 10:13:39,561+03 ERROR [org.ovirt.engine.core.uutils.ssh.SSHDialog] (EE-ManagedThreadFactory-engine-Thread-8) [193f8c23] SSH error running command root.17.42:'umask 0077; MYTMP="$(TMPDIR="${OVIRT_TMPDIR}" mktemp -d -t ovirt-XXXXXXXXXX)"; trap "chmod -R u+rwX \"${MYTMP}\" > /dev/null 2>&1; rm -fr \"${MYTMP}\" > /dev/null 2>&1" 0; tar --warning=no-timestamp -C "${MYTMP}" -x &&  "${MYTMP}"/ovirt-host-deploy DIALOG/dialect=str:machine DIALOG/customization=bool:True': TimeLimitExceededException: SSH session timeout host 'root.17.42'
2018-08-19 10:13:39,561+03 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHDialog] (EE-ManagedThreadFactory-engine-Thread-8) [193f8c23] Exception: javax.naming.TimeLimitExceededException: SSH session timeout host 'root.17.42'

2018-08-19 10:13:39,569+03 ERROR [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (EE-ManagedThreadFactory-engine-Thread-8) [193f8c23] Timeout during host 10.35.17.42 install: SSH session timeout host 'root.17.42'

2018-08-19 10:13:39,569+03 DEBUG [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (EE-ManagedThreadFactory-engine-Thread-8) [193f8c23] Exception: javax.naming.TimeLimitExceededException: SSH session timeout host 'root.17.42'

2018-08-19 10:13:39,591+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-8) [193f8c23] EVENT_ID: VDS_INSTALL_IN_PROGRESS_ERROR(511), An error has occurred during installation of Host temp: Processing stopped due to timeout.

2018-08-19 10:13:39,591+03 ERROR [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand] (EE-ManagedThreadFactory-engine-Thread-8) [193f8c23] Host installation failed for host 'be1302ef-fc7e-4851-8baf-1dfa2e16ab5d', 'temp': SSH session timeout host 'root.17.42'

2018-08-19 10:13:39,591+03 DEBUG [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand] (EE-ManagedThreadFactory-engine-Thread-8) [193f8c23] Exception: javax.naming.TimeLimitExceededException: SSH session timeout host 'root.17.42'

2018-08-19 10:13:39,609+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-8) [193f8c23] EVENT_ID: VDS_INSTALL_FAILED(505), Host temp installation failed. SSH session timeout host 'root.17.42'.


2. in "Steps to Reproduce" on number 3 we need to install python3-ovirt-host-deploy

Comment 3 Gal Zaidman 2018-09-05 07:30:35 UTC
fixed in:
https://gerrit.ovirt.org/#/c/94106/

Comment 4 Gal Zaidman 2018-10-23 10:57:34 UTC
the fix on:
https://gerrit.ovirt.org/#/c/94106/
was reverted,
therefore we need a different patch, discussion:

https://lists.ovirt.org/archives/list/infra@ovirt.org/thread/YFXG2TVBXC4ZNTAYYBIUOFXNO33IGIYU/#QMRM2INTCRDPT7GPF24EEPNJAZRP4CUQ


Note You need to log in before you can comment on or make changes to this bug.