Bug 1341204

Summary: Host deploy fails (from time to time!) on KeyExchange signature verification failed
Product: [oVirt] ovirt-engine Reporter: Yaniv Kaul <ykaul>
Component: Host-DeployAssignee: Martin Perina <mperina>
Status: CLOSED CURRENTRELEASE QA Contact: Aleksei Slaikovskii <aslaikov>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.6.7CC: aslaikov, bugs, mgoldboi, mperina, oourfali, pkliczew, stirabos
Target Milestone: ovirt-4.0.0-rc2Flags: rule-engine: ovirt-4.0.0+
mgoldboi: planning_ack+
mperina: devel_ack+
pstehlik: testing_ack+
Target Release: 4.0.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1347337 (view as bug list) Environment:
Last Closed: 2016-08-17 14:36:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1367107    
Bug Blocks: 1341652, 1347337    
Attachments:
Description Flags
engine.log
none
server.log none

Description Yaniv Kaul 2016-05-31 13:06:44 UTC
Created attachment 1163225 [details]
engine.log

Description of problem:
running ovirt-system tests on Lago, I'm getting (from time to time!) failure to install a host. 
Engine log:
2016-05-31 08:03:16,884 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-10) [5f91f170] SqlCall for procedure [GetVdsByHostName] compiled
2016-05-31 08:03:16,896 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-10) [5f91f170] Compiled stored procedure. Call string is [{call getstorage_poolsbyv
dsgroupid(?)}]
2016-05-31 08:03:16,896 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-10) [5f91f170] SqlCall for procedure [Getstorage_poolsByVdsGroupId] compiled
2016-05-31 08:03:16,906 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-10) [5f91f170] Connecting 'root.201.3'
2016-05-31 08:03:16,987 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-10) [5f91f170] Connect error: java.io.IOException: SSH session closed during connection 'root.201.3'
        at org.ovirt.engine.core.uutils.ssh.SSHClient.connect(SSHClient.java:309) [uutils.jar:]
        at org.ovirt.engine.core.bll.utils.EngineSSHClient.connect(EngineSSHClient.java:55) [bll.jar:]
        at org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand.canConnect(AddVdsCommand.java:451) [bll.jar:]
        at org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand.canDoAction(AddVdsCommand.java:347) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.internalCanDoAction(CommandBase.java:791) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:372) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:484) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:466) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:419) [bll.jar:]

...
2016-05-31 08:03:16,994 ERROR [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-10) [5f91f170] Failed to establish session with host 'lago_basic_suite_3_6_host1': SSH session closed during connection 'root.201.3'
2016-05-31 08:03:16,995 WARN  [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-10) [5f91f170] CanDoAction of action 'AddVds' failed for user admin@internal. Reasons: VAR__ACTION__ADD,VAR__TYPE__HOST,$server 192.168.201.3,VDS_CANNOT_CONNECT_TO_SERVER
2016-05-31 08:03:17,014 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-10) [] Operation Failed: [Cannot add Host. Connecting to host via SSH has failed, verify that the host is reachable (IP address, routable address etc.) You may refer to the engine.log file for further details.]




Server log:
2016-05-31 08:03:16,948 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (pool-10-thread-1) Client session created
2016-05-31 08:03:16,955 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (pool-10-thread-1) Server version string: SSH-2.0-OpenSSH_6.6.1
2016-05-31 08:03:16,955 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (pool-10-thread-1) Kex: server->client aes128-ctr hmac-sha2-256 none
2016-05-31 08:03:16,955 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (pool-10-thread-1) Kex: client->server aes128-ctr hmac-sha2-256 none
2016-05-31 08:03:16,983 WARN  [org.apache.sshd.client.session.ClientSessionImpl] (pool-10-thread-2) Exception caught: org.apache.sshd.common.SshException: KeyExchange signature verification failed
        at org.apache.sshd.client.kex.AbstractDHGClient.next(AbstractDHGClient.java:132)
        at org.apache.sshd.common.session.AbstractSession.doHandleMessage(AbstractSession.java:393)
        at org.apache.sshd.common.session.AbstractSession.handleMessage(AbstractSession.java:295)
        at org.apache.sshd.client.session.ClientSessionImpl.handleMessage(ClientSessionImpl.java:266)
        at org.apache.sshd.common.session.AbstractSession.decode(AbstractSession.java:720)
        at org.apache.sshd.common.session.AbstractSession.messageReceived(AbstractSession.java:277)
        at org.apache.sshd.common.AbstractSessionIoHandler.messageReceived(AbstractSessionIoHandler.java:54)
        at org.apache.sshd.common.io.nio2.Nio2Session$1.completed(Nio2Session.java:188)
        at org.apache.sshd.common.io.nio2.Nio2Session$1.completed(Nio2Session.java:174)
        at sun.nio.ch.Invoker.invokeUnchecked(Invoker.java:126) [rt.jar:1.8.0_91]
        at sun.nio.ch.Invoker$2.run(Invoker.java:218) [rt.jar:1.8.0_91]
        at sun.nio.ch.AsynchronousChannelGroupImpl$1.run(AsynchronousChannelGroupImpl.java:112) [rt.jar:1.8.0_91]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_91]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_91]
        at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_91]


And indeed on the host:
sshd[9772]: error: Received disconnect from 192.168.201.2: 3: KeyExchange signature verification failed [preauth]

Version-Release number of selected component (if applicable):
ovirt-engine-3.6.7-1.el6.noarch
ovirt-host-deploy-java-1.4.1-1.el6.noarch

How reproducible:
Sometimes.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
ovirt-host-deploy-1.4.1-1.el6.noarch
ovirt-image-uploader-3.6.0-1.el6.noarch
ovirt-engine-setup-plugin-websocket-proxy-3.6.7-1.el6.noarch
ovirt-engine-tools-3.6.7-1.el6.noarch
ovirt-engine-setup-plugin-ovirt-engine-3.6.7-1.el6.noarch
ovirt-engine-backend-3.6.7-1.el6.noarch
ovirt-engine-extension-aaa-jdbc-1.0.7-1.el6.noarch
ovirt-setup-lib-1.0.1-1.el6.noarch
ebay-cors-filter-1.0.1-0.1.ovirt.el6.noarch
ovirt-host-deploy-java-1.4.1-1.el6.noarch
ovirt-engine-cli-3.6.2.0-1.el6.noarch
ovirt-iso-uploader-3.6.0-1.el6.noarch
ovirt-engine-wildfly-overlay-8.0.5-1.el6.noarch
ovirt-vmconsole-proxy-1.0.2-1.el6.noarch
ovirt-engine-wildfly-8.2.1-1.el6.x86_64
ovirt-engine-tools-backup-3.6.7-1.el6.noarch
ovirt-engine-websocket-proxy-3.6.7-1.el6.noarch
ovirt-engine-webadmin-portal-3.6.7-1.el6.noarch
ovirt-engine-vmconsole-proxy-helper-3.6.7-1.el6.noarch
ovirt-engine-setup-3.6.7-1.el6.noarch
ovirt-engine-restapi-3.6.7-1.el6.noarch
ovirt-engine-dbscripts-3.6.7-1.el6.noarch
ovirt-engine-3.6.7-1.el6.noarch
ovirt-engine-extensions-api-impl-3.6.7-1.el6.noarch
ovirt-engine-lib-3.6.7-1.el6.noarch
ovirt-engine-sdk-python-3.6.5.0-1.el6.noarch
ovirt-vmconsole-1.0.2-1.el6.noarch
ovirt-engine-setup-base-3.6.7-1.el6.noarch
ovirt-engine-setup-plugin-ovirt-engine-common-3.6.7-1.el6.noarch
ovirt-engine-setup-plugin-vmconsole-proxy-helper-3.6.7-1.el6.noarch
ovirt-engine-userportal-3.6.7-1.el6.noarch

Comment 1 Yaniv Kaul 2016-05-31 13:07:05 UTC
Created attachment 1163226 [details]
server.log

Comment 2 Simone Tiraboschi 2016-05-31 14:14:18 UTC
Oved, looks here:
https://issues.jboss.org/browse/ENTESB-2101
They were facing a similar issue in JBoss Fuse 6.2.
It seams somehow related to an older version of org.apache.sshd.sshd-core.

Paolo Antinori said that requiring version v.0.12.0 of org.apache.sshd.sshd-core seams enough to prevent it.

In 3.6 we still have:

      <dependency>
        <groupId>org.apache.sshd</groupId>
        <artifactId>sshd-core</artifactId>
        <version>${sshd-core.version}</version>
      </dependency>
...
<sshd-core.version>0.11.0</sshd-core.version>

Comment 3 Yaniv Kaul 2016-05-31 15:24:27 UTC
Perhaps it's https://issues.apache.org/jira/browse/SSHD-330 .

I'm wondering why not to upgrade to 0.14. While I agree much newer versions (which I believe we upgraded to in 4.0?) probably a more scary to upgrade to in a 3.6.x, we need to look at the fixes in 0.13, 0.14 and understand if they might affect us:
0.13: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310849&version=12327342
0.14: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310849&version=12329012

BTW, I'm pretty sure it was not 1 out of 256 (my failures), but other than two failures today, I could not reproduce it yet... I'll see if it does.

Comment 4 Simone Tiraboschi 2016-05-31 16:01:58 UTC
We are building with 0.11.0 also on 4.0 RC

Comment 5 Martin Perina 2016-06-01 11:45:45 UTC
I just successfully installed several hosts with 0.12.0 version of apache-sshd and no code changes were required for this upgrade, so we should use apache-sshd 0.12.0 for 3.6/4.0.
Independently on this bug we have already started to create patches to use latest stable version of apache-sshd 1.2.0. But there were significant API changes between 0.11 and 1.2.0, so for this upgrade will be targeted to 4.1

Comment 6 Yaniv Kaul 2016-06-01 13:28:54 UTC
(In reply to Martin Perina from comment #5)
> I just successfully installed several hosts with 0.12.0 version of
> apache-sshd and no code changes were required for this upgrade, so we should
> use apache-sshd 0.12.0 for 3.6/4.0.

I installed ~20-50 hosts. Yet 2 in the same day failed. So we need to run it A LOT.

> Independently on this bug we have already started to create patches to use
> latest stable version of apache-sshd 1.2.0. But there were significant API
> changes between 0.11 and 1.2.0, so for this upgrade will be targeted to 4.1

+1.

Comment 9 Aleksei Slaikovskii 2016-08-16 12:03:09 UTC
I've installed 100 hosts (4 ones for 25 times) on ovirt-engine 4.0.2.6-0.1.el7ev using python sdk.
None of those had any SSH problems.