Description of problem: During what looks like a storage+network outage, the engine connects to a host, the connection did not really succeed (no subscribe, connect frames) and apparently got stuck. The caps command timed out but the engine never tried to connect again (assuming stuck previous connection that never really completed). See: 1. After engine powers up, attempts to connect: 2019-07-01 14:29:13,395-04 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to FQDN/IP 2. Fails after 20s 2019-07-01 14:29:33,434-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-3) [] Unable to RefreshCapabilities: ConnectException: Connection timeout 2019-07-01 14:29:33,449-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-17) [] Command 'GetCapabilitiesAsyncVDSCommand(HostName = FQDN, VdsIdAndVdsVDSCommandParametersBase:{hostId='7a6ff2b0-9142-40ff-8082-76c3995b8543', vds='Host[HV03,7a6ff2b0-9142-40ff-8082-76c3995b8543]'})' execution failed: java.rmi.ConnectException: Connection timeout 3. Connect again, same problem: 2019-07-01 14:29:51,527-04 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to FQDN/IP 2019-07-01 14:30:01,545-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-80) [] Command 'GetCapabilitiesAsyncVDSCommand(HostName = EV03, VdsIdAndVdsVDSCommandParametersBase:{hostId='ae68e955-0646-4c74-9a78-fc7753e779c7', vds='Host[EV03,ae68e955-0646-4c74-9a78-fc7753e779c7]'})' execution failed: java.rmi.ConnectException: Connection timeout 4. But now there are no more connection attempts, just timeouts every ~3min (vdsTimeout?) 2019-07-01 14:33:21,728-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-85) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM HV03 command Get Host Capabilities failed: Message timeout which can be caused by communication issues 2019-07-01 14:36:44,883-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-42) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM HV03 command Get Host Capabilities failed: Message timeout which can be caused by communication issues 2019-07-01 14:40:08,061-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-43) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM HV03 command Get Host Capabilities failed: Message timeout which can be caused by communication issues 5. In VDSM logs we only see accepting the connection (no subscribe or anything), assuming the connection is hung: 2019-07-01 14:29:16,529-0400 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:<IP>:33062 (protocoldetector:61) 2019-07-01 14:29:54,660-0400 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:<IP>:33104 (protocoldetector:61) Not a single GetCapabilities arrived on VDSM. PROBLEM: - engine did not try to connect again after vdsTimeouts? - after restart it worked again Version-Release number of selected component (if applicable): vdsm-4.30.17-1.el7ev.x86_64 rhvm-4.3.4.3-0.1.el7.noarch How reproducible: Unknown
(In reply to Germano Veit Michel from comment #0) > 3. Connect again, same problem: > > 2019-07-01 14:29:51,527-04 INFO > [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) > [] Connecting to FQDN/IP > 2019-07-01 14:30:01,545-04 ERROR > [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand] > (EE-ManagedThreadFactory-engineScheduled-Thread-80) [] Command > 'GetCapabilitiesAsyncVDSCommand(HostName = EV03, > VdsIdAndVdsVDSCommandParametersBase:{hostId='ae68e955-0646-4c74-9a78- > fc7753e779c7', vds='Host[EV03,ae68e955-0646-4c74-9a78-fc7753e779c7]'})' > execution failed: java.rmi.ConnectException: Connection timeout Sorry, this vdsTimeout above is not related to this host. The related vdsTimeout is the first on item 4.
Seen again, now on an older 4.2.8: ovirt-engine-4.2.8.5-0.1.el7ev.noarch vdsm-jsonrpc-java-1.4.15-1.el7ev.noarch
This bug I opened to investigate the connection issues on the engine side. It seems to have given up connecting to VDSM, or is stuck somewhere and not timing out to try again. The storage pool locking problem on the host I already reported in BZ1724009
(In reply to Germano Veit Michel from comment #10) > This bug I opened to investigate the connection issues on the engine side. > It seems to have given up connecting to VDSM, or is stuck somewhere and not > timing out to try again. > > The storage pool locking problem on the host I already reported in BZ1724009 Germano, are you OK to close this bug as duplicate of BZ1710818? We have been trying to fix issue 'connection stucked' there, so not sure what else we could do with this bug unless we have clear reproduction steps
(In reply to Martin Perina from comment #11) > (In reply to Germano Veit Michel from comment #10) > > This bug I opened to investigate the connection issues on the engine side. > > It seems to have given up connecting to VDSM, or is stuck somewhere and not > > timing out to try again. > > > > The storage pool locking problem on the host I already reported in BZ1724009 > > Germano, are you OK to close this bug as duplicate of BZ1710818? We have > been trying to fix issue 'connection stucked' there, so not sure what else > we could do with this bug unless we have clear reproduction steps If you and Piotr think that BZ probably solves this problem as well of course I must agree :)
Closing as duplicate of BZ1710818, we have fixed some engine to host reconnection issues in 4.3.5. But fee free to reopen if this issue can be reproduced *** This bug has been marked as a duplicate of bug 1710818 ***