Bug 1726871

Summary: Connection to host did not complete, timed out, but engine did not try to connect again.
Product: Red Hat Enterprise Virtualization Manager Reporter: Germano Veit Michel <gveitmic>
Component: ovirt-engineAssignee: Piotr Kliczewski <pkliczew>
Status: CLOSED DUPLICATE QA Contact: meital avital <mavital>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: mperina, pkliczew, Rhev-m-bugs, tnisan
Target Milestone: ---Flags: lsvaty: testing_plan_complete-
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-21 06:01:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2019-07-03 23:58:36 UTC
Description of problem:

During what looks like a storage+network outage, the engine connects to a host, the connection did not really succeed (no subscribe, connect frames) and apparently got stuck. The caps command timed out but the engine never tried to connect again (assuming stuck previous connection that never really completed). See:

1. After engine powers up, attempts to connect:

2019-07-01 14:29:13,395-04 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to FQDN/IP

2. Fails after 20s

2019-07-01 14:29:33,434-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-3) [] Unable to RefreshCapabilities: ConnectException: Connection timeout
2019-07-01 14:29:33,449-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-17) [] Command 'GetCapabilitiesAsyncVDSCommand(HostName = FQDN, VdsIdAndVdsVDSCommandParametersBase:{hostId='7a6ff2b0-9142-40ff-8082-76c3995b8543', vds='Host[HV03,7a6ff2b0-9142-40ff-8082-76c3995b8543]'})' execution failed: java.rmi.ConnectException: Connection timeout

3. Connect again, same problem:

2019-07-01 14:29:51,527-04 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to FQDN/IP
2019-07-01 14:30:01,545-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-80) [] Command 'GetCapabilitiesAsyncVDSCommand(HostName = EV03, VdsIdAndVdsVDSCommandParametersBase:{hostId='ae68e955-0646-4c74-9a78-fc7753e779c7', vds='Host[EV03,ae68e955-0646-4c74-9a78-fc7753e779c7]'})' execution failed: java.rmi.ConnectException: Connection timeout

4. But now there are no more connection attempts, just timeouts every ~3min (vdsTimeout?)

2019-07-01 14:33:21,728-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-85) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM HV03 command Get Host Capabilities failed: Message timeout which can be caused by communication issues
2019-07-01 14:36:44,883-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-42) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM HV03 command Get Host Capabilities failed: Message timeout which can be caused by communication issues
2019-07-01 14:40:08,061-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-43) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM HV03 command Get Host Capabilities failed: Message timeout which can be caused by communication issues

5. In VDSM logs we only see accepting the connection (no subscribe or anything), assuming the connection is hung:

2019-07-01 14:29:16,529-0400 INFO  (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:<IP>:33062 (protocoldetector:61)
2019-07-01 14:29:54,660-0400 INFO  (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:<IP>:33104 (protocoldetector:61)

Not a single GetCapabilities arrived on VDSM.

PROBLEM: 
- engine did not try to connect again after vdsTimeouts?
- after restart it worked again

Version-Release number of selected component (if applicable):
vdsm-4.30.17-1.el7ev.x86_64
rhvm-4.3.4.3-0.1.el7.noarch

How reproducible:
Unknown

Comment 2 Germano Veit Michel 2019-07-04 00:02:01 UTC
(In reply to Germano Veit Michel from comment #0)
> 3. Connect again, same problem:
> 
> 2019-07-01 14:29:51,527-04 INFO 
> [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor)
> [] Connecting to FQDN/IP
> 2019-07-01 14:30:01,545-04 ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand]
> (EE-ManagedThreadFactory-engineScheduled-Thread-80) [] Command
> 'GetCapabilitiesAsyncVDSCommand(HostName = EV03,
> VdsIdAndVdsVDSCommandParametersBase:{hostId='ae68e955-0646-4c74-9a78-
> fc7753e779c7', vds='Host[EV03,ae68e955-0646-4c74-9a78-fc7753e779c7]'})'
> execution failed: java.rmi.ConnectException: Connection timeout

Sorry, this vdsTimeout above is not related to this host. The related vdsTimeout is the first on item 4.

Comment 5 Germano Veit Michel 2019-07-19 04:11:35 UTC
Seen again, now on an older 4.2.8:

ovirt-engine-4.2.8.5-0.1.el7ev.noarch
vdsm-jsonrpc-java-1.4.15-1.el7ev.noarch

Comment 10 Germano Veit Michel 2019-07-25 23:16:42 UTC
This bug I opened to investigate the connection issues on the engine side. It seems to have given up connecting to VDSM, or is stuck somewhere and not timing out to try again.

The storage pool locking problem on the host I already reported in BZ1724009

Comment 11 Martin Perina 2019-07-26 07:27:15 UTC
(In reply to Germano Veit Michel from comment #10)
> This bug I opened to investigate the connection issues on the engine side.
> It seems to have given up connecting to VDSM, or is stuck somewhere and not
> timing out to try again.
> 
> The storage pool locking problem on the host I already reported in BZ1724009

Germano, are you OK to close this bug as duplicate of BZ1710818? We have been trying to fix issue 'connection stucked' there, so not sure what else we could do with this bug unless we have clear reproduction steps

Comment 13 Germano Veit Michel 2019-07-28 22:59:02 UTC
(In reply to Martin Perina from comment #11)
> (In reply to Germano Veit Michel from comment #10)
> > This bug I opened to investigate the connection issues on the engine side.
> > It seems to have given up connecting to VDSM, or is stuck somewhere and not
> > timing out to try again.
> > 
> > The storage pool locking problem on the host I already reported in BZ1724009
> 
> Germano, are you OK to close this bug as duplicate of BZ1710818? We have
> been trying to fix issue 'connection stucked' there, so not sure what else
> we could do with this bug unless we have clear reproduction steps

If you and Piotr think that BZ probably solves this problem as well of course I must agree :)

Comment 14 Martin Perina 2019-08-21 06:01:39 UTC
Closing as duplicate of BZ1710818, we have fixed some engine to host reconnection issues in 4.3.5. But fee free to reopen if this issue can be reproduced

*** This bug has been marked as a duplicate of bug 1710818 ***