Bug 1726871 - Connection to host did not complete, timed out, but engine did not try to connect again.
Summary: Connection to host did not complete, timed out, but engine did not try to con...
Keywords:
Status: CLOSED DUPLICATE of bug 1710818
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.3.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Piotr Kliczewski
QA Contact: meital avital
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-03 23:58 UTC by Germano Veit Michel
Modified: 2020-08-03 15:39 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-21 06:01:39 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4292981 0 None None None 2019-07-19 05:16:34 UTC

Description Germano Veit Michel 2019-07-03 23:58:36 UTC
Description of problem:

During what looks like a storage+network outage, the engine connects to a host, the connection did not really succeed (no subscribe, connect frames) and apparently got stuck. The caps command timed out but the engine never tried to connect again (assuming stuck previous connection that never really completed). See:

1. After engine powers up, attempts to connect:

2019-07-01 14:29:13,395-04 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to FQDN/IP

2. Fails after 20s

2019-07-01 14:29:33,434-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-3) [] Unable to RefreshCapabilities: ConnectException: Connection timeout
2019-07-01 14:29:33,449-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-17) [] Command 'GetCapabilitiesAsyncVDSCommand(HostName = FQDN, VdsIdAndVdsVDSCommandParametersBase:{hostId='7a6ff2b0-9142-40ff-8082-76c3995b8543', vds='Host[HV03,7a6ff2b0-9142-40ff-8082-76c3995b8543]'})' execution failed: java.rmi.ConnectException: Connection timeout

3. Connect again, same problem:

2019-07-01 14:29:51,527-04 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to FQDN/IP
2019-07-01 14:30:01,545-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-80) [] Command 'GetCapabilitiesAsyncVDSCommand(HostName = EV03, VdsIdAndVdsVDSCommandParametersBase:{hostId='ae68e955-0646-4c74-9a78-fc7753e779c7', vds='Host[EV03,ae68e955-0646-4c74-9a78-fc7753e779c7]'})' execution failed: java.rmi.ConnectException: Connection timeout

4. But now there are no more connection attempts, just timeouts every ~3min (vdsTimeout?)

2019-07-01 14:33:21,728-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-85) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM HV03 command Get Host Capabilities failed: Message timeout which can be caused by communication issues
2019-07-01 14:36:44,883-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-42) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM HV03 command Get Host Capabilities failed: Message timeout which can be caused by communication issues
2019-07-01 14:40:08,061-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-43) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM HV03 command Get Host Capabilities failed: Message timeout which can be caused by communication issues

5. In VDSM logs we only see accepting the connection (no subscribe or anything), assuming the connection is hung:

2019-07-01 14:29:16,529-0400 INFO  (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:<IP>:33062 (protocoldetector:61)
2019-07-01 14:29:54,660-0400 INFO  (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:<IP>:33104 (protocoldetector:61)

Not a single GetCapabilities arrived on VDSM.

PROBLEM: 
- engine did not try to connect again after vdsTimeouts?
- after restart it worked again

Version-Release number of selected component (if applicable):
vdsm-4.30.17-1.el7ev.x86_64
rhvm-4.3.4.3-0.1.el7.noarch

How reproducible:
Unknown

Comment 2 Germano Veit Michel 2019-07-04 00:02:01 UTC
(In reply to Germano Veit Michel from comment #0)
> 3. Connect again, same problem:
> 
> 2019-07-01 14:29:51,527-04 INFO 
> [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor)
> [] Connecting to FQDN/IP
> 2019-07-01 14:30:01,545-04 ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand]
> (EE-ManagedThreadFactory-engineScheduled-Thread-80) [] Command
> 'GetCapabilitiesAsyncVDSCommand(HostName = EV03,
> VdsIdAndVdsVDSCommandParametersBase:{hostId='ae68e955-0646-4c74-9a78-
> fc7753e779c7', vds='Host[EV03,ae68e955-0646-4c74-9a78-fc7753e779c7]'})'
> execution failed: java.rmi.ConnectException: Connection timeout

Sorry, this vdsTimeout above is not related to this host. The related vdsTimeout is the first on item 4.

Comment 5 Germano Veit Michel 2019-07-19 04:11:35 UTC
Seen again, now on an older 4.2.8:

ovirt-engine-4.2.8.5-0.1.el7ev.noarch
vdsm-jsonrpc-java-1.4.15-1.el7ev.noarch

Comment 10 Germano Veit Michel 2019-07-25 23:16:42 UTC
This bug I opened to investigate the connection issues on the engine side. It seems to have given up connecting to VDSM, or is stuck somewhere and not timing out to try again.

The storage pool locking problem on the host I already reported in BZ1724009

Comment 11 Martin Perina 2019-07-26 07:27:15 UTC
(In reply to Germano Veit Michel from comment #10)
> This bug I opened to investigate the connection issues on the engine side.
> It seems to have given up connecting to VDSM, or is stuck somewhere and not
> timing out to try again.
> 
> The storage pool locking problem on the host I already reported in BZ1724009

Germano, are you OK to close this bug as duplicate of BZ1710818? We have been trying to fix issue 'connection stucked' there, so not sure what else we could do with this bug unless we have clear reproduction steps

Comment 13 Germano Veit Michel 2019-07-28 22:59:02 UTC
(In reply to Martin Perina from comment #11)
> (In reply to Germano Veit Michel from comment #10)
> > This bug I opened to investigate the connection issues on the engine side.
> > It seems to have given up connecting to VDSM, or is stuck somewhere and not
> > timing out to try again.
> > 
> > The storage pool locking problem on the host I already reported in BZ1724009
> 
> Germano, are you OK to close this bug as duplicate of BZ1710818? We have
> been trying to fix issue 'connection stucked' there, so not sure what else
> we could do with this bug unless we have clear reproduction steps

If you and Piotr think that BZ probably solves this problem as well of course I must agree :)

Comment 14 Martin Perina 2019-08-21 06:01:39 UTC
Closing as duplicate of BZ1710818, we have fixed some engine to host reconnection issues in 4.3.5. But fee free to reopen if this issue can be reproduced

*** This bug has been marked as a duplicate of bug 1710818 ***


Note You need to log in before you can comment on or make changes to this bug.