Bug 1408870 - Flood of endless exceptions in engine.log when host is non responsive (unknown cause)
Summary: Flood of endless exceptions in engine.log when host is non responsive (unknow...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backend.Core
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
low
high
Target Milestone: ovirt-4.2.0
: 4.2.0
Assignee: Yevgeny Zaspitsky
QA Contact: Michael Burman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-27 21:17 UTC by Yaniv Kaul
Modified: 2017-12-20 10:59 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-12-20 10:59:50 UTC
oVirt Team: Network
Embargoed:
rule-engine: ovirt-4.2+


Attachments (Terms of Use)
engine log (180.03 KB, application/x-xz)
2016-12-27 21:17 UTC, Yaniv Kaul
no flags Details
vdsm log (16.86 KB, text/plain)
2016-12-27 21:18 UTC, Yaniv Kaul
no flags Details
engine log (4.78 MB, text/plain)
2017-01-19 16:48 UTC, Yaniv Kaul
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 69485 0 'None' 'MERGED' 'Add cause exception upon re-throwing IOException in ReactorClient' 2019-12-02 17:33:36 UTC
oVirt gerrit 69891 0 'None' 'MERGED' 'Version bump' 2019-12-02 17:33:36 UTC
oVirt gerrit 69893 0 'None' 'MERGED' 'Version bump' 2019-12-02 17:33:36 UTC
oVirt gerrit 72194 0 'None' 'MERGED' 'engine: handle TransportRunTimeException in PollVDSCommand' 2019-12-02 17:33:36 UTC

Description Yaniv Kaul 2016-12-27 21:17:39 UTC
Created attachment 1235539 [details]
engine log

Description of problem:
Please see attached engine.log - for some unknown reason, connection failed during host add. Host is responsive to SSH. However, looks like Engine is endlessly trying to communicate - and floods Engine with repetitive exceptions. That's bound to cause log rotation and more. 

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.0-0.0.master.20161227112055.gitdd6737f.el7.centos.noarch

How reproducible:
Not sure.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yaniv Kaul 2016-12-27 21:18:20 UTC
Created attachment 1235540 [details]
vdsm log

Comment 2 Yaniv Kaul 2016-12-27 21:19:59 UTC
First failure:
2016-12-27 16:01:08,193-05 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [7f3923b7] Connecting to lago-basic-suite-master-host1/192.168.201.4
2016-12-27 16:01:08,197-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-2) [5e53d5b1] Error: org.ovirt.engine.core.vdsbroker.TransportRunTimeException: Connection issues during send request
2016-12-27 16:01:08,197-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-2) [5e53d5b1] Exception: java.util.concurrent.ExecutionException: org.ovirt.engine.core.vdsbroker.TransportRunTimeException: Connection issues during send request
        at java.util.concurrent.FutureTask.report(FutureTask.java:122) [rt.jar:1.8.0_111]
        at java.util.concurrent.FutureTask.get(FutureTask.java:206) [rt.jar:1.8.0_111]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.FutureVDSCommand.get(FutureVDSCommand.java:68) [vdsbroker.jar:]
        at org.ovirt.engine.core.bll.network.host.HostSetupNetworkPoller.getValue(HostSetupNetworkPoller.java:56) [bll.jar:]
        at org.ovirt.engine.core.bll.network.host.HostSetupNetworkPoller.poll(HostSetupNetworkPoller.java:41) [bll.jar:]
        at org.ovirt.engine.core.bll.host.HostConnectivityChecker.check(HostConnectivityChecker.java:21) [bll.jar:]
        at org.ovirt.engine.core.bll.network.NetworkConfigurator.awaitVdsmResponse(NetworkConfigurator.java:123) [bll.jar:]

Comment 3 Yaniv Kaul 2016-12-28 08:03:13 UTC
Note: this host non responsive only happens when I try to add the host using the APIv4 API - not sure how that makes a difference. See code @ https://gerrit.ovirt.org/#/c/65285/9/basic-suite-master/test-scenarios/002_bootstrap.py for how I perform it. When I use the v3 - all is well.

Comment 4 Oved Ourfali 2016-12-29 07:40:01 UTC
IIUC it happens only when adding hosts. Am I correct?
As you see above this is part of the host setup networks poller, which polls until the host network is up.

Moving that to network team to investigate, however as this happens only when adding a host then I don't think this should be on high severity, so consider reducing severity. Up to you.

Comment 5 Yevgeny Zaspitsky 2017-01-03 11:33:02 UTC
IIUC comment #2 (engine.log I guess) refers to lago-basic-suite-master-host1 whereas the vdsm.log is from lago-basic-suite-master-host0 (note the different number at the end).

Engine log includes connectivity issues for both of the hosts from different flows (GetCapabilitiesVDSCommand/HostSetupNetworkPoller and GetAllVmStatsVDSCommand/PollVmStatsRefresher), this makes me thinking that there is a networking issue on the engine host side and for sure it isn't an issue related to a specific flow.

org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.connect(ReactorClient.java:140) - vdsm-jsonrpc-java-client code catches IOException and throws another exception, but doesn't include the original cause, which makes understanding the real problem near to impossible.

As far as I can understand from the logs the issue is in connection creating rather then maintaining of an open connection.

Comment 6 Dan Kenigsberg 2017-01-04 10:15:29 UTC
Piotr, would take this patch in? It may give valuable info in the next time this bug occurs.

Comment 7 Piotr Kliczewski 2017-01-04 11:02:08 UTC
Patch acked but needs to be verified. Once it is I am going to merge it.

Comment 8 Dan Kenigsberg 2017-01-04 11:43:40 UTC
Do you have an insight regarding the problem seen here?

Comment 9 Piotr Kliczewski 2017-01-04 12:41:54 UTC
We have additional info what happened so this stack trace is not that critical. In most of the situation we need debug logs anyway.

Comment 10 Yaniv Kaul 2017-01-09 11:15:36 UTC
(In reply to Piotr Kliczewski from comment #9)
> We have additional info what happened so this stack trace is not that
> critical. In most of the situation we need debug logs anyway.

Piotr - debug log of what? Engine? VDSM? What is needed here?

Comment 11 Piotr Kliczewski 2017-01-09 11:32:28 UTC
(In reply to Yaniv Kaul from comment #10)
> (In reply to Piotr Kliczewski from comment #9)
> > We have additional info what happened so this stack trace is not that
> > critical. In most of the situation we need debug logs anyway.
> 
> Piotr - debug log of what? Engine? VDSM? What is needed here?

The concern was that we may loose a bit of information with the patch. My point is that the stack trace itself is not only info we need to diagnose any complex issues. To understand them we need engine log in debug anyway.

Comment 12 Yaniv Kaul 2017-01-09 11:59:11 UTC
(In reply to Piotr Kliczewski from comment #11)
> (In reply to Yaniv Kaul from comment #10)
> > (In reply to Piotr Kliczewski from comment #9)
> > > We have additional info what happened so this stack trace is not that
> > > critical. In most of the situation we need debug logs anyway.
> > 
> > Piotr - debug log of what? Engine? VDSM? What is needed here?
> 
> The concern was that we may loose a bit of information with the patch. My
> point is that the stack trace itself is not only info we need to diagnose
> any complex issues. To understand them we need engine log in debug anyway.

I was under ovirt-system-tests sets it to run in debug. If this is not the case, please send a patch.

Comment 13 Piotr Kliczewski 2017-01-09 12:25:35 UTC
The patch was already merged but we need to update the version which I am going to do today.

Comment 14 Dan Kenigsberg 2017-01-19 14:56:24 UTC
We've added some more debug information to the exception traceback. If this ever reproduces, please add a fresh log so we can dig a bit deeper.

Comment 15 Yaniv Kaul 2017-01-19 16:48:05 UTC
Created attachment 1242530 [details]
engine log

Comment 16 Yaniv Kaul 2017-01-19 16:48:33 UTC
(In reply to Yaniv Kaul from comment #15)
> Created attachment 1242530 [details]
> engine log

Reproduced on Master.

Comment 18 Martin Perina 2017-02-07 10:10:41 UTC
It seems to me strange that at "2017-02-06 08:51:36,926-05" we were able to successfully connect to the host using SSH connection, but at "2017-02-06 08:51:37,022-05" which 1 second later we receive "No route to host". That would indicate that we have some route and/or firewall issue ...

Anyway Dan, shouldn't this "No route to host" exception be handled in setup networks part of the code and fail whole installation of the host?

Comment 19 Piotr Kliczewski 2017-02-13 08:42:03 UTC
"No route to host" is low level java exception which means that there are issues with network connectivity. At the time the exception occurred setupNetworks is not running and there is no way to handle it there. We need to check networking setting before progressing.

@Martin ssh succeeded to connect at 2017-01-19 11:38:46,636-05 and we saw the exception at 2017-01-19 11:41:02,564-05 which means that networking (routing) stopped working in between.

Comment 20 Martin Perina 2017-02-13 10:59:13 UTC
I've just noticed that NoRouteToHostException is also raised when default firewall rules are still present on host (which means to drop all disallowed packets), so are we sure that firewall is properly reconfigured when host-deploy flow is finished? If not (and firewall rules are not yet reloaded), then it might fail PollVdsCommand ...

Comment 21 Michael Burman 2017-07-04 14:00:02 UTC
Hi

Please tell how this report should be tested and steps to reproduce.

Comment 22 Yaniv Kaul 2017-07-04 15:10:30 UTC
(In reply to Michael Burman from comment #21)
> Hi
> 
> Please tell how this report should be tested and steps to reproduce.

Honestly, no idea.

Comment 23 Dan Kenigsberg 2017-07-05 08:12:22 UTC
Nobody really knows. Only verification possible is to grep fororg.ovirt.engine.core.vdsbroker.TransportRunTimeException in engine.log during tons of add host attempts. Close the bug if you don't find a flood of this exception.

Comment 24 Michael Burman 2017-07-06 13:45:05 UTC
Dan

I have added host vdsm-4.20.1-147.git53e5704.el7.centos.x86_64 to  4.2.0-0.0.master.20170705100427.gitb0385be.el7.centos engine, the add host was successful, but the grep from engine log returns(this is from only one add host attempt) :

2017-07-06 14:27:49,100+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-37) [25878b26] Error: org.ovirt.engine.core.vdsbroker.TransportRunTimeException: Connection issues during send request
2017-07-06 14:27:49,101+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-37) [25878b26] Command 'PollVDSCommand(HostName = orchid-vds2.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='8c929bc9-8d51-456f-86ef-41d5d1ab4d62'})' execution failed: java.util.concurrent.ExecutionException: org.ovirt.engine.core.vdsbroker.TransportRunTimeException: Connection issues during send request
2017-07-06 16:24:03,356+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-16) [2f45cf2b] Error: org.ovirt.engine.core.vdsbroker.TransportRunTimeException: Connection issues during send request
2017-07-06 16:24:03,357+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-16) [2f45cf2b] Command 'PollVDSCommand(HostName = orchid-vds2.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='f2c828bb-999d-41ff-9fc8-b0d187253f88'})' execution failed: java.util.concurrent.ExecutionException: org.ovirt.engine.core.vdsbroker.TransportRunTimeException: Connection issues during send request
2017-07-06 16:25:29,185+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-2) [46ae7408] Error: org.ovirt.engine.core.vdsbroker.TransportRunTimeException: Connection issues during send request
2017-07-06 16:25:29,186+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-2) [46ae7408] Command 'PollVDSCommand(HostName = silver-vdsb.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='a47a2dd7-a36d-4553-b349-32eba8dff96b'})' execution failed: java.util.concurrent.ExecutionException: org.ovirt.engine.core.vdsbroker.TransportRunTimeException: Connection issues during send request

Is it ok?

Comment 25 Meni Yakove 2017-07-10 07:52:31 UTC
Danken believes that this is fine, but Piotr, would you agree to review this, now that Yevgeny is gone?

Comment 26 Piotr Kliczewski 2017-07-10 07:56:19 UTC
Yes, it is OK, those exceptions are expected. We could reduce the number of occurrences by increasing ping timeout.

Comment 27 Sandro Bonazzola 2017-12-20 10:59:50 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.