Created attachment 1235539 [details] engine log Description of problem: Please see attached engine.log - for some unknown reason, connection failed during host add. Host is responsive to SSH. However, looks like Engine is endlessly trying to communicate - and floods Engine with repetitive exceptions. That's bound to cause log rotation and more. Version-Release number of selected component (if applicable): ovirt-engine-4.2.0-0.0.master.20161227112055.gitdd6737f.el7.centos.noarch How reproducible: Not sure. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1235540 [details] vdsm log
First failure: 2016-12-27 16:01:08,193-05 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [7f3923b7] Connecting to lago-basic-suite-master-host1/192.168.201.4 2016-12-27 16:01:08,197-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-2) [5e53d5b1] Error: org.ovirt.engine.core.vdsbroker.TransportRunTimeException: Connection issues during send request 2016-12-27 16:01:08,197-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-2) [5e53d5b1] Exception: java.util.concurrent.ExecutionException: org.ovirt.engine.core.vdsbroker.TransportRunTimeException: Connection issues during send request at java.util.concurrent.FutureTask.report(FutureTask.java:122) [rt.jar:1.8.0_111] at java.util.concurrent.FutureTask.get(FutureTask.java:206) [rt.jar:1.8.0_111] at org.ovirt.engine.core.vdsbroker.vdsbroker.FutureVDSCommand.get(FutureVDSCommand.java:68) [vdsbroker.jar:] at org.ovirt.engine.core.bll.network.host.HostSetupNetworkPoller.getValue(HostSetupNetworkPoller.java:56) [bll.jar:] at org.ovirt.engine.core.bll.network.host.HostSetupNetworkPoller.poll(HostSetupNetworkPoller.java:41) [bll.jar:] at org.ovirt.engine.core.bll.host.HostConnectivityChecker.check(HostConnectivityChecker.java:21) [bll.jar:] at org.ovirt.engine.core.bll.network.NetworkConfigurator.awaitVdsmResponse(NetworkConfigurator.java:123) [bll.jar:]
Note: this host non responsive only happens when I try to add the host using the APIv4 API - not sure how that makes a difference. See code @ https://gerrit.ovirt.org/#/c/65285/9/basic-suite-master/test-scenarios/002_bootstrap.py for how I perform it. When I use the v3 - all is well.
IIUC it happens only when adding hosts. Am I correct? As you see above this is part of the host setup networks poller, which polls until the host network is up. Moving that to network team to investigate, however as this happens only when adding a host then I don't think this should be on high severity, so consider reducing severity. Up to you.
IIUC comment #2 (engine.log I guess) refers to lago-basic-suite-master-host1 whereas the vdsm.log is from lago-basic-suite-master-host0 (note the different number at the end). Engine log includes connectivity issues for both of the hosts from different flows (GetCapabilitiesVDSCommand/HostSetupNetworkPoller and GetAllVmStatsVDSCommand/PollVmStatsRefresher), this makes me thinking that there is a networking issue on the engine host side and for sure it isn't an issue related to a specific flow. org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.connect(ReactorClient.java:140) - vdsm-jsonrpc-java-client code catches IOException and throws another exception, but doesn't include the original cause, which makes understanding the real problem near to impossible. As far as I can understand from the logs the issue is in connection creating rather then maintaining of an open connection.
Piotr, would take this patch in? It may give valuable info in the next time this bug occurs.
Patch acked but needs to be verified. Once it is I am going to merge it.
Do you have an insight regarding the problem seen here?
We have additional info what happened so this stack trace is not that critical. In most of the situation we need debug logs anyway.
(In reply to Piotr Kliczewski from comment #9) > We have additional info what happened so this stack trace is not that > critical. In most of the situation we need debug logs anyway. Piotr - debug log of what? Engine? VDSM? What is needed here?
(In reply to Yaniv Kaul from comment #10) > (In reply to Piotr Kliczewski from comment #9) > > We have additional info what happened so this stack trace is not that > > critical. In most of the situation we need debug logs anyway. > > Piotr - debug log of what? Engine? VDSM? What is needed here? The concern was that we may loose a bit of information with the patch. My point is that the stack trace itself is not only info we need to diagnose any complex issues. To understand them we need engine log in debug anyway.
(In reply to Piotr Kliczewski from comment #11) > (In reply to Yaniv Kaul from comment #10) > > (In reply to Piotr Kliczewski from comment #9) > > > We have additional info what happened so this stack trace is not that > > > critical. In most of the situation we need debug logs anyway. > > > > Piotr - debug log of what? Engine? VDSM? What is needed here? > > The concern was that we may loose a bit of information with the patch. My > point is that the stack trace itself is not only info we need to diagnose > any complex issues. To understand them we need engine log in debug anyway. I was under ovirt-system-tests sets it to run in debug. If this is not the case, please send a patch.
The patch was already merged but we need to update the version which I am going to do today.
We've added some more debug information to the exception traceback. If this ever reproduces, please add a fresh log so we can dig a bit deeper.
Created attachment 1242530 [details] engine log
(In reply to Yaniv Kaul from comment #15) > Created attachment 1242530 [details] > engine log Reproduced on Master.
(In reply to Yaniv Kaul from comment #16) > (In reply to Yaniv Kaul from comment #15) > > Created attachment 1242530 [details] > > engine log > > Reproduced on Master. And again - see http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/42/artifact/exported-artifacts/basic_suite_master__logs/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-engine/_var_log/ovirt-engine/engine.log
It seems to me strange that at "2017-02-06 08:51:36,926-05" we were able to successfully connect to the host using SSH connection, but at "2017-02-06 08:51:37,022-05" which 1 second later we receive "No route to host". That would indicate that we have some route and/or firewall issue ... Anyway Dan, shouldn't this "No route to host" exception be handled in setup networks part of the code and fail whole installation of the host?
"No route to host" is low level java exception which means that there are issues with network connectivity. At the time the exception occurred setupNetworks is not running and there is no way to handle it there. We need to check networking setting before progressing. @Martin ssh succeeded to connect at 2017-01-19 11:38:46,636-05 and we saw the exception at 2017-01-19 11:41:02,564-05 which means that networking (routing) stopped working in between.
I've just noticed that NoRouteToHostException is also raised when default firewall rules are still present on host (which means to drop all disallowed packets), so are we sure that firewall is properly reconfigured when host-deploy flow is finished? If not (and firewall rules are not yet reloaded), then it might fail PollVdsCommand ...
Hi Please tell how this report should be tested and steps to reproduce.
(In reply to Michael Burman from comment #21) > Hi > > Please tell how this report should be tested and steps to reproduce. Honestly, no idea.
Nobody really knows. Only verification possible is to grep fororg.ovirt.engine.core.vdsbroker.TransportRunTimeException in engine.log during tons of add host attempts. Close the bug if you don't find a flood of this exception.
Dan I have added host vdsm-4.20.1-147.git53e5704.el7.centos.x86_64 to 4.2.0-0.0.master.20170705100427.gitb0385be.el7.centos engine, the add host was successful, but the grep from engine log returns(this is from only one add host attempt) : 2017-07-06 14:27:49,100+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-37) [25878b26] Error: org.ovirt.engine.core.vdsbroker.TransportRunTimeException: Connection issues during send request 2017-07-06 14:27:49,101+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-37) [25878b26] Command 'PollVDSCommand(HostName = orchid-vds2.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='8c929bc9-8d51-456f-86ef-41d5d1ab4d62'})' execution failed: java.util.concurrent.ExecutionException: org.ovirt.engine.core.vdsbroker.TransportRunTimeException: Connection issues during send request 2017-07-06 16:24:03,356+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-16) [2f45cf2b] Error: org.ovirt.engine.core.vdsbroker.TransportRunTimeException: Connection issues during send request 2017-07-06 16:24:03,357+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-16) [2f45cf2b] Command 'PollVDSCommand(HostName = orchid-vds2.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='f2c828bb-999d-41ff-9fc8-b0d187253f88'})' execution failed: java.util.concurrent.ExecutionException: org.ovirt.engine.core.vdsbroker.TransportRunTimeException: Connection issues during send request 2017-07-06 16:25:29,185+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-2) [46ae7408] Error: org.ovirt.engine.core.vdsbroker.TransportRunTimeException: Connection issues during send request 2017-07-06 16:25:29,186+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-2) [46ae7408] Command 'PollVDSCommand(HostName = silver-vdsb.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='a47a2dd7-a36d-4553-b349-32eba8dff96b'})' execution failed: java.util.concurrent.ExecutionException: org.ovirt.engine.core.vdsbroker.TransportRunTimeException: Connection issues during send request Is it ok?
Danken believes that this is fine, but Piotr, would you agree to review this, now that Yevgeny is gone?
Yes, it is OK, those exceptions are expected. We could reduce the number of occurrences by increasing ping timeout.
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017. Since the problem described in this bug report should be resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.