Bug 1320128 - Host setup fails - network disconnection causes SetupNetwork not to be sent (or not received by host)
Summary: Host setup fails - network disconnection causes SetupNetwork not to be sent (...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Network
Version: 3.6.3.3
Hardware: x86_64
OS: Linux
urgent
urgent with 2 votes
Target Milestone: ovirt-3.6.6
: 3.6.6.2
Assignee: Martin Mucha
QA Contact: Meni Yakove
URL:
Whiteboard:
: 1315824 1326847 1331344 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-22 11:19 UTC by Yaniv Kaul
Modified: 2016-05-30 10:56 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-30 10:56:22 UTC
oVirt Team: Network
Embargoed:
rule-engine: ovirt-3.6.z+
rule-engine: blocker+
rule-engine: planning_ack+
danken: devel_ack+
myakove: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1311317 0 unspecified CLOSED Unattended Ovirt host deploy fails on second host when the engine already imported the hosted-engine storage domain 2021-02-22 00:41:40 UTC
oVirt gerrit 55855 0 master MERGED vdsbroker: wait for reconnection 2016-04-14 07:00:30 UTC
oVirt gerrit 55973 0 ovirt-engine-3.6 MERGED vdsbroker: wait for reconnection 2016-04-14 13:02:56 UTC
oVirt gerrit 56279 0 master MERGED vdsbroker: check connection identifier during policy reset 2016-05-01 07:38:39 UTC
oVirt gerrit 56647 0 master MERGED jsonrpc: version bump 2016-05-01 09:56:23 UTC
oVirt gerrit 56649 0 ovirt-engine-3.6 MERGED jsonrpc: version bump 2016-05-02 18:41:25 UTC
oVirt gerrit 56908 0 ovirt-engine-3.6.6 MERGED jsonrpc: version bump 2016-05-02 20:23:31 UTC

Internal Links: 1311317 1315824 1331481

Description Yaniv Kaul 2016-03-22 11:19:41 UTC
Description of problem:
Since the log rotate for engine log is defined with both minsize and 'daily', it means that it has to match both - more than a day and more than minsize (20M). Since a day hasn't passed since I've installed engine, yet I'm getting communication issues with one of the hosts, my engine log keeps growing with communication exceptions. It's now at 320MB and keeps growing. The repetitive exception: 
2016-03-22 11:16:55,533 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-8-thread-4) [602b6a1e] Command 'PollVDSCommand(HostName = lago_basic_suite_3_6_host0, VdsIdVDSComma
ndParametersBase:{runAsync='true', hostId='b99902a4-92a2-4642-a6ad-ed98f389a94e'})' execution failed: java.util.concurrent.ExecutionException: org.ovirt.engine.core.vdsbroker.xmlrpc.XmlRpcRunTimeException: Conne
ction issues during send request
2016-03-22 11:16:55,533 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [7fdd752] Connecting to /192.168.200.6
2016-03-22 11:16:58,531 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-8-thread-4) [602b6a1e] Error: org.ovirt.engine.core.vdsbroker.xmlrpc.XmlRpcRunTimeException: Connec
tion issues during send request
2016-03-22 11:16:58,531 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-8-thread-4) [602b6a1e] Exception: java.util.concurrent.ExecutionException: org.ovirt.engine.core.vd
sbroker.xmlrpc.XmlRpcRunTimeException: Connection issues during send request
        at java.util.concurrent.FutureTask.report(FutureTask.java:122) [rt.jar:1.7.0_79]
        at java.util.concurrent.FutureTask.get(FutureTask.java:202) [rt.jar:1.7.0_79]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.FutureVDSCommand.get(FutureVDSCommand.java:72) [vdsbroker.jar:]
        at org.ovirt.engine.core.bll.network.host.HostSetupNetworkPoller.poll(HostSetupNetworkPoller.java:30) [bll.jar:]
        at org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand.invokeSetupNetworksCommand(HostSetupNetworksCommand.java:423) [bll.jar:]
        at org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand.executeCommand(HostSetupNetworksCommand.java:256) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1215) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1359) [bll.jar:]

...


Version-Release number of selected component (if applicable):
[root@lago_basic_suite_3_6_engine ~]# rpm -qa |grep ovirt
ovirt-engine-sdk-python-3.6.3.0-1.el6.noarch
ovirt-setup-lib-1.0.1-1.el6.noarch
ovirt-host-deploy-1.4.1-1.el6.noarch
ovirt-engine-dbscripts-3.6.3.4-1.el6.noarch
ovirt-engine-setup-plugin-ovirt-engine-3.6.3.4-1.el6.noarch
ovirt-engine-userportal-3.6.3.4-1.el6.noarch
ovirt-engine-jboss-as-7.1.1-1.el6.x86_64
ovirt-engine-extension-aaa-jdbc-1.0.6-1.el6.noarch
ovirt-engine-cli-3.6.2.0-1.el6.noarch
ovirt-image-uploader-3.6.0-1.el6.noarch
ovirt-engine-setup-base-3.6.3.4-1.el6.noarch
ovirt-engine-setup-plugin-websocket-proxy-3.6.3.4-1.el6.noarch
ovirt-engine-wildfly-8.2.1-1.el6.x86_64
ovirt-vmconsole-proxy-1.0.0-1.el6.noarch
ovirt-engine-websocket-proxy-3.6.3.4-1.el6.noarch
ebay-cors-filter-1.0.1-0.1.ovirt.el6.noarch
ovirt-engine-setup-3.6.3.4-1.el6.noarch
ovirt-engine-setup-plugin-vmconsole-proxy-helper-3.6.3.4-1.el6.noarch
ovirt-engine-restapi-3.6.3.4-1.el6.noarch
ovirt-engine-tools-3.6.3.4-1.el6.noarch
ovirt-engine-webadmin-portal-3.6.3.4-1.el6.noarch
ovirt-engine-lib-3.6.3.4-1.el6.noarch
ovirt-engine-extensions-api-impl-3.6.3.4-1.el6.noarch
ovirt-iso-uploader-3.6.0-1.el6.noarch
ovirt-engine-setup-plugin-ovirt-engine-common-3.6.3.4-1.el6.noarch
ovirt-host-deploy-java-1.4.1-1.el6.noarch
ovirt-vmconsole-1.0.0-1.el6.noarch
ovirt-engine-wildfly-overlay-8.0.4-1.el6.noarch
ovirt-engine-vmconsole-proxy-helper-3.6.3.4-1.el6.noarch
ovirt-engine-backend-3.6.3.4-1.el6.noarch
ovirt-engine-3.6.3.4-1.el6.noarch


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Oved Ourfali 2016-03-22 11:23:27 UTC
This is part of the setup networks.
I guess they can hide the exception, if needed.
Moving to network.

Comment 2 Yaniv Kaul 2016-03-27 09:54:46 UTC
I'm also seeing:
2016-03-27 05:33:32,409 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-10-thread-2) [705b4264] Error: null
2016-03-27 05:33:32,410 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-10-thread-2) [705b4264] Exception: java.lang.NullPointerException
        at org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc.<init>(StatusForXmlRpc.java:20) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc.<init>(StatusOnlyReturnForXmlRpc.java:13) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.FutureVDSCommand.get(FutureVDSCommand.java:66) [vdsbroker.jar:]
        at org.ovirt.engine.core.bll.network.host.HostSetupNetworkPoller.poll(HostSetupNetworkPoller.java:31) [bll.jar:]
        at org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand.invokeSetupNetworksCommand(HostSetupNetworksCommand.java:433) [bll.jar:]
        at org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand.executeCommand(HostSetupNetworksCommand.java:294) [bll.jar:]

Comment 3 Yaniv Kaul 2016-03-27 09:55:23 UTC
Above causes host installation to fail. Setting as High and AutomationBlocker as it affects testing in Lago.

Comment 5 Martin Mucha 2016-04-05 08:07:22 UTC
Hi, I need a help with deciding how to resolve this bug. As one of comments were private, I'm sticking to that policy as well.

There are three exceptions mentioned in comments. All of them are nothing new, but I think first two (in order I mention them) should be fixed. Log size is influenced by error in poller, which was fixed by https://gerrit.ovirt.org/#/c/54644/ and backported to 3.6 and 3.6.5

• comment 4: Exception text from this command:

 "Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues" 

leads to FutureVDSCommand. Any(literally) exception catched here except for TimeoutException is only logged and (probably null) return value is returned. If this is source of excessive logging (I think it is), then it needs to be refactored and it affects whole project. It's not possible to 'swallow' this exception in Poller (network code), as it's logged by superclass FutureVDSCommand. And that is reached via calling FutureVDSCommandType.Poll.

• source of NPE exception from comment 2:
it's very same issue as comment 4. 'Something bad' happens and FutureVDSCommand logs it. We cannot influence it in poller. 'Something bad' here probably means, that response for  FutureVDSCommandType.Poll does not contain message key in innerMap parameter. See:
org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc#StatusForXmlRpc(java.util.Map<java.lang.String,java.lang.Object>)

• and finally exception from comment 1:
it's very same issue as comment 4. 'Something bad' happens and FutureVDSCommand logs it. We cannot influence it in poller. 'Something bad' here probably means, that getBroker().poll() failed with some exception. That's probably ok, as there might be some error during run of HostSetupNetworksCommand

Comment 7 Yaniv Kaul 2016-04-05 09:08:04 UTC
(In reply to Martin Mucha from comment #5)
> Hi, I need a help with deciding how to resolve this bug. As one of comments
> were private, I'm sticking to that policy as well.
> 
> There are three exceptions mentioned in comments. All of them are nothing
> new, but I think first two (in order I mention them) should be fixed. Log
> size is influenced by error in poller, which was fixed by
> https://gerrit.ovirt.org/#/c/54644/ and backported to 3.6 and 3.6.5
> 

I think we are looking at this from the wrong direction. The question we should ask ourselves is why, while Engine is sending a setup network command to the host, the host never receives it. My feeling is that it is caused by the *2nd* disconnect (which I'm not seeing when all is well). In comment 4 , it's:
JsonRpc (StompReactor)::ERROR::2016-03-28 14:51:36,220::betterAsyncore::124::vds.dispatcher::(recv) SSL error during reading data: unexpected eof

- and as you can see, this is NOT happening when all is well.

Comment 8 Yaniv Kaul 2016-04-13 21:03:38 UTC
*** Bug 1326847 has been marked as a duplicate of this bug. ***

Comment 9 Red Hat Bugzilla Rules Engine 2016-04-14 12:35:25 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 10 Charlie Inglese 2016-04-14 18:05:21 UTC
Is there any possibility of having this fixed in the 3.6.x release as well? This breaks our deployment currently, and waiting for 4.x would be a show stopper for us.

Comment 11 Yaniv Kaul 2016-04-14 19:51:25 UTC
(In reply to Charlie Inglese from comment #10)
> Is there any possibility of having this fixed in the 3.6.x release as well?
> This breaks our deployment currently, and waiting for 4.x would be a show
> stopper for us.

Definitely. However, our practice is to always fix for the next release before back-porting to existing release. We will fix it promptly for 3.6.x as soon as it's completed in 4.0.

Comment 12 Sven Kieske 2016-04-22 08:18:19 UTC
(In reply to Yaniv Kaul from comment #11)
> (In reply to Charlie Inglese from comment #10)
> > Is there any possibility of having this fixed in the 3.6.x release as well?
> > This breaks our deployment currently, and waiting for 4.x would be a show
> > stopper for us.
> 
> Definitely. However, our practice is to always fix for the next release
> before back-porting to existing release. We will fix it promptly for 3.6.x
> as soon as it's completed in 4.0.

But this does not imply that you wait for 4.0 release (which will be months away)
until you backport to 3.6 ?

this is also a showstopper for me (see BZ 1329317 ) and breaking any deployment.

Thanks

Sven

Comment 13 Sven Kieske 2016-04-22 08:22:49 UTC
I just saw this was already merged in 3.6.

Is there a snapshot release which I might be able to test?

I need this fix asap.

Thanks

Sven

Comment 14 Yaniv Kaul 2016-04-24 12:17:01 UTC
(In reply to Sven Kieske from comment #12)
> (In reply to Yaniv Kaul from comment #11)
> > (In reply to Charlie Inglese from comment #10)
> > > Is there any possibility of having this fixed in the 3.6.x release as well?
> > > This breaks our deployment currently, and waiting for 4.x would be a show
> > > stopper for us.
> > 
> > Definitely. However, our practice is to always fix for the next release
> > before back-porting to existing release. We will fix it promptly for 3.6.x
> > as soon as it's completed in 4.0.
> 
> But this does not imply that you wait for 4.0 release (which will be months
> away)
> until you backport to 3.6 ?

Of course not. As soon as it's in 4.0 and verified, we backport.

> 
> this is also a showstopper for me (see BZ 1329317 ) and breaking any
> deployment.

How often does it happen? We see it as a race that happens from time to time and that re-installation solves it. We now suspect that IPv6 on the host causes it to be more apparent - and are looking into it as well.

> 
> Thanks
> 
> Sven

Comment 15 Sven Kieske 2016-04-27 13:27:36 UTC
Hi,

I redeployed today the same hosts and still get some errors.

I agree it doesn't happen everytime (I was able to reinstall host from GUI), but it totally breaks automation.

Here's an excerpt from engine.log:

grep -A26 "Connection issues during send request" /var/log/ovirt-engine/engine.log
2016-04-27 10:49:22,859 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-77) [] ERROR, org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand, exception: Connection issues during send request, log id: 18ba111
2016-04-27 10:49:22,859 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-77) [] Exception: org.ovirt.engine.core.vdsbroker.xmlrpc.XmlRpcRunTimeException: Connection issues during send request
	at org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.<init>(FutureMap.java:70) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcIIrsServer.getStoragePoolInfo(JsonRpcIIrsServer.java:398) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand.executeIrsBrokerCommand(GetStoragePoolInfoVDSCommand.java:31) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand.executeVDSCommand(IrsBrokerCommand.java:161) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) [vdsbroker.jar:]
	at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]
	at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData.proceedStoragePoolStats(IrsProxyData.java:336) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData._updatingTimer_Elapsed(IrsProxyData.java:238) [vdsbroker.jar:]
	at sun.reflect.GeneratedMethodAccessor73.invoke(Unknown Source) [:1.8.0_71]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_71]
	at java.lang.reflect.Method.invoke(Method.java:497) [rt.jar:1.8.0_71]
	at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]
	at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]
Caused by: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection timeout
	at org.ovirt.vdsm.jsonrpc.client.reactors.stomp.SSLStompClient.waitForConnect(SSLStompClient.java:107) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.reactors.stomp.SSLStompClient.sendMessage(SSLStompClient.java:77) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.JsonRpcClient.call(JsonRpcClient.java:73) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.<init>(FutureMap.java:68) [vdsbroker.jar:]
	... 15 more


You can find the version information regarding engine and vdsm here:

https://bugzilla.redhat.com/show_bug.cgi?id=1329317#c20

(I still don't know if I'm facing 1, 2 or more bugs here).


You can also find complete logs in the above mentioned BZ (tagged as the "new" logs).

HTH

Sven

Comment 16 Piotr Kliczewski 2016-04-29 08:50:27 UTC
I see from description that this BZ is related to or the same as BZ #1329317.

Comment 17 Dan Kenigsberg 2016-05-01 12:36:18 UTC
*** Bug 1331344 has been marked as a duplicate of this bug. ***

Comment 18 Charlie Inglese 2016-05-09 12:36:49 UTC
*** Bug 1315824 has been marked as a duplicate of this bug. ***

Comment 19 Michael Burman 2016-05-15 11:59:08 UTC
Verified on - 3.6.6.2-0.1.el6 and vdsm-4.17.28-0.el7ev.noarch


Note You need to log in before you can comment on or make changes to this bug.