Bug 1388364 - VDSNetworkException: Timeout during xml-rpc call when adding host
Summary: VDSNetworkException: Timeout during xml-rpc call when adding host
Keywords:
Status: CLOSED DUPLICATE of bug 1388778
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Network
Version: 4.0.5.2
Hardware: x86_64
OS: Linux
unspecified
urgent vote
Target Milestone: ovirt-4.0.5
: ---
Assignee: Dan Kenigsberg
QA Contact: Michael Burman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-25 07:56 UTC by Michael Burman
Modified: 2020-06-11 13:02 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-10-30 09:24:25 UTC
oVirt Team: Network
gklein: ovirt-4.0.z?
gklein: blocker?
mburman: planning_ack?
mburman: devel_ack?
mburman: testing_ack?


Attachments (Terms of Use)
logs (529.23 KB, application/x-gzip)
2016-10-25 07:56 UTC, Michael Burman
no flags Details
detailed engine.log (2.35 MB, application/x-gzip)
2016-10-27 09:39 UTC, Moti Asayag
no flags Details

Description Michael Burman 2016-10-25 07:56:36 UTC
Created attachment 1213739 [details]
logs

Description of problem:
VDSNetworkException: Timeout during xml-rpc call when adding host.

When adding new host or changing the management network from VM to non-VM(bridge less) we failing with the next error exception:

2016-10-25 10:02:50,589 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-45) [409fafa2] Command 'PollVDSCommand(HostName = orchid-vds2.qa.lab.tlv.redhat.com, VdsIdVD
SCommandParametersBase:{runAsync='true', hostId='9020e220-a5db-4393-9fe3-a595dd0faa38'})' execution failed: VDSGenericException: VDSNetworkException: Timeout during xml-rpc call

This will end up with failure to configure management network on the host, after few minutes the host will be auto recovered and will go up, BUT, the management network won't be persisted, which will cause the host not to come up after reboot. 

Version-Release number of selected component (if applicable):
4.0.5.2-0.2.el7ev

How reproducible:
100

Steps to Reproduce:
1. Add new host to rhv-m 4.0.5.2-0.2.el7ev

Actual results:
- Failed to configure management network on the host
- Host was auto recovered after few minutes
- Management network wasn't persisted
- After reboot host didn't recovered and 'ovirtmgmt' wasn't restored

Expected results:
Should work as expected

Additional info:

Comment 1 Dan Kenigsberg 2016-10-25 10:17:17 UTC
2016-10-25 10:01:45,363 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] (org.ovirt.thread.pool-6-thread-45) [409fafa2] START, HostSetupNetworksVDSCommand(HostName = orchid-vds2.qa.lab.tlv.redhat.com, HostSetupNetworksVdsCommandParameters:{runAsync='true', hostId='9020e220-a5db-4393-9fe3-a595dd0faa38', vds='Host[orchid-vds2.qa.lab.tlv.redhat.com,9020e220-a5db-4393-9fe3-a595dd0faa38]', rollbackOnFailure='true', connectivityTimeout='120', networks='[HostNetwork:{defaultRoute='true', bonding='false', networkName='ovirtmgmt', nicName='enp4s0', vlan='null', mtu='0', vmNetwork='true', stp='false', properties='null', ipv4BootProtocol='DHCP', ipv4Address='null', ipv4Netmask='null', ipv4Gateway='null', ipv6BootProtocol='NONE', ipv6Address='null', ipv6Prefix='null', ipv6Gateway='null', switchType='LEGACY'}]', removedNetworks='[]', bonds='[]', removedBonds='[]'}), log id: 76324159
2016-10-25 10:01:45,369 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] (org.ovirt.thread.pool-6-thread-45) [409fafa2] FINISH, HostSetupNetworksVDSCommand, log id: 76324159
2016-10-25 10:01:45,398 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-45) [409fafa2] Policy reset required for network reconfiguration
2016-10-25 10:01:45,589 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to orchid-vds2.qa.lab.tlv.redhat.com/10.35.128.15
2016-10-25 10:01:50,345 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-45) [409fafa2] Command 'PollVDSCommand(HostName = orchid-vds2.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='9020e220-a5db-4393-9fe3-a595dd0faa38'})' execution failed: VDSGenericException: VDSNetworkException: Timeout during xml-rpc call
2016-10-25 10:01:50,346 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-45) [409fafa2] Timeout waiting for VDSM response: Internal timeout occured
2016-10-25 10:01:52,851 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-45) [409fafa2] Command 'PollVDSCommand(HostName = orchid-vds2.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='9020e220-a5db-4393-9fe3-a595dd0faa38'})' execution failed: VDSGenericException: VDSNetworkException: Timeout during xml-rpc call



Vdsm receives the request, and acts upon it after (quite a long) minute
jsonrpc.Executor/3::DEBUG::2016-10-25 10:01:45,855::__init__::530::jsonrpc.JsonRpcServer::(_handle_request) Calling 'Host.setupNetworks' in bridge with {'bondings': {}, 'networks': {'ovirtmgmt': {'ipv6autoconf': False, 'nic': 'enp4s0', 'mtu': 1500, 'switch': 'legacy', 'dhcpv6': False, 'STP': 'no', 'bridged': 'true', 'defaultRoute': True, 'bootproto': 'dhcp'}}, 'options': {'connectivityCheck': 'true', 'connectivityTimeout': 120}}
...
jsonrpc.Executor/3::DEBUG::2016-10-25 10:02:41,802::__init__::555::jsonrpc.JsonRpcServer::(_handle_request) Return 'Host.setupNetworks' in bridge with {'message': 'Done', 'code': 0}
jsonrpc.Executor/3::INFO::2016-10-25 10:02:41,802::__init__::513::jsonrpc.JsonRpcServer::(_serveRequest) RPC call Host.setupNetworks succeeded in 55.94 seconds

and answers pings

jsonrpc.Executor/0::DEBUG::2016-10-25 10:02:43,177::__init__::530::jsonrpc.JsonRpcServer::(_handle_request) Calling 'Host.ping' in bridge with {}
jsonrpc.Executor/0::DEBUG::2016-10-25 10:02:43,177::__init__::555::jsonrpc.JsonRpcServer::(_handle_request) Return 'Host.ping' in bridge with True

7 seconds before Engine declated timeout.

Which version of vdsm-jsonrpc-java was used here?

Comment 2 Michael Burman 2016-10-25 10:26:30 UTC
vdsm-jsonrpc-java-1.2.6-1.el7ev.noarch

The one that coming from http://bob.eng.lab.tlv.redhat.com/builds/4.0/4.0.5-3/el$releasever

Comment 3 Dan Kenigsberg 2016-10-26 05:27:29 UTC
Piotr, could you take a look? It seems awfully familiar.

Comment 4 Piotr Kliczewski 2016-10-26 07:46:20 UTC
Dan based on the logs I see that:

Engine connected to vdsm:
2016-10-25 10:01:42,400 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to orchid-vds2.qa.lab.tlv.redhat.com/10.35.128.15

engine initiated setup networks:
2016-10-25 10:01:45,149 INFO  [org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand] (org.ovirt.thread.pool-6-thread-45) [409fafa2] Lock Acquired to object 'EngineLock:{exclusiveLocks='[HOST_NETWORK9020e220-a5db-4393-9fe3-a595dd0faa38=<HOST_NETWORK, ACTION_TYPE_FAILED_SETUP_NETWORKS_OR_REFRESH_IN_PROGRESS>]', sharedLocks='null'}'

vdsm received setupNetworks call:
jsonrpc.Executor/3::DEBUG::2016-10-25 10:01:45,855::__init__::530::jsonrpc.JsonRpcServer::(_handle_request) Calling 'Host.setupNetworks' in bridge with {'bondings': {}, 'networks': {'ovirtmgmt': {'ipv6autoconf': False, 'nic': 'enp4s0', 'mtu': 1500, 'switch': 'legacy', 'dhcpv6': False, 'STP': 'no', 'bridged': 'true', 'defaultRoute': True, 'bootproto': 'dhcp'}}, 'options': {'connectivityCheck': 'true', 'connectivityTimeout': 120}}

vdsm responsed to the call:
jsonrpc.Executor/3::DEBUG::2016-10-25 10:02:41,802::__init__::555::jsonrpc.JsonRpcServer::(_handle_request) Return 'Host.setupNetworks' in bridge with {'message': 'Done', 'code': 0}
jsonrpc.Executor/3::INFO::2016-10-25 10:02:41,802::__init__::513::jsonrpc.JsonRpcServer::(_serveRequest) RPC call Host.setupNetworks succeeded in 55.94 seconds

engine triggered policy reset after the setupNetworks:
2016-10-25 10:02:58,158 ERROR [org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand] (org.ovirt.thread.pool-6-thread-45) [409fafa2] Command 'org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Policy reset (Failed with error VDS_NETWORK_ERROR and code 5022)

engine declared host as nonResponsive:
2016-10-25 10:02:58,160 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-22) [409fafa2] Host 'orchid-vds2.qa.lab.tlv.redhat.com' is not responding.


The timeout you mentioned is poll timeout not verb timeout. We know that after post second policy reset it seems to be OK. We know that setupNetworks unlocked the host and VdsManager declared it as not responsive. I wonder whether we have a race between setupNetworks and host monitoring.

Comment 5 Moti Asayag 2016-10-27 09:39:24 UTC
Created attachment 1214543 [details]
detailed engine.log

Attached is the detailed engine.log which contains both engine in TRACE log level and jsonrpc in DEBUG.

Comment 6 Michal Skrivanek 2016-10-27 12:46:16 UTC
(In reply to Piotr Kliczewski from comment #4)
> I wonder whether we have a race between setupNetworks and host monitoring.

We very well might have, but then this is not 100% reproducible as the description says. So I wonder...does it really reproduce all the time in manual scenarios?

Comment 7 Michael Burman 2016-10-27 12:50:53 UTC
It is 100% reproducible for me at least)

Comment 8 Gil Klein 2016-10-27 13:10:36 UTC
(In reply to Michal Skrivanek from comment #6)
> (In reply to Piotr Kliczewski from comment #4)
> > I wonder whether we have a race between setupNetworks and host monitoring.
> 
> We very well might have, but then this is not 100% reproducible as the
> description says. So I wonder...does it really reproduce all the time in
> manual scenarios?
I don't think it is reproducible on all environments, but I just got one more report from RHEV Storage QE that it was reproduced in a different environment.

Comment 9 Michael Burman 2016-10-27 14:56:07 UTC
Looks like it the same issue - https://bugzilla.redhat.com/show_bug.cgi?id=1388778

Comment 10 Piotr Kliczewski 2016-10-28 09:41:29 UTC
I would say it is not the same issue but BZ #1388778 fixes this issue.

Comment 11 Yaniv Kaul 2016-10-30 09:24:25 UTC
(In reply to Piotr Kliczewski from comment #10)
> I would say it is not the same issue but BZ #1388778 fixes this issue.

So we can close-duplicate, as I've added a note there to check this scenario when verifying it.

*** This bug has been marked as a duplicate of bug 1388778 ***


Note You need to log in before you can comment on or make changes to this bug.