Bug 1262051 - Host move from 'up' to 'connecting' and back to 'up' from time to time
Summary: Host move from 'up' to 'connecting' and back to 'up' from time to time
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.6.0
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ovirt-3.6.1
: 3.6.1
Assignee: Marcin Mirecki
QA Contact: Meni Yakove
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-09-10 17:39 UTC by Meni Yakove
Modified: 2016-04-20 01:32 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-04-20 01:32:51 UTC
oVirt Team: Network
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
engine logs (642.94 KB, application/x-bzip)
2015-09-10 17:40 UTC, Meni Yakove
no flags Details
hosts logs (host_mixed_1 | 10.35.128.28) (156.53 KB, application/x-bzip)
2015-09-10 17:41 UTC, Meni Yakove
no flags Details
engine, vdsm and superbdsm logs - after fix (263.78 KB, application/zip)
2016-01-12 13:59 UTC, Meni Yakove
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 48754 0 master MERGED engine: update setupNetworks heartbeat policy only for mgmt networks 2020-06-03 16:08:53 UTC
oVirt gerrit 49508 0 ovirt-engine-3.6 MERGED engine: update setupNetworks heartbeat policy only for mgmt networks 2020-06-03 16:08:53 UTC
oVirt gerrit 49641 0 ovirt-engine-3.6.1 MERGED engine: update setupNetworks heartbeat policy only for mgmt networks 2020-06-03 16:08:53 UTC

Description Meni Yakove 2015-09-10 17:39:08 UTC
Description of problem:
When running automation tests the host is move from up status to connecting and then after few seconds to up again. 

Version-Release number of selected component (if applicable):
vdsm-4.17.6-1.el7ev.noarch
rhevm-3.6.0-0.15.master.el6.noarch


Steps to Reproduce:
1. Run network automation

Actual results:
Host should stay in up status

Expected results:
Host move to connecting status

Comment 1 Meni Yakove 2015-09-10 17:40:43 UTC
Created attachment 1072306 [details]
engine logs

Comment 2 Meni Yakove 2015-09-10 17:41:26 UTC
Created attachment 1072308 [details]
hosts logs (host_mixed_1 | 10.35.128.28)

Comment 3 Piotr Kliczewski 2015-09-11 07:07:04 UTC
From what I can see in the logs there was communication disruption caused by:

2015-09-10 19:44:14,463 INFO  [org.ovirt.engine.core.bll.MultipleActionsRunner] (ajp-/127.0.0.1:8702-3) [658cfcc] ParallelMultipleActionsRunner of type 'PersistentSetupNetworks' invoked with no actions
2015-09-10 19:44:17,977 INFO  [org.ovirt.engine.core.bll.network.host.SetupNetworksCommand] (ajp-/127.0.0.1:8702-7) [host_nics_syncAction_ddd7cac6-5fde-4f08] Running command: SetupNetworksCommand internal: false. Entities affected :  ID: 7ac23e28-ae4f-4c03-8b6f-4904fa439db5 Type: VDSAction group CONFIGURE_HOST_NETWORK with role type ADMIN
2015-09-10 19:44:17,981 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-7) [host_nics_syncAction_ddd7cac6-5fde-4f08] START, SetupNetworksVDSCommand(HostName = host_mixed_1, SetupNetworksVdsCommandParameters:{runAsync='true', hostId='7ac23e28-ae4f-4c03-8b6f-4904fa439db5', force='false', checkConnectivity='true', connectivityTimeout='60', networks='[Network:{id='b9e8aa84-65be-424e-acff-5c8a52fe19d0', description='null', comment='null', subnet='null', gateway='null', type='null', vlanId='2', stp='false', dataCenterId='0abffcd5-d9f7-49d9-875c-01a9c66a8d5a', mtu='0', vmNetwork='true', cluster='NetworkCluster:{id='NetworkClusterId:{clusterId='null', networkId='null'}', status='OPERATIONAL', display='false', required='false', migration='false', management='false', gluster='false'}', providedBy='null', label='null', qosId='null'}]', bonds='[]', interfaces='[Nic:{id='null', name='enp1s0f1.2', vdsId='null', networkName='case2_net1', bootProtocol='STATIC_IP', address='1.1.1.1', subnet='255.255.255.0', gateway='null', mtu='0', bridged='false', type='0', networkImplementationDetails='null', qos='null', customProperties='null', macAddress='null', bondName='null', speed='null', labels='null'}, Nic:{id='e1a79fad-8884-45e6-93bb-ec73b60f7ff4', name='enp1s0f0', vdsId='null', networkName='ovirtmgmt', bootProtocol='DHCP', address='10.35.128.28', subnet='255.255.255.0', gateway='10.35.128.254', mtu='0', bridged='false', type='0', networkImplementationDetails='null', qos='null', customProperties='null', macAddress='00:25:90:c6:3d:58', bondName='null', speed='null', labels='null'}, Nic:{id='3dc26820-7ffc-40d1-bee9-c5a14aa962eb', name='enp1s0f1', vdsId='null', networkName='null', bootProtocol='NONE', address='', subnet='', gateway='null', mtu='0', bridged='false', type='0', networkImplementationDetails='null', qos='null', customProperties='null', macAddress='00:25:90:c6:3d:59', bondName='null', speed='null', labels='[]'}]', removedNetworks='[]', removedBonds='[]'}), log id: 176cd21a
2015-09-10 19:44:17,981 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.SetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-7) [host_nics_syncAction_ddd7cac6-5fde-4f08] Host 'host_mixed_1' ('7ac23e28-ae4f-4c03-8b6f-4904fa439db5') doesn't contain Supported Cluster Versions, therefore 'defaultRoute' will not be sent via the SetupNetworks
2015-09-10 19:44:17,995 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-7) [host_nics_syncAction_ddd7cac6-5fde-4f08] FINISH, SetupNetworksVDSCommand, log id: 176cd21a


After this command there was no communication and vdsm was rebooted at:

MainThread::DEBUG::2015-09-10 19:50:37,644::vdsm::68::vds::(sigtermHandler) Received signal 15

As a result many command which were running before started to time out which caused short communication disruptions.

Dan can you please take a look?

Comment 4 Martin Mucha 2015-09-14 08:43:31 UTC
I don't know how did we end up here, but from given message & related code it seems, that SetupNetworks was invoked on Host which does not have set 'SupportedClusterVersionsSet' so we cannot determine if it supports 'defaultRoute', thus we're assuming that it's not supported, and therefore default route is not set. This can cause that engine and vdsm cannot see each other. Please can you confirm, that problematic host does not have SupportedClusterVersions set, and if it does not have it set, is this valid state?

Comment 5 Piotr Kliczewski 2015-09-14 09:16:05 UTC
Can you please check the setting?

Comment 6 Martin Mucha 2015-09-14 10:07:13 UTC
it seems that it may be related to 1209570.

Comment 7 Martin Mucha 2015-09-15 13:40:47 UTC
I do not know how to induce this, but I submitted patch trying to fix 46181, maybe this bug will be fix by this as well.

Comment 8 Yaniv Lavi 2015-10-07 09:30:07 UTC
Can you try to recreate with the new setup networks?

Comment 9 Meni Yakove 2015-10-14 09:47:24 UTC
This is with the new API

Comment 11 Gil Klein 2015-12-01 06:53:16 UTC
Raising up the priority and setting back the automation blocker flag. This is blocking tier 1 testing in CI.

Comment 12 Eyal Edri 2015-12-09 09:05:22 UTC
part of the 3rd ovirt-3.6.1 milestone build.

Comment 13 Meni Yakove 2016-01-12 13:59:44 UTC
Created attachment 1113953 [details]
engine, vdsm and superbdsm logs - after fix

Comment 14 Meni Yakove 2016-01-12 14:00:04 UTC
With this fix one of our sync network test fail with:
2016-01-12 15:36:06,776 - MainThread - api_utils - ERROR - Failed to syncAction element NOT as expected:
00:10:18.405 	Status: 400
00:10:18.405 	Reason: Bad Request
00:10:18.405 	Detail: [Resource unavailable]

vdsm, supervdsm and engine logs are attached

mmirecki can you please take a look

Comment 15 Marcin Mirecki 2016-01-13 09:43:12 UTC
The problem in the logs:
engine.log:
2016-01-12 15:36:06,727 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-5) [hosts_syncAction_b869458d-fe35-4e50] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to HostSetupNetworksVDS, error = Resource unavailable, code = 40

The related entry in vdsm.log:
jsonrpc.Executor/6::WARNING::2016-01-12 15:36:04,688::API::1487::vds::(setupNetworks) concurrent network verb already executing

This is the result of failing the following operation:
self._cif._networkSemaphore.acquire(blocking=False)

The operation failed to aquire a semaphore needed to execute the required changes, which means that there was another operation modyfying the networks already in progress.

Comment 16 Meni Yakove 2016-01-17 13:10:49 UTC
ok, I will investigate and open a new bug if needed


Note You need to log in before you can comment on or make changes to this bug.