Bug 1262051

Summary: Host move from 'up' to 'connecting' and back to 'up' from time to time
Product: Red Hat Enterprise Virtualization Manager Reporter: Meni Yakove <myakove>
Component: ovirt-engineAssignee: Marcin Mirecki <mmirecki>
Status: CLOSED CURRENTRELEASE QA Contact: Meni Yakove <myakove>
Severity: urgent Docs Contact:
Priority: high    
Version: 3.6.0CC: alkaplan, danken, gklein, lsurette, mmirecki, mmucha, myakove, oourfali, rbalakri, Rhev-m-bugs, srevivo, ykaul, ylavi
Target Milestone: ovirt-3.6.1Keywords: Automation, AutomationBlocker, Regression
Target Release: 3.6.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-20 01:32:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine logs
none
hosts logs (host_mixed_1 | 10.35.128.28)
none
engine, vdsm and superbdsm logs - after fix none

Description Meni Yakove 2015-09-10 17:39:08 UTC
Description of problem:
When running automation tests the host is move from up status to connecting and then after few seconds to up again. 

Version-Release number of selected component (if applicable):
vdsm-4.17.6-1.el7ev.noarch
rhevm-3.6.0-0.15.master.el6.noarch


Steps to Reproduce:
1. Run network automation

Actual results:
Host should stay in up status

Expected results:
Host move to connecting status

Comment 1 Meni Yakove 2015-09-10 17:40:43 UTC
Created attachment 1072306 [details]
engine logs

Comment 2 Meni Yakove 2015-09-10 17:41:26 UTC
Created attachment 1072308 [details]
hosts logs (host_mixed_1 | 10.35.128.28)

Comment 3 Piotr Kliczewski 2015-09-11 07:07:04 UTC
From what I can see in the logs there was communication disruption caused by:

2015-09-10 19:44:14,463 INFO  [org.ovirt.engine.core.bll.MultipleActionsRunner] (ajp-/127.0.0.1:8702-3) [658cfcc] ParallelMultipleActionsRunner of type 'PersistentSetupNetworks' invoked with no actions
2015-09-10 19:44:17,977 INFO  [org.ovirt.engine.core.bll.network.host.SetupNetworksCommand] (ajp-/127.0.0.1:8702-7) [host_nics_syncAction_ddd7cac6-5fde-4f08] Running command: SetupNetworksCommand internal: false. Entities affected :  ID: 7ac23e28-ae4f-4c03-8b6f-4904fa439db5 Type: VDSAction group CONFIGURE_HOST_NETWORK with role type ADMIN
2015-09-10 19:44:17,981 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-7) [host_nics_syncAction_ddd7cac6-5fde-4f08] START, SetupNetworksVDSCommand(HostName = host_mixed_1, SetupNetworksVdsCommandParameters:{runAsync='true', hostId='7ac23e28-ae4f-4c03-8b6f-4904fa439db5', force='false', checkConnectivity='true', connectivityTimeout='60', networks='[Network:{id='b9e8aa84-65be-424e-acff-5c8a52fe19d0', description='null', comment='null', subnet='null', gateway='null', type='null', vlanId='2', stp='false', dataCenterId='0abffcd5-d9f7-49d9-875c-01a9c66a8d5a', mtu='0', vmNetwork='true', cluster='NetworkCluster:{id='NetworkClusterId:{clusterId='null', networkId='null'}', status='OPERATIONAL', display='false', required='false', migration='false', management='false', gluster='false'}', providedBy='null', label='null', qosId='null'}]', bonds='[]', interfaces='[Nic:{id='null', name='enp1s0f1.2', vdsId='null', networkName='case2_net1', bootProtocol='STATIC_IP', address='1.1.1.1', subnet='255.255.255.0', gateway='null', mtu='0', bridged='false', type='0', networkImplementationDetails='null', qos='null', customProperties='null', macAddress='null', bondName='null', speed='null', labels='null'}, Nic:{id='e1a79fad-8884-45e6-93bb-ec73b60f7ff4', name='enp1s0f0', vdsId='null', networkName='ovirtmgmt', bootProtocol='DHCP', address='10.35.128.28', subnet='255.255.255.0', gateway='10.35.128.254', mtu='0', bridged='false', type='0', networkImplementationDetails='null', qos='null', customProperties='null', macAddress='00:25:90:c6:3d:58', bondName='null', speed='null', labels='null'}, Nic:{id='3dc26820-7ffc-40d1-bee9-c5a14aa962eb', name='enp1s0f1', vdsId='null', networkName='null', bootProtocol='NONE', address='', subnet='', gateway='null', mtu='0', bridged='false', type='0', networkImplementationDetails='null', qos='null', customProperties='null', macAddress='00:25:90:c6:3d:59', bondName='null', speed='null', labels='[]'}]', removedNetworks='[]', removedBonds='[]'}), log id: 176cd21a
2015-09-10 19:44:17,981 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.SetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-7) [host_nics_syncAction_ddd7cac6-5fde-4f08] Host 'host_mixed_1' ('7ac23e28-ae4f-4c03-8b6f-4904fa439db5') doesn't contain Supported Cluster Versions, therefore 'defaultRoute' will not be sent via the SetupNetworks
2015-09-10 19:44:17,995 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-7) [host_nics_syncAction_ddd7cac6-5fde-4f08] FINISH, SetupNetworksVDSCommand, log id: 176cd21a


After this command there was no communication and vdsm was rebooted at:

MainThread::DEBUG::2015-09-10 19:50:37,644::vdsm::68::vds::(sigtermHandler) Received signal 15

As a result many command which were running before started to time out which caused short communication disruptions.

Dan can you please take a look?

Comment 4 Martin Mucha 2015-09-14 08:43:31 UTC
I don't know how did we end up here, but from given message & related code it seems, that SetupNetworks was invoked on Host which does not have set 'SupportedClusterVersionsSet' so we cannot determine if it supports 'defaultRoute', thus we're assuming that it's not supported, and therefore default route is not set. This can cause that engine and vdsm cannot see each other. Please can you confirm, that problematic host does not have SupportedClusterVersions set, and if it does not have it set, is this valid state?

Comment 5 Piotr Kliczewski 2015-09-14 09:16:05 UTC
Can you please check the setting?

Comment 6 Martin Mucha 2015-09-14 10:07:13 UTC
it seems that it may be related to 1209570.

Comment 7 Martin Mucha 2015-09-15 13:40:47 UTC
I do not know how to induce this, but I submitted patch trying to fix 46181, maybe this bug will be fix by this as well.

Comment 8 Yaniv Lavi 2015-10-07 09:30:07 UTC
Can you try to recreate with the new setup networks?

Comment 9 Meni Yakove 2015-10-14 09:47:24 UTC
This is with the new API

Comment 11 Gil Klein 2015-12-01 06:53:16 UTC
Raising up the priority and setting back the automation blocker flag. This is blocking tier 1 testing in CI.

Comment 12 Eyal Edri 2015-12-09 09:05:22 UTC
part of the 3rd ovirt-3.6.1 milestone build.

Comment 13 Meni Yakove 2016-01-12 13:59:44 UTC
Created attachment 1113953 [details]
engine, vdsm and superbdsm logs - after fix

Comment 14 Meni Yakove 2016-01-12 14:00:04 UTC
With this fix one of our sync network test fail with:
2016-01-12 15:36:06,776 - MainThread - api_utils - ERROR - Failed to syncAction element NOT as expected:
00:10:18.405 	Status: 400
00:10:18.405 	Reason: Bad Request
00:10:18.405 	Detail: [Resource unavailable]

vdsm, supervdsm and engine logs are attached

mmirecki can you please take a look

Comment 15 Marcin Mirecki 2016-01-13 09:43:12 UTC
The problem in the logs:
engine.log:
2016-01-12 15:36:06,727 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-5) [hosts_syncAction_b869458d-fe35-4e50] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to HostSetupNetworksVDS, error = Resource unavailable, code = 40

The related entry in vdsm.log:
jsonrpc.Executor/6::WARNING::2016-01-12 15:36:04,688::API::1487::vds::(setupNetworks) concurrent network verb already executing

This is the result of failing the following operation:
self._cif._networkSemaphore.acquire(blocking=False)

The operation failed to aquire a semaphore needed to execute the required changes, which means that there was another operation modyfying the networks already in progress.

Comment 16 Meni Yakove 2016-01-17 13:10:49 UTC
ok, I will investigate and open a new bug if needed