Bug 1262051 - Host move from 'up' to 'connecting' and back to 'up' from time to time
Host move from 'up' to 'connecting' and back to 'up' from time to time
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.6.0
x86_64 Linux
high Severity urgent
: ovirt-3.6.1
: 3.6.1
Assigned To: Marcin Mirecki
Meni Yakove
: Automation, AutomationBlocker, Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-09-10 13:39 EDT by Meni Yakove
Modified: 2016-04-19 21:32 EDT (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-04-19 21:32:51 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Network
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
engine logs (642.94 KB, application/x-bzip)
2015-09-10 13:40 EDT, Meni Yakove
no flags Details
hosts logs (host_mixed_1 | 10.35.128.28) (156.53 KB, application/x-bzip)
2015-09-10 13:41 EDT, Meni Yakove
no flags Details
engine, vdsm and superbdsm logs - after fix (263.78 KB, application/zip)
2016-01-12 08:59 EST, Meni Yakove
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 48754 master MERGED engine: update setupNetworks heartbeat policy only for mgmt networks Never
oVirt gerrit 49508 ovirt-engine-3.6 MERGED engine: update setupNetworks heartbeat policy only for mgmt networks Never
oVirt gerrit 49641 ovirt-engine-3.6.1 MERGED engine: update setupNetworks heartbeat policy only for mgmt networks Never

  None (edit)
Description Meni Yakove 2015-09-10 13:39:08 EDT
Description of problem:
When running automation tests the host is move from up status to connecting and then after few seconds to up again. 

Version-Release number of selected component (if applicable):
vdsm-4.17.6-1.el7ev.noarch
rhevm-3.6.0-0.15.master.el6.noarch


Steps to Reproduce:
1. Run network automation

Actual results:
Host should stay in up status

Expected results:
Host move to connecting status
Comment 1 Meni Yakove 2015-09-10 13:40:43 EDT
Created attachment 1072306 [details]
engine logs
Comment 2 Meni Yakove 2015-09-10 13:41:26 EDT
Created attachment 1072308 [details]
hosts logs (host_mixed_1 | 10.35.128.28)
Comment 3 Piotr Kliczewski 2015-09-11 03:07:04 EDT
From what I can see in the logs there was communication disruption caused by:

2015-09-10 19:44:14,463 INFO  [org.ovirt.engine.core.bll.MultipleActionsRunner] (ajp-/127.0.0.1:8702-3) [658cfcc] ParallelMultipleActionsRunner of type 'PersistentSetupNetworks' invoked with no actions
2015-09-10 19:44:17,977 INFO  [org.ovirt.engine.core.bll.network.host.SetupNetworksCommand] (ajp-/127.0.0.1:8702-7) [host_nics_syncAction_ddd7cac6-5fde-4f08] Running command: SetupNetworksCommand internal: false. Entities affected :  ID: 7ac23e28-ae4f-4c03-8b6f-4904fa439db5 Type: VDSAction group CONFIGURE_HOST_NETWORK with role type ADMIN
2015-09-10 19:44:17,981 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-7) [host_nics_syncAction_ddd7cac6-5fde-4f08] START, SetupNetworksVDSCommand(HostName = host_mixed_1, SetupNetworksVdsCommandParameters:{runAsync='true', hostId='7ac23e28-ae4f-4c03-8b6f-4904fa439db5', force='false', checkConnectivity='true', connectivityTimeout='60', networks='[Network:{id='b9e8aa84-65be-424e-acff-5c8a52fe19d0', description='null', comment='null', subnet='null', gateway='null', type='null', vlanId='2', stp='false', dataCenterId='0abffcd5-d9f7-49d9-875c-01a9c66a8d5a', mtu='0', vmNetwork='true', cluster='NetworkCluster:{id='NetworkClusterId:{clusterId='null', networkId='null'}', status='OPERATIONAL', display='false', required='false', migration='false', management='false', gluster='false'}', providedBy='null', label='null', qosId='null'}]', bonds='[]', interfaces='[Nic:{id='null', name='enp1s0f1.2', vdsId='null', networkName='case2_net1', bootProtocol='STATIC_IP', address='1.1.1.1', subnet='255.255.255.0', gateway='null', mtu='0', bridged='false', type='0', networkImplementationDetails='null', qos='null', customProperties='null', macAddress='null', bondName='null', speed='null', labels='null'}, Nic:{id='e1a79fad-8884-45e6-93bb-ec73b60f7ff4', name='enp1s0f0', vdsId='null', networkName='ovirtmgmt', bootProtocol='DHCP', address='10.35.128.28', subnet='255.255.255.0', gateway='10.35.128.254', mtu='0', bridged='false', type='0', networkImplementationDetails='null', qos='null', customProperties='null', macAddress='00:25:90:c6:3d:58', bondName='null', speed='null', labels='null'}, Nic:{id='3dc26820-7ffc-40d1-bee9-c5a14aa962eb', name='enp1s0f1', vdsId='null', networkName='null', bootProtocol='NONE', address='', subnet='', gateway='null', mtu='0', bridged='false', type='0', networkImplementationDetails='null', qos='null', customProperties='null', macAddress='00:25:90:c6:3d:59', bondName='null', speed='null', labels='[]'}]', removedNetworks='[]', removedBonds='[]'}), log id: 176cd21a
2015-09-10 19:44:17,981 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.SetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-7) [host_nics_syncAction_ddd7cac6-5fde-4f08] Host 'host_mixed_1' ('7ac23e28-ae4f-4c03-8b6f-4904fa439db5') doesn't contain Supported Cluster Versions, therefore 'defaultRoute' will not be sent via the SetupNetworks
2015-09-10 19:44:17,995 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-7) [host_nics_syncAction_ddd7cac6-5fde-4f08] FINISH, SetupNetworksVDSCommand, log id: 176cd21a


After this command there was no communication and vdsm was rebooted at:

MainThread::DEBUG::2015-09-10 19:50:37,644::vdsm::68::vds::(sigtermHandler) Received signal 15

As a result many command which were running before started to time out which caused short communication disruptions.

Dan can you please take a look?
Comment 4 Martin Mucha 2015-09-14 04:43:31 EDT
I don't know how did we end up here, but from given message & related code it seems, that SetupNetworks was invoked on Host which does not have set 'SupportedClusterVersionsSet' so we cannot determine if it supports 'defaultRoute', thus we're assuming that it's not supported, and therefore default route is not set. This can cause that engine and vdsm cannot see each other. Please can you confirm, that problematic host does not have SupportedClusterVersions set, and if it does not have it set, is this valid state?
Comment 5 Piotr Kliczewski 2015-09-14 05:16:05 EDT
Can you please check the setting?
Comment 6 Martin Mucha 2015-09-14 06:07:13 EDT
it seems that it may be related to 1209570.
Comment 7 Martin Mucha 2015-09-15 09:40:47 EDT
I do not know how to induce this, but I submitted patch trying to fix 46181, maybe this bug will be fix by this as well.
Comment 8 Yaniv Lavi 2015-10-07 05:30:07 EDT
Can you try to recreate with the new setup networks?
Comment 9 Meni Yakove 2015-10-14 05:47:24 EDT
This is with the new API
Comment 11 Gil Klein 2015-12-01 01:53:16 EST
Raising up the priority and setting back the automation blocker flag. This is blocking tier 1 testing in CI.
Comment 12 Eyal Edri 2015-12-09 04:05:22 EST
part of the 3rd ovirt-3.6.1 milestone build.
Comment 13 Meni Yakove 2016-01-12 08:59 EST
Created attachment 1113953 [details]
engine, vdsm and superbdsm logs - after fix
Comment 14 Meni Yakove 2016-01-12 09:00:04 EST
With this fix one of our sync network test fail with:
2016-01-12 15:36:06,776 - MainThread - api_utils - ERROR - Failed to syncAction element NOT as expected:
00:10:18.405 	Status: 400
00:10:18.405 	Reason: Bad Request
00:10:18.405 	Detail: [Resource unavailable]

vdsm, supervdsm and engine logs are attached

mmirecki can you please take a look
Comment 15 Marcin Mirecki 2016-01-13 04:43:12 EST
The problem in the logs:
engine.log:
2016-01-12 15:36:06,727 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-5) [hosts_syncAction_b869458d-fe35-4e50] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to HostSetupNetworksVDS, error = Resource unavailable, code = 40

The related entry in vdsm.log:
jsonrpc.Executor/6::WARNING::2016-01-12 15:36:04,688::API::1487::vds::(setupNetworks) concurrent network verb already executing

This is the result of failing the following operation:
self._cif._networkSemaphore.acquire(blocking=False)

The operation failed to aquire a semaphore needed to execute the required changes, which means that there was another operation modyfying the networks already in progress.
Comment 16 Meni Yakove 2016-01-17 08:10:49 EST
ok, I will investigate and open a new bug if needed

Note You need to log in before you can comment on or make changes to this bug.