Bug 1262051
Summary: | Host move from 'up' to 'connecting' and back to 'up' from time to time | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Meni Yakove <myakove> | ||||||||
Component: | ovirt-engine | Assignee: | Marcin Mirecki <mmirecki> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Meni Yakove <myakove> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 3.6.0 | CC: | alkaplan, danken, gklein, lsurette, mmirecki, mmucha, myakove, oourfali, rbalakri, Rhev-m-bugs, srevivo, ykaul, ylavi | ||||||||
Target Milestone: | ovirt-3.6.1 | Keywords: | Automation, AutomationBlocker, Regression | ||||||||
Target Release: | 3.6.1 | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2016-04-20 01:32:51 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Meni Yakove
2015-09-10 17:39:08 UTC
Created attachment 1072306 [details]
engine logs
Created attachment 1072308 [details]
hosts logs (host_mixed_1 | 10.35.128.28)
From what I can see in the logs there was communication disruption caused by: 2015-09-10 19:44:14,463 INFO [org.ovirt.engine.core.bll.MultipleActionsRunner] (ajp-/127.0.0.1:8702-3) [658cfcc] ParallelMultipleActionsRunner of type 'PersistentSetupNetworks' invoked with no actions 2015-09-10 19:44:17,977 INFO [org.ovirt.engine.core.bll.network.host.SetupNetworksCommand] (ajp-/127.0.0.1:8702-7) [host_nics_syncAction_ddd7cac6-5fde-4f08] Running command: SetupNetworksCommand internal: false. Entities affected : ID: 7ac23e28-ae4f-4c03-8b6f-4904fa439db5 Type: VDSAction group CONFIGURE_HOST_NETWORK with role type ADMIN 2015-09-10 19:44:17,981 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-7) [host_nics_syncAction_ddd7cac6-5fde-4f08] START, SetupNetworksVDSCommand(HostName = host_mixed_1, SetupNetworksVdsCommandParameters:{runAsync='true', hostId='7ac23e28-ae4f-4c03-8b6f-4904fa439db5', force='false', checkConnectivity='true', connectivityTimeout='60', networks='[Network:{id='b9e8aa84-65be-424e-acff-5c8a52fe19d0', description='null', comment='null', subnet='null', gateway='null', type='null', vlanId='2', stp='false', dataCenterId='0abffcd5-d9f7-49d9-875c-01a9c66a8d5a', mtu='0', vmNetwork='true', cluster='NetworkCluster:{id='NetworkClusterId:{clusterId='null', networkId='null'}', status='OPERATIONAL', display='false', required='false', migration='false', management='false', gluster='false'}', providedBy='null', label='null', qosId='null'}]', bonds='[]', interfaces='[Nic:{id='null', name='enp1s0f1.2', vdsId='null', networkName='case2_net1', bootProtocol='STATIC_IP', address='1.1.1.1', subnet='255.255.255.0', gateway='null', mtu='0', bridged='false', type='0', networkImplementationDetails='null', qos='null', customProperties='null', macAddress='null', bondName='null', speed='null', labels='null'}, Nic:{id='e1a79fad-8884-45e6-93bb-ec73b60f7ff4', name='enp1s0f0', vdsId='null', networkName='ovirtmgmt', bootProtocol='DHCP', address='10.35.128.28', subnet='255.255.255.0', gateway='10.35.128.254', mtu='0', bridged='false', type='0', networkImplementationDetails='null', qos='null', customProperties='null', macAddress='00:25:90:c6:3d:58', bondName='null', speed='null', labels='null'}, Nic:{id='3dc26820-7ffc-40d1-bee9-c5a14aa962eb', name='enp1s0f1', vdsId='null', networkName='null', bootProtocol='NONE', address='', subnet='', gateway='null', mtu='0', bridged='false', type='0', networkImplementationDetails='null', qos='null', customProperties='null', macAddress='00:25:90:c6:3d:59', bondName='null', speed='null', labels='[]'}]', removedNetworks='[]', removedBonds='[]'}), log id: 176cd21a 2015-09-10 19:44:17,981 WARN [org.ovirt.engine.core.vdsbroker.vdsbroker.SetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-7) [host_nics_syncAction_ddd7cac6-5fde-4f08] Host 'host_mixed_1' ('7ac23e28-ae4f-4c03-8b6f-4904fa439db5') doesn't contain Supported Cluster Versions, therefore 'defaultRoute' will not be sent via the SetupNetworks 2015-09-10 19:44:17,995 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-7) [host_nics_syncAction_ddd7cac6-5fde-4f08] FINISH, SetupNetworksVDSCommand, log id: 176cd21a After this command there was no communication and vdsm was rebooted at: MainThread::DEBUG::2015-09-10 19:50:37,644::vdsm::68::vds::(sigtermHandler) Received signal 15 As a result many command which were running before started to time out which caused short communication disruptions. Dan can you please take a look? I don't know how did we end up here, but from given message & related code it seems, that SetupNetworks was invoked on Host which does not have set 'SupportedClusterVersionsSet' so we cannot determine if it supports 'defaultRoute', thus we're assuming that it's not supported, and therefore default route is not set. This can cause that engine and vdsm cannot see each other. Please can you confirm, that problematic host does not have SupportedClusterVersions set, and if it does not have it set, is this valid state? Can you please check the setting? it seems that it may be related to 1209570. I do not know how to induce this, but I submitted patch trying to fix 46181, maybe this bug will be fix by this as well. Can you try to recreate with the new setup networks? This is with the new API Raising up the priority and setting back the automation blocker flag. This is blocking tier 1 testing in CI. part of the 3rd ovirt-3.6.1 milestone build. Created attachment 1113953 [details]
engine, vdsm and superbdsm logs - after fix
With this fix one of our sync network test fail with: 2016-01-12 15:36:06,776 - MainThread - api_utils - ERROR - Failed to syncAction element NOT as expected: 00:10:18.405 Status: 400 00:10:18.405 Reason: Bad Request 00:10:18.405 Detail: [Resource unavailable] vdsm, supervdsm and engine logs are attached mmirecki can you please take a look The problem in the logs: engine.log: 2016-01-12 15:36:06,727 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-5) [hosts_syncAction_b869458d-fe35-4e50] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to HostSetupNetworksVDS, error = Resource unavailable, code = 40 The related entry in vdsm.log: jsonrpc.Executor/6::WARNING::2016-01-12 15:36:04,688::API::1487::vds::(setupNetworks) concurrent network verb already executing This is the result of failing the following operation: self._cif._networkSemaphore.acquire(blocking=False) The operation failed to aquire a semaphore needed to execute the required changes, which means that there was another operation modyfying the networks already in progress. ok, I will investigate and open a new bug if needed |