Bug 1184454
Summary: | [engine-backend] requests doesn't reach vdsm | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Elad <ebenahar> | ||||||
Component: | ovirt-engine | Assignee: | Piotr Kliczewski <pkliczew> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Elad <ebenahar> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 3.5.0 | CC: | bazulay, ebenahar, eedri, gklein, laravot, lpeer, lsurette, oourfali, pkliczew, pstehlik, rbalakri, Rhev-m-bugs, sherold, srevivo, ykaul | ||||||
Target Milestone: | ovirt-3.6.0-rc | Keywords: | Regression, ZStream | ||||||
Target Release: | 3.6.0 | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | first 3.6.0 build | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1185818 (view as bug list) | Environment: | |||||||
Last Closed: | 2016-04-20 01:37:13 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1164311, 1185818 | ||||||||
Attachments: |
|
I checked the logs and here is what is going on: Engine: ActivateStorageDomain runs connectStorageServer which seems to be OK. DeactivationStorageDomain fails since there is still lock on the object. 2015-01-21 11:43:10,785 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-7-thread-16) [192c0a96] START, ConnectStorageServerVDSCommand(HostName = blond-vdsf.qa.lab, HostId = ff0bc0f6-0b4b-4734-b8b3-e3de13200de0, storagePoolId = 00000000-0000-0000-0000-000000000000, storageType = NFS, connectionList = [{ id: 0e800f12-b27f-468e-af21-8970698c53d8, connection: 10.35.160.108:/RHEV/elad/6, iqn: null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 300dc6be 2015-01-21 11:43:10,819 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-7-thread-16) [192c0a96] FINISH, ConnectStorageServerVDSCommand, return: {0e800f12-b27f-468e-af21-8970698c53d8=0}, log id: 300dc6be 2015-01-21 11:43:10,821 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (org.ovirt.thread.pool-7-thread-17) [63d8a56a] START, ActivateStorageDomainVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000000eb, ignoreFailoverLimit = false, storageDomainId = 82f18d42-9627-4a59-b5bb-408672b9d818), log id: 7be2212e 2015-01-21 11:52:10,523 INFO [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainWithOvfUpdateCommand] (ajp-/127.0.0.1:8702-9) [385e95ba] Lock Acquired to object EngineLock [exclusiveLocks= key: 82f18d42-9627-4a59-b5bb-408672b9d818 value: STORAGE , sharedLocks= key: 00000002-0002-0002-0002-0000000000eb value: POOL ] 2015-01-21 11:52:10,534 WARN [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainWithOvfUpdateCommand] (ajp-/127.0.0.1:8702-9) [385e95ba] CanDoAction of action DeactivateStorageDomainWithOvfUpdate failed. Reasons:VAR__TYPE__STORAGE__DOMAIN,VAR__ACTION__DEACTIVATE,ACTION_TYPE_FAILED_OBJECT_LOCKED vdsm: Sending connectStorage server response was last message send/received to/from the engine. Thread-89::DEBUG::2015-01-21 11:43:10,607::task::993::Storage.TaskManager.Task::(_decref) Task=`aa668261-811f-4053-822e-4ae5e50fe34b`::ref 0 aborting False Thread-89::DEBUG::2015-01-21 11:43:10,607::__init__::500::jsonrpc.JsonRpcServer::(_serveRequest) Return 'StoragePool.connectStorageServer' in bridge with [{'status': 0, 'id': u'0e800f12-b27f-468e-af21-8970698c53d8'}] Thread-89::DEBUG::2015-01-21 11:43:10,608::stompReactor::163::yajsonrpc.StompServer::(send) Sending response I can see plenty of: Thread-60::WARNING::2015-01-21 11:40:20,045::fileSD::688::Storage.scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/10.35.160.108:_RHEV_elad_5 Traceback (most recent call last): File "/usr/share/vdsm/storage/fileSD.py", line 674, in collectMetaFiles sd.DOMAIN_META_DATA)) File "/usr/share/vdsm/storage/outOfProcess.py", line 120, in glob return self._iop.glob(pattern) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 500, in glob return self._sendCommand("glob", {"pattern": pattern}, self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 385, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) Timeout: Connection timed out and can see: Thread-30::DEBUG::2015-01-21 11:43:10,835::misc::751::Storage.SamplingMethod::(__call__) Returning last result Thread-90::DEBUG::2015-01-21 11:43:10,837::__init__::298::IOProcessClient::(_run) Starting IOProcess... Thread-91::DEBUG::2015-01-21 11:43:10,854::__init__::298::IOProcessClient::(_run) Starting IOProcess... Thread-92::DEBUG::2015-01-21 11:43:10,869::__init__::298::IOProcessClient::(_run) Starting IOProcess... Thread-93::DEBUG::2015-01-21 11:43:10,886::__init__::298::IOProcessClient::(_run) Starting IOProcess... Thread-94::DEBUG::2015-01-21 11:43:10,901::__init__::298::IOProcessClient::(_run) Starting IOProcess... Piotr, those errors are because of blocked storage (the tested scenario) - but are not the cause of the issue. The issue here is that the communication between the host and the engine get stuck, which leads for infinite call that doesn't even reach vdsm. 2015-01-21 11:43:10,821 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (org.ovirt.thread.pool-7-thread-17) [63d8a56a] START , ActivateStorageDomainVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000000eb, ignoreFailoverLimit = false, storageDomainId = 82f18d42-9627-4a59-b5b b-408672b9d818), log id: 7be2212e never reach vdsm and get stuck forever which causes to not be able to execute spm calls. as we've seen this issue before after getting this message[1], moving to infra for further investigation. [1]2015-01-21 11:37:56,406 WARN [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) Exception thrown during message processing correction for my last comment as it came out unclear, it seems that requests (both spm and non spm) do not reach vdsm. What are the exact steps to reproduce? Does it happen all the time? It seems like some of the requests are not reaching vdsm and some of them are. I can see that vdsm crashed at INFO::2015-01-21 11:34:03,23 and we reconnected (vds) at DEBUG::2015-01-21 11:37:55,950. Because of the crash we initiated master reconstruct and made sure to stop irs connection. In the mean time we attempted to ActivateStorageDomainCommand which never triggered reconnection to vdsm. Can you please provide server.log? I would like to take a look whether there are no issues in the log. Provided offline I do not see anything in server.log which could cause the issue. There is no attempt to reconnect irs connection and I only can see in the log as below. During activation we run connectHostsInUpToDomainStorageServer which according to logs was successful and returned {0e800f12-b27f-468e-af21-8970698c53d8=0}. Next step is syncStorageDomainInfo and after that there is ActivateStorageDomain which never happened. 2015-01-21 11:43:10,702 INFO [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (org.ovirt.thread.pool-7-thread-17) [63d8a56a] ActivateStorage Domain. Before Connect all hosts to pool. Time:1/21/15 11:43 AM 2015-01-21 11:43:10,781 INFO [org.ovirt.engine.core.bll.storage.ConnectStorageToVdsCommand] (org.ovirt.thread.pool-7-thread-16) [192c0a96] Running command: ConnectStorageToVdsCommand internal: true. Entities affected : ID: aaa00000-0000-0000-0000-123456789aaa Type: SystemAction group CREATE_STORAGE_DOMAIN with role type ADMIN 2015-01-21 11:43:10,785 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-7-thread-16) [192c0a96] START, ConnectStorageServerVDSCommand(HostName = blond-vdsf.qa.lab, HostId = ff0bc0f6-0b4b-4734-b8b3-e3de13200de0, storagePoolId = 00000000-0000-0000-0000-000000000000, storageType = NFS, connectionList = [{ id: 0e800f12-b27f-468e-af21-8970698c53d8, connection: 10.35.160.108:/RHEV/elad/6, iqn: null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 300dc6be 2015-01-21 11:43:10,819 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-7-thread-16) [192c0a96] FINISH, ConnectStorageServerVDSCommand, return: {0e800f12-b27f-468e-af21-8970698c53d8=0}, log id: 300dc6be 2015-01-21 11:43:10,821 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (org.ovirt.thread.pool-7-thread-17) [63d8a56a] START, ActivateStorageDomainVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000000eb, ignoreFailoverLimit = false, storageDomainId = 82f18d42-9627-4a59-b5bb-408672b9d818), log id: 7be2212e Created attachment 982464 [details]
Engine Thread dump
oved, i assume this is for the zero day release? can you update TR? Using JsonRpc, Tested storage domain activation flow several times over block and file, all succeeded. Used: rhevm-3.6.0.2-0.1.el6.noarch vdsm-4.17.10-5.el7ev.noarch |
Created attachment 982303 [details] logs from engine and host Description of problem: Activate storage domain request doesn't reach vdsm for some reason. The system becomes kind of stale. Also, the operation is not terminated with a timeout by engine. I'm working with JSON. Version-Release number of selected component (if applicable): rhev 3.5 vt13.8 How reproducible: Happens occasionally with JSON-RPC Steps to Reproduce: 1. Tried to activate a storage domain in the DC Actual results: The request to activate the storage domain is not sent to vdsm. Here is the request: 2015-01-21 11:43:00,898 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (org.ovirt.thread.pool-7-thread-8) [685eb9e3] START, ActivateStorageDomainVDSCommand( storagePoolId = 00000 002-0002-0002-0002-0000000000eb, ignoreFailoverLimit = false, storageDomainId = 20dfeaa4-948d-4cea-b430-a54c70505c77), log id: 5df64c09 VDSM doesn't get this request. There is no timeout for the storage domain activation. The storage domains get stuck in status 'locked'. The operation is not terminated by engine. Expected results: The request should get to vdsm. If the request doesn't get to vdsm, it should fail with a timeout. Additional info: logs from engine and host