Bug 1184454

Summary: [engine-backend] requests doesn't reach vdsm
Product: Red Hat Enterprise Virtualization Manager Reporter: Elad <ebenahar>
Component: ovirt-engineAssignee: Piotr Kliczewski <pkliczew>
Status: CLOSED CURRENTRELEASE QA Contact: Elad <ebenahar>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.5.0CC: bazulay, ebenahar, eedri, gklein, laravot, lpeer, lsurette, oourfali, pkliczew, pstehlik, rbalakri, Rhev-m-bugs, sherold, srevivo, ykaul
Target Milestone: ovirt-3.6.0-rcKeywords: Regression, ZStream
Target Release: 3.6.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: first 3.6.0 build Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1185818 (view as bug list) Environment:
Last Closed: 2016-04-20 01:37:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1164311, 1185818    
Attachments:
Description Flags
logs from engine and host
none
Engine Thread dump none

Description Elad 2015-01-21 12:50:16 UTC
Created attachment 982303 [details]
logs from engine and host

Description of problem:
Activate storage domain request doesn't reach vdsm for some reason. The system becomes kind of stale.
Also, the operation is not terminated with a timeout by engine.

I'm working with JSON. 

Version-Release number of selected component (if applicable):
rhev 3.5 vt13.8

How reproducible:
Happens occasionally with JSON-RPC

Steps to Reproduce:
1. Tried to activate a storage domain in the DC


Actual results:
The request to activate the storage domain is not sent to vdsm. 
Here is the request:

2015-01-21 11:43:00,898 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (org.ovirt.thread.pool-7-thread-8) [685eb9e3] START, ActivateStorageDomainVDSCommand( storagePoolId = 00000
002-0002-0002-0002-0000000000eb, ignoreFailoverLimit = false, storageDomainId = 20dfeaa4-948d-4cea-b430-a54c70505c77), log id: 5df64c09

VDSM doesn't get this request. 
There is no timeout for the storage domain activation.
The storage domains get stuck in status 'locked'.
The operation is not terminated by engine.

Expected results:
The request should get to vdsm.
If the request doesn't get to vdsm, it should fail with a timeout.

Additional info: logs from engine and host

Comment 2 Piotr Kliczewski 2015-01-21 13:23:34 UTC
I checked the logs and here is what is going on:

Engine:

ActivateStorageDomain runs connectStorageServer which seems to be OK. DeactivationStorageDomain fails since there is still lock on the object.

2015-01-21 11:43:10,785 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-7-thread-16) [192c0a96] START, ConnectStorageServerVDSCommand(HostName = blond-vdsf.qa.lab, HostId = ff0bc0f6-0b4b-4734-b8b3-e3de13200de0, storagePoolId = 00000000-0000-0000-0000-000000000000, storageType = NFS, connectionList = [{ id: 0e800f12-b27f-468e-af21-8970698c53d8, connection: 10.35.160.108:/RHEV/elad/6, iqn: null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 300dc6be
2015-01-21 11:43:10,819 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-7-thread-16) [192c0a96] FINISH, ConnectStorageServerVDSCommand, return: {0e800f12-b27f-468e-af21-8970698c53d8=0}, log id: 300dc6be
2015-01-21 11:43:10,821 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (org.ovirt.thread.pool-7-thread-17) [63d8a56a] START, ActivateStorageDomainVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000000eb, ignoreFailoverLimit = false, storageDomainId = 82f18d42-9627-4a59-b5bb-408672b9d818), log id: 7be2212e
2015-01-21 11:52:10,523 INFO  [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainWithOvfUpdateCommand] (ajp-/127.0.0.1:8702-9) [385e95ba] Lock Acquired to object EngineLock [exclusiveLocks= key: 82f18d42-9627-4a59-b5bb-408672b9d818 value: STORAGE
, sharedLocks= key: 00000002-0002-0002-0002-0000000000eb value: POOL
]
2015-01-21 11:52:10,534 WARN  [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainWithOvfUpdateCommand] (ajp-/127.0.0.1:8702-9) [385e95ba] CanDoAction of action DeactivateStorageDomainWithOvfUpdate failed. Reasons:VAR__TYPE__STORAGE__DOMAIN,VAR__ACTION__DEACTIVATE,ACTION_TYPE_FAILED_OBJECT_LOCKED


vdsm:

Sending connectStorage server response was last message send/received to/from the engine.

Thread-89::DEBUG::2015-01-21 11:43:10,607::task::993::Storage.TaskManager.Task::(_decref) Task=`aa668261-811f-4053-822e-4ae5e50fe34b`::ref 0 aborting False
Thread-89::DEBUG::2015-01-21 11:43:10,607::__init__::500::jsonrpc.JsonRpcServer::(_serveRequest) Return 'StoragePool.connectStorageServer' in bridge with [{'status': 0, 'id': u'0e800f12-b27f-468e-af21-8970698c53d8'}]
Thread-89::DEBUG::2015-01-21 11:43:10,608::stompReactor::163::yajsonrpc.StompServer::(send) Sending response

I can see plenty of:

Thread-60::WARNING::2015-01-21 11:40:20,045::fileSD::688::Storage.scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/10.35.160.108:_RHEV_elad_5
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/fileSD.py", line 674, in collectMetaFiles
    sd.DOMAIN_META_DATA))
  File "/usr/share/vdsm/storage/outOfProcess.py", line 120, in glob
    return self._iop.glob(pattern)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 500, in glob
    return self._sendCommand("glob", {"pattern": pattern}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 385, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out

and can see:

Thread-30::DEBUG::2015-01-21 11:43:10,835::misc::751::Storage.SamplingMethod::(__call__) Returning last result
Thread-90::DEBUG::2015-01-21 11:43:10,837::__init__::298::IOProcessClient::(_run) Starting IOProcess...
Thread-91::DEBUG::2015-01-21 11:43:10,854::__init__::298::IOProcessClient::(_run) Starting IOProcess...
Thread-92::DEBUG::2015-01-21 11:43:10,869::__init__::298::IOProcessClient::(_run) Starting IOProcess...
Thread-93::DEBUG::2015-01-21 11:43:10,886::__init__::298::IOProcessClient::(_run) Starting IOProcess...
Thread-94::DEBUG::2015-01-21 11:43:10,901::__init__::298::IOProcessClient::(_run) Starting IOProcess...

Comment 3 Liron Aravot 2015-01-21 13:50:56 UTC
Piotr, those errors are because of blocked storage (the tested scenario) - but are not the cause of the issue.
The issue here is that the communication between the host and the engine get stuck, which leads for infinite call that doesn't even reach vdsm.


2015-01-21 11:43:10,821 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (org.ovirt.thread.pool-7-thread-17) [63d8a56a] START
, ActivateStorageDomainVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000000eb, ignoreFailoverLimit = false, storageDomainId = 82f18d42-9627-4a59-b5b
b-408672b9d818), log id: 7be2212e


never reach vdsm and get stuck forever which causes to not be able to execute spm calls.

as we've seen this issue before after getting this message[1], moving to infra for further investigation.

[1]2015-01-21 11:37:56,406 WARN  [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) Exception thrown during message processing

Comment 4 Liron Aravot 2015-01-21 14:06:30 UTC
correction for my last comment as it came out unclear, it seems that requests (both spm and non spm) do not reach vdsm.

Comment 5 Oved Ourfali 2015-01-21 14:18:50 UTC
What are the exact steps to reproduce?
Does it happen all the time?

Comment 6 Piotr Kliczewski 2015-01-21 15:01:54 UTC
It seems like some of the requests are not reaching vdsm and some of them are.

I can see that vdsm crashed at INFO::2015-01-21 11:34:03,23 and we reconnected (vds) at DEBUG::2015-01-21 11:37:55,950.

Because of the crash we initiated master reconstruct and made sure to stop irs connection.

In the mean time we attempted to ActivateStorageDomainCommand which never triggered reconnection to vdsm.

Can you please provide server.log? I would like to take a look whether there are no issues in the log.

Comment 7 Elad 2015-01-21 15:07:31 UTC
Provided offline

Comment 8 Piotr Kliczewski 2015-01-21 15:34:10 UTC
I do not see anything in server.log which could cause the issue.

There is no attempt to reconnect irs connection and I only can see in the log as below. During activation we run connectHostsInUpToDomainStorageServer which according to logs was successful and returned {0e800f12-b27f-468e-af21-8970698c53d8=0}.

Next step is syncStorageDomainInfo and after that there is ActivateStorageDomain which never happened. 

2015-01-21 11:43:10,702 INFO  [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (org.ovirt.thread.pool-7-thread-17) [63d8a56a] ActivateStorage Domain. Before Connect all hosts to pool. Time:1/21/15 11:43 AM
2015-01-21 11:43:10,781 INFO  [org.ovirt.engine.core.bll.storage.ConnectStorageToVdsCommand] (org.ovirt.thread.pool-7-thread-16) [192c0a96] Running command: ConnectStorageToVdsCommand internal: true. Entities affected :  ID: aaa00000-0000-0000-0000-123456789aaa Type: SystemAction group CREATE_STORAGE_DOMAIN with role type ADMIN
2015-01-21 11:43:10,785 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-7-thread-16) [192c0a96] START, ConnectStorageServerVDSCommand(HostName = blond-vdsf.qa.lab, HostId = ff0bc0f6-0b4b-4734-b8b3-e3de13200de0, storagePoolId = 00000000-0000-0000-0000-000000000000, storageType = NFS, connectionList = [{ id: 0e800f12-b27f-468e-af21-8970698c53d8, connection: 10.35.160.108:/RHEV/elad/6, iqn: null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 300dc6be
2015-01-21 11:43:10,819 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-7-thread-16) [192c0a96] FINISH, ConnectStorageServerVDSCommand, return: {0e800f12-b27f-468e-af21-8970698c53d8=0}, log id: 300dc6be
2015-01-21 11:43:10,821 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (org.ovirt.thread.pool-7-thread-17) [63d8a56a] START, ActivateStorageDomainVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000000eb, ignoreFailoverLimit = false, storageDomainId = 82f18d42-9627-4a59-b5bb-408672b9d818), log id: 7be2212e

Comment 9 Piotr Kliczewski 2015-01-21 17:37:22 UTC
Created attachment 982464 [details]
Engine Thread dump

Comment 11 Eyal Edri 2015-01-25 09:19:03 UTC
oved, i assume this is for the zero day release? can you update TR?

Comment 15 Elad 2015-11-05 09:43:09 UTC
Using JsonRpc, Tested storage domain activation flow several times over block and file, all succeeded.

Used:
rhevm-3.6.0.2-0.1.el6.noarch
vdsm-4.17.10-5.el7ev.noarch