Bug 1184454 - [engine-backend] requests doesn't reach vdsm
Summary: [engine-backend] requests doesn't reach vdsm
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.5.0
Hardware: x86_64
OS: Unspecified
unspecified
urgent
Target Milestone: ovirt-3.6.0-rc
: 3.6.0
Assignee: Piotr Kliczewski
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks: rhev35gablocker 1185818
TreeView+ depends on / blocked
 
Reported: 2015-01-21 12:50 UTC by Elad
Modified: 2016-04-20 01:37 UTC (History)
15 users (show)

Fixed In Version: first 3.6.0 build
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1185818 (view as bug list)
Environment:
Last Closed: 2016-04-20 01:37:13 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs from engine and host (1.47 MB, application/x-gzip)
2015-01-21 12:50 UTC, Elad
no flags Details
Engine Thread dump (162.56 KB, text/plain)
2015-01-21 17:37 UTC, Piotr Kliczewski
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 37200 0 ovirt-3.5 MERGED make sure that we have data to remove from. Never
oVirt gerrit 37224 0 master MERGED make sure that we have data to remove from. Never
oVirt gerrit 37268 0 ovirt-engine-3.5 MERGED jsonrpc: version bump Never

Description Elad 2015-01-21 12:50:16 UTC
Created attachment 982303 [details]
logs from engine and host

Description of problem:
Activate storage domain request doesn't reach vdsm for some reason. The system becomes kind of stale.
Also, the operation is not terminated with a timeout by engine.

I'm working with JSON. 

Version-Release number of selected component (if applicable):
rhev 3.5 vt13.8

How reproducible:
Happens occasionally with JSON-RPC

Steps to Reproduce:
1. Tried to activate a storage domain in the DC


Actual results:
The request to activate the storage domain is not sent to vdsm. 
Here is the request:

2015-01-21 11:43:00,898 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (org.ovirt.thread.pool-7-thread-8) [685eb9e3] START, ActivateStorageDomainVDSCommand( storagePoolId = 00000
002-0002-0002-0002-0000000000eb, ignoreFailoverLimit = false, storageDomainId = 20dfeaa4-948d-4cea-b430-a54c70505c77), log id: 5df64c09

VDSM doesn't get this request. 
There is no timeout for the storage domain activation.
The storage domains get stuck in status 'locked'.
The operation is not terminated by engine.

Expected results:
The request should get to vdsm.
If the request doesn't get to vdsm, it should fail with a timeout.

Additional info: logs from engine and host

Comment 2 Piotr Kliczewski 2015-01-21 13:23:34 UTC
I checked the logs and here is what is going on:

Engine:

ActivateStorageDomain runs connectStorageServer which seems to be OK. DeactivationStorageDomain fails since there is still lock on the object.

2015-01-21 11:43:10,785 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-7-thread-16) [192c0a96] START, ConnectStorageServerVDSCommand(HostName = blond-vdsf.qa.lab, HostId = ff0bc0f6-0b4b-4734-b8b3-e3de13200de0, storagePoolId = 00000000-0000-0000-0000-000000000000, storageType = NFS, connectionList = [{ id: 0e800f12-b27f-468e-af21-8970698c53d8, connection: 10.35.160.108:/RHEV/elad/6, iqn: null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 300dc6be
2015-01-21 11:43:10,819 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-7-thread-16) [192c0a96] FINISH, ConnectStorageServerVDSCommand, return: {0e800f12-b27f-468e-af21-8970698c53d8=0}, log id: 300dc6be
2015-01-21 11:43:10,821 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (org.ovirt.thread.pool-7-thread-17) [63d8a56a] START, ActivateStorageDomainVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000000eb, ignoreFailoverLimit = false, storageDomainId = 82f18d42-9627-4a59-b5bb-408672b9d818), log id: 7be2212e
2015-01-21 11:52:10,523 INFO  [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainWithOvfUpdateCommand] (ajp-/127.0.0.1:8702-9) [385e95ba] Lock Acquired to object EngineLock [exclusiveLocks= key: 82f18d42-9627-4a59-b5bb-408672b9d818 value: STORAGE
, sharedLocks= key: 00000002-0002-0002-0002-0000000000eb value: POOL
]
2015-01-21 11:52:10,534 WARN  [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainWithOvfUpdateCommand] (ajp-/127.0.0.1:8702-9) [385e95ba] CanDoAction of action DeactivateStorageDomainWithOvfUpdate failed. Reasons:VAR__TYPE__STORAGE__DOMAIN,VAR__ACTION__DEACTIVATE,ACTION_TYPE_FAILED_OBJECT_LOCKED


vdsm:

Sending connectStorage server response was last message send/received to/from the engine.

Thread-89::DEBUG::2015-01-21 11:43:10,607::task::993::Storage.TaskManager.Task::(_decref) Task=`aa668261-811f-4053-822e-4ae5e50fe34b`::ref 0 aborting False
Thread-89::DEBUG::2015-01-21 11:43:10,607::__init__::500::jsonrpc.JsonRpcServer::(_serveRequest) Return 'StoragePool.connectStorageServer' in bridge with [{'status': 0, 'id': u'0e800f12-b27f-468e-af21-8970698c53d8'}]
Thread-89::DEBUG::2015-01-21 11:43:10,608::stompReactor::163::yajsonrpc.StompServer::(send) Sending response

I can see plenty of:

Thread-60::WARNING::2015-01-21 11:40:20,045::fileSD::688::Storage.scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/10.35.160.108:_RHEV_elad_5
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/fileSD.py", line 674, in collectMetaFiles
    sd.DOMAIN_META_DATA))
  File "/usr/share/vdsm/storage/outOfProcess.py", line 120, in glob
    return self._iop.glob(pattern)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 500, in glob
    return self._sendCommand("glob", {"pattern": pattern}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 385, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out

and can see:

Thread-30::DEBUG::2015-01-21 11:43:10,835::misc::751::Storage.SamplingMethod::(__call__) Returning last result
Thread-90::DEBUG::2015-01-21 11:43:10,837::__init__::298::IOProcessClient::(_run) Starting IOProcess...
Thread-91::DEBUG::2015-01-21 11:43:10,854::__init__::298::IOProcessClient::(_run) Starting IOProcess...
Thread-92::DEBUG::2015-01-21 11:43:10,869::__init__::298::IOProcessClient::(_run) Starting IOProcess...
Thread-93::DEBUG::2015-01-21 11:43:10,886::__init__::298::IOProcessClient::(_run) Starting IOProcess...
Thread-94::DEBUG::2015-01-21 11:43:10,901::__init__::298::IOProcessClient::(_run) Starting IOProcess...

Comment 3 Liron Aravot 2015-01-21 13:50:56 UTC
Piotr, those errors are because of blocked storage (the tested scenario) - but are not the cause of the issue.
The issue here is that the communication between the host and the engine get stuck, which leads for infinite call that doesn't even reach vdsm.


2015-01-21 11:43:10,821 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (org.ovirt.thread.pool-7-thread-17) [63d8a56a] START
, ActivateStorageDomainVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000000eb, ignoreFailoverLimit = false, storageDomainId = 82f18d42-9627-4a59-b5b
b-408672b9d818), log id: 7be2212e


never reach vdsm and get stuck forever which causes to not be able to execute spm calls.

as we've seen this issue before after getting this message[1], moving to infra for further investigation.

[1]2015-01-21 11:37:56,406 WARN  [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) Exception thrown during message processing

Comment 4 Liron Aravot 2015-01-21 14:06:30 UTC
correction for my last comment as it came out unclear, it seems that requests (both spm and non spm) do not reach vdsm.

Comment 5 Oved Ourfali 2015-01-21 14:18:50 UTC
What are the exact steps to reproduce?
Does it happen all the time?

Comment 6 Piotr Kliczewski 2015-01-21 15:01:54 UTC
It seems like some of the requests are not reaching vdsm and some of them are.

I can see that vdsm crashed at INFO::2015-01-21 11:34:03,23 and we reconnected (vds) at DEBUG::2015-01-21 11:37:55,950.

Because of the crash we initiated master reconstruct and made sure to stop irs connection.

In the mean time we attempted to ActivateStorageDomainCommand which never triggered reconnection to vdsm.

Can you please provide server.log? I would like to take a look whether there are no issues in the log.

Comment 7 Elad 2015-01-21 15:07:31 UTC
Provided offline

Comment 8 Piotr Kliczewski 2015-01-21 15:34:10 UTC
I do not see anything in server.log which could cause the issue.

There is no attempt to reconnect irs connection and I only can see in the log as below. During activation we run connectHostsInUpToDomainStorageServer which according to logs was successful and returned {0e800f12-b27f-468e-af21-8970698c53d8=0}.

Next step is syncStorageDomainInfo and after that there is ActivateStorageDomain which never happened. 

2015-01-21 11:43:10,702 INFO  [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (org.ovirt.thread.pool-7-thread-17) [63d8a56a] ActivateStorage Domain. Before Connect all hosts to pool. Time:1/21/15 11:43 AM
2015-01-21 11:43:10,781 INFO  [org.ovirt.engine.core.bll.storage.ConnectStorageToVdsCommand] (org.ovirt.thread.pool-7-thread-16) [192c0a96] Running command: ConnectStorageToVdsCommand internal: true. Entities affected :  ID: aaa00000-0000-0000-0000-123456789aaa Type: SystemAction group CREATE_STORAGE_DOMAIN with role type ADMIN
2015-01-21 11:43:10,785 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-7-thread-16) [192c0a96] START, ConnectStorageServerVDSCommand(HostName = blond-vdsf.qa.lab, HostId = ff0bc0f6-0b4b-4734-b8b3-e3de13200de0, storagePoolId = 00000000-0000-0000-0000-000000000000, storageType = NFS, connectionList = [{ id: 0e800f12-b27f-468e-af21-8970698c53d8, connection: 10.35.160.108:/RHEV/elad/6, iqn: null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 300dc6be
2015-01-21 11:43:10,819 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-7-thread-16) [192c0a96] FINISH, ConnectStorageServerVDSCommand, return: {0e800f12-b27f-468e-af21-8970698c53d8=0}, log id: 300dc6be
2015-01-21 11:43:10,821 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (org.ovirt.thread.pool-7-thread-17) [63d8a56a] START, ActivateStorageDomainVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000000eb, ignoreFailoverLimit = false, storageDomainId = 82f18d42-9627-4a59-b5bb-408672b9d818), log id: 7be2212e

Comment 9 Piotr Kliczewski 2015-01-21 17:37:22 UTC
Created attachment 982464 [details]
Engine Thread dump

Comment 11 Eyal Edri 2015-01-25 09:19:03 UTC
oved, i assume this is for the zero day release? can you update TR?

Comment 15 Elad 2015-11-05 09:43:09 UTC
Using JsonRpc, Tested storage domain activation flow several times over block and file, all succeeded.

Used:
rhevm-3.6.0.2-0.1.el6.noarch
vdsm-4.17.10-5.el7ev.noarch


Note You need to log in before you can comment on or make changes to this bug.