Bug 1102829 - [vdsm] After a removal of a FC connected device from LUN masking in the storage server, getDeviceList fails with a timeout, which causes to storage domain to become inactive
Summary: [vdsm] After a removal of a FC connected device from LUN masking in the stora...
Keywords:
Status: CLOSED DUPLICATE of bug 1104801
Alias: None
Product: oVirt
Classification: Retired
Component: vdsm
Version: 3.5
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.5.0
Assignee: Nir Soffer
QA Contact: Aharon Canan
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-05-29 15:53 UTC by Elad
Modified: 2017-02-23 22:07 UTC (History)
11 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2014-06-11 18:52:17 UTC
oVirt Team: Storage
Embargoed:


Attachments (Terms of Use)
engine and vdsm logs (443.67 KB, application/x-gzip)
2014-05-29 15:53 UTC, Elad
no flags Details
/var/log/messages (112.16 KB, application/x-gzip)
2014-05-29 18:09 UTC, Elad
no flags Details
downstream (715.69 KB, application/x-gzip)
2014-06-01 10:33 UTC, Elad
no flags Details
domainMonitor (17.43 KB, text/plain)
2014-06-01 12:46 UTC, Elad
no flags Details

Description Elad 2014-05-29 15:53:42 UTC
Created attachment 900419 [details]
engine and vdsm logs

Description of problem:
When trying to edit a fibre-channel storage domain after a removal of a LUN, which is not part of the storage domain volume group, from LUN masking in the storage server, engine fails getDeviceList sync task because of a 3 minutes timeout. This causes to engine to set the FC domain to inactive, even though all its PVs are accessible from the host.  

Version-Release number of selected component (if applicable):
ovirt-3.5-alpha-1
ovirt-engine-3.5.0-0.0.master.20140519181229.gitc6324d4.el6.noarch
vdsm-4.14.1-340.gitedb02ba.el6.x86_64

How reproducible:
Always

Steps to Reproduce:
Have a host with FC HBA connected and logged in to a storage server. Expose several LUNs to it.
1. Create a FC storage domain resides on one of the LUNs, wait for it to become active.
2. From storage server side, remove a LUN which doesn't participate in the storage server's VG as a PV.
3. Click on 'edit' storage domain for the FC domain.


Actual results:
Engine sends GetDeviceListVDSCommand to vdsm and because a of the LUN which was removed from LUN masking, vdsm hangs the operation:

Command is sent to vdsm:

2014-05-29 17:46:07,429 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp--127.0.0.1-8702-1) START, GetDeviceListVDSCommand(HostName = green-vdsb, HostId = fe319ceb-7307-466a-99fe-f54e8115923a, storageType=FCP), log id: 6627fae9


Thread-13::INFO::2014-05-29 17:46:06,535::logUtils::44::dispatcher::(wrapper) Run and protect: getDeviceList(storageType=2, options={})


GetdeviceList fails with a timeout on engine:

2014-05-29 17:49:07,440 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp--127.0.0.1-8702-1) Command GetDeviceListVDSCommand(HostName = green-vdsb, HostId = fe319ceb-7307-466a-99fe-f54e8115923a, storageType=FCP) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException


2014-05-29 17:49:07,443 ERROR [org.ovirt.engine.core.bll.storage.GetDeviceListQuery] (ajp--127.0.0.1-8702-1) Query GetDeviceListQuery failed. Exception message is VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022) : org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022): org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022)


FC domain is reported in problem:

2014-05-29 17:51:53,713 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-6-thread-15) domain 67c5fa81-99de-4436-9de3-335f3457e060:data1 in problem. vds: green-vdsb

ERROR in vdsm.log:

Thread-37::ERROR::2014-05-29 18:00:05,679::task::866::TaskManager.Task::(_setError) Task=`8e25b0ed-199b-49be-acbe-65bd56868f54`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 593, in spmStop
    vars.task.getExclusiveLock(STORAGE, spUUID)
  File "/usr/share/vdsm/storage/task.py", line 1332, in getExclusiveLock
    timeout)
  File "/usr/share/vdsm/storage/resourceManager.py", line 820, in acquire
    raise se.ResourceTimeout()
ResourceTimeout: Resource timeout: ()


Thread-37::ERROR::2014-05-29 18:00:05,695::dispatcher::76::Storage.Dispatcher::(wrapper) {'status': {'message': 'Resource timeout: ()', 'code': 851}}



FC domain moves to inactive.


LVM commands on vdsm hang.


Expected results:
Not sure what should be done here, maybe need to fine some kind of mechanism that detect the removing of a device that was connected by FC. 

Additional info: engine and vdsm logs

Comment 1 Nir Soffer 2014-05-29 16:04:12 UTC
Tested with this patch: http://gerrit.ovirt.org/27122

To continue with this bug, we must know:
- How reproducible is this?
- Does it happen also with latest stable release?

Comment 2 Nir Soffer 2014-05-29 16:07:16 UTC
And we need also /var/log/messages, probabbly explain why getDevicesList and lvm commands block for 5 minutes.

Comment 3 Elad 2014-05-29 18:09:02 UTC
Created attachment 900476 [details]
/var/log/messages

/var/log/messages attached

Comment 4 Elad 2014-05-29 19:49:08 UTC
(In reply to Nir Soffer from comment #1)
> Tested with this patch: http://gerrit.ovirt.org/27122
Tested also with the vdsm version mentioned in the description, which is ovirt-3.5-alpha-1.1 build
> To continue with this bug, we must know:
> - How reproducible is this?
Happened every-time 
> - Does it happen also with latest stable release?
Will test it and report about the results

Comment 5 Nir Soffer 2014-05-29 22:01:44 UTC
(In reply to Elad from comment #4)
> Happened every-time
How many times?

Comment 6 Elad 2014-06-01 10:33:29 UTC
Created attachment 901199 [details]
downstream

(In reply to Elad from comment #4)
> (In reply to Nir Soffer from comment #1)
> > Tested with this patch: http://gerrit.ovirt.org/27122
> Tested also with the vdsm version mentioned in the description, which is
> ovirt-3.5-alpha-1.1 build
> > To continue with this bug, we must know:
> > - How reproducible is this?
> Happened every-time 
> > - Does it happen also with latest stable release?
Happened also with rhev3.4-av9.3


2014-06-01 11:54:31,546 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-1) Command GetDeviceListVDSCommand(HostName = green-vdsb, HostId = de7ea426-5a00-400d-b540-61a543c57210, storageType=FCP) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException
2014-06-01 11:54:31,546 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-1) FINISH, GetDeviceListVDSCommand, log id: 30ea8bb8
2014-06-01 11:54:31,547 ERROR [org.ovirt.engine.core.bll.storage.GetDeviceListQuery] (ajp-/127.0.0.1:8702-1) Query GetDeviceListQuery failed. Exception message is VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022) : org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022): org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022)


Attaching the logs (downstream)




(In reply to Nir Soffer from comment #5)
> (In reply to Elad from comment #4)
> > Happened every-time
> How many times?
1 time with http://gerrit.ovirt.org/27122
1 time with ovirt-3.5-alpha-1.1
1 time with rhev3.4-av9.3

Comment 7 Elad 2014-06-01 12:46:01 UTC
Created attachment 901218 [details]
domainMonitor

Attaching a file describes the domain monitoring for the DC domain while getDeviceList is stuck in vdsm.

Comment 8 Elad 2014-06-02 21:03:07 UTC
Happened every-time while testing with EMC-VNX storage server.
When testing it with another storage server - EMC-XtremIO, removal of a LUN from LUN masking didn't cause getDeviceList to fail, the removed LUNs were reported as faulty by vdsm to engine.

For some reason I think this issue occurs from the same reason 1098769 does.

Comment 9 Nir Soffer 2014-06-11 18:52:17 UTC

*** This bug has been marked as a duplicate of bug 1104801 ***


Note You need to log in before you can comment on or make changes to this bug.