Created attachment 900419 [details] engine and vdsm logs Description of problem: When trying to edit a fibre-channel storage domain after a removal of a LUN, which is not part of the storage domain volume group, from LUN masking in the storage server, engine fails getDeviceList sync task because of a 3 minutes timeout. This causes to engine to set the FC domain to inactive, even though all its PVs are accessible from the host. Version-Release number of selected component (if applicable): ovirt-3.5-alpha-1 ovirt-engine-3.5.0-0.0.master.20140519181229.gitc6324d4.el6.noarch vdsm-4.14.1-340.gitedb02ba.el6.x86_64 How reproducible: Always Steps to Reproduce: Have a host with FC HBA connected and logged in to a storage server. Expose several LUNs to it. 1. Create a FC storage domain resides on one of the LUNs, wait for it to become active. 2. From storage server side, remove a LUN which doesn't participate in the storage server's VG as a PV. 3. Click on 'edit' storage domain for the FC domain. Actual results: Engine sends GetDeviceListVDSCommand to vdsm and because a of the LUN which was removed from LUN masking, vdsm hangs the operation: Command is sent to vdsm: 2014-05-29 17:46:07,429 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp--127.0.0.1-8702-1) START, GetDeviceListVDSCommand(HostName = green-vdsb, HostId = fe319ceb-7307-466a-99fe-f54e8115923a, storageType=FCP), log id: 6627fae9 Thread-13::INFO::2014-05-29 17:46:06,535::logUtils::44::dispatcher::(wrapper) Run and protect: getDeviceList(storageType=2, options={}) GetdeviceList fails with a timeout on engine: 2014-05-29 17:49:07,440 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp--127.0.0.1-8702-1) Command GetDeviceListVDSCommand(HostName = green-vdsb, HostId = fe319ceb-7307-466a-99fe-f54e8115923a, storageType=FCP) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException 2014-05-29 17:49:07,443 ERROR [org.ovirt.engine.core.bll.storage.GetDeviceListQuery] (ajp--127.0.0.1-8702-1) Query GetDeviceListQuery failed. Exception message is VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022) : org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022): org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022) FC domain is reported in problem: 2014-05-29 17:51:53,713 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-6-thread-15) domain 67c5fa81-99de-4436-9de3-335f3457e060:data1 in problem. vds: green-vdsb ERROR in vdsm.log: Thread-37::ERROR::2014-05-29 18:00:05,679::task::866::TaskManager.Task::(_setError) Task=`8e25b0ed-199b-49be-acbe-65bd56868f54`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 593, in spmStop vars.task.getExclusiveLock(STORAGE, spUUID) File "/usr/share/vdsm/storage/task.py", line 1332, in getExclusiveLock timeout) File "/usr/share/vdsm/storage/resourceManager.py", line 820, in acquire raise se.ResourceTimeout() ResourceTimeout: Resource timeout: () Thread-37::ERROR::2014-05-29 18:00:05,695::dispatcher::76::Storage.Dispatcher::(wrapper) {'status': {'message': 'Resource timeout: ()', 'code': 851}} FC domain moves to inactive. LVM commands on vdsm hang. Expected results: Not sure what should be done here, maybe need to fine some kind of mechanism that detect the removing of a device that was connected by FC. Additional info: engine and vdsm logs
Tested with this patch: http://gerrit.ovirt.org/27122 To continue with this bug, we must know: - How reproducible is this? - Does it happen also with latest stable release?
And we need also /var/log/messages, probabbly explain why getDevicesList and lvm commands block for 5 minutes.
Created attachment 900476 [details] /var/log/messages /var/log/messages attached
(In reply to Nir Soffer from comment #1) > Tested with this patch: http://gerrit.ovirt.org/27122 Tested also with the vdsm version mentioned in the description, which is ovirt-3.5-alpha-1.1 build > To continue with this bug, we must know: > - How reproducible is this? Happened every-time > - Does it happen also with latest stable release? Will test it and report about the results
(In reply to Elad from comment #4) > Happened every-time How many times?
Created attachment 901199 [details] downstream (In reply to Elad from comment #4) > (In reply to Nir Soffer from comment #1) > > Tested with this patch: http://gerrit.ovirt.org/27122 > Tested also with the vdsm version mentioned in the description, which is > ovirt-3.5-alpha-1.1 build > > To continue with this bug, we must know: > > - How reproducible is this? > Happened every-time > > - Does it happen also with latest stable release? Happened also with rhev3.4-av9.3 2014-06-01 11:54:31,546 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-1) Command GetDeviceListVDSCommand(HostName = green-vdsb, HostId = de7ea426-5a00-400d-b540-61a543c57210, storageType=FCP) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException 2014-06-01 11:54:31,546 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-1) FINISH, GetDeviceListVDSCommand, log id: 30ea8bb8 2014-06-01 11:54:31,547 ERROR [org.ovirt.engine.core.bll.storage.GetDeviceListQuery] (ajp-/127.0.0.1:8702-1) Query GetDeviceListQuery failed. Exception message is VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022) : org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022): org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022) Attaching the logs (downstream) (In reply to Nir Soffer from comment #5) > (In reply to Elad from comment #4) > > Happened every-time > How many times? 1 time with http://gerrit.ovirt.org/27122 1 time with ovirt-3.5-alpha-1.1 1 time with rhev3.4-av9.3
Created attachment 901218 [details] domainMonitor Attaching a file describes the domain monitoring for the DC domain while getDeviceList is stuck in vdsm.
Happened every-time while testing with EMC-VNX storage server. When testing it with another storage server - EMC-XtremIO, removal of a LUN from LUN masking didn't cause getDeviceList to fail, the removed LUNs were reported as faulty by vdsm to engine. For some reason I think this issue occurs from the same reason 1098769 does.
*** This bug has been marked as a duplicate of bug 1104801 ***