Bug 1102829
| Summary: | [vdsm] After a removal of a FC connected device from LUN masking in the storage server, getDeviceList fails with a timeout, which causes to storage domain to become inactive | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Retired] oVirt | Reporter: | Elad <ebenahar> | ||||||||||
| Component: | vdsm | Assignee: | Nir Soffer <nsoffer> | ||||||||||
| Status: | CLOSED DUPLICATE | QA Contact: | Aharon Canan <acanan> | ||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | unspecified | ||||||||||||
| Version: | 3.5 | CC: | acanan, acathrow, amureini, bazulay, bugs, ebenahar, gklein, iheim, mgoldboi, nsoffer, yeylon | ||||||||||
| Target Milestone: | --- | Keywords: | Triaged | ||||||||||
| Target Release: | 3.5.0 | ||||||||||||
| Hardware: | x86_64 | ||||||||||||
| OS: | Unspecified | ||||||||||||
| Whiteboard: | storage | ||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2014-06-11 18:52:17 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
Tested with this patch: http://gerrit.ovirt.org/27122 To continue with this bug, we must know: - How reproducible is this? - Does it happen also with latest stable release? And we need also /var/log/messages, probabbly explain why getDevicesList and lvm commands block for 5 minutes. Created attachment 900476 [details]
/var/log/messages
/var/log/messages attached
(In reply to Nir Soffer from comment #1) > Tested with this patch: http://gerrit.ovirt.org/27122 Tested also with the vdsm version mentioned in the description, which is ovirt-3.5-alpha-1.1 build > To continue with this bug, we must know: > - How reproducible is this? Happened every-time > - Does it happen also with latest stable release? Will test it and report about the results (In reply to Elad from comment #4) > Happened every-time How many times? Created attachment 901199 [details] downstream (In reply to Elad from comment #4) > (In reply to Nir Soffer from comment #1) > > Tested with this patch: http://gerrit.ovirt.org/27122 > Tested also with the vdsm version mentioned in the description, which is > ovirt-3.5-alpha-1.1 build > > To continue with this bug, we must know: > > - How reproducible is this? > Happened every-time > > - Does it happen also with latest stable release? Happened also with rhev3.4-av9.3 2014-06-01 11:54:31,546 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-1) Command GetDeviceListVDSCommand(HostName = green-vdsb, HostId = de7ea426-5a00-400d-b540-61a543c57210, storageType=FCP) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException 2014-06-01 11:54:31,546 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-1) FINISH, GetDeviceListVDSCommand, log id: 30ea8bb8 2014-06-01 11:54:31,547 ERROR [org.ovirt.engine.core.bll.storage.GetDeviceListQuery] (ajp-/127.0.0.1:8702-1) Query GetDeviceListQuery failed. Exception message is VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022) : org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022): org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022) Attaching the logs (downstream) (In reply to Nir Soffer from comment #5) > (In reply to Elad from comment #4) > > Happened every-time > How many times? 1 time with http://gerrit.ovirt.org/27122 1 time with ovirt-3.5-alpha-1.1 1 time with rhev3.4-av9.3 Created attachment 901218 [details]
domainMonitor
Attaching a file describes the domain monitoring for the DC domain while getDeviceList is stuck in vdsm.
Happened every-time while testing with EMC-VNX storage server. When testing it with another storage server - EMC-XtremIO, removal of a LUN from LUN masking didn't cause getDeviceList to fail, the removed LUNs were reported as faulty by vdsm to engine. For some reason I think this issue occurs from the same reason 1098769 does. *** This bug has been marked as a duplicate of bug 1104801 *** |
Created attachment 900419 [details] engine and vdsm logs Description of problem: When trying to edit a fibre-channel storage domain after a removal of a LUN, which is not part of the storage domain volume group, from LUN masking in the storage server, engine fails getDeviceList sync task because of a 3 minutes timeout. This causes to engine to set the FC domain to inactive, even though all its PVs are accessible from the host. Version-Release number of selected component (if applicable): ovirt-3.5-alpha-1 ovirt-engine-3.5.0-0.0.master.20140519181229.gitc6324d4.el6.noarch vdsm-4.14.1-340.gitedb02ba.el6.x86_64 How reproducible: Always Steps to Reproduce: Have a host with FC HBA connected and logged in to a storage server. Expose several LUNs to it. 1. Create a FC storage domain resides on one of the LUNs, wait for it to become active. 2. From storage server side, remove a LUN which doesn't participate in the storage server's VG as a PV. 3. Click on 'edit' storage domain for the FC domain. Actual results: Engine sends GetDeviceListVDSCommand to vdsm and because a of the LUN which was removed from LUN masking, vdsm hangs the operation: Command is sent to vdsm: 2014-05-29 17:46:07,429 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp--127.0.0.1-8702-1) START, GetDeviceListVDSCommand(HostName = green-vdsb, HostId = fe319ceb-7307-466a-99fe-f54e8115923a, storageType=FCP), log id: 6627fae9 Thread-13::INFO::2014-05-29 17:46:06,535::logUtils::44::dispatcher::(wrapper) Run and protect: getDeviceList(storageType=2, options={}) GetdeviceList fails with a timeout on engine: 2014-05-29 17:49:07,440 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp--127.0.0.1-8702-1) Command GetDeviceListVDSCommand(HostName = green-vdsb, HostId = fe319ceb-7307-466a-99fe-f54e8115923a, storageType=FCP) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException 2014-05-29 17:49:07,443 ERROR [org.ovirt.engine.core.bll.storage.GetDeviceListQuery] (ajp--127.0.0.1-8702-1) Query GetDeviceListQuery failed. Exception message is VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022) : org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022): org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022) FC domain is reported in problem: 2014-05-29 17:51:53,713 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-6-thread-15) domain 67c5fa81-99de-4436-9de3-335f3457e060:data1 in problem. vds: green-vdsb ERROR in vdsm.log: Thread-37::ERROR::2014-05-29 18:00:05,679::task::866::TaskManager.Task::(_setError) Task=`8e25b0ed-199b-49be-acbe-65bd56868f54`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 593, in spmStop vars.task.getExclusiveLock(STORAGE, spUUID) File "/usr/share/vdsm/storage/task.py", line 1332, in getExclusiveLock timeout) File "/usr/share/vdsm/storage/resourceManager.py", line 820, in acquire raise se.ResourceTimeout() ResourceTimeout: Resource timeout: () Thread-37::ERROR::2014-05-29 18:00:05,695::dispatcher::76::Storage.Dispatcher::(wrapper) {'status': {'message': 'Resource timeout: ()', 'code': 851}} FC domain moves to inactive. LVM commands on vdsm hang. Expected results: Not sure what should be done here, maybe need to fine some kind of mechanism that detect the removing of a device that was connected by FC. Additional info: engine and vdsm logs