Bug 844180
Summary: | formatStorageDomain fails due to devices not being discovered in time | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Daniel Paikov <dpaikov> | ||||||
Component: | vdsm | Assignee: | Yeela Kaplan <ykaplan> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Dafna Ron <dron> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 6.2 | CC: | abaron, aburden, bazulay, chetan, danken, dron, hateya, iheim, ilvovsky, lpeer, oourfali, syeghiay, ykaul, zdover | ||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | storage | ||||||||
Fixed In Version: | vdsm-4.9.6-41.0 | Doc Type: | Bug Fix | ||||||
Doc Text: |
Previously, deleting an iSCSI domain sometimes failed due to a race between discovering devices after connecting to the iSCSI target and trying to delete the domain. Now, formatStorageDomain performs a scan to make certain that all devices are detected, and iSCSI domains can be successfully deleted.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2012-12-04 19:03:58 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 766281, 820486 | ||||||||
Attachments: |
|
(In reply to comment #0) > Created attachment 601047 [details] > vdsm.log > > getDeviceList on an FCP host (also logged into some iSCSI targets) takes 83 > seconds. This is too slow and the GUI times out before presenting a New > Domain dialog. > > real 1m23.188s > user 0m0.085s > sys 0m0.043s getDeviceList was initiated at Thread-12656::DEBUG::2012-07-29 18:32:36,046::BindingXMLRPC::164::vds::(wrapper), thread started to work, and last call was at Thread-12656::DEBUG::2012-07-29 18:32:36,071::misc::1090::SamplingMethod::(__call__), then it went to sleep for a minute and so till Thread-12656::DEBUG::2012-07-29 18:33:58,474::__init__::1164::Storage.Misc.excCmd::(_log). is it a locking issue (not appears to be related to storage or any lvm operations). (In reply to comment #2) > (In reply to comment #0) > > Created attachment 601047 [details] > > vdsm.log > > > > getDeviceList on an FCP host (also logged into some iSCSI targets) takes 83 > > seconds. This is too slow and the GUI times out before presenting a New > > Domain dialog. > > > > real 1m23.188s > > user 0m0.085s > > sys 0m0.043s > > getDeviceList was initiated at Thread-12656::DEBUG::2012-07-29 > 18:32:36,046::BindingXMLRPC::164::vds::(wrapper), thread started to work, > and last call was at Thread-12656::DEBUG::2012-07-29 > 18:32:36,071::misc::1090::SamplingMethod::(__call__), then it went to sleep > for a minute and so till Thread-12656::DEBUG::2012-07-29 > 18:33:58,474::__init__::1164::Storage.Misc.excCmd::(_log). > > is it a locking issue (not appears to be related to storage or any lvm > operations). No, looks like either a supervdsm issue or forceIScsiScan is taking too long. For some reason we're lacking logging there so we need to add some to forceIScsiScan and reproduce. (In reply to comment #3) > (In reply to comment #2) > > (In reply to comment #0) > > > Created attachment 601047 [details] > > > vdsm.log > > > > > > getDeviceList on an FCP host (also logged into some iSCSI targets) takes 83 > > > seconds. This is too slow and the GUI times out before presenting a New > > > Domain dialog. > > > > > > real 1m23.188s > > > user 0m0.085s > > > sys 0m0.043s > > > > getDeviceList was initiated at Thread-12656::DEBUG::2012-07-29 > > 18:32:36,046::BindingXMLRPC::164::vds::(wrapper), thread started to work, > > and last call was at Thread-12656::DEBUG::2012-07-29 > > 18:32:36,071::misc::1090::SamplingMethod::(__call__), then it went to sleep > > for a minute and so till Thread-12656::DEBUG::2012-07-29 > > 18:33:58,474::__init__::1164::Storage.Misc.excCmd::(_log). > > > > is it a locking issue (not appears to be related to storage or any lvm > > operations). > > No, looks like either a supervdsm issue or forceIScsiScan is taking too long. > For some reason we're lacking logging there so we need to add some to > forceIScsiScan and reproduce. well, reducing severity, it appears that scsi-rescan takes too long (appears to be storage problem): [root@orange-vdse ~]# for p in qls /sys/class/scsi_host/host*/; do basename $p; time echo "- - -" > $p/scan; done ls real 0m0.000s user 0m0.000s sys 0m0.000s host0 real 0m0.161s user 0m0.000s sys 0m0.000s host1 real 0m0.000s user 0m0.000s sys 0m0.000s host2 real 0m0.225s user 0m0.000s sys 0m0.035s host3 real 1m22.011s user 0m0.000s sys 0m0.000s host4 real 0m0.000s user 0m0.000s sys 0m0.000s however, let's make sure we are adding logs for such cases, I would change BZ title. (In reply to comment #5) > (In reply to comment #3) > > (In reply to comment #2) > > > (In reply to comment #0) > > > > Created attachment 601047 [details] > > > > vdsm.log > > > > > > > > getDeviceList on an FCP host (also logged into some iSCSI targets) takes 83 > > > > seconds. This is too slow and the GUI times out before presenting a New > > > > Domain dialog. > > > > > > > > real 1m23.188s > > > > user 0m0.085s > > > > sys 0m0.043s > > > > > > getDeviceList was initiated at Thread-12656::DEBUG::2012-07-29 > > > 18:32:36,046::BindingXMLRPC::164::vds::(wrapper), thread started to work, > > > and last call was at Thread-12656::DEBUG::2012-07-29 > > > 18:32:36,071::misc::1090::SamplingMethod::(__call__), then it went to sleep > > > for a minute and so till Thread-12656::DEBUG::2012-07-29 > > > 18:33:58,474::__init__::1164::Storage.Misc.excCmd::(_log). > > > > > > is it a locking issue (not appears to be related to storage or any lvm > > > operations). > > > > No, looks like either a supervdsm issue or forceIScsiScan is taking too long. > > For some reason we're lacking logging there so we need to add some to > > forceIScsiScan and reproduce. > > well, reducing severity, it appears that scsi-rescan takes too long (appears > to be storage problem): > > [root@orange-vdse ~]# for p in qls /sys/class/scsi_host/host*/; do basename > $p; time echo "- - -" > $p/scan; done > ls > > real 0m0.000s > user 0m0.000s > sys 0m0.000s > host0 > > real 0m0.161s > user 0m0.000s > sys 0m0.000s > host1 > > real 0m0.000s > user 0m0.000s > sys 0m0.000s > host2 > > real 0m0.225s > user 0m0.000s > sys 0m0.035s > host3 > > real 1m22.011s > user 0m0.000s > sys 0m0.000s > host4 Actually only one scsi host is problematic here, we should limit the amount of time we wait and return whatever we have after the timeout. loggin is also important though. > > real 0m0.000s > user 0m0.000s > sys 0m0.000s > > however, let's make sure we are adding logs for such cases, I would change > BZ title. Need to document the new vdsm config options. Yeela, please provide a short explanations on which parameters have been added and what their function is. since the iscsiscan was suppose to solve the multipath race bug I am assigning this bug back on devel. during remove of several domains one domain failed on format due to the multipath race bug. attaching logs Created attachment 634474 [details]
logs
I would like to clarify things up a bit, we decided to move this bug back to assigned per our agreement where the patch solves this bug should workaround multipath race we are facing in the various scenarios, thus, we found out that it still fail in format storage domain scenario. Ayal - i need some kind of a fix for the multipath race, this is a blocker for the version, there is no point of opening a new bug since we have several already, either we should call the forceScsiRescan on every connectStorageServer or drop the disconnect storage server on engine side. in any case, its your call. The fix here is to scan in formatStorageDomain in case device is not visible. a patch not related to the original bug that is meant to solve the multipath race in formatStorageDomin: http://gerrit.ovirt.org/#/c/8921/ verified on vdsm-4.9.6-41.0.el6_3.x86_64 and si24 Ayal, Please read the Doc Text field and let me know if I've correctly summarized this fix. Thanks! Zac Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-1508.html The patch of comment 8 attempted to warn on bad config values, but instead - would raise an exception. See http://gerrit.ovirt.org/#/c/9690/ Dan, you are most definitely right. Thank you for the patch. |
Created attachment 601047 [details] vdsm.log getDeviceList on an FCP host (also logged into some iSCSI targets) takes 83 seconds. This is too slow and the GUI times out before presenting a New Domain dialog. real 1m23.188s user 0m0.085s sys 0m0.043s