844180 – formatStorageDomain fails due to devices not being discovered in time

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 844180 - formatStorageDomain fails due to devices not being discovered in time

Summary: formatStorageDomain fails due to devices not being discovered in time

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	6.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Yeela Kaplan
QA Contact:	Dafna Ron
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:	766281 820486
TreeView+	depends on / blocked

Reported:	2012-07-29 15:37 UTC by Daniel Paikov
Modified:	2022-07-09 05:37 UTC (History)
CC List:	14 users (show)
Fixed In Version:	vdsm-4.9.6-41.0
Doc Type:	Bug Fix
Doc Text:	Previously, deleting an iSCSI domain sometimes failed due to a race between discovering devices after connecting to the iSCSI target and trying to delete the domain. Now, formatStorageDomain performs a scan to make certain that all devices are detected, and iSCSI domains can be successfully deleted.
Clone Of:
Environment:
Last Closed:	2012-12-04 19:03:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vdsm.log (983.94 KB, application/x-compressed-tar) 2012-07-29 15:37 UTC, Daniel Paikov	no flags	Details
logs (613.98 KB, application/x-gzip) 2012-10-28 10:00 UTC, Dafna Ron	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2012:1508	0	normal	SHIPPED_LIVE	Important: rhev-3.1.0 vdsm security, bug fix, and enhancement update	2012-12-04 23:48:05 UTC

Description Daniel Paikov 2012-07-29 15:37:30 UTC

Created attachment 601047 [details]
vdsm.log

getDeviceList on an FCP host (also logged into some iSCSI targets) takes 83 seconds. This is too slow and the GUI times out before presenting a New Domain dialog.

real    1m23.188s
user    0m0.085s
sys     0m0.043s

Comment 2 Haim 2012-07-29 16:52:29 UTC

(In reply to comment #0)
> Created attachment 601047 [details]
> vdsm.log
> 
> getDeviceList on an FCP host (also logged into some iSCSI targets) takes 83
> seconds. This is too slow and the GUI times out before presenting a New
> Domain dialog.
> 
> real    1m23.188s
> user    0m0.085s
> sys     0m0.043s

getDeviceList was initiated at Thread-12656::DEBUG::2012-07-29 18:32:36,046::BindingXMLRPC::164::vds::(wrapper), thread started to work, and last call was at Thread-12656::DEBUG::2012-07-29 18:32:36,071::misc::1090::SamplingMethod::(__call__), then it went to sleep for a minute and so till Thread-12656::DEBUG::2012-07-29 18:33:58,474::__init__::1164::Storage.Misc.excCmd::(_log). 

is it a locking issue (not appears to be related to storage or any lvm operations).

Comment 3 Ayal Baron 2012-07-29 20:31:52 UTC

(In reply to comment #2)
> (In reply to comment #0)
> > Created attachment 601047 [details]
> > vdsm.log
> > 
> > getDeviceList on an FCP host (also logged into some iSCSI targets) takes 83
> > seconds. This is too slow and the GUI times out before presenting a New
> > Domain dialog.
> > 
> > real    1m23.188s
> > user    0m0.085s
> > sys     0m0.043s
> 
> getDeviceList was initiated at Thread-12656::DEBUG::2012-07-29
> 18:32:36,046::BindingXMLRPC::164::vds::(wrapper), thread started to work,
> and last call was at Thread-12656::DEBUG::2012-07-29
> 18:32:36,071::misc::1090::SamplingMethod::(__call__), then it went to sleep
> for a minute and so till Thread-12656::DEBUG::2012-07-29
> 18:33:58,474::__init__::1164::Storage.Misc.excCmd::(_log). 
> 
> is it a locking issue (not appears to be related to storage or any lvm
> operations).

No, looks like either a supervdsm issue or forceIScsiScan is taking too long.
For some reason we're lacking logging there so we need to add some to forceIScsiScan and reproduce.

Comment 5 Haim 2012-07-30 16:10:44 UTC

(In reply to comment #3)
> (In reply to comment #2)
> > (In reply to comment #0)
> > > Created attachment 601047 [details]
> > > vdsm.log
> > > 
> > > getDeviceList on an FCP host (also logged into some iSCSI targets) takes 83
> > > seconds. This is too slow and the GUI times out before presenting a New
> > > Domain dialog.
> > > 
> > > real    1m23.188s
> > > user    0m0.085s
> > > sys     0m0.043s
> > 
> > getDeviceList was initiated at Thread-12656::DEBUG::2012-07-29
> > 18:32:36,046::BindingXMLRPC::164::vds::(wrapper), thread started to work,
> > and last call was at Thread-12656::DEBUG::2012-07-29
> > 18:32:36,071::misc::1090::SamplingMethod::(__call__), then it went to sleep
> > for a minute and so till Thread-12656::DEBUG::2012-07-29
> > 18:33:58,474::__init__::1164::Storage.Misc.excCmd::(_log). 
> > 
> > is it a locking issue (not appears to be related to storage or any lvm
> > operations).
> 
> No, looks like either a supervdsm issue or forceIScsiScan is taking too long.
> For some reason we're lacking logging there so we need to add some to
> forceIScsiScan and reproduce.

well, reducing severity, it appears that scsi-rescan takes too long (appears to be storage problem):

[root@orange-vdse ~]#  for p in qls /sys/class/scsi_host/host*/; do basename $p; time echo "- - -" > $p/scan; done
ls

real    0m0.000s
user    0m0.000s
sys     0m0.000s
host0

real    0m0.161s
user    0m0.000s
sys     0m0.000s
host1

real    0m0.000s
user    0m0.000s
sys     0m0.000s
host2

real    0m0.225s
user    0m0.000s
sys     0m0.035s
host3

real    1m22.011s
user    0m0.000s
sys     0m0.000s
host4

real    0m0.000s
user    0m0.000s
sys     0m0.000s

however, let's make sure we are adding logs for such cases, I would change BZ title.

Comment 6 Yeela Kaplan 2012-07-31 08:04:36 UTC

(In reply to comment #5)
> (In reply to comment #3)
> > (In reply to comment #2)
> > > (In reply to comment #0)
> > > > Created attachment 601047 [details]
> > > > vdsm.log
> > > > 
> > > > getDeviceList on an FCP host (also logged into some iSCSI targets) takes 83
> > > > seconds. This is too slow and the GUI times out before presenting a New
> > > > Domain dialog.
> > > > 
> > > > real    1m23.188s
> > > > user    0m0.085s
> > > > sys     0m0.043s
> > > 
> > > getDeviceList was initiated at Thread-12656::DEBUG::2012-07-29
> > > 18:32:36,046::BindingXMLRPC::164::vds::(wrapper), thread started to work,
> > > and last call was at Thread-12656::DEBUG::2012-07-29
> > > 18:32:36,071::misc::1090::SamplingMethod::(__call__), then it went to sleep
> > > for a minute and so till Thread-12656::DEBUG::2012-07-29
> > > 18:33:58,474::__init__::1164::Storage.Misc.excCmd::(_log). 
> > > 
> > > is it a locking issue (not appears to be related to storage or any lvm
> > > operations).
> > 
> > No, looks like either a supervdsm issue or forceIScsiScan is taking too long.
> > For some reason we're lacking logging there so we need to add some to
> > forceIScsiScan and reproduce.
> 
> well, reducing severity, it appears that scsi-rescan takes too long (appears
> to be storage problem):
> 
> [root@orange-vdse ~]#  for p in qls /sys/class/scsi_host/host*/; do basename
> $p; time echo "- - -" > $p/scan; done
> ls
> 
> real    0m0.000s
> user    0m0.000s
> sys     0m0.000s
> host0
> 
> real    0m0.161s
> user    0m0.000s
> sys     0m0.000s
> host1
> 
> real    0m0.000s
> user    0m0.000s
> sys     0m0.000s
> host2
> 
> real    0m0.225s
> user    0m0.000s
> sys     0m0.035s
> host3
> 
> real    1m22.011s
> user    0m0.000s
> sys     0m0.000s
> host4

Actually only one scsi host is problematic here, we should limit the amount of time we wait and return whatever we have after the timeout.
loggin is also important though.

> 
> real    0m0.000s
> user    0m0.000s
> sys     0m0.000s
> 
> however, let's make sure we are adding logs for such cases, I would change
> BZ title.

Comment 8 Yeela Kaplan 2012-07-31 15:30:08 UTC

http://gerrit.ovirt.org/#/c/6796/

Comment 10 Ayal Baron 2012-08-22 10:30:02 UTC

Need to document the new vdsm config options.
Yeela, please provide a short explanations on which parameters have been added and what their function is.

Comment 14 Dafna Ron 2012-10-28 10:00:01 UTC

since the iscsiscan was suppose to solve the multipath race bug I am assigning this bug back on devel. 

during remove of several domains one domain failed on format due to the multipath race bug. 

attaching logs

Comment 15 Dafna Ron 2012-10-28 10:00:32 UTC

Created attachment 634474 [details]
logs

Comment 16 Haim 2012-10-28 10:23:03 UTC

I would like to clarify things up a bit, we decided to move this bug back to assigned per our agreement where the patch solves this bug should workaround multipath race we are facing in the various scenarios, thus, we found out that it still fail in format storage domain scenario.

Ayal - i need some kind of a fix for the multipath race, this is a blocker for the version, there is no point of opening a new bug since we have several already, either we should call the forceScsiRescan on every connectStorageServer or drop the disconnect storage server on engine side.
in any case, its your call.

Comment 17 Ayal Baron 2012-10-30 02:00:14 UTC

The fix here is to scan in formatStorageDomain in case device is not visible.

Comment 18 Yeela Kaplan 2012-10-30 09:37:56 UTC

a patch not related to the original bug that is meant to solve the multipath race in formatStorageDomin:

http://gerrit.ovirt.org/#/c/8921/

Comment 19 Dafna Ron 2012-11-06 14:01:52 UTC

verified on vdsm-4.9.6-41.0.el6_3.x86_64 and si24

Comment 20 Zac Dover 2012-11-08 06:19:31 UTC

Ayal,

Please read the Doc Text field and let me know if I've correctly summarized this fix.

Thanks!

Zac

Comment 22 errata-xmlrpc 2012-12-04 19:03:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html

Comment 23 Dan Kenigsberg 2012-12-07 20:54:02 UTC

The patch of comment 8 attempted to warn on bad config values, but instead - would raise an exception. See http://gerrit.ovirt.org/#/c/9690/

Comment 24 Yeela Kaplan 2012-12-09 07:20:54 UTC

Dan, you are most definitely right. Thank you for the patch.

Note You need to log in before you can comment on or make changes to this bug.