1258632 – Two instances of UpdateStorageDomainCommand/ExtendSANStorageDomainCommand executed concurrently

Bug 1258632 - Two instances of UpdateStorageDomainCommand/ExtendSANStorageDomainCommand executed concurrently

Summary: Two instances of UpdateStorageDomainCommand/ExtendSANStorageDomainCommand exe...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.5.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-3.6.0-rc3
Target Release:	3.6.0
Assignee:	Fred Rolland
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1261531 1265906
TreeView+	depends on / blocked

Reported:	2015-08-31 20:37 UTC by Gordon Watson
Modified:	2019-10-10 10:08 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1261531 1265906 (view as bug list)
Environment:
Last Closed:	2016-03-09 21:12:37 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2016:0376	normal	SHIPPED_LIVE	Red Hat Enterprise Virtualization Manager 3.6.0	2016-03-10 01:20:52 UTC
oVirt gerrit	46082	master	MERGED	engine: Add lock on Extend SD command	Never
oVirt gerrit	46212	ovirt-engine-3.6	MERGED	engine: Add lock on Extend SD command	Never

Description Gordon Watson 2015-08-31 20:37:57 UTC

Description of problem:

The customer tried to extend a storage domain and the sequence appears to have been executed twice concurrently.

As a result, the SD/VG ended up with extraneous "unknown device" PVs, which caused the SD to become inactive.

On face value this looks the same as the problem described in BZ 1133938, which was fixed in 3.5.0. However the problem being described now occurred on 3.5.3.



Version-Release number of selected component (if applicable):

RHEV 3.5.3
RHEV-H 6.6 (20150512.0) hosts


How reproducible:

N/A


Steps to Reproduce:
1.
2.
3.

Actual results:

The sequence of UpdateStorageDomainCommand/ConnectAllHostsToLunCommand/ExtendSANStorageDomainCommand were executed twice concurrently for the same request to extend the storage domain.


Expected results:

Only should have been executed.


Additional info:

Comment 6 Nir Soffer 2015-08-31 21:27:38 UTC

This must be protected on the vdsm side by taking the appropriate locks.
I think we should try to reproduce this manually using vdsClient.

Comment 7 Nir Soffer 2015-08-31 21:30:08 UTC

(In reply to Gordon Watson from comment #0)
> As a result, the SD/VG ended up with extraneous "unknown device" PVs, which
> caused the SD to become inactive.

Why do you think that the result was extraneous "unknown device" PVs?

Comment 9 Allon Mureinik 2015-09-13 09:35:01 UTC

Fred, please add some explanation here as to what the root cause was, how the patch solves it, and our assessment on why this is a 3.6.0 issue.

Comment 10 Fred Rolland 2015-09-16 07:23:10 UTC

Before the ExtendSanStorageDomainCommand sends an extends command to the VDSM, it will check if the additional LUN is already part of the storage domain.

In a scenario where the command is executed twice at the same time, the validation will be OK for both commands and the extend command will be send twice to the VDSM.

As a result, the VDSM will create twice a PV on the same device, and extend the VG with both PVs. One of the PVs will have an "unknown device" state.
The integrity of the metadata of the VG will be affected and the VG will be in a partial state.

The patch is adding a lock to the execution of the command, so that the command will not be able to run again until full completion of the current execution.

Since this issue will cause the Storage Domain to become Inactive and cannot be reverted easily, it is recommended to introduce the fix in 3.6.0.


An additional fix is added on VDSM side to validate that the device is not already part of the VG before extending it. [1]

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1261531

Comment 11 Allon Mureinik 2015-09-17 11:18:12 UTC

Yaniv/Aharon - following the discussion on bug 1261531, it seems there is a requirement for this in z-stream.

From dev's side the backport seems simple enough.
Please weigh in from PM/QA side.

Comment 14 Elad 2015-10-15 07:23:17 UTC

Hi Gordon, 
Can you provide the exact steps to reproduce please?

Comment 15 Nir Soffer 2015-10-15 07:35:05 UTC

(In reply to Elad from comment #14)
> Hi Gordon, 
> Can you provide the exact steps to reproduce please?

Note that even if you can reproduce the engine flow, it will not corrupt
the vg now, since vdsm does check now if a pv is already part of a vg
during extend, and will fail the request in this case.

If the engine fix is correct, you cannot reproduce this now. If the engine
side is incorrect and engine will try to extend a vg twice using the same
pv, the request will fail at the vdsm side with this error:

    "Cannot extend vg <vg uuid>: pvs already belong to vg '/dev/mapper/<guid>'"

Comment 16 Nir Soffer 2015-10-15 07:36:46 UTC

(In reply to Nir Soffer from comment #15)
Elad, see bug 1261531 about the vdsm side.

Comment 17 Fred Rolland 2015-10-15 09:28:15 UTC

(In reply to Elad from comment #14)
> Hi Gordon, 
> Can you provide the exact steps to reproduce please?

I reproduced with REST API, sending the update command twice at the same time from two different browsers .

Comment 18 Elad 2015-10-15 14:45:10 UTC

An attempt to extend a block domain (FC and iSCSI) while a similar operation is in progress using RHEVM is now blocked on CDA.



<fault>
<reason>Operation Failed</reason>
<detail>
[Cannot extend Storage. Related operation is currently in progress. Please try again later.]
</detail>
</fault>


2015-10-15 14:42:37,882 WARN  [org.ovirt.engine.core.bll.storage.ExtendSANStorageDomainCommand] (ajp-/127.0.0.1:8702-2) [75aa25a5] CanDoAction of action 'ExtendSANStorageDomain' failed for user admin@internal. Reasons: VAR__TYPE__STORAGE__DOMAIN,VAR__ACTION__EXTEND,ACTION_TYPE_FAILED_OBJECT_LOCKED


Verified using RHEV-3.6.0-15
rhevm-3.6.0-0.18.el6.noarch

Comment 21 errata-xmlrpc 2016-03-09 21:12:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0376.html

Note You need to log in before you can comment on or make changes to this bug.