Bug 1258632 - Two instances of UpdateStorageDomainCommand/ExtendSANStorageDomainCommand executed concurrently
Two instances of UpdateStorageDomainCommand/ExtendSANStorageDomainCommand exe...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.5.3
Unspecified Unspecified
medium Severity medium
: ovirt-3.6.0-rc3
: 3.6.0
Assigned To: Fred Rolland
Elad
: ZStream
Depends On:
Blocks: 1261531 1265906
  Show dependency treegraph
 
Reported: 2015-08-31 16:37 EDT by Gordon Watson
Modified: 2016-03-09 16:12 EST (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1261531 1265906 (view as bug list)
Environment:
Last Closed: 2016-03-09 16:12:37 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 46082 master MERGED engine: Add lock on Extend SD command Never
oVirt gerrit 46212 ovirt-engine-3.6 MERGED engine: Add lock on Extend SD command Never

  None (edit)
Description Gordon Watson 2015-08-31 16:37:57 EDT
Description of problem:

The customer tried to extend a storage domain and the sequence appears to have been executed twice concurrently.

As a result, the SD/VG ended up with extraneous "unknown device" PVs, which caused the SD to become inactive.

On face value this looks the same as the problem described in BZ 1133938, which was fixed in 3.5.0. However the problem being described now occurred on 3.5.3.



Version-Release number of selected component (if applicable):

RHEV 3.5.3
RHEV-H 6.6 (20150512.0) hosts


How reproducible:

N/A


Steps to Reproduce:
1.
2.
3.

Actual results:

The sequence of UpdateStorageDomainCommand/ConnectAllHostsToLunCommand/ExtendSANStorageDomainCommand were executed twice concurrently for the same request to extend the storage domain.


Expected results:

Only should have been executed.


Additional info:
Comment 6 Nir Soffer 2015-08-31 17:27:38 EDT
This must be protected on the vdsm side by taking the appropriate locks.
I think we should try to reproduce this manually using vdsClient.
Comment 7 Nir Soffer 2015-08-31 17:30:08 EDT
(In reply to Gordon Watson from comment #0)
> As a result, the SD/VG ended up with extraneous "unknown device" PVs, which
> caused the SD to become inactive.

Why do you think that the result was extraneous "unknown device" PVs?
Comment 9 Allon Mureinik 2015-09-13 05:35:01 EDT
Fred, please add some explanation here as to what the root cause was, how the patch solves it, and our assessment on why this is a 3.6.0 issue.
Comment 10 Fred Rolland 2015-09-16 03:23:10 EDT
Before the ExtendSanStorageDomainCommand sends an extends command to the VDSM, it will check if the additional LUN is already part of the storage domain.

In a scenario where the command is executed twice at the same time, the validation will be OK for both commands and the extend command will be send twice to the VDSM.

As a result, the VDSM will create twice a PV on the same device, and extend the VG with both PVs. One of the PVs will have an "unknown device" state.
The integrity of the metadata of the VG will be affected and the VG will be in a partial state.

The patch is adding a lock to the execution of the command, so that the command will not be able to run again until full completion of the current execution.

Since this issue will cause the Storage Domain to become Inactive and cannot be reverted easily, it is recommended to introduce the fix in 3.6.0.


An additional fix is added on VDSM side to validate that the device is not already part of the VG before extending it. [1]

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1261531
Comment 11 Allon Mureinik 2015-09-17 07:18:12 EDT
Yaniv/Aharon - following the discussion on bug 1261531, it seems there is a requirement for this in z-stream.

From dev's side the backport seems simple enough.
Please weigh in from PM/QA side.
Comment 14 Elad 2015-10-15 03:23:17 EDT
Hi Gordon, 
Can you provide the exact steps to reproduce please?
Comment 15 Nir Soffer 2015-10-15 03:35:05 EDT
(In reply to Elad from comment #14)
> Hi Gordon, 
> Can you provide the exact steps to reproduce please?

Note that even if you can reproduce the engine flow, it will not corrupt
the vg now, since vdsm does check now if a pv is already part of a vg
during extend, and will fail the request in this case.

If the engine fix is correct, you cannot reproduce this now. If the engine
side is incorrect and engine will try to extend a vg twice using the same
pv, the request will fail at the vdsm side with this error:

    "Cannot extend vg <vg uuid>: pvs already belong to vg '/dev/mapper/<guid>'"
Comment 16 Nir Soffer 2015-10-15 03:36:46 EDT
(In reply to Nir Soffer from comment #15)
Elad, see bug 1261531 about the vdsm side.
Comment 17 Fred Rolland 2015-10-15 05:28:15 EDT
(In reply to Elad from comment #14)
> Hi Gordon, 
> Can you provide the exact steps to reproduce please?

I reproduced with REST API, sending the update command twice at the same time from two different browsers .
Comment 18 Elad 2015-10-15 10:45:10 EDT
An attempt to extend a block domain (FC and iSCSI) while a similar operation is in progress using RHEVM is now blocked on CDA.



<fault>
<reason>Operation Failed</reason>
<detail>
[Cannot extend Storage. Related operation is currently in progress. Please try again later.]
</detail>
</fault>


2015-10-15 14:42:37,882 WARN  [org.ovirt.engine.core.bll.storage.ExtendSANStorageDomainCommand] (ajp-/127.0.0.1:8702-2) [75aa25a5] CanDoAction of action 'ExtendSANStorageDomain' failed for user admin@internal. Reasons: VAR__TYPE__STORAGE__DOMAIN,VAR__ACTION__EXTEND,ACTION_TYPE_FAILED_OBJECT_LOCKED


Verified using RHEV-3.6.0-15
rhevm-3.6.0-0.18.el6.noarch
Comment 21 errata-xmlrpc 2016-03-09 16:12:37 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0376.html

Note You need to log in before you can comment on or make changes to this bug.