Hide Forgot
Description of problem: The customer tried to extend a storage domain and the sequence appears to have been executed twice concurrently. As a result, the SD/VG ended up with extraneous "unknown device" PVs, which caused the SD to become inactive. On face value this looks the same as the problem described in BZ 1133938, which was fixed in 3.5.0. However the problem being described now occurred on 3.5.3. Version-Release number of selected component (if applicable): RHEV 3.5.3 RHEV-H 6.6 (20150512.0) hosts How reproducible: N/A Steps to Reproduce: 1. 2. 3. Actual results: The sequence of UpdateStorageDomainCommand/ConnectAllHostsToLunCommand/ExtendSANStorageDomainCommand were executed twice concurrently for the same request to extend the storage domain. Expected results: Only should have been executed. Additional info:
This must be protected on the vdsm side by taking the appropriate locks. I think we should try to reproduce this manually using vdsClient.
(In reply to Gordon Watson from comment #0) > As a result, the SD/VG ended up with extraneous "unknown device" PVs, which > caused the SD to become inactive. Why do you think that the result was extraneous "unknown device" PVs?
Fred, please add some explanation here as to what the root cause was, how the patch solves it, and our assessment on why this is a 3.6.0 issue.
Before the ExtendSanStorageDomainCommand sends an extends command to the VDSM, it will check if the additional LUN is already part of the storage domain. In a scenario where the command is executed twice at the same time, the validation will be OK for both commands and the extend command will be send twice to the VDSM. As a result, the VDSM will create twice a PV on the same device, and extend the VG with both PVs. One of the PVs will have an "unknown device" state. The integrity of the metadata of the VG will be affected and the VG will be in a partial state. The patch is adding a lock to the execution of the command, so that the command will not be able to run again until full completion of the current execution. Since this issue will cause the Storage Domain to become Inactive and cannot be reverted easily, it is recommended to introduce the fix in 3.6.0. An additional fix is added on VDSM side to validate that the device is not already part of the VG before extending it. [1] [1] https://bugzilla.redhat.com/show_bug.cgi?id=1261531
Yaniv/Aharon - following the discussion on bug 1261531, it seems there is a requirement for this in z-stream. From dev's side the backport seems simple enough. Please weigh in from PM/QA side.
Hi Gordon, Can you provide the exact steps to reproduce please?
(In reply to Elad from comment #14) > Hi Gordon, > Can you provide the exact steps to reproduce please? Note that even if you can reproduce the engine flow, it will not corrupt the vg now, since vdsm does check now if a pv is already part of a vg during extend, and will fail the request in this case. If the engine fix is correct, you cannot reproduce this now. If the engine side is incorrect and engine will try to extend a vg twice using the same pv, the request will fail at the vdsm side with this error: "Cannot extend vg <vg uuid>: pvs already belong to vg '/dev/mapper/<guid>'"
(In reply to Nir Soffer from comment #15) Elad, see bug 1261531 about the vdsm side.
(In reply to Elad from comment #14) > Hi Gordon, > Can you provide the exact steps to reproduce please? I reproduced with REST API, sending the update command twice at the same time from two different browsers .
An attempt to extend a block domain (FC and iSCSI) while a similar operation is in progress using RHEVM is now blocked on CDA. <fault> <reason>Operation Failed</reason> <detail> [Cannot extend Storage. Related operation is currently in progress. Please try again later.] </detail> </fault> 2015-10-15 14:42:37,882 WARN [org.ovirt.engine.core.bll.storage.ExtendSANStorageDomainCommand] (ajp-/127.0.0.1:8702-2) [75aa25a5] CanDoAction of action 'ExtendSANStorageDomain' failed for user admin@internal. Reasons: VAR__TYPE__STORAGE__DOMAIN,VAR__ACTION__EXTEND,ACTION_TYPE_FAILED_OBJECT_LOCKED Verified using RHEV-3.6.0-15 rhevm-3.6.0-0.18.el6.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-0376.html