Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1438575

Summary: Live merge fails after preview of in-chain snapshot
Product: [oVirt] ovirt-engine Reporter: Raz Tamir <ratamir>
Component: BLL.StorageAssignee: Adam Litke <alitke>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Raz Tamir <ratamir>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.1.1.7CC: alitke, amureini, bugs, tnisan
Target Milestone: ovirt-4.1.3Keywords: Automation, Regression
Target Release: ---Flags: rule-engine: ovirt-4.1+
rule-engine: ovirt-4.2+
rule-engine: blocker+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-16 10:59:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine and vdsm logs none

Description Raz Tamir 2017-04-03 19:52:21 UTC
Created attachment 1268480 [details]
engine and vdsm logs

Probably there's a simpler steps to reproduce this.

Description of problem:
I have a VM with 4 disks with following permutation:
format: COW interface: virtio
format: COW interface: virtio-scsi
format: RAW interface: virtio
format: RAW interface: virtio-scsi

I created 3 snapshots with memory state, deleted (live merge) the second, preview the 3rd snapshot (without the memory) and undo.  After these steps, when trying to remove the 3rd snapshot, the operation fails with (ovirt-engine.log):

2017-04-03 19:50:10,076+03 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-5) [598c338b] Unexpected return value: Status [code=-32603, message=No such drive: '{'domainID': 'd20d4865-e368-4dbd-9307-0337d2b21af3', 'imageID': '15f4c58e-e177-4095-9f8c-77c00581629f', 'volumeID': 'bc9508be-3cd3-4fd9-873d-4928642f2bfe', 'poolID': '8be71a8d-4e55-4c18-a66b-c66c3027cb7b'}']
2017-04-03 19:50:10,076+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-5) [598c338b] Failed in 'MergeVDS' method
2017-04-03 19:50:10,077+03 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-5) [598c338b] Unexpected return value: Status [code=-32603, message=No such drive: '{'domainID': 'd20d4865-e368-4dbd-9307-0337d2b21af3', 'imageID': '15f4c58e-e177-4095-9f8c-77c00581629f', 'volumeID': 'bc9508be-3cd3-4fd9-873d-4928642f2bfe', 'poolID': '8be71a8d-4e55-4c18-a66b-c66c3027cb7b'}']
2017-04-03 19:50:10,105+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-5-thread-5) [598c338b] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_2 command MergeVDS failed: No such drive: '{'domainID': 'd20d4865-e368-4dbd-9307-0337d2b21af3', 'imageID': '15f4c58e-e177-4095-9f8c-77c00581629f', 'volumeID': 'bc9508be-3cd3-4fd9-873d-4928642f2bfe', 'poolID': '8be71a8d-4e55-4c18-a66b-c66c3027cb7b'}'
2017-04-03 19:50:10,106+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-5) [598c338b] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand' return value 'StatusOnlyReturn [status=Status [code=-32603, message=No such drive: '{'domainID': 'd20d4865-e368-4dbd-9307-0337d2b21af3', 'imageID': '15f4c58e-e177-4095-9f8c-77c00581629f', 'volumeID': 'bc9508be-3cd3-4fd9-873d-4928642f2bfe', 'poolID': '8be71a8d-4e55-4c18-a66b-c66c3027cb7b'}']]'
2017-04-03 19:50:10,106+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-5) [598c338b] HostName = host_mixed_2
2017-04-03 19:50:10,107+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-5) [598c338b] Command 'MergeVDSCommand(HostName = host_mixed_2, MergeVDSCommandParameters:{runAsync='true', hostId='98fce30d-3423-43b6-89eb-5a3c73517026', vmId='3748922b-cfdb-4377-a07d-573ef3e6622b', storagePoolId='8be71a8d-4e55-4c18-a66b-c66c3027cb7b', storageDomainId='d20d4865-e368-4dbd-9307-0337d2b21af3', imageGroupId='40bb7148-b3f0-4a9b-a76e-bb36e8334f35', imageId='86147e87-d2d2-4560-b991-2c16fcf512d6', baseImageId='b6e37f7a-5fd9-464c-b10d-350b67c9c85e', topImageId='86147e87-d2d2-4560-b991-2c16fcf512d6', bandwidth='0'})' execution failed: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = No such drive: '{'domainID': 'd20d4865-e368-4dbd-9307-0337d2b21af3', 'imageID': '15f4c58e-e177-4095-9f8c-77c00581629f', 'volumeID': 'bc9508be-3cd3-4fd9-873d-4928642f2bfe', 'poolID': '8be71a8d-4e55-4c18-a66b-c66c3027cb7b'}', code = -32603
2017-04-03 19:50:10,107+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-5) [598c338b] FINISH, MergeVDSCommand, log id: 1e8fb271
2017-04-03 19:50:10,108+03 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-5-thread-5) [598c338b] Engine exception thrown while sending merge command: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = No such drive: '{'domainID': 'd20d4865-e368-4dbd-9307-0337d2b21af3', 'imageID': '15f4c58e-e177-4095-9f8c-77c00581629f', 'volumeID': 'bc9508be-3cd3-4fd9-873d-4928642f2bfe', 'poolID': '8be71a8d-4e55-4c18-a66b-c66c3027cb7b'}', code = -32603 (Failed with error unexpected and code 16)

In vdsm.log:

2017-04-03 19:50:08,986+0300 DEBUG (merge/3748922b) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/lvm lvchange --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspende
d_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3514f0c5a51600493|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=
0 }  backup {  retain_min = 50  retain_days = 0 } ' --refresh d20d4865-e368-4dbd-9307-0337d2b21af3/leases (cwd None) (commands:69)
2017-04-03 19:50:08,953+0300 ERROR (jsonrpc/3) [jsonrpc.JsonRpcServer] Internal server error (__init__:552)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 547, in _handle_request
    res = method(**params)
  File "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 202, in _dynamicMethod
    result = fn(*methodArgs)
  File "/usr/share/vdsm/API.py", line 836, in merge
    return v.merge(drive, baseVolUUID, topVolUUID, bandwidth, jobUUID)
  File "<string>", line 2, in merge
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 41, in method
    ret = func(*args, **kwargs)
  File "/usr/share/vdsm/virt/vm.py", line 4824, in merge
    self.updateVmJobs()
  File "/usr/share/vdsm/virt/vm.py", line 4623, in updateVmJobs
    self._vmJobs = self.queryBlockJobs()
  File "/usr/share/vdsm/virt/vm.py", line 4652, in queryBlockJobs
    drive = self._findDriveByUUIDs(storedJob['disk'])
  File "/usr/share/vdsm/virt/vm.py", line 3324, in _findDriveByUUIDs
    raise LookupError("No such drive: '%s'" % drive)
LookupError: No such drive: '{'domainID': 'd20d4865-e368-4dbd-9307-0337d2b21af3', 'imageID': '15f4c58e-e177-4095-9f8c-77c00581629f', 'volumeID': 'bc9508be-3cd3-4fd9-873d-4928642f2bfe', 'poolID': '8be71a8d-4e55-4c18-a66b-c66c3027cb7b'}'
2017-04-03 19:50:09,013+0300 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call VM.merge failed (error -32603) in 5.69 seconds (__init__:515)

In the snapshot's sub-tab left only 1 disk virtio RAW disk in Illegal state.


 
Version-Release number of selected component (if applicable):
rhevm-4.1.1.7-0.1.el7
vdsm-4.19.10.1-1.el7ev.x86_64

How reproducible:
100% in our automation

Steps to Reproduce:
1. Create a VM with 1 disk with OS
2. Add 4 more disks with the permutation described above
3. Start the VM and create 3 snapshots A, B and C in that order
4. Delete snapshot B
5. Preview snapshot C with memory
6. Undo the previewed snapshot
7. Remove snapshot C

Actual results:
Removal fails and the snapshot's disks contain 1 disk out of 5 in illegal state

Expected results:


Additional info:

Comment 1 Raz Tamir 2017-04-03 20:59:57 UTC
Just tested a few more scenarios and seems like the operation failed even without the previewing the snapshot.

So the valid steps to reproduced are 1 - 4

Comment 2 Red Hat Bugzilla Rules Engine 2017-04-04 08:05:26 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 3 Adam Litke 2017-04-06 20:57:51 UTC
My preliminary analysis points to a race between the merge verb (VM.merge) and ongoing block job monitoring (VM.queryBlockJobs).  We see the following error:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 547, in _handle_request
    res = method(**params)
  File "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 202, in _dynamicMethod
    result = fn(*methodArgs)
  File "/usr/share/vdsm/API.py", line 836, in merge
    return v.merge(drive, baseVolUUID, topVolUUID, bandwidth, jobUUID)
  File "<string>", line 2, in merge
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 41, in method
    ret = func(*args, **kwargs)
  File "/usr/share/vdsm/virt/vm.py", line 4824, in merge
    self.updateVmJobs()
  File "/usr/share/vdsm/virt/vm.py", line 4623, in updateVmJobs
    self._vmJobs = self.queryBlockJobs()
  File "/usr/share/vdsm/virt/vm.py", line 4652, in queryBlockJobs
    drive = self._findDriveByUUIDs(storedJob['disk'])
  File "/usr/share/vdsm/virt/vm.py", line 3324, in _findDriveByUUIDs
    raise LookupError("No such drive: '%s'" % drive)
LookupError: No such drive: '{'domainID': 'd20d4865-e368-4dbd-9307-0337d2b21af3', 'imageID': '15f4c58e-e177-4095-9f8c-77c00581629f', 'volumeID': 'bc9508be-3cd3-4fd9-873d-4928642f2bfe', 'poolID': '8be71a8d-4e55-4c18-a66b-c66c3027cb7b'}'

but earlier in the merge verb the same call was fine.  I need to look into this further to understand exactly what happened.

Comment 4 Adam Litke 2017-04-12 15:29:19 UTC
Hi Raz,

I am trying to investigate and confirm we are hitting a race as described in comment 3.  Could you rerun this scenario with the vdsm log level set to DEBUG please?  This will help me trace the queryBlockJob calls to see if they are interleaving somehow.

Thanks.

Comment 5 Raz Tamir 2017-04-13 09:22:04 UTC
Hi Adam,
I ran all our tiers (1, 2 and 3) and I don't see this failure again.
I'll update when it happen again. In any case, I will run with DEBUG level

Comment 6 Adam Litke 2017-04-13 13:02:40 UTC
Thanks Raz.  This adds credence to the theory of a race condition trigerred by the unique conditions present in the lago environment.

Comment 7 Adam Litke 2017-04-17 14:58:57 UTC
Waiting on more information...

Comment 8 Adam Litke 2017-04-25 19:52:10 UTC
Downgrading this bug since it is not reproducing in automation anymore.  I hesitate to close outright since a legitimate failure occurred.

Comment 9 Red Hat Bugzilla Rules Engine 2017-04-25 19:53:20 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 10 Allon Mureinik 2017-05-15 11:38:02 UTC
(In reply to Adam Litke from comment #8)
> Downgrading this bug since it is not reproducing in automation anymore.  I
> hesitate to close outright since a legitimate failure occurred.

Any news?
Do we have a reproducer for this?

Comment 11 Adam Litke 2017-05-15 15:04:34 UTC
No news, no reproducers.  I wonder if it's related to Bug 1438850 which was fixed by: https://gerrit.ovirt.org/#/c/75954 .  Should we close this?

Comment 12 Raz Tamir 2017-05-15 17:08:07 UTC
It wasn't reproduced on our side as well.
As I see it we can close it with resolution insufficient data so we will be able to track it in case of future reproduction

Comment 13 Allon Mureinik 2017-05-16 10:59:22 UTC
(In reply to Adam Litke from comment #11)
> No news, no reproducers.  I wonder if it's related to Bug 1438850 which was
> fixed by: https://gerrit.ovirt.org/#/c/75954 .  Should we close this?

(In reply to Raz Tamir from comment #12)
> It wasn't reproduced on our side as well.
> As I see it we can close it with resolution insufficient data so we will be
> able to track it in case of future reproduction

Agreed, closing.