Bug 1438575
| Summary: | Live merge fails after preview of in-chain snapshot | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Raz Tamir <ratamir> | ||||
| Component: | BLL.Storage | Assignee: | Adam Litke <alitke> | ||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Raz Tamir <ratamir> | ||||
| Severity: | low | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.1.1.7 | CC: | alitke, amureini, bugs, tnisan | ||||
| Target Milestone: | ovirt-4.1.3 | Keywords: | Automation, Regression | ||||
| Target Release: | --- | Flags: | rule-engine:
ovirt-4.1+
rule-engine: ovirt-4.2+ rule-engine: blocker+ |
||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2017-05-16 10:59:22 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Just tested a few more scenarios and seems like the operation failed even without the previewing the snapshot. So the valid steps to reproduced are 1 - 4 This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP. My preliminary analysis points to a race between the merge verb (VM.merge) and ongoing block job monitoring (VM.queryBlockJobs). We see the following error:
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 547, in _handle_request
res = method(**params)
File "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 202, in _dynamicMethod
result = fn(*methodArgs)
File "/usr/share/vdsm/API.py", line 836, in merge
return v.merge(drive, baseVolUUID, topVolUUID, bandwidth, jobUUID)
File "<string>", line 2, in merge
File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 41, in method
ret = func(*args, **kwargs)
File "/usr/share/vdsm/virt/vm.py", line 4824, in merge
self.updateVmJobs()
File "/usr/share/vdsm/virt/vm.py", line 4623, in updateVmJobs
self._vmJobs = self.queryBlockJobs()
File "/usr/share/vdsm/virt/vm.py", line 4652, in queryBlockJobs
drive = self._findDriveByUUIDs(storedJob['disk'])
File "/usr/share/vdsm/virt/vm.py", line 3324, in _findDriveByUUIDs
raise LookupError("No such drive: '%s'" % drive)
LookupError: No such drive: '{'domainID': 'd20d4865-e368-4dbd-9307-0337d2b21af3', 'imageID': '15f4c58e-e177-4095-9f8c-77c00581629f', 'volumeID': 'bc9508be-3cd3-4fd9-873d-4928642f2bfe', 'poolID': '8be71a8d-4e55-4c18-a66b-c66c3027cb7b'}'
but earlier in the merge verb the same call was fine. I need to look into this further to understand exactly what happened.
Hi Raz, I am trying to investigate and confirm we are hitting a race as described in comment 3. Could you rerun this scenario with the vdsm log level set to DEBUG please? This will help me trace the queryBlockJob calls to see if they are interleaving somehow. Thanks. Hi Adam, I ran all our tiers (1, 2 and 3) and I don't see this failure again. I'll update when it happen again. In any case, I will run with DEBUG level Thanks Raz. This adds credence to the theory of a race condition trigerred by the unique conditions present in the lago environment. Waiting on more information... Downgrading this bug since it is not reproducing in automation anymore. I hesitate to close outright since a legitimate failure occurred. This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP. (In reply to Adam Litke from comment #8) > Downgrading this bug since it is not reproducing in automation anymore. I > hesitate to close outright since a legitimate failure occurred. Any news? Do we have a reproducer for this? No news, no reproducers. I wonder if it's related to Bug 1438850 which was fixed by: https://gerrit.ovirt.org/#/c/75954 . Should we close this? It wasn't reproduced on our side as well. As I see it we can close it with resolution insufficient data so we will be able to track it in case of future reproduction (In reply to Adam Litke from comment #11) > No news, no reproducers. I wonder if it's related to Bug 1438850 which was > fixed by: https://gerrit.ovirt.org/#/c/75954 . Should we close this? (In reply to Raz Tamir from comment #12) > It wasn't reproduced on our side as well. > As I see it we can close it with resolution insufficient data so we will be > able to track it in case of future reproduction Agreed, closing. |
Created attachment 1268480 [details] engine and vdsm logs Probably there's a simpler steps to reproduce this. Description of problem: I have a VM with 4 disks with following permutation: format: COW interface: virtio format: COW interface: virtio-scsi format: RAW interface: virtio format: RAW interface: virtio-scsi I created 3 snapshots with memory state, deleted (live merge) the second, preview the 3rd snapshot (without the memory) and undo. After these steps, when trying to remove the 3rd snapshot, the operation fails with (ovirt-engine.log): 2017-04-03 19:50:10,076+03 WARN [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-5) [598c338b] Unexpected return value: Status [code=-32603, message=No such drive: '{'domainID': 'd20d4865-e368-4dbd-9307-0337d2b21af3', 'imageID': '15f4c58e-e177-4095-9f8c-77c00581629f', 'volumeID': 'bc9508be-3cd3-4fd9-873d-4928642f2bfe', 'poolID': '8be71a8d-4e55-4c18-a66b-c66c3027cb7b'}'] 2017-04-03 19:50:10,076+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-5) [598c338b] Failed in 'MergeVDS' method 2017-04-03 19:50:10,077+03 WARN [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-5) [598c338b] Unexpected return value: Status [code=-32603, message=No such drive: '{'domainID': 'd20d4865-e368-4dbd-9307-0337d2b21af3', 'imageID': '15f4c58e-e177-4095-9f8c-77c00581629f', 'volumeID': 'bc9508be-3cd3-4fd9-873d-4928642f2bfe', 'poolID': '8be71a8d-4e55-4c18-a66b-c66c3027cb7b'}'] 2017-04-03 19:50:10,105+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-5-thread-5) [598c338b] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_2 command MergeVDS failed: No such drive: '{'domainID': 'd20d4865-e368-4dbd-9307-0337d2b21af3', 'imageID': '15f4c58e-e177-4095-9f8c-77c00581629f', 'volumeID': 'bc9508be-3cd3-4fd9-873d-4928642f2bfe', 'poolID': '8be71a8d-4e55-4c18-a66b-c66c3027cb7b'}' 2017-04-03 19:50:10,106+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-5) [598c338b] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand' return value 'StatusOnlyReturn [status=Status [code=-32603, message=No such drive: '{'domainID': 'd20d4865-e368-4dbd-9307-0337d2b21af3', 'imageID': '15f4c58e-e177-4095-9f8c-77c00581629f', 'volumeID': 'bc9508be-3cd3-4fd9-873d-4928642f2bfe', 'poolID': '8be71a8d-4e55-4c18-a66b-c66c3027cb7b'}']]' 2017-04-03 19:50:10,106+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-5) [598c338b] HostName = host_mixed_2 2017-04-03 19:50:10,107+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-5) [598c338b] Command 'MergeVDSCommand(HostName = host_mixed_2, MergeVDSCommandParameters:{runAsync='true', hostId='98fce30d-3423-43b6-89eb-5a3c73517026', vmId='3748922b-cfdb-4377-a07d-573ef3e6622b', storagePoolId='8be71a8d-4e55-4c18-a66b-c66c3027cb7b', storageDomainId='d20d4865-e368-4dbd-9307-0337d2b21af3', imageGroupId='40bb7148-b3f0-4a9b-a76e-bb36e8334f35', imageId='86147e87-d2d2-4560-b991-2c16fcf512d6', baseImageId='b6e37f7a-5fd9-464c-b10d-350b67c9c85e', topImageId='86147e87-d2d2-4560-b991-2c16fcf512d6', bandwidth='0'})' execution failed: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = No such drive: '{'domainID': 'd20d4865-e368-4dbd-9307-0337d2b21af3', 'imageID': '15f4c58e-e177-4095-9f8c-77c00581629f', 'volumeID': 'bc9508be-3cd3-4fd9-873d-4928642f2bfe', 'poolID': '8be71a8d-4e55-4c18-a66b-c66c3027cb7b'}', code = -32603 2017-04-03 19:50:10,107+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-5) [598c338b] FINISH, MergeVDSCommand, log id: 1e8fb271 2017-04-03 19:50:10,108+03 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-5-thread-5) [598c338b] Engine exception thrown while sending merge command: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = No such drive: '{'domainID': 'd20d4865-e368-4dbd-9307-0337d2b21af3', 'imageID': '15f4c58e-e177-4095-9f8c-77c00581629f', 'volumeID': 'bc9508be-3cd3-4fd9-873d-4928642f2bfe', 'poolID': '8be71a8d-4e55-4c18-a66b-c66c3027cb7b'}', code = -32603 (Failed with error unexpected and code 16) In vdsm.log: 2017-04-03 19:50:08,986+0300 DEBUG (merge/3748922b) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/lvm lvchange --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspende d_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3514f0c5a51600493|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad= 0 } backup { retain_min = 50 retain_days = 0 } ' --refresh d20d4865-e368-4dbd-9307-0337d2b21af3/leases (cwd None) (commands:69) 2017-04-03 19:50:08,953+0300 ERROR (jsonrpc/3) [jsonrpc.JsonRpcServer] Internal server error (__init__:552) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 547, in _handle_request res = method(**params) File "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 202, in _dynamicMethod result = fn(*methodArgs) File "/usr/share/vdsm/API.py", line 836, in merge return v.merge(drive, baseVolUUID, topVolUUID, bandwidth, jobUUID) File "<string>", line 2, in merge File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 41, in method ret = func(*args, **kwargs) File "/usr/share/vdsm/virt/vm.py", line 4824, in merge self.updateVmJobs() File "/usr/share/vdsm/virt/vm.py", line 4623, in updateVmJobs self._vmJobs = self.queryBlockJobs() File "/usr/share/vdsm/virt/vm.py", line 4652, in queryBlockJobs drive = self._findDriveByUUIDs(storedJob['disk']) File "/usr/share/vdsm/virt/vm.py", line 3324, in _findDriveByUUIDs raise LookupError("No such drive: '%s'" % drive) LookupError: No such drive: '{'domainID': 'd20d4865-e368-4dbd-9307-0337d2b21af3', 'imageID': '15f4c58e-e177-4095-9f8c-77c00581629f', 'volumeID': 'bc9508be-3cd3-4fd9-873d-4928642f2bfe', 'poolID': '8be71a8d-4e55-4c18-a66b-c66c3027cb7b'}' 2017-04-03 19:50:09,013+0300 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call VM.merge failed (error -32603) in 5.69 seconds (__init__:515) In the snapshot's sub-tab left only 1 disk virtio RAW disk in Illegal state. Version-Release number of selected component (if applicable): rhevm-4.1.1.7-0.1.el7 vdsm-4.19.10.1-1.el7ev.x86_64 How reproducible: 100% in our automation Steps to Reproduce: 1. Create a VM with 1 disk with OS 2. Add 4 more disks with the permutation described above 3. Start the VM and create 3 snapshots A, B and C in that order 4. Delete snapshot B 5. Preview snapshot C with memory 6. Undo the previewed snapshot 7. Remove snapshot C Actual results: Removal fails and the snapshot's disks contain 1 disk out of 5 in illegal state Expected results: Additional info: