Created attachment 1472152 [details] engine and VDSM logs Description of problem: Automation tier4 TestCase6045 with live merge . Perform Live merge (VM + 4 disks + 3 snapshots s1,s2,s3) delete s2 + VDSM restart - VM goes to an unresponsive state Event: Aug 1, 2018, 3:35:01 PM VM vm_TestCase6045_0115045030 is not responding. Engine: 2018-08-01 15:27:12,337+03 INFO [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommandCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-40) [snapshots_delete_8298f174-29e5-4bb7] Comman d 'RemoveSnapshotSingleDiskLive' (id: 'c2a675f4-7692-486b-a3ca-64a3907b9d06') waiting on child command id: 'cbdb8c98-8304-43dd-a7ee-2fc933677d48' type:'Merge' to complete 2018-08-01 15:27:14,413+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-81) [] VM 'ef167fbd-2d8c-44f3-920b-9c59e900af0f'(vm_TestCase6045_011504503 0) moved from 'Up' --> 'NotResponding' 2018-08-01 15:27:14,485+03 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-81) [] EVENT_ID: VM_NOT_RESPONDING(126), VM vm_TestCase6045_0115045030 is not responding. merge started: 2018-08-01 15:26:17,023+0300 INFO (jsonrpc/3) [api.virt] START merge( merge failed & from that point it keeps failing in the VDSM log: 2018-08-01 15:27:16,430+0300 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {'topVolUUID': '2d77b6b8-7bb8-494f-be54-46623d7cb7da', 'vmID': 'ef167fbd-2d8c-44f3-920b-9c59e900af0f', 'drive': {'imageID': '30ea2bd8-a605-49bb-adea-8b6f5325f4d0', 'volumeID': '26f25780-fa8d-4bf8-b654-d86fab629ab2', 'domainID': 'ca5552b7-305a-4667-9688-a42c180806ff', 'poolID': '289db27b-7067-4c88-8311-88e0092e692d'}, 'bandwidth': '0', 'jobUUID': 'e05c259c-adcb-4b37-a3ab-b4e7d5f82457', 'baseVolUUID': '62595d43-e15a-45ad-bc8e-9f69641439bc'}, 'jsonrpc': '2.0', 'method': 'VM.merge', 'id': 'cfb7738e-20fe-454c-ae78-308f48ba446f'} at 0x7fb8d16ba890> timeout=60, duration=60 at 0x7fb8d16baed0> task#=248 at 0x7fb8e0164750>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__ self._handler(self._ctx, self._req) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest response = self._handle_request(req, ctx) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request res = method(**params) File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 201, in _dynamicMethod result = fn(*methodArgs) File: "<string>", line 2, in merge File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method ret = func(*args, **kwargs) File: "<string>", line 2, in merge File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in method ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 738, in merge drive, baseVolUUID, topVolUUID, bandwidth, jobUUID) File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5999, in merge self.updateVmJobs() File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5776, in updateVmJobs self._vmJobs = self.queryBlockJobs() File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5790, in queryBlockJobs with self._jobsLock: File: "/usr/lib/python2.7/site-packages/pthreading.py", line 60, in __enter__ self.acquire() File: "/usr/lib/python2.7/site-packages/pthreading.py", line 68, in acquire rc = self.lock() if blocking else self.trylock() File: "/usr/lib/python2.7/site-packages/pthread.py", line 96, in lock return _libpthread.pthread_mutex_lock(self._mutex) (executor:363) 20 minutes later and still : 2018-08-01 15:56:17,473+0300 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/7 running <Task <JsonRpcTask {'params': {'topVolUUID': 'a44100da-580b-4e58-b9e4-0cc0ab29c0f3', 'vmID': 'ef167fb d-2d8c-44f3-920b-9c59e900af0f', 'drive': {'imageID': '5fad9513-6f49-46a8-9790-c2e959e2d930', 'volumeID': 'd9ddd8a0-2985-46ae-972f-5a611ca29a5a', 'domainID': 'ca5552b7-305a-4667-9688-a42c180806ff', 'poolID': '289 db27b-7067-4c88-8311-88e0092e692d'}, 'bandwidth': '0', 'jobUUID': '9f8d9c64-684d-4bf9-953f-28c337955316', 'baseVolUUID': '33079e30-6846-4b61-aa20-4651f4fa0de0'}, 'jsonrpc': '2.0', 'method': 'VM.merge', 'id': '7c 83e2ab-5936-47d3-9822-85cb4e77fb59'} at 0x7fb8d136d490> timeout=60, duration=1800 at 0x7fb8d136d510> task#=246 at 0x7fb8e0170810>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__ self._handler(self._ctx, self._req) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest response = self._handle_request(req, ctx) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request res = method(**params) File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 201, in _dynamicMethod result = fn(*methodArgs) File: "<string>", line 2, in merge File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method ret = func(*args, **kwargs) File: "<string>", line 2, in merge File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in method ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 738, in merge : Version-Release number of selected component (if applicable): Engine 4.2.6_SNAPSHOT-84.gad3de30.0.scratch.master.el7ev VDSM vdsm-4.20.35-1 How reproducible: Happened once so far Steps to Reproduce: 1. Create a VM with 4 disks and OS installed 2. Write file1 to all disks 3. Create a snapshot S1 of the VM with all disks 4. Write file2 to to all disks 5. Create a snapshot S2 of the VM with all disks 6. Write file3 to all disks 7. Create snapshot S3 fo the VM with all disks 8. Delete snapshot 2 and while the snapshot disks are locked restart vdsm Timeline when the issue occurred : 2018-08-01 15:26:12,019 - MainThread - art.ll_lib.vms - INFO - Snapshots found: ['Active VM', 'snapshot_6045_iscsi_0', 'snapshot_6045_iscsi_1', 'snapshot_6045_iscsi_2'] 2018-08-01 15:26:12,019 - MainThread - art.ll_lib.vms - INFO - Removing snapshot snapshot_6045_iscsi_1 2018-08-01 15:26:14,013 - MainThread - art.ll_lib.vms - INFO - Waiting until snapshots: snapshot_6045_iscsi_1 of vm_TestCase6045_0115045030 vm are in one of following states: locked 2018-08-01 15:26:22,042 - MainThread - art.ll_lib.hosts - INFO - kill VDSM pid: 7143 on host 10.35.82.81 merge started: 2018-08-01 15:26:17,023+0300 INFO (jsonrpc/3) [api.virt] START merge( Actual results: Merged failed but VM went to unresponsive. 2018-08-01 15:27:14,413+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-81) [] VM 'ef167fbd-2d8c-44f3-920b-9c59e900af0f'(vm_TestCase6045_011504503 0) moved from 'Up' --> 'NotResponding' 2018-08-01 15:27:14,485+03 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-81) [] EVENT_ID: VM_NOT_RESPONDING(126), VM vm_TestCase6045_0115045030 is not responding. Expected results: Even if merge fails , VM should go up . Additional info:
VM stays in non-responsive as the deleted snapshot "snapshot_6045_iscsi_1 " disks are all in "Illegal" status. Details on snapshot_6045_iscsi_1 disks: disk 1: Status Disk in status Illegal Alias disk_virtio_scsicow_0115081015 Virtual Size 1 GiB Actual Size 1 GiB Allocation Policy Sparse Interface VirtIO_SCSI Creation Date Aug 1, 2018, 3:11:49 PM Disk Snapshot ID 415b9663-d91f-4d69-92fa-e433e8194686 Type IMAGE Description N/A disk 2: Status Disk in status Illegal Alias disk_virtio_scsiraw_0115081016 Virtual Size 1 GiB Actual Size 1 GiB Allocation Policy Sparse Interface VirtIO_SCSI Creation Date Aug 1, 2018, 3:11:40 PM Disk Snapshot ID 33079e30-6846-4b61-aa20-4651f4fa0de0 Type IMAGE Description N/A disk 3: Status Disk in status Illegal Alias disk_virtiocow_0115081015 Virtual Size 1 GiB Actual Size 1 GiB Allocation Policy Sparse Interface VirtIO Creation Date Aug 1, 2018, 3:11:45 PM Disk Snapshot ID 62595d43-e15a-45ad-bc8e-9f69641439bc Type IMAGE Description N/A disk 4: Status Disk in status Illegal Alias disk_virtioraw_0115081016 Virtual Size 1 GiB Actual Size 1 GiB Allocation Policy Sparse Interface VirtIO Creation Date Aug 1, 2018, 3:11:54 PM Disk Snapshot ID 2078a3b3-5073-44ed-8f53-eaf5461a6ac2 Type IMAGE Description N/A Status Disk in status Illegal Alias vm_TestCase6045_0115045030_Disk_0 Virtual Size 10 GiB Actual Size 1 GiB Allocation Policy Sparse Interface VirtIO Creation Date Aug 1, 2018, 3:11:52 PM Disk Snapshot ID e45b404d-e8dd-45af-9cb3-ebff3f865410 Type IMAGE Description rhel7.4_rhv4.2_guest_disk (2954e57)
Issue was not seen with the same test running same scenario with cold merge + VDSM restart
Issue reporduce 2/2 times runnning automation TestCase6045 (live merge+VDSM restart scenario)
This flow is the same flow as bug 1601212.
Closing this bug. Seems like a duplication of bug 1601212. If this bug reproduced after bug 1601212 will be fixed please reopen. *** This bug has been marked as a duplicate of bug 1601212 ***
Hi Eyal, why is this a dup of bug 1601212? The flow and the error are different.
(In reply to Elad from comment #6) > Hi Eyal, why is this a dup of bug 1601212? > The flow and the error are different. The flow is the same we just added the host restart in the middle of the snapshot deletion. The snapshot deletion already failed as described in bug 1601212 and I think that this is the root cause. You can see in the VDSM log the same WARN and INFO which indicate that QEMU ignores the block-commit operation sent from Libvirt which occurs also in bug 1601212: 2018-08-01 15:26:35,787+0300 WARN (vdsm.Scheduler) [Executor] executor state: count=5 workers=set([<Worker name=periodic/4 waiting task#=0 at 0x7fb8d161d810>, <Worker name=periodic/0 running <Task discardable <Operation action=<VmDispatcher operation=<class 'vdsm.virt.periodic.BlockjobMonitor'> at 0x7fb8e01100d0> at 0x7fb8e0110110> timeout=7.5, duration=7 at 0x7fb8d15e7310> task#=1929 at 0x7fb8e017f410>, <Worker name=periodic/3 running <Task discardable <Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7fb8e0110310> at 0x7fb8e01108d0> timeout=7.5, duration=7 at 0x7fb8d16fea10> discarded task#=1900 at 0x7fb8e017fad0>, <Worker name=periodic/2 waiting task#=1875 at 0x7fb8e017f850>, <Worker name=periodic/1 waiting task#=1903 at 0x7fb8e017f610>]) (executor:213) 2018-08-01 15:26:36,000+0300 INFO (vdsm.Scheduler) [Executor] Worker discarded: <Worker name=periodic/0 running <Task discardable <Operation action=<VmDispatcher operation=<class 'vdsm.virt.periodic.BlockjobMonitor'> at 0x7fb8e01100d0> at 0x7fb8e0110110> timeout=7.5, duration=7 at 0x7fb8d15e7310> discarded task#=1929 at 0x7fb8e017f410> (executor:355) As I wrote, if this bug reproduced after bug 1601212 will be fixed, please open again.