1610929 – Live merge + VDSM restart - VM goes to unresponsive state as the deleted snapshot disks all goes to illegal status

Bug 1610929 - Live merge + VDSM restart - VM goes to unresponsive state as the deleted snapshot disks all goes to illegal status

Summary: Live merge + VDSM restart - VM goes to unresponsive state as the deleted snap...

Keywords:
Status:	CLOSED DUPLICATE of bug 1601212
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Storage
Sub Component:
Version:	4.2.5.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.3.0
Target Release:	---
Assignee:	Eyal Shenitzky
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:	1601212 1637976
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-01 16:20 UTC by Avihai
Modified:	2018-10-10 12:40 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-08-14 06:21:31 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.3+

Attachments	(Terms of Use)
engine and VDSM logs (1.46 MB, application/x-gzip) 2018-08-01 16:20 UTC, Avihai	no flags	Details
View All

Description Avihai 2018-08-01 16:20:03 UTC

Created attachment 1472152 [details]
engine and VDSM logs

Description of problem:
Automation tier4 TestCase6045 with live merge .

Perform Live merge (VM + 4 disks + 3 snapshots s1,s2,s3) delete s2 + VDSM restart - VM goes to an unresponsive state

Event:
Aug 1, 2018, 3:35:01 PM VM vm_TestCase6045_0115045030 is not responding.

Engine:
2018-08-01 15:27:12,337+03 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommandCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-40) [snapshots_delete_8298f174-29e5-4bb7] Comman
d 'RemoveSnapshotSingleDiskLive' (id: 'c2a675f4-7692-486b-a3ca-64a3907b9d06') waiting on child command id: 'cbdb8c98-8304-43dd-a7ee-2fc933677d48' type:'Merge' to complete
2018-08-01 15:27:14,413+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-81) [] VM 'ef167fbd-2d8c-44f3-920b-9c59e900af0f'(vm_TestCase6045_011504503
0) moved from 'Up' --> 'NotResponding'
2018-08-01 15:27:14,485+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-81) [] EVENT_ID: VM_NOT_RESPONDING(126), VM vm_TestCase6045_0115045030 is not responding.


merge started:
2018-08-01 15:26:17,023+0300 INFO  (jsonrpc/3) [api.virt] START merge(

merge failed & from that point it keeps failing in the VDSM log:
2018-08-01 15:27:16,430+0300 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {'topVolUUID': '2d77b6b8-7bb8-494f-be54-46623d7cb7da', 'vmID': 'ef167fbd-2d8c-44f3-920b-9c59e900af0f', 'drive': {'imageID': '30ea2bd8-a605-49bb-adea-8b6f5325f4d0', 'volumeID': '26f25780-fa8d-4bf8-b654-d86fab629ab2', 'domainID': 'ca5552b7-305a-4667-9688-a42c180806ff', 'poolID': '289db27b-7067-4c88-8311-88e0092e692d'}, 'bandwidth': '0', 'jobUUID': 'e05c259c-adcb-4b37-a3ab-b4e7d5f82457', 'baseVolUUID': '62595d43-e15a-45ad-bc8e-9f69641439bc'}, 'jsonrpc': '2.0', 'method': 'VM.merge', 'id': 'cfb7738e-20fe-454c-ae78-308f48ba446f'} at 0x7fb8d16ba890> timeout=60, duration=60 at 0x7fb8d16baed0> task#=248 at 0x7fb8e0164750>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__
  self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest
  response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request
  res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 201, in _dynamicMethod
  result = fn(*methodArgs)
File: "<string>", line 2, in merge
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
  ret = func(*args, **kwargs)
File: "<string>", line 2, in merge
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in method
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 738, in merge
  drive, baseVolUUID, topVolUUID, bandwidth, jobUUID)
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5999, in merge
  self.updateVmJobs()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5776, in updateVmJobs
  self._vmJobs = self.queryBlockJobs()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5790, in queryBlockJobs
  with self._jobsLock:
File: "/usr/lib/python2.7/site-packages/pthreading.py", line 60, in __enter__
  self.acquire()
File: "/usr/lib/python2.7/site-packages/pthreading.py", line 68, in acquire
  rc = self.lock() if blocking else self.trylock()
File: "/usr/lib/python2.7/site-packages/pthread.py", line 96, in lock
  return _libpthread.pthread_mutex_lock(self._mutex) (executor:363)

20 minutes later and still :
2018-08-01 15:56:17,473+0300 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/7 running <Task <JsonRpcTask {'params': {'topVolUUID': 'a44100da-580b-4e58-b9e4-0cc0ab29c0f3', 'vmID': 'ef167fb
d-2d8c-44f3-920b-9c59e900af0f', 'drive': {'imageID': '5fad9513-6f49-46a8-9790-c2e959e2d930', 'volumeID': 'd9ddd8a0-2985-46ae-972f-5a611ca29a5a', 'domainID': 'ca5552b7-305a-4667-9688-a42c180806ff', 'poolID': '289
db27b-7067-4c88-8311-88e0092e692d'}, 'bandwidth': '0', 'jobUUID': '9f8d9c64-684d-4bf9-953f-28c337955316', 'baseVolUUID': '33079e30-6846-4b61-aa20-4651f4fa0de0'}, 'jsonrpc': '2.0', 'method': 'VM.merge', 'id': '7c
83e2ab-5936-47d3-9822-85cb4e77fb59'} at 0x7fb8d136d490> timeout=60, duration=1800 at 0x7fb8d136d510> task#=246 at 0x7fb8e0170810>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__
  self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest
  response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request
  res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 201, in _dynamicMethod
  result = fn(*methodArgs)
File: "<string>", line 2, in merge
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
  ret = func(*args, **kwargs)
File: "<string>", line 2, in merge
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in method
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 738, in merge
:


Version-Release number of selected component (if applicable):
Engine 4.2.6_SNAPSHOT-84.gad3de30.0.scratch.master.el7ev
VDSM vdsm-4.20.35-1

How reproducible:
Happened once so far


Steps to Reproduce:
1. Create a VM with 4 disks and OS installed
2. Write file1 to all disks
3. Create a snapshot S1 of the VM with all disks
4. Write file2 to to all disks
5. Create a snapshot S2 of the VM with all disks
6. Write file3 to all disks
7. Create snapshot S3 fo the VM with all disks
8. Delete snapshot 2 and while the snapshot disks are locked restart vdsm

Timeline when the issue occurred :
2018-08-01 15:26:12,019 - MainThread - art.ll_lib.vms - INFO - Snapshots found: ['Active VM', 'snapshot_6045_iscsi_0', 'snapshot_6045_iscsi_1', 'snapshot_6045_iscsi_2']
2018-08-01 15:26:12,019 - MainThread - art.ll_lib.vms - INFO - Removing snapshot snapshot_6045_iscsi_1
2018-08-01 15:26:14,013 - MainThread - art.ll_lib.vms - INFO - Waiting until snapshots: snapshot_6045_iscsi_1 of vm_TestCase6045_0115045030 vm are in one of following states: locked
2018-08-01 15:26:22,042 - MainThread - art.ll_lib.hosts - INFO - kill VDSM pid: 7143 on host 10.35.82.81

merge started:
2018-08-01 15:26:17,023+0300 INFO  (jsonrpc/3) [api.virt] START merge(

Actual results:
Merged failed but VM went to unresponsive.

2018-08-01 15:27:14,413+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-81) [] VM 'ef167fbd-2d8c-44f3-920b-9c59e900af0f'(vm_TestCase6045_011504503
0) moved from 'Up' --> 'NotResponding'
2018-08-01 15:27:14,485+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-81) [] EVENT_ID: VM_NOT_RESPONDING(126), VM vm_TestCase6045_0115045030 is not responding.

Expected results:
Even if merge fails , VM should go up .


Additional info:

Comment 1 Avihai 2018-08-01 16:59:11 UTC

VM stays in non-responsive as the deleted snapshot "snapshot_6045_iscsi_1 " disks are all in "Illegal" status.

Details on snapshot_6045_iscsi_1 disks:

disk 1:
Status
Disk in status Illegal
Alias
disk_virtio_scsicow_0115081015
Virtual Size
1 GiB
Actual Size
1 GiB
Allocation Policy
Sparse
Interface
VirtIO_SCSI
Creation Date
Aug 1, 2018, 3:11:49 PM
Disk Snapshot ID
415b9663-d91f-4d69-92fa-e433e8194686
Type
IMAGE
Description
N/A

disk 2:

Status
Disk in status Illegal
Alias
disk_virtio_scsiraw_0115081016
Virtual Size
1 GiB
Actual Size
1 GiB
Allocation Policy
Sparse
Interface
VirtIO_SCSI
Creation Date
Aug 1, 2018, 3:11:40 PM
Disk Snapshot ID
33079e30-6846-4b61-aa20-4651f4fa0de0
Type
IMAGE
Description
N/A

disk 3:

Status
Disk in status Illegal
Alias
disk_virtiocow_0115081015
Virtual Size
1 GiB
Actual Size
1 GiB
Allocation Policy
Sparse
Interface
VirtIO
Creation Date
Aug 1, 2018, 3:11:45 PM
Disk Snapshot ID
62595d43-e15a-45ad-bc8e-9f69641439bc
Type
IMAGE
Description
N/A

disk 4:

Status
Disk in status Illegal
Alias
disk_virtioraw_0115081016
Virtual Size
1 GiB
Actual Size
1 GiB
Allocation Policy
Sparse
Interface
VirtIO
Creation Date
Aug 1, 2018, 3:11:54 PM
Disk Snapshot ID
2078a3b3-5073-44ed-8f53-eaf5461a6ac2
Type
IMAGE
Description
N/A
Status
Disk in status Illegal
Alias
vm_TestCase6045_0115045030_Disk_0
Virtual Size
10 GiB
Actual Size
1 GiB
Allocation Policy
Sparse
Interface
VirtIO
Creation Date
Aug 1, 2018, 3:11:52 PM
Disk Snapshot ID
e45b404d-e8dd-45af-9cb3-ebff3f865410
Type
IMAGE
Description
rhel7.4_rhv4.2_guest_disk (2954e57)

Comment 2 Avihai 2018-08-01 17:00:14 UTC

Issue was not seen with the same test running same scenario with cold merge + VDSM restart

Comment 3 Avihai 2018-08-02 03:50:20 UTC

Issue reporduce 2/2 times runnning automation TestCase6045 (live merge+VDSM restart scenario)

Comment 4 Eyal Shenitzky 2018-08-02 10:58:27 UTC

This flow is the same flow as bug 1601212.

Comment 5 Eyal Shenitzky 2018-08-14 06:21:31 UTC

Closing this bug.

Seems like a duplication of bug 1601212.

If this bug reproduced after bug 1601212 will be fixed please reopen.

*** This bug has been marked as a duplicate of bug 1601212 ***

Comment 6 Elad 2018-08-14 07:50:44 UTC

Hi Eyal, why is this a dup of bug 1601212? 
The flow and the error are different.

Comment 7 Eyal Shenitzky 2018-08-14 08:18:28 UTC

(In reply to Elad from comment #6)
> Hi Eyal, why is this a dup of bug 1601212? 
> The flow and the error are different.

The flow is the same we just added the host restart in the middle of the snapshot deletion.

The snapshot deletion already failed as described in bug 1601212 and I think that this is the root cause.

You can see in the VDSM log the same WARN and INFO which indicate that QEMU ignores the block-commit operation sent from Libvirt which occurs also in bug 1601212:

2018-08-01 15:26:35,787+0300 WARN  (vdsm.Scheduler) [Executor] executor state: count=5 workers=set([<Worker name=periodic/4 waiting task#=0 at 0x7fb8d161d810>, <Worker name=periodic/0 running <Task discardable <Operation action=<VmDispatcher operation=<class 'vdsm.virt.periodic.BlockjobMonitor'> at 0x7fb8e01100d0> at 0x7fb8e0110110> timeout=7.5, duration=7 at 0x7fb8d15e7310> task#=1929 at 0x7fb8e017f410>, <Worker name=periodic/3 running <Task discardable <Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7fb8e0110310> at 0x7fb8e01108d0> timeout=7.5, duration=7 at 0x7fb8d16fea10> discarded task#=1900 at 0x7fb8e017fad0>, <Worker name=periodic/2 waiting task#=1875 at 0x7fb8e017f850>, <Worker name=periodic/1 waiting task#=1903 at 0x7fb8e017f610>]) (executor:213)
2018-08-01 15:26:36,000+0300 INFO  (vdsm.Scheduler) [Executor] Worker discarded: <Worker name=periodic/0 running <Task discardable <Operation action=<VmDispatcher operation=<class 'vdsm.virt.periodic.BlockjobMonitor'> at 0x7fb8e01100d0> at 0x7fb8e0110110> timeout=7.5, duration=7 at 0x7fb8d15e7310> discarded task#=1929 at 0x7fb8e017f410> (executor:355)

As I wrote, if this bug reproduced after bug 1601212 will be fixed, please open again.

Note You need to log in before you can comment on or make changes to this bug.