Bug 1376580

Summary: Live Merge: Handle block copy still active error
Product: [oVirt] vdsm Reporter: Markus Stockhausen <mst>
Component: GeneralAssignee: Ala Hino <ahino>
Status: CLOSED CURRENTRELEASE QA Contact: Kevin Alon Goldblatt <kgoldbla>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.17.28CC: ahino, alitke, amureini, bugs, creatmbox, eedri, jspanko, klaas, lveyde, mkalinin, nsoffer, tnisan, ylavi
Target Milestone: ovirt-4.1.2Keywords: ZStream
Target Release: 4.19.11Flags: rule-engine: ovirt-4.1+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1427184 1438850 (view as bug list) Environment:
Last Closed: 2017-05-23 08:21:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1427184, 1438850    
Attachments:
Description Flags
Engine Log
none
VDSM Log
none
Task State
none
io during failing live merge
none
vdsm log 2nd error none

Description Markus Stockhausen 2016-09-15 19:21:20 UTC
Description of problem:

A live merge of a disk snapshot is initiated on a running VM. The task aborts but engine sees the job still running.

Version-Release number of selected component (if applicable):

Ovirt 3.6.7
VDSM 4.17.28

How reproducible:

Don't know

Steps to Reproduce:
1. Start Snapshot live merge on running VM

Actual results:

Job fails. Tasks stays active

Thread-15824378::ERROR::2016-09-15 11:17:36,599::utils::739::root::(wrapper) Unhandled exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 736, in wrapper
    return f(*a, **kw)
  File "/usr/share/vdsm/virt/vm.py", line 5266, in run
    self.tryPivot()
  File "/usr/share/vdsm/virt/vm.py", line 5235, in tryPivot
    ret = self.vm._dom.blockJobAbort(self.drive.name, flags)
  File "/usr/share/vdsm/virt/virdomain.py", line 68, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 1313, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 733, in blockJobAbort
    if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
libvirtError: Block Kopieren ist immer noch aktiv: Disk 'vda' noch nicht bereit für Pivot

Expected results:

Snapshot should finish successfully. If the error is correct engine should correctly determine the error state and cancel the running task.

Additional info:

Log files and screenshots attached

Comment 1 Markus Stockhausen 2016-09-15 19:22:43 UTC
Created attachment 1201372 [details]
Engine Log

Comment 2 Markus Stockhausen 2016-09-15 19:23:03 UTC
Created attachment 1201373 [details]
VDSM Log

Comment 3 Markus Stockhausen 2016-09-15 19:24:45 UTC
Created attachment 1201374 [details]
Task State

Comment 4 Markus Stockhausen 2016-10-13 19:07:41 UTC
Yesterday the error occured once again. This time a VM with 6 disk with a disk snapshot that contained 4 disks.

Result: task is running but vdsm shows abort traceback.

Very interesting for us, after the livemerge init qemu started some kind of high disk IO for more than 6 hours. So we assume, that something was doing merge operations but we do not know was really going on. See screenshot attached.

Comment 5 Markus Stockhausen 2016-10-13 19:08:27 UTC
Created attachment 1210227 [details]
io during failing live merge

Comment 6 Markus Stockhausen 2016-10-13 19:41:25 UTC
VDSM log of second error attached

Comment 7 Markus Stockhausen 2016-10-13 19:43:03 UTC
Created attachment 1210247 [details]
vdsm log 2nd error

Comment 8 Jaroslav Spanko 2016-12-21 09:07:37 UTC
I count with customer the same

  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 372, in wrapper
    return f(*a, **kw)
  File "/usr/share/vdsm/virt/vm.py", line 4924, in run
    self.tryPivot()
  File "/usr/share/vdsm/virt/vm.py", line 4893, in tryPivot
    ret = self.vm._dom.blockJobAbort(self.drive.name, flags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 69, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 917, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 733, in blockJobAbort
    if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
ckJobAbort
libvirtError: block copy still active: disk 'vda' not ready for pivot yet

VM with high disk I/O (hosting Mysql,Nagios) is not able to live migrate disks, job fails.

Comment 10 Ala Hino 2017-01-22 11:03:33 UTC
*** Bug 1406851 has been marked as a duplicate of this bug. ***

Comment 11 Sandro Bonazzola 2017-01-25 07:56:33 UTC
4.0.6 has been the last oVirt 4.0 release, please re-target this bug.

Comment 12 Tal Nisan 2017-02-07 09:40:21 UTC
*** Bug 1419767 has been marked as a duplicate of this bug. ***

Comment 18 Allon Mureinik 2017-03-15 15:53:52 UTC
Ala, the attached patch is merged, but not backported to any stable branch. Is this intentional? What is the next step here?

Comment 19 Ala Hino 2017-03-15 17:35:08 UTC
Actually, this is intentional - this area of the code isn't tested and any change is considered risky. This patch handles the (expected) exception that is thrown by not logging it.

The next step here is to handle libvirt event that is fired when disk is ready for pivot.

Comment 20 Allon Mureinik 2017-03-30 11:33:43 UTC
(In reply to Ala Hino from comment #19)
> Actually, this is intentional - this area of the code isn't tested and any
> change is considered risky. This patch handles the (expected) exception that
> is thrown by not logging it.
> 
> The next step here is to handle libvirt event that is fired when disk is
> ready for pivot.

Please clone this bug so we have a different BZ to track the event handling and rename this one to reflect what was actually done there.

Comment 21 Ala Hino 2017-04-04 14:47:31 UTC
Renamed and cloned to BZ 1438850.

Comment 22 Allon Mureinik 2017-04-06 12:07:57 UTC
Ala, is there anything we need to document here?

Comment 23 Ala Hino 2017-04-06 12:30:17 UTC
Not really, we now just print the traceback of the exception. The behavior and user experience didn't change

Comment 24 rhev-integ 2017-04-26 10:51:26 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[NO RELEVANT PATCHES FOUND]

For more info please contact: infra

Comment 25 Eyal Edri 2017-04-26 13:03:49 UTC
This bug has an open patch attached it (POST), can you provide info on it is pending another patch backported to 4.1 or it can move to ON_QA?
If not, it might be best to move it back to POST status.

Comment 26 Sandro Bonazzola 2017-04-27 11:29:12 UTC
Moving back to POST while waiting for needed info

Comment 27 Red Hat Bugzilla Rules Engine 2017-04-27 11:29:21 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 28 Ala Hino 2017-05-01 13:37:28 UTC
(In reply to Eyal Edri from comment #25)
> This bug has an open patch attached it (POST), can you provide info on it is
> pending another patch backported to 4.1 or it can move to ON_QA?
> If not, it might be best to move it back to POST status.

The attached patch should attached to a different bug. I remove it and changed bug status back to MODIFIED.

Comment 29 Allon Mureinik 2017-05-08 09:14:18 UTC
Nir - you just posted another patch for a bug that's already ON_QA.
Does this mean the ON_QA status was wrong and the bug needs to be moved back to POST?

Comment 30 Kevin Alon Goldblatt 2017-05-08 15:58:30 UTC
Verified with the following code:
------------------------------------
ovirt-engine-4.1.2.1-0.1.el7.noarch
rhevm-4.1.2.1-0.1.el7.noarch
vdsm-4.19.12-1.el7ev.x86_64


Verified with the following scenario:
--------------------------------------
1. Create a VM with 2 disks of 12g and 6g
2. Install OS on 12g disk and copy 1g of data to 2nd disk
3. Create 3 snapshots sna1, sna2, sna3 adding at least 500MB of new data between each snapshot
4. Start copying 2GB of new data to the disk and delete sna2 which performs a live merge operation >>>> The live merge operation is successful and is not aborted.

Tier 1, 2 and 3 of automation passed and no live merge failures reported

Will not move to verify until Nir replies as per Comment 29

Comment 31 Nir Soffer 2017-05-09 10:00:55 UTC
(In reply to Allon Mureinik from comment #29)
> Nir - you just posted another patch for a bug that's already ON_QA.
> Does this mean the ON_QA status was wrong and the bug needs to be moved back
> to POST?

The posted patch fixes the root cause, avoiding calling blockCommit before the 
block job is ready. However this bug is about failing nicely when a blockCommit
is not ready, and we do want to keep the new behavior.

I remove this patch from this bug and added it to bug 1438850.

Kevin, I think we can mark this as verified now.

Comment 32 Kevin Alon Goldblatt 2017-05-10 13:18:42 UTC
Based on comment 31 moving to VERIFIED!