1442266 – VM paused with Improbable extension request messages during active layer block commit operation.

Bug 1442266 - VM paused with Improbable extension request messages during active layer block commit operation.

Summary: VM paused with Improbable extension request messages during active layer bloc...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.0.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.2.0
Target Release:	---
Assignee:	Ala Hino
QA Contact:	Kevin Alon Goldblatt
Docs Contact:
URL:
Whiteboard:
Depends On:	1443493
Blocks:	1449626 1449634
TreeView+	depends on / blocked

Reported:	2017-04-13 21:06 UTC by Bimal Chollera
Modified:	2020-12-14 09:05 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Clones:	1449626 1449634 (view as bug list)
Environment:
Last Closed:	2018-05-15 17:51:25 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1470634	urgent	CLOSED	Wrong allocation value after virDomainBlockCopy() (alloc=capacity)	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHEA-2018:1489	None	None	None	2018-05-15 17:53:40 UTC
oVirt gerrit	75954	'None'	MERGED	vm: Detect when a block job is ready	2021-02-18 22:29:53 UTC
oVirt gerrit	76645	'None'	MERGED	vm: Detect when a block job is ready	2021-02-18 22:29:53 UTC

Internal Links: 1470634

Description Bimal Chollera 2017-04-13 21:06:34 UTC

Description of problem:

VM paused with Improbable extension request messages during active layer block commit operation.


VM with 2 disks with 2 images in each disk.
During live merge operation (snapshot removal), the merge and image removal is successful one but fails on the other with the following:

~~~
periodic/6::INFO::2017-04-13 14:40:09,367::vm::4560::virt.vm::(queryBlockJobs) vmId=`52a39fd6-7415-4391-a7bf-1a4fb297813d`::Libvirt job 03de863d-e680-407c-84fe-4b2f0e753635 was terminated

periodic/6::DEBUG::2017-04-13 14:40:09,367::vm::4573::virt.vm::(queryBlockJobs) vmId=`52a39fd6-7415-4391-a7bf-1a4fb297813d`::Still waiting for block job 03de863d-e680-407c-84fe-4b2f0e753635 to be synchronized

periodic/0::INFO::2017-04-13 14:40:09,857::logUtils::49::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID=u'bf46dbaa-4ce1-4db3-8b94-2facfb63dcbb', spUUID=u'72ab8679-5f51-4673-a61d-2f3b768a598a', imgUUID=u'837801dc-773a-40c7-a0d0-2bc4a74d1dbe', volUUID=u'0ebb52d2-abdc-45e0-ac13-605d9f63f9df', options=None)

periodic/7::ERROR::2017-04-13 14:40:11,764::vm::1002::virt.vm::(_shouldExtendVolume) vmId=`52a39fd6-7415-4391-a7bf-1a4fb297813d`::Improbable extension request for volume 0ebb52d2-abdc-45e0-ac13-605d9f63f9df on domain bf46dbaa-4ce1-4db3-8b94-2facfb63dcbb, pausing the VM to avoid corruptions (capacity: 53687091200, allocated: 20354334720, physical: 1073741824, next physical size: 2147483648)
~~~


Also per the code, the error shows when 

~~~
   if alloc > nextPhysSize:
              msg = ("Improbable extension request for volume %s on domain "
                     "%s, pausing the VM to avoid corruptions (capacity: %s, "
                     "allocated: %s, physical: %s, next physical size: %s)" %
                     (volumeID, drive.domainID, capacity, alloc, physical,
                      nextPhysSize))
  alloc (8GB) > nextphyssize (2GB)
~~~

~~~
(capacity: 53687091200,          - 50G
 allocated: 9415688192,          - 8G
 physical: 1073741824,           -  1 G
next physical size: 2147483648)  - 2G
~~~

Why would the allocated showing as 8GB . qemu-img check also shows 0.05% usage for the image.

Version-Release number of selected component (if applicable):

rhevm-4.0.7.4-0.1.el7ev.noarch
vdsm-4.18.24-3.el7ev.x86_64

How reproducible:

Unknown

Steps to Reproduce:

Issue was observed in end-user environment when deleting a snapshot from RHEV-M which contains 2 disks.

Actual results:

VM pauses during live merge operation.

Expected results:

Merge should complete and VM shouldn't be paused.

Additional info:
To recover, had to shut the VM down and remove the snapshot.

Comment 6 Nir Soffer 2017-04-17 13:34:56 UTC

Bimal, would you provide the output of:
rpm -qa | egrep 'qemu|libvirt'

Comment 7 Nir Soffer 2017-04-17 14:20:15 UTC

The issue is trying to extend the old active image:

periodic/7::ERROR::2017-04-06 15:02:37,039::vm::1002::virt.vm::(_shouldExtendVolume) vmId=`94afe3a1-2f67-497d-8b34-3cb8f441107e`::Improbable extension request for volume 15f4e150-82ef-4592-8d20-4e122ee64f47 on domain bf46dbaa-4ce1-4db3-8b94-2facfb63dcbb, pausing the VM to avoid corruptions (capacity: 53687091200, allocated: 9415688192, physical: 1073741824, next physical size: 2147483648)

While the allocation size (9415688192) belongs to the base image:

ImageID: 766cdd3b-e950-430a-a3ac-140947c68b74
--------------------------------------
5fc515a1-ce7d-45b9-a347-428c53493a5c  -  base image
15f4e150-82ef-4592-8d20-4e122ee64f47  -  active image

Old active image:

[root@rheva2bla11 ~][130]# qemu-img check /rhev/data-center/72ab8679-5f51-4673-a61d-2f3b768a598a/bf46dbaa-4ce1-4db3-8b94-2facfb63dcbb/images/766cdd3b-e950-430a-a3ac-140947c68b74/15f4e150-82ef-4592-8d20-4e122ee64f47
No errors were found on the image.
372/819200 = 0.05% allocated, 68.01% fragmented, 0.00% compressed clusters
Image end offset: 26083328

Base image:

[root@rheva2bla11 ~][130]# qemu-img check /rhev/data-center/72ab8679-5f51-4673-a61d-2f3b768a598a/bf46dbaa-4ce1-4db3-8b94-2facfb63dcbb/images/766cdd3b-e950-430a-a3ac-140947c68b74/5fc515a1-ce7d-45b9-a347-428c53493a5c
No errors were found on the image.
143615/819200 = 17.53% allocated, 21.33% fragmented, 0.00% compressed clusters
Image end offset: 9415688192

Following this merge operation:

1. Starting job

jsonrpc.Executor/2::DEBUG::2017-04-06 15:02:21,072::__init__::530::jsonrpc.JsonRpcServer::(_handle_request) Calling 'VM.merge' in bridge with {u'topVolUUID': u'15f4e150-82ef-4592-8d20-4e122ee64f47', u'vmID': u'94afe3a1-2f67-497d-8b34-3cb8f441107e', u'drive': {u'poolID': u'72ab8679-5f51-4673-a61d-2f3b768a598a', u'volumeID': u'15f4e150-82ef-4592-8d20-4e122ee64f47', u'domainID': u'bf46dbaa-4ce1-4db3-8b94-2facfb63dcbb', u'imageID': u'766cdd3b-e950-430a-a3ac-140947c68b74'}, u'bandwidth': u'0', u'jobUUID': u'62977c02-2622-4197-9105-13549db8d3ad', u'baseVolUUID': u'5fc515a1-ce7d-45b9-a347-428c53493a5c'}

2. Checking the job, it seems that the job is finished:

jsonrpc.Executor/2::DEBUG::2017-04-06 15:02:21,180::vm::4521::virt.vm::(queryBlockJobs) vmId=`94afe3a1-2f67-497d-8b34-3cb8f441107e`::Checking job 62977c02-2622-4197-9105-13549db8d3ad
jsonrpc.Executor/2::DEBUG::2017-04-06 15:02:21,191::vm::4551::virt.vm::(queryBlockJobs) vmId=`94afe3a1-2f67-497d-8b34-3cb8f441107e`::Job 62977c02-2622-4197-9105-13549db8d3ad live info: {'end': 0L, 'bandwidth': 0L, 'type': 4, 'cur': 0L}

3. cur == end, means that the job is finished, so we start
   the cleanup thread:

jsonrpc.Executor/2::INFO::2017-04-06 15:02:21,192::vm::4568::virt.vm::(queryBlockJobs) vmId=`94afe3a1-2f67-497d-8b34-3cb8f441107e`::Starting cleanup thread for job: 62977c02-2622-4197-9105-13549db8d3ad

4. 200 milliseconds later, we check the job again

jsonrpc.Executor/4::DEBUG::2017-04-06 15:02:21,417::vm::4521::virt.vm::(queryBlockJobs) vmId=`94afe3a1-2f67-497d-8b34-3cb8f441107e`::Checking job 62977c02-2622-4197-9105-13549db8d3ad
jsonrpc.Executor/4::DEBUG::2017-04-06 15:02:21,418::vm::4551::virt.vm::(queryBlockJobs) vmId=`94afe3a1-2f67-497d-8b34-3cb8f441107e`::Job 62977c02-2622-4197-9105-13549db8d3ad live info: {'end': 24379392L, 'bandwidth': 0L, 'type': 4, 'cur': 22544384L}

The job is not finished (24379392 - 22544384 = 1835008)



----

This smells like drive monitoring issue, monitoring the old active volume
after a pivot, or libvirt issue, reporting incorrect allocation for the
the old active volume after a pivot, or both.

After a successful pivot, the drive object must be updated to reflect the
new chain before drive monitoring is enabled for this drive again.

Peter, is it possible that libvirt reports the wrong info for a drive
after a pivot?

The flow is this:

0. have a chain with volume A (base) and B (top)
1. Start block commit of volume A into B
2. Finish the block job, pivoting to volume B
3. Invoke blockInfo() with path of volume A
4. Libvirt returns allocation of volume B

Comment 8 Nir Soffer 2017-04-17 14:34:44 UTC

Posted too early by mistake, still investigating the log. Removing the need info
for Peter for now.

Comment 10 Nir Soffer 2017-04-17 16:59:02 UTC

(Continued from comment 7)


5. In the meantime, the cleanup thread (Thread-192) is trying
   to pivot, assuming that the job was completed.

Thread-192::INFO::2017-04-06 15:02:21,481::vm::4970::virt.vm::(tryPivot) vmId=`94afe3a1-2f67-497d-8b34-3cb8f441107e`::Requesting pivot to complete active layer commit (job 62977c02-2622-4197-9105-13549db8d3ad)

The thread is doing:

    flags = libvirt.VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT
    ret = self.vm._dom.blockJobAbort(self.drive.name, flags)

According to livbirt docs:
https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBlockJobAbort

Adding @flags of VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT causes this
call to fail with VIR_ERR_BLOCK_COPY_ACTIVE if the copy or commit is not
yet ready; otherwise it will swap the disk over to the new active image to end
the mirroring or active commit.

The libvirt call succeeds, although the block job was not completed yet
(see step 4 in comment 7).


6. Vdsm received events about this block job

Event 1:

libvirtEventLoop::WARNING::2017-04-06 15:02:21,483::clientIF::538::virt.vm::(dispatchLibvirtEvents) vmId=`94afe3a1-2f67-497d-8b34-3cb8f441107e`::unknown eventid 8 args ('/rhev/data-center/72
ab8679-5f51-4673-a61d-2f3b768a598a/bf46dbaa-4ce1-4db3-8b94-2facfb63dcbb/images/766cdd3b-e950-430a-a3ac-140947c68b74/15f4e150-82ef-4592-8d20-4e122ee64f47', 4, 3, 8)

Event id 8 is VIR_DOMAIN_EVENT_ID_BLOCK_JOB - vdsm does not handle this 
event currently.
The event arguments are:
- /rhev/data-center/.../15f4e150-82ef-4592-8d20-4e122ee64f47: path commited image
- 4: VIR_DOMAIN_BLOCK_JOB_TYPE_ACTIVE_COMMIT
- 3: VIR_DOMAIN_BLOCK_JOB_READY
- 8: application data

This event means the job is ready now for pivot, but we already requested
a pivot.

Event 2:

libvirtEventLoop::WARNING::2017-04-06 15:02:21,487::clientIF::538::virt.vm::(dispatchLibvirtEvents) vmId=`94afe3a1-2f67-497d-8b34-3cb8f441107e`::unknown eventid 8 args ('/rhev/data-center/72ab8679-5f51-4673-a61d-2f3b768a598a/bf46dbaa-4ce1-4db3-8b94-2facfb63dcbb/images/766cdd3b-e950-430a-a3ac-140947c68b74/15f4e150-82ef-4592-8d20-4e122ee64f47', 4, 0, 8)

Same as event 1, but now the job status is 0 - VIR_DOMAIN_BLOCK_JOB_COMPLETED.

Vdsm does not handle these events.


7. The cleanup thread is assuming that the pivot was completed, and is 
   waiting until the xml is updated:

Thread-192::INFO::2017-04-06 15:02:21,488::vm::5044::virt.vm::(_waitForXMLUpdate) vmId=`94afe3a1-2f67-497d-8b34-3cb8f441107e`::Waiting for libvirt to update the XML after pivot of drive virtio-disk0 completed

This is the last message we see from this thread. Looking at the code, it is
doing:

        while True:                                                                                                                                                                           
            # This operation should complete in either one or two iterations of
            # this loop.  Until libvirt updates the XML there is nothing to do
            # but wait.  While we wait we continue to tell engine that the job
            # is ongoing.  If we are still in this loop when the VM is powered
            # off, the merge will be resolved manually by engine using the
            # reconcileVolumeChain verb.
            chains = self.vm._driveGetActualVolumeChain([self.drive])
            if alias not in chains.keys():
                raise RuntimeError("Failed to retrieve volume chain for "
                                   "drive %s.  Pivot failed.", alias)
            curVols = sorted([entry.uuid for entry in chains[alias]])

            if curVols == origVols:
                time.sleep(1)
            elif curVols == expectedVols:
                self.vm.log.info("The XML update has been completed")
                break
            else:
                self.vm.log.error("Bad volume chain found for drive %s. "
                                  "Previous chain: %s, Expected chain: %s, "
                                  "Actual chain: %s", alias, origVols,
                                  expectedVols, curVols)
                raise RuntimeError("Bad volume chain found")

We don't log anything while polling the xml, so having no event from 
this thread means it is busy polling the xml, and, and never find the
expected updated xml.


8. Meanwhile, the job monitor check the job status and find that the 
   cleanup thread is still busy:

periodic/6::DEBUG::2017-04-06 15:02:36,380::vm::4521::virt.vm::(queryBlockJobs) vmId=`94afe3a1-2f67-497d-8b34-3cb8f441107e`::Checking job 62977c02-2622-4197-9105-13549db8d3ad
periodic/6::INFO::2017-04-06 15:02:36,381::vm::4560::virt.vm::(queryBlockJobs) vmId=`94afe3a1-2f67-497d-8b34-3cb8f441107e`::Libvirt job 62977c02-2622-4197-9105-13549db8d3ad was terminated
periodic/6::DEBUG::2017-04-06 15:02:36,382::vm::4573::virt.vm::(queryBlockJobs) vmId=`94afe3a1-2f67-497d-8b34-3cb8f441107e`::Still waiting for block job 62977c02-2622-4197-9105-13549db8d3ad to be synchronized

This log repeats for 41 minutes every 15 seconds

periodic/0::DEBUG::2017-04-06 15:43:21,505::vm::4521::virt.vm::(queryBlockJobs) vmId=`94afe3a1-2f67-497d-8b34-3cb8f441107e`::Checking job 62977c02-2622-4197-9105-13549db8d3ad
periodic/0::DEBUG::2017-04-06 15:43:21,505::vm::4573::virt.vm::(queryBlockJobs) vmId=`94afe3a1-2f67-497d-8b34-3cb8f441107e`::Still waiting for block job 62977c02-2622-4197-9105-13549db8d3ad to be synchronized


9. Few minutes after the pivot attempt, we see this log with the vm information:

jsonrpc.Executor/3::DEBUG::2017-04-06 15:02:27,452::__init__::555::jsonrpc.JsonRpcServer::(_handle_request) Return 'Host.getVMFullList' in bridge with ...

Formatting the log show that vdsm thinks that the volume chain still contains
the active volume:

            {
                "poolID": "72ab8679-5f51-4673-a61d-2f3b768a598a", 
                "reqsize": "0", 
                "index": 0, 
                "iface": "virtio", 
                "apparentsize": "15032385536", 
                "specParams": {}, 
                "imageID": "766cdd3b-e950-430a-a3ac-140947c68b74", 
                "readonly": "False", 
                "shared": "false", 
                "truesize": "15032385536", 
                "type": "disk", 
                "domainID": "bf46dbaa-4ce1-4db3-8b94-2facfb63dcbb", 
                "volumeInfo": {
                    "domainID": "bf46dbaa-4ce1-4db3-8b94-2facfb63dcbb", 
                    "volType": "path", 
                    "leaseOffset": 176160768, 
                    "volumeID": "15f4e150-82ef-4592-8d20-4e122ee64f47", 
                    "leasePath": "/dev/bf46dbaa-4ce1-4db3-8b94-2facfb63dcbb/leases", 
                    "imageID": "766cdd3b-e950-430a-a3ac-140947c68b74", 
                    "path": "/rhev/data-center/mnt/blockSD/bf46dbaa-4ce1-4db3-8b94-2facfb63dcbb/images/766cdd3b-e950-430a-a3ac-140947c68b74/15f4e150-82ef-4592-8d20-4e122ee64f47"
                }, 
                "format": "cow", 
                "deviceId": "766cdd3b-e950-430a-a3ac-140947c68b74", 
                "address": {
                    "slot": "0x05", 
                    "bus": "0x00", 
                    "domain": "0x0000", 
                    "type": "pci", 
                    "function": "0x0"
                }, 
                "device": "disk", 
                "path": "/rhev/data-center/72ab8679-5f51-4673-a61d-2f3b768a598a/bf46dbaa-4ce1-4db3-8b94-2facfb63dcbb/images/766cdd3b-e950-430a-a3ac-140947c68b74/15f4e150-82ef-4592-8d20-4e122ee64f47", 
                "propagateErrors": "off", 
                "optional": "false", 
                "name": "vda", 
                "bootOrder": "1", 
                "volumeID": "15f4e150-82ef-4592-8d20-4e122ee64f47", 
                "alias": "virtio-disk0", 
                "volumeChain": [
                    {
                        "domainID": "bf46dbaa-4ce1-4db3-8b94-2facfb63dcbb", 
                        "volType": "path", 
                        "leaseOffset": 126877696, 
                        "volumeID": "5fc515a1-ce7d-45b9-a347-428c53493a5c", 
                        "leasePath": "/dev/bf46dbaa-4ce1-4db3-8b94-2facfb63dcbb/leases", 
                        "imageID": "766cdd3b-e950-430a-a3ac-140947c68b74", 
                        "path": "/rhev/data-center/mnt/blockSD/bf46dbaa-4ce1-4db3-8b94-2facfb63dcbb/images/766cdd3b-e950-430a-a3ac-140947c68b74/5fc515a1-ce7d-45b9-a347-428c53493a5c"
                    }, 
                    {
                        "domainID": "bf46dbaa-4ce1-4db3-8b94-2facfb63dcbb", 
                        "volType": "path", 
                        "leaseOffset": 176160768, 
                        "volumeID": "15f4e150-82ef-4592-8d20-4e122ee64f47", 
                        "leasePath": "/dev/bf46dbaa-4ce1-4db3-8b94-2facfb63dcbb/leases", 
                        "imageID": "766cdd3b-e950-430a-a3ac-140947c68b74", 
                        "path": "/rhev/data-center/mnt/blockSD/bf46dbaa-4ce1-4db3-8b94-2facfb63dcbb/images/766cdd3b-e950-430a-a3ac-140947c68b74/15f4e150-82ef-4592-8d20-4e122ee64f47"
                    }
                ]
            }, 

This may not reflect the actual libvirt chain, it may be cached state from
the vdsm device object.


10. Finally the vm is paused by impossible allocation info on the active layer:

periodic/7::ERROR::2017-04-06 15:43:21,705::vm::1002::virt.vm::(_shouldExtendVolume) vmId=`94afe3a1-2f67-497d-8b34-3cb8f441107e`::Improbable extension request for volume 15f4e150-82ef-4592-8d20-4e122ee64f47 on domain bf46dbaa-4ce1-4db3-8b94-2facfb63dcbb, pausing the VM to avoid corruptions (capacity: 53687091200, allocated: 9415688192, physical: 1073741824, next physical size: 2147483648)


So we have multiple issues:

1. Vdsm uses block job "cur" and "end" to determine that a block job is ready
    instead of block job ready event, so it try to pivot a block job before it
   was ready.
   (see step 1 in comment 7 and step 6 in this comment)

2. Vdsm does not handle block job completed events, so it does not know
    what is the status of the block job.
   (See step 6 in this comment)

3. blockJobAbort succeeded, but the libvirt xml is never updated.
   (See step 5 in this comment)

4. The cleanup thread is waiting forever for the xml update, so the system
   cannot recover without stopping vdsm. Not sure if we can improve this, we must
   trust libvirt blockJobAbort.
   See step 7 in this comment)

5. We get the allocation of the base volume when we query the top volume,
   after a pivot.

6. The volume chain returned from Host.getVMFullList may not reflect the actual
    libvirt chain.


It is not clear if the commit block job:
- succeeded without updating the xml
- succeeded and update the xml, but vdsm failed to detect the update
- canceled

Peter, can you tell by the info in this comment and comment 7 if the block job
was canceled or completed successfully?

Comment 11 Adam Litke 2017-04-17 20:05:55 UTC

I wonder if under certain conditions libvirt/qemu will report cur==end==0 if the block job hasn't fully initialized yet for some reason.  Maybe we just need to ignore this special case since we only work with drives that have a minimum size of 1G.

Comment 12 Nir Soffer 2017-04-18 17:56:21 UTC

(In reply to Adam Litke from comment #11)

cur and end report the allocation, so I don't think we can ignore them, zero
allocation is possible if nothing was written yet to the disk.

See here for example:

jsonrpc.Executor/4::DEBUG::2017-04-06 15:02:21,417::vm::4521::virt.vm::(queryBlockJobs) vmId=`94afe3a1-2f67-497d-8b34-3cb8f441107e`::Checking job 62977c02-2622-4197-9105-13549db8d3ad
jsonrpc.Executor/4::DEBUG::2017-04-06 15:02:21,418::vm::4551::virt.vm::(queryBlockJobs) vmId=`94afe3a1-2f67-497d-8b34-3cb8f441107e`::Job 62977c02-2622-4197-9105-13549db8d3ad live info: {'end': 24379392L, 'bandwidth': 0L, 'type': 4, 'cur': 22544384L}

Comment 13 Peter Krempa 2017-04-18 18:59:14 UTC

Yes that was the issue some time ago. With recent qemus, very early it would report cur == end == 0. This means that the job has not started yet though.

Libvirt already fixed that issue by reporting "false" data until the job starts. See https://bugzilla.redhat.com/show_bug.cgi?id=1372613 .

There are a few issues with the approach oVirt is using. Since you don't listen for events there is no way you can deterministically say when the block commit or block pivot operation finishes. Even if qemu reports cur == end == actual disk size the block job still may not be in the synchronized phase. There isn't much we can do about that. Even with the pivot job it may take some time until the overlay image is fully merged (and the block job fully vanishes).

Libvirt updates the XML only when we receive events. Please use events. The job size markers are unreliable.

Comment 14 Peter Krempa 2017-04-18 19:02:08 UTC

(In reply to Nir Soffer from comment #12)
> (In reply to Adam Litke from comment #11)
> 
> cur and end report the allocation, so I don't think we can ignore them, zero
> allocation is possible if nothing was written yet to the disk.

If the image does not have zero size, but end == 0, then the copy did not start yet in qemu, But the bugzilla I mentioned in Comment 13 changed this, so that libvirt returns cur == 0 and end == 1 if the job did not start yet ( previously cur == end == 0)

Comment 15 Adam Litke 2017-04-18 19:06:01 UTC

I agree with Peter that we should consider using events at some point.  We originally chose not to use them because we need to handle missed events anyway (vdsm restarting at the time a merge completes).  In the short term, it seems like we should handle cur==end==0 as a special case and assume the job is just starting.  We are already retrying pivots that fail due to the job not ready errors.

Nir, I don't understand what you are trying to demonstrate in your example.  Your cur and end values are typical.

Comment 16 Allon Mureinik 2017-04-19 05:56:06 UTC

Tentatively targeting to 4.2 for the events solution.
If there's anything simpler/safer we can provide in 4.1.z, we should.

Comment 17 Nir Soffer 2017-04-19 06:04:19 UTC

(In reply to Adam Litke from comment #15)
> I agree with Peter that we should consider using events at some point.

Some point looks like now to me.

> We originally chose not to use them because we need to handle missed events
> anyway (vdsm restarting at the time a merge completes).

Handling of missed events should be a fallback when we don't have a better way.

> In the short term,
> it seems like we should handle cur==end==0 as a special case and assume the
> job is just starting.

Is it possible that real job will report end=0? for example, merging of a secondary
disk that had no data yet. In this case we will never end the job if we treat
end=0 as special case.

> Nir, I don't understand what you are trying to demonstrate in your example. 
> Your cur and end values are typical.

end (24379392L) is less then 1g, so we cannot assume that small value have special
meaning.

Comment 18 Nir Soffer 2017-04-19 06:24:22 UTC

Peter, I agree that ovirt should use events, but there is major issue here,  pivot
succeeded (no exception was raised), but the xml did not update.

This is the flow:

2017-04-06 15:02:21,481 calling blockJobAbort with VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT

2017-04-06 15:02:21,483 getting VIR_DOMAIN_BLOCK_JOB_READY event

2017-04-06 15:02:21,487 getting VIR_DOMAIN_BLOCK_JOB_COMPLETED event

2017-04-06 15:02:21,488 starting the loop polling changes in the xml

The polling loop never ends, maybe the xml was never updated, or maybe there is 
a bug in this code, but we believe this polling loop works.

It looks like calling blockJobAbort with VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT *before*
the job is ready is not reliable, but we cannot call blockJobAbort only after 
receiving the event since the event may have been missed (e.g. vdsm was restarted
in the middle of a block job).

We must have a reliable way to detect if a block job is ready, without depending on
events. Events are only optimization so we can end the block job quicker.

Comment 19 Peter Krempa 2017-04-19 08:59:40 UTC

(In reply to Nir Soffer from comment #17)
> (In reply to Adam Litke from comment #15)

[...]

> Is it possible that real job will report end=0? for example, merging of a
> secondary
> disk that had no data yet. In this case we will never end the job if we treat
> end=0 as special case.

I think I said that in comment 14. If the image size is 0 then end will be 0. If you have images with 0 size, then you have a problem and have to wait for the event.

> > Nir, I don't understand what you are trying to demonstrate in your example. 
> > Your cur and end values are typical.
> 
> end (24379392L) is less then 1g, so we cannot assume that small value have
> special
> meaning.

It certainly does not have any special meaning in libvirt.

The only special meaning we have is cur == 0, end == 1 if the job did not start yet (after the fix mentioned above). Be careful that if you use that data to report the progress to the user then you get instant 50% completion which will turn back to 0% once the job actually starts. Any job reporting code will need a detection mechanism for this.


(In reply to Nir Soffer from comment #18)
> Peter, I agree that ovirt should use events, but there is major issue here, 
> pivot
> succeeded (no exception was raised), but the xml did not update.
> 
> This is the flow:
> 
> 2017-04-06 15:02:21,481 calling blockJobAbort with
> VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT

This should have failed ...

> 2017-04-06 15:02:21,483 getting VIR_DOMAIN_BLOCK_JOB_READY event

... if you called it before qemu was ready. But apparently it was serialized to happen in time, since ...

> 
> 2017-04-06 15:02:21,487 getting VIR_DOMAIN_BLOCK_JOB_COMPLETED event

... this event denotes that the PIVOT was finished.
 
> 2017-04-06 15:02:21,488 starting the loop polling changes in the xml
> 
> The polling loop never ends, maybe the xml was never updated, or maybe there
> is 
> a bug in this code, but we believe this polling loop works.

How do you detect that something has changed? Do you store the XML and then poll and compare it?

From the log entries above it looks like you managed to hit a very precise moment with your call to PIVOT to the next disk which finished just right before you started your "xml polling loop". If you watch for changes there they won't happen.

This is yet another reason to use events. You don't need to watch for changes.

If you think that something is wrong with libvirt I'll need a reliable reproducer or debug logs from libvirtd.

> It looks like calling blockJobAbort with VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT
> *before*
> the job is ready is not reliable, but we cannot call blockJobAbort only

Well, to a degree. Libvirt should reject it before qemu tells libvirt that the job is READY to be pivoted. In your case it looks that it worked as it should have, due to lucky timing. According to the second event received, the job was successfully pivoted.

> after 
> receiving the event since the event may have been missed (e.g. vdsm was
> restarted
> in the middle of a block job).
> We must have a reliable way to detect if a block job is ready, without
> depending on

For doing it once (e.g. after a restart) it is possible to get the completness state from the XML, which is reliable and based on the event from qemu. You should not poll the XML repeatedly just to get this since you have the event.

> events. Events are only optimization so we can end the block job quicker.

Comment 21 Nir Soffer 2017-04-19 10:53:03 UTC

The root cause starting this is calling blockJobAbort before the job started, when
"cur" and "end" are both zero. This issue was fixed in upstream (bug 1372613) and
we hope to get a backport in 7.3. Once we depend on this version we can eliminate
the root cause.

Other then that we have several other issues to consider (see comment 10).

Returning to 4.1.3 for now.

Comment 22 Nir Soffer 2017-04-19 11:00:26 UTC

Comment 19 has interesting info on how we can improve the polling mechanism using
the vm xml instead of blockJobInfo().

Comment 25 Adam Litke 2017-04-20 19:23:25 UTC

IN the interim (before the libvirt fix is available) I think we can update vdsm to ignore (log) the case where cur==end==0.

Comment 26 Peter Krempa 2017-04-21 06:57:07 UTC

In that case you don't really need anything to be changed in libvirt, since all what libvirt does is to change cur==end==0 to cur == 0, end == 1.

Comment 27 Nir Soffer 2017-04-24 13:24:18 UTC

(In reply to Bimal Chollera from comment #23)

Looking in vdsm.log.18 from attachment 1272830 [details]:

1. We start 2 concurrent merge operations

jsonrpc.Executor/5::DEBUG::2017-04-17 06:13:11,886::__init__::530::jsonrpc.JsonRpcServer::(_handle_request) Calling 'VM.merge' in bridge with {u'topVolUUID': u'aa6cee7f-d7a0-4996-92be-0bb6eeae5ef2', u'vmID': u'961ea670-0724-45ee-9266-a1a29f84d1bc', u'drive': {u'poolID': u'02692f68-7638-4c1b-a590-1e31130035c1', u'volumeID': u'aa6cee7f-d7a0-4996-92be-0bb6eeae5ef2', u'domainID': u'8b0983b1-b844-42f5-bdf4-5e6c9da2b34a', u'imageID': u'c39f3c4c-a01e-42cb-957c-db9671d5543e'}, u'bandwidth': u'0', u'jobUUID': u'69c7b807-686b-4b4a-b27d-bdad6ec06fd8', u'baseVolUUID': u'e7bb3242-4eeb-4aba-ab8f-2c5239fcb996'}

  job: 69c7b807-686b-4b4a-b27d-bdad6ec06fd8
  top: aa6cee7f-d7a0-4996-92be-0bb6eeae5ef2
  base: e7bb3242-4eeb-4aba-ab8f-2c5239fcb996

jsonrpc.Executor/4::DEBUG::2017-04-17 06:13:12,900::__init__::530::jsonrpc.JsonRpcServer::(_handle_request) Calling 'VM.merge' in bridge with {u'topVolUUID': u'5916d7ee-5739-42ed-9815-2218d8a5c2c2', u'vmID': u'961ea670-0724-45ee-9266-a1a29f84d1bc', u'drive': {u'poolID': u'02692f68-7638-4c1b-a590-1e31130035c1', u'volumeID': u'5916d7ee-5739-42ed-9815-2218d8a5c2c2', u'domainID': u'8b0983b1-b844-42f5-bdf4-5e6c9da2b34a', u'imageID': u'd5c76066-2a17-437e-af44-bba764a8a84c'}, u'bandwidth': u'0', u'jobUUID': u'04b511ab-6721-41a7-84ea-39bd41ab70a6', u'baseVolUUID': u'0e33b1bc-5135-4f4f-9f29-6da12a853103'}

  job: 04b511ab-6721-41a7-84ea-39bd41ab70a6
  top: 5916d7ee-5739-42ed-9815-2218d8a5c2c2
  base: 0e33b1bc-5135-4f4f-9f29-6da12a853103


2. Job 69c7b807-686b-4b4a-b27d-bdad6ec06fd8 finish and start cleanup thread

periodic/11::DEBUG::2017-04-17 06:13:12,924::vm::4551::virt.vm::(queryBlockJobs) vmId=`961ea670-0724-45ee-9266-a1a29f84d1bc`::Job 69c7b807-686b-4b4a-b27d-bdad6ec06fd8 live info: {'end': 52625408L, 'bandwidth': 0L, 'type': 4, 'cur': 52625408L}

periodic/11::INFO::2017-04-17 06:13:12,924::vm::4568::virt.vm::(queryBlockJobs) vmId=`961ea670-0724-45ee-9266-a1a29f84d1bc`::Starting cleanup thread for job: 69c7b807-686b-4b4a-b27d-bdad6ec06fd8


3. Job 04b511ab-6721-41a7-84ea-39bd41ab70a6 start cleanup thread too early
   when finding cur=0 end=0

jsonrpc.Executor/4::DEBUG::2017-04-17 06:13:13,015::vm::4551::virt.vm::(queryBlockJobs) vmId=`961ea670-0724-45ee-9266-a1a29f84d1bc`::Job 04b511ab-6721-41a7-84ea-39bd41ab70a6 live info: {'end': 0L, 'bandwidth': 0L, 'type': 4, 'cur': 0L}

jsonrpc.Executor/4::INFO::2017-04-17 06:13:13,015::vm::4568::virt.vm::(queryBlockJobs) vmId=`961ea670-0724-45ee-9266-a1a29f84d1bc`::Starting cleanup thread for job: 04b511ab-6721-41a7-84ea-39bd41ab70a6


4. Job 69c7b807-686b-4b4a-b27d-bdad6ec06fd8 cleanup thread request pivot

Thread-28646::INFO::2017-04-17 06:13:13,261::vm::4970::virt.vm::(tryPivot) vmId=`961ea670-0724-45ee-9266-a1a29f84d1bc`::Requesting pivot to complete active layer commit (job 69c7b807-686b-4b4a-b27d-bdad6ec06fd8)


5. Job 69c7b807-686b-4b4a-b27d-bdad6ec06fd8 completed pivot

Thread-28646::INFO::2017-04-17 06:13:13,321::vm::4983::virt.vm::(tryPivot) vmId=`961ea670-0724-45ee-9266-a1a29f84d1bc`::Pivot completed (job 69c7b807-686b-4b4a-b27d-bdad6ec06fd8)


6. Job 04b511ab-6721-41a7-84ea-39bd41ab70a6 cleanup thread request pivot

Thread-28647::INFO::2017-04-17 06:13:13,499::vm::4970::virt.vm::(tryPivot) vmId=`961ea670-0724-45ee-9266-a1a29f84d1bc`::Requesting pivot to complete active layer commit (job 04b511ab-6721-41a7-84ea-39bd41ab70a6)


7. Getting ready and completed events for job 04b511ab-6721-41a7-84ea-39bd41ab70a6

libvirtEventLoop::WARNING::2017-04-17 06:13:13,505::clientIF::538::virt.vm::(dispatchLibvirtEvents) vmId=`961ea670-0724-45ee-9266-a1a29f84d1bc`::unknown eventid 8 args ('/rhev/data-center/02692f68-7638-4c1b-a590-1e31130035c1/8b0983b1-b844-42f5-bdf4-5e6c9da2b34a/images/d5c76066-2a17-437e-af44-bba764a8a84c/5916d7ee-5739-42ed-9815-2218d8a5c2c2', 4, 3, 8)

libvirtEventLoop::WARNING::2017-04-17 06:13:13,506::clientIF::538::virt.vm::(dispatchLibvirtEvents) vmId=`961ea670-0724-45ee-9266-a1a29f84d1bc`::unknown eventid 8 args ('/rhev/data-center/02692f68-7638-4c1b-a590-1e31130035c1/8b0983b1-b844-42f5-bdf4-5e6c9da2b34a/images/d5c76066-2a17-437e-af44-bba764a8a84c/5916d7ee-5739-42ed-9815-2218d8a5c2c2', 4, 0, 8)


8. Job 04b511ab-6721-41a7-84ea-39bd41ab70a6 cleanup thread waiting for xml
   update which never happen.

Thread-28647::INFO::2017-04-17 06:13:13,506::vm::5044::virt.vm::(_waitForXMLUpdate) vmId=`961ea670-0724-45ee-9266-a1a29f84d1bc`::Waiting for libvirt to update the XML after pivot of drive virtio-disk0 completed

Until this point, this is the same as the first previous case.


9. Job 69c7b807-686b-4b4a-b27d-bdad6ec06fd8 finished sycnhronizing drive volume
   chain

Thread-28646::INFO::2017-04-17 06:13:13,558::logUtils::52::dispatcher::(wrapper) Run and protect: imageSyncVolumeChain, Return response: None

At this point the thread is enabling drive monitoring on the vm:

5259         self.vm._syncVolumeChain(self.drive)
5260         if self.doPivot:
5261             self.vm.enableDriveMonitor()


10. Getting improbable extend request for the top volume with the alloction of
    the base volume?.

periodic/18::ERROR::2017-04-17 06:13:28,083::vm::1002::virt.vm::(_shouldExtendVolume) vmId=`961ea670-0724-45ee-9266-a1a29f84d1bc`::Improbable extension request for volume 5916d7ee-5739-42ed-9815-2218d8a5c2c2 on domain 8b0983b1-b844-42f5-bdf4-5e6c9da2b34a, pausing the VM to avoid corruptions (capacity: 214748364800, allocated: 20336738304, physical: 1073741824, next physical size: 2147483648)

allocated: 20336738304 (18.9g)
physical: 1073741824 (1.0g)
next physical size: 2147483648 (2.0g)


Summary: the root cause is again starting cleanup thread too early and no
update in vm xml after successful pivot

This case reveal why we get improbable request request for drive which is
in the middle of a stuck pivot operation.

The code is doing:

1. disable drive monitoring
2. try to pivot
3. enable drive monitoring

But this works only for single block job per vm. When running multiple block
jobs per vm, each job is enabling and disbaling monitoring regardless of the
other block jobs. This mechanism cannot work for concurrent block jobs, and
should be replaced with finer grain mechanism that disable monitoring for
specific drive.

We need to reproduce this in the lab with libvirt debug logs, to understand:
- why pivot succeeds but vm xml does not update
- why we get the allocation of the base volume when calling blockInfo on the
  top volume after a successful pivot (that get stuck).

Comment 28 Nir Soffer 2017-04-24 14:49:29 UTC

I think correct way to solve this issue is to check the drive xml when we think
that the block job is ready:

    <mirror type='file' job='active-commit' ready='yes'>
      <format type='raw'/>
      <source file='/rhev/data-center/.../168134ea-a173-448c-b1b9-087abac87ed8'/>
    </mirror>

If we we don't get ready='yes', we should wait for the next check interval
before trying to pivot.

Even when we support events we can use this after vdsm is restarted/crash in 
the middle of a block job.

Peter, will this work with current libvirt and qemu in rhel/centos 7.3?

Comment 29 Peter Krempa 2017-04-25 11:02:01 UTC

(In reply to Nir Soffer from comment #28)
> I think correct way to solve this issue is to check the drive xml when we
> think
> that the block job is ready:
> 
>     <mirror type='file' job='active-commit' ready='yes'>
>       <format type='raw'/>
>       <source
> file='/rhev/data-center/.../168134ea-a173-448c-b1b9-087abac87ed8'/>
>     </mirror>
> 
> If we we don't get ready='yes', we should wait for the next check interval
> before trying to pivot.

Yes. The ready='yes' attribute is in the XML only if qemu is ready to pivot.

> Even when we support events we can use this after vdsm is restarted/crash in 
> the middle of a block job.
> 
> Peter, will this work with current libvirt and qemu in rhel/centos 7.3?

The ready marker was fixed by commit:

commit eae59247c59aa02147b2b4a50177e8e877fdb218
Author: Peter Krempa <pkrempa>
Date:   Wed Jul 15 15:11:02 2015 +0200

    qemu: Update state of block job to READY only if it actually is ready
    
    Few parts of the code looked at the current progress of and assumed that
    a two phase blockjob is in the _READY state as soon as the progress
    reached 100% (info.cur == info.end). In current versions of qemu this
    assumption is invalid and qemu exposes a new flag 'ready' in the
    query-block-jobs output that is set to true if the job is actually
    finished.
[...]
http://libvirt.org/git/?p=libvirt.git;a=commit;h=eae59247c59aa02147b2b4a50177e8e877fdb218

This commit is contained in libvirt-1.2.18, so centos 7.3 is certainly okay.

Comment 30 Nir Soffer 2017-04-26 06:01:02 UTC

(In reply to Peter Krempa from comment #29)
Thanks Peter!

Comment 31 Nir Soffer 2017-05-07 15:09:02 UTC

I could reproduce this while verifying the patch:


Using iSCSI storage:

1. Starting merge with snapshot with 2 disks, this is the second disk

2017-05-07 17:46:40,756+0300 INFO  (jsonrpc/6) [virt.vm] (vmId='c961c7fb-3f18-4e7a-b441-56b2fc6973e0') Starting merge with jobUUID='dfb39159-21af-47c2-9a5b-48e9278b0e85' original chain=5f435bad-cd3a-4203-99aa-3ee1b90eec9d < 505e00da-e2a8-464b-be8f-97dabadb0086 (top) (vm:4956)

2. First check detect end=0, cur=0

2017-05-07 17:46:40,790+0300 DEBUG (jsonrpc/6) [virt.vm] (vmId='c961c7fb-3f18-4e7a-b441-56b2fc6973e0') Job dfb39159-21af-47c2-9a5b-48e9278b0e85 live info: {'end': 0L, 'bandwidth': 0L, 'type': 4, 'cur': 0L} (vm:4831)

3. Checking xml for ready=yes

2017-05-07 17:46:40,790+0300 DEBUG (jsonrpc/6) [virt.vm] (vmId='c961c7fb-3f18-4e7a-b441-56b2fc6973e0') Checking xml for drive 'sdb' (vm:4764)

4. Checking job again after 12 seconds, looks ready...

2017-05-07 17:46:52,044+0300 DEBUG (periodic/1) [virt.vm] (vmId='c961c7fb-3f18-4e7a-b441-56b2fc6973e0') Job dfb39159-21af-47c2-9a5b-48e9278b0e85 live info: {'end': 0L, 'bandwidth': 0L, 'type': 4, 'cur': 0L} (vm:4831)

5. Checking xml for ready=yes

2017-05-07 17:46:52,044+0300 DEBUG (periodic/1) [virt.vm] (vmId='c961c7fb-3f18-4e7a-b441-56b2fc6973e0') Checking xml for drive 'sdb' (vm:4764)

6. Since job is actually ready, start cleanup thread

2017-05-07 17:46:52,050+0300 INFO  (periodic/1) [virt.vm] (vmId='c961c7fb-3f18-4e7a-b441-56b2fc6973e0') Starting cleanup thread for job: dfb39159-21af-47c2-9a5b-48e9278b0e85 (vm:4848)


Using NFS:

1. Staring the second disk merge

2017-05-07 18:03:58,179+0300 INFO  (jsonrpc/1) [virt.vm] (vmId='e0a1bc82-d685-4538-8372-f1c6e6022bd9') Starting merge with jobUUID='2a611b46-f33b-4c16-87ed-481a4ad0377d' original chain=f0da1b05-3c91-4b79-b15c-938852482ebe < 2afc4144-9362-4046-a83e-855c8af5cb68 (top) (vm:4956)

2. Checking job...

2017-05-07 18:03:58,305+0300 DEBUG (jsonrpc/1) [virt.vm] (vmId='e0a1bc82-d685-4538-8372-f1c6e6022bd9') Job 2a611b46-f33b-4c16-87ed-481a4ad0377d live info: {'end': 0L, 'bandwidth': 0L, 'type': 4, 'cur': 0L} (vm:4831)

3. Job looks ready, so checking xml...

2017-05-07 18:03:58,305+0300 DEBUG (jsonrpc/1) [virt.vm] (vmId='e0a1bc82-d685-4538-8372-f1c6e6022bd9') Checking xml for drive 'vda' (vm:4764)

4. Job is actually ready, starting cleanup thead

2017-05-07 18:03:58,310+0300 INFO  (jsonrpc/1) [virt.vm] (vmId='e0a1bc82-d685-4538-8372-f1c6e6022bd9') Starting cleanup thread for job: 2a611b46-f33b-4c16-87ed-481a4ad0377d (vm:4848)


This show also why we cannot treat end=0, cur=0 as not ready - these merges
would never complete if we did.

Comment 35 Kevin Alon Goldblatt 2017-06-06 15:22:41 UTC

Moving to VERIFIED Based on Comment 34

Comment 36 Nir Soffer 2017-07-15 15:19:48 UTC

We fixed the root cause in vdsm, but this bug revealed an issue in libvirt,
returning allocation value in domblockinfo, see details in bug 1470634.

Comment 45 errata-xmlrpc 2018-05-15 17:51:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1489

Comment 46 Franta Kust 2019-05-16 13:06:07 UTC

BZ<2>Jira Resync

Note You need to log in before you can comment on or make changes to this bug.