Bug 1038069

Summary: Repeated migration of a VM causes its disk to grow indefinitely
Product: Red Hat Enterprise Virtualization Manager Reporter: David Jaša <djasa>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED ERRATA QA Contact: Raz Tamir <ratamir>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: acanan, amureini, bazulay, djasa, fsimonce, iheim, lpeer, michal.skrivanek, scohen, sgotliv, yeylon, zdover
Target Milestone: ---Flags: scohen: Triaged+
Target Release: 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: vdsm-4.14.7-0.1.beta3.el6ev Doc Type: Bug Fix
Doc Text:
Previously, repeatedly migrating virtual machines caused the disks associated with the virtual machines to grow indefinitely. This was due to a race between live migration and highWrite monitoring. libvirt-0.10.2-29.el6_5.4 repaired this issue, and is now included in Red Hat Enterprise Virtualization. Repeated migration of virtual machines no longer causes their associated disks to grow indefinitely.
Story Points: ---
Clone Of:
: 1040507 (view as bug list) Environment:
Last Closed: 2014-06-09 13:26:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1040507, 1059904, 1065531    
Bug Blocks:    
Attachments:
Description Flags
Storage/Virtual Machines subtab none

Description David Jaša 2013-12-04 11:06:21 UTC
Created attachment 832576 [details]
Storage/Virtual Machines subtab

Description of problem:
repeated migration of a VM causes its disk to grow indefinitely. The setup uses thin-provisioned VM from template on qcow disk on iSCSI data domain.

Version-Release number of selected component (if applicable):
engine: is25:
rhevm-backend-3.3.0-0.37.beta1.el6ev.noarch
hosts: is23
vdsm-4.13.0-0.8.beta1.el6ev.x86_64
libvirt-0.10.2-29.el6.x86_64
qemu-img-rhev-0.12.1.2-2.414.el6.x86_64

How reproducible:
always

Steps to Reproduce:
1. note Actual Size of some VM disk (its Active VM snapshot)
2. create & refresh authentication cookie:
while true ; do curl -H "prefer: persistent-auth" -c COOKIE_FILE --cacert CA_FILE.pem https://rhevm33.example.com/api -u admin@internal:PASSWORD ; sleep 1200 ; done
3. migrate the VM repeatedly:
while true ; do curl -H "prefer: persistent-auth" -H "content-type: application/xml" -b COOKIE_FILE --cacert CA_FILE.pem https://rhevm33.example.com/api/VM_UUID/migrate -X POST -d '<action/>' ; sleep 20 ; done
4. watch the Actual size of "Active VM" snapshot in VM/disks subtab or in Storage/Virtual Machines subtab

Actual results:
After several hours, VM grows way beyond expected size limit

Expected results:
  * if there is (almost) no write load, Actual Size of "Active VM" should remain as it was
  * if there is heavy write load, Actual Size of VM snapshot must not exceed Virtual Size plus some space for qcow metadata

Additional info:

Comment 6 Ayal Baron 2013-12-04 19:29:35 UTC
Fede, qemu cluster leak?

Comment 7 Federico Simoncelli 2013-12-10 17:20:08 UTC
This looks like a race between live migration and highWrite monitoring.
I'll take one of the iterations as example (even though the pattern repeats for each migration):

Thread-547544::DEBUG::2013-12-04 00:01:13,683::API::469::vds::(migrate) {'tunneled': 'false', 'src': '10.34.73.74', 'dst': '10.34.73.80:54321', 'vmId': '05563fbc-6b2c-4f4e-9096-8daeb9754aa2', 'abortOnError': 'true', 'method': 'online'}
Thread-547544::DEBUG::2013-12-04 00:01:13,683::BindingXMLRPC::991::vds::(wrapper) return vmMigrate with {'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0}
...
Thread-547545::DEBUG::2013-12-04 00:01:43,670::vm::743::vm.Vm::(cancel) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::canceling migration downtime thread
Thread-547545::DEBUG::2013-12-04 00:01:43,671::vm::813::vm.Vm::(stop) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::stopping migration monitor thread
Thread-547547::DEBUG::2013-12-04 00:01:43,671::vm::740::vm.Vm::(run) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::migration downtime thread exiting
Thread-547529::INFO::2013-12-04 00:01:43,671::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 42949672960 capacity: 16106127360, alloc: 42949672960, phys: 42949672960
Thread-547548::DEBUG::2013-12-04 00:01:43,675::libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 42 edom: 10 level: 2 message: Domain not found: no domain with matching uuid '05563fbc-6b2c-4f4e-9096-8daeb9754aa2'
Thread-547529::DEBUG::2013-12-04 00:01:43,683::vm::2382::vm.Vm::(__extendDriveVolume) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::Requesting an extension for the volume: {'newSize': 41984, 'domainID': 'd5a781d6-a319-4d9f-9d1a-1f776f709911', 'name': u'vda', 'volumeID': '8cc85f7a-a7df-4566-9817-0f61b227660a', 'imageID': '924daf17-9767-49cf-a248-3b49f3d9ee8e', 'poolID': '5849b030-626e-47cb-ad90-3ce782d831b3'}
Thread-547545::DEBUG::2013-12-04 00:01:43,684::libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 42 edom: 10 level: 2 message: Domain not found: no domain with matching uuid '05563fbc-6b2c-4f4e-9096-8daeb9754aa2'
Thread-547545::DEBUG::2013-12-04 00:01:43,685::vm::2557::vm.Vm::(setDownStatus) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::Changed state to Down: Migration succeeded


It seems that for some reason when the migration is finalized, blockInfo (libvirt/qemu) returns alloc == apparent == phys (e.g. in this case 42949672960):

Thread-547529::INFO::2013-12-04 00:01:43,671::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 42949672960 capacity: 16106127360, alloc: 42949672960, phys: 42949672960

triggering an lv extension:

Thread-547529::DEBUG::2013-12-04 00:01:43,683::vm::2382::vm.Vm::(__extendDriveVolume) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::Requesting an extension for the volume: {'newSize': 41984, 'domainID': 'd5a781d6-a319-4d9f-9d1a-1f776f709911', 'name': u'vda', 'volumeID': '8cc85f7a-a7df-4566-9817-0f61b227660a', 'imageID': '924daf17-9767-49cf-a248-3b49f3d9ee8e', 'poolID': '5849b030-626e-47cb-ad90-3ce782d831b3'}

In fact it seems very unlikely that at the end of each migration we always hit the very end of the lv (without any enospc):

Thread-547529::INFO::2013-12-04 00:01:43,671::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 42949672960 capacity: 16106127360, alloc: 42949672960, phys: 42949672960
Thread-547614::INFO::2013-12-04 00:05:19,606::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 44023414784 capacity: 16106127360, alloc: 44023414784, phys: 44023414784
Thread-547802::INFO::2013-12-04 00:08:52,475::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 45097156608 capacity: 16106127360, alloc: 45097156608, phys: 45097156608
Thread-548505::INFO::2013-12-04 00:27:03,603::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 46170898432 capacity: 16106127360, alloc: 46170898432, phys: 46170898432
Thread-548705::INFO::2013-12-04 00:29:04,005::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 47244640256 capacity: 16106127360, alloc: 47244640256, phys: 47244640256
Thread-549010::INFO::2013-12-04 00:37:16,096::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 48318382080 capacity: 16106127360, alloc: 48318382080, phys: 48318382080
Thread-549184::INFO::2013-12-04 00:41:13,268::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 49392123904 capacity: 16106127360, alloc: 49392123904, phys: 49392123904
Thread-549272::INFO::2013-12-04 00:43:22,801::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 50465865728 capacity: 16106127360, alloc: 50465865728, phys: 50465865728
Thread-549360::INFO::2013-12-04 00:47:26,494::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 51539607552 capacity: 16106127360, alloc: 51539607552, phys: 51539607552

...

I suppose this should be taken care of in libvirt/qemu. Maybe we could try to protect against pausing the statistics at the end of the migration. I don't think we should address this in _highWrite using an upper bound.

Comment 8 Ayal Baron 2013-12-11 07:56:04 UTC
(In reply to Federico Simoncelli from comment #7)
> This looks like a race between live migration and highWrite monitoring.
> I'll take one of the iterations as example (even though the pattern repeats
> for each migration):
> 
> Thread-547544::DEBUG::2013-12-04 00:01:13,683::API::469::vds::(migrate)
> {'tunneled': 'false', 'src': '10.34.73.74', 'dst': '10.34.73.80:54321',
> 'vmId': '05563fbc-6b2c-4f4e-9096-8daeb9754aa2', 'abortOnError': 'true',
> 'method': 'online'}
> Thread-547544::DEBUG::2013-12-04
> 00:01:13,683::BindingXMLRPC::991::vds::(wrapper) return vmMigrate with
> {'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0}
> ...
> Thread-547545::DEBUG::2013-12-04 00:01:43,670::vm::743::vm.Vm::(cancel)
> vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::canceling migration downtime
> thread
> Thread-547545::DEBUG::2013-12-04 00:01:43,671::vm::813::vm.Vm::(stop)
> vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::stopping migration monitor
> thread
> Thread-547547::DEBUG::2013-12-04 00:01:43,671::vm::740::vm.Vm::(run)
> vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::migration downtime thread
> exiting
> Thread-547529::INFO::2013-12-04 00:01:43,671::vm::521::vm.Vm::(_highWrite)
> vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-
> 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 42949672960
> capacity: 16106127360, alloc: 42949672960, phys: 42949672960
> Thread-547548::DEBUG::2013-12-04
> 00:01:43,675::libvirtconnection::108::libvirtconnection::(wrapper) Unknown
> libvirterror: ecode: 42 edom: 10 level: 2 message: Domain not found: no
> domain with matching uuid '05563fbc-6b2c-4f4e-9096-8daeb9754aa2'
> Thread-547529::DEBUG::2013-12-04
> 00:01:43,683::vm::2382::vm.Vm::(__extendDriveVolume)
> vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::Requesting an extension for the
> volume: {'newSize': 41984, 'domainID':
> 'd5a781d6-a319-4d9f-9d1a-1f776f709911', 'name': u'vda', 'volumeID':
> '8cc85f7a-a7df-4566-9817-0f61b227660a', 'imageID':
> '924daf17-9767-49cf-a248-3b49f3d9ee8e', 'poolID':
> '5849b030-626e-47cb-ad90-3ce782d831b3'}
> Thread-547545::DEBUG::2013-12-04
> 00:01:43,684::libvirtconnection::108::libvirtconnection::(wrapper) Unknown
> libvirterror: ecode: 42 edom: 10 level: 2 message: Domain not found: no
> domain with matching uuid '05563fbc-6b2c-4f4e-9096-8daeb9754aa2'
> Thread-547545::DEBUG::2013-12-04
> 00:01:43,685::vm::2557::vm.Vm::(setDownStatus)
> vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::Changed state to Down:
> Migration succeeded
> 
> 
> It seems that for some reason when the migration is finalized, blockInfo
> (libvirt/qemu) returns alloc == apparent == phys (e.g. in this case
> 42949672960):
> 
> Thread-547529::INFO::2013-12-04 00:01:43,671::vm::521::vm.Vm::(_highWrite)
> vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-
> 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 42949672960
> capacity: 16106127360, alloc: 42949672960, phys: 42949672960
> 
> triggering an lv extension:
> 
> Thread-547529::DEBUG::2013-12-04
> 00:01:43,683::vm::2382::vm.Vm::(__extendDriveVolume)
> vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::Requesting an extension for the
> volume: {'newSize': 41984, 'domainID':
> 'd5a781d6-a319-4d9f-9d1a-1f776f709911', 'name': u'vda', 'volumeID':
> '8cc85f7a-a7df-4566-9817-0f61b227660a', 'imageID':
> '924daf17-9767-49cf-a248-3b49f3d9ee8e', 'poolID':
> '5849b030-626e-47cb-ad90-3ce782d831b3'}
> 
> In fact it seems very unlikely that at the end of each migration we always
> hit the very end of the lv (without any enospc):
> 
> Thread-547529::INFO::2013-12-04 00:01:43,671::vm::521::vm.Vm::(_highWrite)
> vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-
> 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 42949672960
> capacity: 16106127360, alloc: 42949672960, phys: 42949672960
> Thread-547614::INFO::2013-12-04 00:05:19,606::vm::521::vm.Vm::(_highWrite)
> vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-
> 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 44023414784
> capacity: 16106127360, alloc: 44023414784, phys: 44023414784
> Thread-547802::INFO::2013-12-04 00:08:52,475::vm::521::vm.Vm::(_highWrite)
> vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-
> 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 45097156608
> capacity: 16106127360, alloc: 45097156608, phys: 45097156608
> Thread-548505::INFO::2013-12-04 00:27:03,603::vm::521::vm.Vm::(_highWrite)
> vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-
> 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 46170898432
> capacity: 16106127360, alloc: 46170898432, phys: 46170898432
> Thread-548705::INFO::2013-12-04 00:29:04,005::vm::521::vm.Vm::(_highWrite)
> vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-
> 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 47244640256
> capacity: 16106127360, alloc: 47244640256, phys: 47244640256
> Thread-549010::INFO::2013-12-04 00:37:16,096::vm::521::vm.Vm::(_highWrite)
> vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-
> 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 48318382080
> capacity: 16106127360, alloc: 48318382080, phys: 48318382080
> Thread-549184::INFO::2013-12-04 00:41:13,268::vm::521::vm.Vm::(_highWrite)
> vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-
> 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 49392123904
> capacity: 16106127360, alloc: 49392123904, phys: 49392123904
> Thread-549272::INFO::2013-12-04 00:43:22,801::vm::521::vm.Vm::(_highWrite)
> vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-
> 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 50465865728
> capacity: 16106127360, alloc: 50465865728, phys: 50465865728
> Thread-549360::INFO::2013-12-04 00:47:26,494::vm::521::vm.Vm::(_highWrite)
> vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-
> 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 51539607552
> capacity: 16106127360, alloc: 51539607552, phys: 51539607552
> 
> ...
> 
> I suppose this should be taken care of in libvirt/qemu. Maybe we could try

Please file a bug on libvirt then?

> to protect against pausing the statistics at the end of the migration. I
> don't think we should address this in _highWrite using an upper bound.

I think we should have an upper bound in highWrite regardless (i.e. not let disk grow beyond a certain limit).

Comment 10 Federico Simoncelli 2013-12-11 23:44:36 UTC
Based on bug 1040507 comment 2, I have reasons to believe that this is not a regression.

David could you verify if indeed this happens on rhev 3.2 as well? (Same conditions: vm with a cow disk on block domain). Thanks!

Comment 11 David Jaša 2013-12-12 14:06:35 UTC
(In reply to Federico Simoncelli from comment #10)
> Based on bug 1040507 comment 2, I have reasons to believe that this is not a
> regression.
> 
> David could you verify if indeed this happens on rhev 3.2 as well? (Same
> conditions: vm with a cow disk on block domain). Thanks!

I'm fairly sure we hit VMs with insanely big disks in 3.2 and before and the repeated migration could have been the cause.

Comment 12 David Jaša 2013-12-13 12:47:53 UTC
(In reply to David Jaša from comment #11)
> (In reply to Federico Simoncelli from comment #10)
> > Based on bug 1040507 comment 2, I have reasons to believe that this is not a
> > regression.
> > 
> > David could you verify if indeed this happens on rhev 3.2 as well? (Same
> > conditions: vm with a cow disk on block domain). Thanks!
> 
> I'm fairly sure we hit VMs with insanely big disks in 3.2 and before and the
> repeated migration could have been the cause.

confirmed, a 30GB disk on rhev 3.2 with 6.4 hosts can get to 267 GB overnight.

Comment 13 Sergey Gotliv 2013-12-14 19:45:15 UTC
Moving to 3.4 because this bug depends on libvirt BZ#1040507.

Comment 16 Allon Mureinik 2014-04-22 12:13:59 UTC
libvirt-0.10.2-29.el6_5.4 contains this fix. Need to backport the patch that requires it.

Comment 17 Raz Tamir 2014-05-01 05:02:14 UTC
Verified on:
3.4.0-0.15.beta3.el6ev
vdsm-4.14.7-0.1.beta3.el6ev.x86_64

Run VM migration (every 60 seconds) for several hours (~11) and disk's actual size didn't change.

Comment 18 errata-xmlrpc 2014-06-09 13:26:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0504.html