Hide Forgot
Created attachment 832576 [details] Storage/Virtual Machines subtab Description of problem: repeated migration of a VM causes its disk to grow indefinitely. The setup uses thin-provisioned VM from template on qcow disk on iSCSI data domain. Version-Release number of selected component (if applicable): engine: is25: rhevm-backend-3.3.0-0.37.beta1.el6ev.noarch hosts: is23 vdsm-4.13.0-0.8.beta1.el6ev.x86_64 libvirt-0.10.2-29.el6.x86_64 qemu-img-rhev-0.12.1.2-2.414.el6.x86_64 How reproducible: always Steps to Reproduce: 1. note Actual Size of some VM disk (its Active VM snapshot) 2. create & refresh authentication cookie: while true ; do curl -H "prefer: persistent-auth" -c COOKIE_FILE --cacert CA_FILE.pem https://rhevm33.example.com/api -u admin@internal:PASSWORD ; sleep 1200 ; done 3. migrate the VM repeatedly: while true ; do curl -H "prefer: persistent-auth" -H "content-type: application/xml" -b COOKIE_FILE --cacert CA_FILE.pem https://rhevm33.example.com/api/VM_UUID/migrate -X POST -d '<action/>' ; sleep 20 ; done 4. watch the Actual size of "Active VM" snapshot in VM/disks subtab or in Storage/Virtual Machines subtab Actual results: After several hours, VM grows way beyond expected size limit Expected results: * if there is (almost) no write load, Actual Size of "Active VM" should remain as it was * if there is heavy write load, Actual Size of VM snapshot must not exceed Virtual Size plus some space for qcow metadata Additional info:
Fede, qemu cluster leak?
This looks like a race between live migration and highWrite monitoring. I'll take one of the iterations as example (even though the pattern repeats for each migration): Thread-547544::DEBUG::2013-12-04 00:01:13,683::API::469::vds::(migrate) {'tunneled': 'false', 'src': '10.34.73.74', 'dst': '10.34.73.80:54321', 'vmId': '05563fbc-6b2c-4f4e-9096-8daeb9754aa2', 'abortOnError': 'true', 'method': 'online'} Thread-547544::DEBUG::2013-12-04 00:01:13,683::BindingXMLRPC::991::vds::(wrapper) return vmMigrate with {'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0} ... Thread-547545::DEBUG::2013-12-04 00:01:43,670::vm::743::vm.Vm::(cancel) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::canceling migration downtime thread Thread-547545::DEBUG::2013-12-04 00:01:43,671::vm::813::vm.Vm::(stop) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::stopping migration monitor thread Thread-547547::DEBUG::2013-12-04 00:01:43,671::vm::740::vm.Vm::(run) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::migration downtime thread exiting Thread-547529::INFO::2013-12-04 00:01:43,671::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 42949672960 capacity: 16106127360, alloc: 42949672960, phys: 42949672960 Thread-547548::DEBUG::2013-12-04 00:01:43,675::libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 42 edom: 10 level: 2 message: Domain not found: no domain with matching uuid '05563fbc-6b2c-4f4e-9096-8daeb9754aa2' Thread-547529::DEBUG::2013-12-04 00:01:43,683::vm::2382::vm.Vm::(__extendDriveVolume) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::Requesting an extension for the volume: {'newSize': 41984, 'domainID': 'd5a781d6-a319-4d9f-9d1a-1f776f709911', 'name': u'vda', 'volumeID': '8cc85f7a-a7df-4566-9817-0f61b227660a', 'imageID': '924daf17-9767-49cf-a248-3b49f3d9ee8e', 'poolID': '5849b030-626e-47cb-ad90-3ce782d831b3'} Thread-547545::DEBUG::2013-12-04 00:01:43,684::libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 42 edom: 10 level: 2 message: Domain not found: no domain with matching uuid '05563fbc-6b2c-4f4e-9096-8daeb9754aa2' Thread-547545::DEBUG::2013-12-04 00:01:43,685::vm::2557::vm.Vm::(setDownStatus) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::Changed state to Down: Migration succeeded It seems that for some reason when the migration is finalized, blockInfo (libvirt/qemu) returns alloc == apparent == phys (e.g. in this case 42949672960): Thread-547529::INFO::2013-12-04 00:01:43,671::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 42949672960 capacity: 16106127360, alloc: 42949672960, phys: 42949672960 triggering an lv extension: Thread-547529::DEBUG::2013-12-04 00:01:43,683::vm::2382::vm.Vm::(__extendDriveVolume) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::Requesting an extension for the volume: {'newSize': 41984, 'domainID': 'd5a781d6-a319-4d9f-9d1a-1f776f709911', 'name': u'vda', 'volumeID': '8cc85f7a-a7df-4566-9817-0f61b227660a', 'imageID': '924daf17-9767-49cf-a248-3b49f3d9ee8e', 'poolID': '5849b030-626e-47cb-ad90-3ce782d831b3'} In fact it seems very unlikely that at the end of each migration we always hit the very end of the lv (without any enospc): Thread-547529::INFO::2013-12-04 00:01:43,671::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 42949672960 capacity: 16106127360, alloc: 42949672960, phys: 42949672960 Thread-547614::INFO::2013-12-04 00:05:19,606::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 44023414784 capacity: 16106127360, alloc: 44023414784, phys: 44023414784 Thread-547802::INFO::2013-12-04 00:08:52,475::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 45097156608 capacity: 16106127360, alloc: 45097156608, phys: 45097156608 Thread-548505::INFO::2013-12-04 00:27:03,603::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 46170898432 capacity: 16106127360, alloc: 46170898432, phys: 46170898432 Thread-548705::INFO::2013-12-04 00:29:04,005::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 47244640256 capacity: 16106127360, alloc: 47244640256, phys: 47244640256 Thread-549010::INFO::2013-12-04 00:37:16,096::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 48318382080 capacity: 16106127360, alloc: 48318382080, phys: 48318382080 Thread-549184::INFO::2013-12-04 00:41:13,268::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 49392123904 capacity: 16106127360, alloc: 49392123904, phys: 49392123904 Thread-549272::INFO::2013-12-04 00:43:22,801::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 50465865728 capacity: 16106127360, alloc: 50465865728, phys: 50465865728 Thread-549360::INFO::2013-12-04 00:47:26,494::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 51539607552 capacity: 16106127360, alloc: 51539607552, phys: 51539607552 ... I suppose this should be taken care of in libvirt/qemu. Maybe we could try to protect against pausing the statistics at the end of the migration. I don't think we should address this in _highWrite using an upper bound.
(In reply to Federico Simoncelli from comment #7) > This looks like a race between live migration and highWrite monitoring. > I'll take one of the iterations as example (even though the pattern repeats > for each migration): > > Thread-547544::DEBUG::2013-12-04 00:01:13,683::API::469::vds::(migrate) > {'tunneled': 'false', 'src': '10.34.73.74', 'dst': '10.34.73.80:54321', > 'vmId': '05563fbc-6b2c-4f4e-9096-8daeb9754aa2', 'abortOnError': 'true', > 'method': 'online'} > Thread-547544::DEBUG::2013-12-04 > 00:01:13,683::BindingXMLRPC::991::vds::(wrapper) return vmMigrate with > {'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0} > ... > Thread-547545::DEBUG::2013-12-04 00:01:43,670::vm::743::vm.Vm::(cancel) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::canceling migration downtime > thread > Thread-547545::DEBUG::2013-12-04 00:01:43,671::vm::813::vm.Vm::(stop) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::stopping migration monitor > thread > Thread-547547::DEBUG::2013-12-04 00:01:43,671::vm::740::vm.Vm::(run) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::migration downtime thread > exiting > Thread-547529::INFO::2013-12-04 00:01:43,671::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 42949672960 > capacity: 16106127360, alloc: 42949672960, phys: 42949672960 > Thread-547548::DEBUG::2013-12-04 > 00:01:43,675::libvirtconnection::108::libvirtconnection::(wrapper) Unknown > libvirterror: ecode: 42 edom: 10 level: 2 message: Domain not found: no > domain with matching uuid '05563fbc-6b2c-4f4e-9096-8daeb9754aa2' > Thread-547529::DEBUG::2013-12-04 > 00:01:43,683::vm::2382::vm.Vm::(__extendDriveVolume) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::Requesting an extension for the > volume: {'newSize': 41984, 'domainID': > 'd5a781d6-a319-4d9f-9d1a-1f776f709911', 'name': u'vda', 'volumeID': > '8cc85f7a-a7df-4566-9817-0f61b227660a', 'imageID': > '924daf17-9767-49cf-a248-3b49f3d9ee8e', 'poolID': > '5849b030-626e-47cb-ad90-3ce782d831b3'} > Thread-547545::DEBUG::2013-12-04 > 00:01:43,684::libvirtconnection::108::libvirtconnection::(wrapper) Unknown > libvirterror: ecode: 42 edom: 10 level: 2 message: Domain not found: no > domain with matching uuid '05563fbc-6b2c-4f4e-9096-8daeb9754aa2' > Thread-547545::DEBUG::2013-12-04 > 00:01:43,685::vm::2557::vm.Vm::(setDownStatus) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::Changed state to Down: > Migration succeeded > > > It seems that for some reason when the migration is finalized, blockInfo > (libvirt/qemu) returns alloc == apparent == phys (e.g. in this case > 42949672960): > > Thread-547529::INFO::2013-12-04 00:01:43,671::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 42949672960 > capacity: 16106127360, alloc: 42949672960, phys: 42949672960 > > triggering an lv extension: > > Thread-547529::DEBUG::2013-12-04 > 00:01:43,683::vm::2382::vm.Vm::(__extendDriveVolume) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::Requesting an extension for the > volume: {'newSize': 41984, 'domainID': > 'd5a781d6-a319-4d9f-9d1a-1f776f709911', 'name': u'vda', 'volumeID': > '8cc85f7a-a7df-4566-9817-0f61b227660a', 'imageID': > '924daf17-9767-49cf-a248-3b49f3d9ee8e', 'poolID': > '5849b030-626e-47cb-ad90-3ce782d831b3'} > > In fact it seems very unlikely that at the end of each migration we always > hit the very end of the lv (without any enospc): > > Thread-547529::INFO::2013-12-04 00:01:43,671::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 42949672960 > capacity: 16106127360, alloc: 42949672960, phys: 42949672960 > Thread-547614::INFO::2013-12-04 00:05:19,606::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 44023414784 > capacity: 16106127360, alloc: 44023414784, phys: 44023414784 > Thread-547802::INFO::2013-12-04 00:08:52,475::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 45097156608 > capacity: 16106127360, alloc: 45097156608, phys: 45097156608 > Thread-548505::INFO::2013-12-04 00:27:03,603::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 46170898432 > capacity: 16106127360, alloc: 46170898432, phys: 46170898432 > Thread-548705::INFO::2013-12-04 00:29:04,005::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 47244640256 > capacity: 16106127360, alloc: 47244640256, phys: 47244640256 > Thread-549010::INFO::2013-12-04 00:37:16,096::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 48318382080 > capacity: 16106127360, alloc: 48318382080, phys: 48318382080 > Thread-549184::INFO::2013-12-04 00:41:13,268::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 49392123904 > capacity: 16106127360, alloc: 49392123904, phys: 49392123904 > Thread-549272::INFO::2013-12-04 00:43:22,801::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 50465865728 > capacity: 16106127360, alloc: 50465865728, phys: 50465865728 > Thread-549360::INFO::2013-12-04 00:47:26,494::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 51539607552 > capacity: 16106127360, alloc: 51539607552, phys: 51539607552 > > ... > > I suppose this should be taken care of in libvirt/qemu. Maybe we could try Please file a bug on libvirt then? > to protect against pausing the statistics at the end of the migration. I > don't think we should address this in _highWrite using an upper bound. I think we should have an upper bound in highWrite regardless (i.e. not let disk grow beyond a certain limit).
Based on bug 1040507 comment 2, I have reasons to believe that this is not a regression. David could you verify if indeed this happens on rhev 3.2 as well? (Same conditions: vm with a cow disk on block domain). Thanks!
(In reply to Federico Simoncelli from comment #10) > Based on bug 1040507 comment 2, I have reasons to believe that this is not a > regression. > > David could you verify if indeed this happens on rhev 3.2 as well? (Same > conditions: vm with a cow disk on block domain). Thanks! I'm fairly sure we hit VMs with insanely big disks in 3.2 and before and the repeated migration could have been the cause.
(In reply to David Jaša from comment #11) > (In reply to Federico Simoncelli from comment #10) > > Based on bug 1040507 comment 2, I have reasons to believe that this is not a > > regression. > > > > David could you verify if indeed this happens on rhev 3.2 as well? (Same > > conditions: vm with a cow disk on block domain). Thanks! > > I'm fairly sure we hit VMs with insanely big disks in 3.2 and before and the > repeated migration could have been the cause. confirmed, a 30GB disk on rhev 3.2 with 6.4 hosts can get to 267 GB overnight.
Moving to 3.4 because this bug depends on libvirt BZ#1040507.
libvirt-0.10.2-29.el6_5.4 contains this fix. Need to backport the patch that requires it.
Verified on: 3.4.0-0.15.beta3.el6ev vdsm-4.14.7-0.1.beta3.el6ev.x86_64 Run VM migration (every 60 seconds) for several hours (~11) and disk's actual size didn't change.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0504.html