Bug 1038069
Summary: | Repeated migration of a VM causes its disk to grow indefinitely | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | David Jaša <djasa> | ||||
Component: | vdsm | Assignee: | Federico Simoncelli <fsimonce> | ||||
Status: | CLOSED ERRATA | QA Contact: | Raz Tamir <ratamir> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.3.0 | CC: | acanan, amureini, bazulay, djasa, fsimonce, iheim, lpeer, michal.skrivanek, scohen, sgotliv, yeylon, zdover | ||||
Target Milestone: | --- | Flags: | scohen:
Triaged+
|
||||
Target Release: | 3.4.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | storage | ||||||
Fixed In Version: | vdsm-4.14.7-0.1.beta3.el6ev | Doc Type: | Bug Fix | ||||
Doc Text: |
Previously, repeatedly migrating virtual machines caused the disks associated with the virtual machines to grow indefinitely. This was due to a race between live migration and highWrite monitoring. libvirt-0.10.2-29.el6_5.4 repaired this issue, and is now included in Red Hat Enterprise Virtualization. Repeated migration of virtual machines no longer causes their associated disks to grow indefinitely.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1040507 (view as bug list) | Environment: | |||||
Last Closed: | 2014-06-09 13:26:53 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1040507, 1059904, 1065531 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
David Jaša
2013-12-04 11:06:21 UTC
Fede, qemu cluster leak? This looks like a race between live migration and highWrite monitoring. I'll take one of the iterations as example (even though the pattern repeats for each migration): Thread-547544::DEBUG::2013-12-04 00:01:13,683::API::469::vds::(migrate) {'tunneled': 'false', 'src': '10.34.73.74', 'dst': '10.34.73.80:54321', 'vmId': '05563fbc-6b2c-4f4e-9096-8daeb9754aa2', 'abortOnError': 'true', 'method': 'online'} Thread-547544::DEBUG::2013-12-04 00:01:13,683::BindingXMLRPC::991::vds::(wrapper) return vmMigrate with {'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0} ... Thread-547545::DEBUG::2013-12-04 00:01:43,670::vm::743::vm.Vm::(cancel) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::canceling migration downtime thread Thread-547545::DEBUG::2013-12-04 00:01:43,671::vm::813::vm.Vm::(stop) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::stopping migration monitor thread Thread-547547::DEBUG::2013-12-04 00:01:43,671::vm::740::vm.Vm::(run) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::migration downtime thread exiting Thread-547529::INFO::2013-12-04 00:01:43,671::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 42949672960 capacity: 16106127360, alloc: 42949672960, phys: 42949672960 Thread-547548::DEBUG::2013-12-04 00:01:43,675::libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 42 edom: 10 level: 2 message: Domain not found: no domain with matching uuid '05563fbc-6b2c-4f4e-9096-8daeb9754aa2' Thread-547529::DEBUG::2013-12-04 00:01:43,683::vm::2382::vm.Vm::(__extendDriveVolume) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::Requesting an extension for the volume: {'newSize': 41984, 'domainID': 'd5a781d6-a319-4d9f-9d1a-1f776f709911', 'name': u'vda', 'volumeID': '8cc85f7a-a7df-4566-9817-0f61b227660a', 'imageID': '924daf17-9767-49cf-a248-3b49f3d9ee8e', 'poolID': '5849b030-626e-47cb-ad90-3ce782d831b3'} Thread-547545::DEBUG::2013-12-04 00:01:43,684::libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 42 edom: 10 level: 2 message: Domain not found: no domain with matching uuid '05563fbc-6b2c-4f4e-9096-8daeb9754aa2' Thread-547545::DEBUG::2013-12-04 00:01:43,685::vm::2557::vm.Vm::(setDownStatus) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::Changed state to Down: Migration succeeded It seems that for some reason when the migration is finalized, blockInfo (libvirt/qemu) returns alloc == apparent == phys (e.g. in this case 42949672960): Thread-547529::INFO::2013-12-04 00:01:43,671::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 42949672960 capacity: 16106127360, alloc: 42949672960, phys: 42949672960 triggering an lv extension: Thread-547529::DEBUG::2013-12-04 00:01:43,683::vm::2382::vm.Vm::(__extendDriveVolume) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::Requesting an extension for the volume: {'newSize': 41984, 'domainID': 'd5a781d6-a319-4d9f-9d1a-1f776f709911', 'name': u'vda', 'volumeID': '8cc85f7a-a7df-4566-9817-0f61b227660a', 'imageID': '924daf17-9767-49cf-a248-3b49f3d9ee8e', 'poolID': '5849b030-626e-47cb-ad90-3ce782d831b3'} In fact it seems very unlikely that at the end of each migration we always hit the very end of the lv (without any enospc): Thread-547529::INFO::2013-12-04 00:01:43,671::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 42949672960 capacity: 16106127360, alloc: 42949672960, phys: 42949672960 Thread-547614::INFO::2013-12-04 00:05:19,606::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 44023414784 capacity: 16106127360, alloc: 44023414784, phys: 44023414784 Thread-547802::INFO::2013-12-04 00:08:52,475::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 45097156608 capacity: 16106127360, alloc: 45097156608, phys: 45097156608 Thread-548505::INFO::2013-12-04 00:27:03,603::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 46170898432 capacity: 16106127360, alloc: 46170898432, phys: 46170898432 Thread-548705::INFO::2013-12-04 00:29:04,005::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 47244640256 capacity: 16106127360, alloc: 47244640256, phys: 47244640256 Thread-549010::INFO::2013-12-04 00:37:16,096::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 48318382080 capacity: 16106127360, alloc: 48318382080, phys: 48318382080 Thread-549184::INFO::2013-12-04 00:41:13,268::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 49392123904 capacity: 16106127360, alloc: 49392123904, phys: 49392123904 Thread-549272::INFO::2013-12-04 00:43:22,801::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 50465865728 capacity: 16106127360, alloc: 50465865728, phys: 50465865728 Thread-549360::INFO::2013-12-04 00:47:26,494::vm::521::vm.Vm::(_highWrite) vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a-1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 51539607552 capacity: 16106127360, alloc: 51539607552, phys: 51539607552 ... I suppose this should be taken care of in libvirt/qemu. Maybe we could try to protect against pausing the statistics at the end of the migration. I don't think we should address this in _highWrite using an upper bound. (In reply to Federico Simoncelli from comment #7) > This looks like a race between live migration and highWrite monitoring. > I'll take one of the iterations as example (even though the pattern repeats > for each migration): > > Thread-547544::DEBUG::2013-12-04 00:01:13,683::API::469::vds::(migrate) > {'tunneled': 'false', 'src': '10.34.73.74', 'dst': '10.34.73.80:54321', > 'vmId': '05563fbc-6b2c-4f4e-9096-8daeb9754aa2', 'abortOnError': 'true', > 'method': 'online'} > Thread-547544::DEBUG::2013-12-04 > 00:01:13,683::BindingXMLRPC::991::vds::(wrapper) return vmMigrate with > {'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0} > ... > Thread-547545::DEBUG::2013-12-04 00:01:43,670::vm::743::vm.Vm::(cancel) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::canceling migration downtime > thread > Thread-547545::DEBUG::2013-12-04 00:01:43,671::vm::813::vm.Vm::(stop) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::stopping migration monitor > thread > Thread-547547::DEBUG::2013-12-04 00:01:43,671::vm::740::vm.Vm::(run) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::migration downtime thread > exiting > Thread-547529::INFO::2013-12-04 00:01:43,671::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 42949672960 > capacity: 16106127360, alloc: 42949672960, phys: 42949672960 > Thread-547548::DEBUG::2013-12-04 > 00:01:43,675::libvirtconnection::108::libvirtconnection::(wrapper) Unknown > libvirterror: ecode: 42 edom: 10 level: 2 message: Domain not found: no > domain with matching uuid '05563fbc-6b2c-4f4e-9096-8daeb9754aa2' > Thread-547529::DEBUG::2013-12-04 > 00:01:43,683::vm::2382::vm.Vm::(__extendDriveVolume) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::Requesting an extension for the > volume: {'newSize': 41984, 'domainID': > 'd5a781d6-a319-4d9f-9d1a-1f776f709911', 'name': u'vda', 'volumeID': > '8cc85f7a-a7df-4566-9817-0f61b227660a', 'imageID': > '924daf17-9767-49cf-a248-3b49f3d9ee8e', 'poolID': > '5849b030-626e-47cb-ad90-3ce782d831b3'} > Thread-547545::DEBUG::2013-12-04 > 00:01:43,684::libvirtconnection::108::libvirtconnection::(wrapper) Unknown > libvirterror: ecode: 42 edom: 10 level: 2 message: Domain not found: no > domain with matching uuid '05563fbc-6b2c-4f4e-9096-8daeb9754aa2' > Thread-547545::DEBUG::2013-12-04 > 00:01:43,685::vm::2557::vm.Vm::(setDownStatus) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::Changed state to Down: > Migration succeeded > > > It seems that for some reason when the migration is finalized, blockInfo > (libvirt/qemu) returns alloc == apparent == phys (e.g. in this case > 42949672960): > > Thread-547529::INFO::2013-12-04 00:01:43,671::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 42949672960 > capacity: 16106127360, alloc: 42949672960, phys: 42949672960 > > triggering an lv extension: > > Thread-547529::DEBUG::2013-12-04 > 00:01:43,683::vm::2382::vm.Vm::(__extendDriveVolume) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::Requesting an extension for the > volume: {'newSize': 41984, 'domainID': > 'd5a781d6-a319-4d9f-9d1a-1f776f709911', 'name': u'vda', 'volumeID': > '8cc85f7a-a7df-4566-9817-0f61b227660a', 'imageID': > '924daf17-9767-49cf-a248-3b49f3d9ee8e', 'poolID': > '5849b030-626e-47cb-ad90-3ce782d831b3'} > > In fact it seems very unlikely that at the end of each migration we always > hit the very end of the lv (without any enospc): > > Thread-547529::INFO::2013-12-04 00:01:43,671::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 42949672960 > capacity: 16106127360, alloc: 42949672960, phys: 42949672960 > Thread-547614::INFO::2013-12-04 00:05:19,606::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 44023414784 > capacity: 16106127360, alloc: 44023414784, phys: 44023414784 > Thread-547802::INFO::2013-12-04 00:08:52,475::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 45097156608 > capacity: 16106127360, alloc: 45097156608, phys: 45097156608 > Thread-548505::INFO::2013-12-04 00:27:03,603::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 46170898432 > capacity: 16106127360, alloc: 46170898432, phys: 46170898432 > Thread-548705::INFO::2013-12-04 00:29:04,005::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 47244640256 > capacity: 16106127360, alloc: 47244640256, phys: 47244640256 > Thread-549010::INFO::2013-12-04 00:37:16,096::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 48318382080 > capacity: 16106127360, alloc: 48318382080, phys: 48318382080 > Thread-549184::INFO::2013-12-04 00:41:13,268::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 49392123904 > capacity: 16106127360, alloc: 49392123904, phys: 49392123904 > Thread-549272::INFO::2013-12-04 00:43:22,801::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 50465865728 > capacity: 16106127360, alloc: 50465865728, phys: 50465865728 > Thread-549360::INFO::2013-12-04 00:47:26,494::vm::521::vm.Vm::(_highWrite) > vmId=`05563fbc-6b2c-4f4e-9096-8daeb9754aa2`::d5a781d6-a319-4d9f-9d1a- > 1f776f709911/8cc85f7a-a7df-4566-9817-0f61b227660a apparent: 51539607552 > capacity: 16106127360, alloc: 51539607552, phys: 51539607552 > > ... > > I suppose this should be taken care of in libvirt/qemu. Maybe we could try Please file a bug on libvirt then? > to protect against pausing the statistics at the end of the migration. I > don't think we should address this in _highWrite using an upper bound. I think we should have an upper bound in highWrite regardless (i.e. not let disk grow beyond a certain limit). Based on bug 1040507 comment 2, I have reasons to believe that this is not a regression. David could you verify if indeed this happens on rhev 3.2 as well? (Same conditions: vm with a cow disk on block domain). Thanks! (In reply to Federico Simoncelli from comment #10) > Based on bug 1040507 comment 2, I have reasons to believe that this is not a > regression. > > David could you verify if indeed this happens on rhev 3.2 as well? (Same > conditions: vm with a cow disk on block domain). Thanks! I'm fairly sure we hit VMs with insanely big disks in 3.2 and before and the repeated migration could have been the cause. (In reply to David Jaša from comment #11) > (In reply to Federico Simoncelli from comment #10) > > Based on bug 1040507 comment 2, I have reasons to believe that this is not a > > regression. > > > > David could you verify if indeed this happens on rhev 3.2 as well? (Same > > conditions: vm with a cow disk on block domain). Thanks! > > I'm fairly sure we hit VMs with insanely big disks in 3.2 and before and the > repeated migration could have been the cause. confirmed, a 30GB disk on rhev 3.2 with 6.4 hosts can get to 267 GB overnight. Moving to 3.4 because this bug depends on libvirt BZ#1040507. libvirt-0.10.2-29.el6_5.4 contains this fix. Need to backport the patch that requires it. Verified on: 3.4.0-0.15.beta3.el6ev vdsm-4.14.7-0.1.beta3.el6ev.x86_64 Run VM migration (every 60 seconds) for several hours (~11) and disk's actual size didn't change. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0504.html |