Bug 923964
Summary: | vdsm: within few seconds after a live snapshot the volume extension requests might be too large | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Haim <hateya> | ||||
Component: | vdsm | Assignee: | Federico Simoncelli <fsimonce> | ||||
Status: | CLOSED ERRATA | QA Contact: | Dafna Ron <dron> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.2.0 | CC: | abaron, amureini, bazulay, hateya, iheim, lpeer, scohen, sgrinber, yeylon, ykaul, zdover | ||||
Target Milestone: | --- | Flags: | abaron:
Triaged+
|
||||
Target Release: | 3.2.0 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | storage | ||||||
Fixed In Version: | vdsm-4.10.2-18.0.el6ev | Doc Type: | Bug Fix | ||||
Doc Text: |
Previously, during live storage migrations of 100GB disks, the disk (a logical volume) was duplicated, and the original disk was not removed. This led to situations in which a 100GB logical volume would be replaced with a 200GB logical volume.
This was due to a race with the volume statistics. If a volume extension request arrived when the VmStatsThread didn't update the apparentsize, then VDSM requested an extension on top of the original raw volume size.
This was fixed in vdsm-4.10.2-18.0.el6ev.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2013-06-10 20:46:09 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 948448 | ||||||
Attachments: |
|
Description
Haim
2013-03-20 19:43:30 UTC
Created attachment 713446 [details]
engine, vdsm and lvm
(In reply to comment #0) > Description of problem: > > flow: live storage migration of 100g disk (migration back to original domain) > > affect on user: disk is multiplied as snapshot is not removed > > repro steps: > > 1) run VM > 2) install O.S > 3) migrate disk from source to dest > 4) migrate disk from dest to source > 5) check disk size - 202g > > since disk is basically LV, I checked LV size and noticed 2 LVs: > > [root@puma31 tmp]# lvs > LV VG > Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert > 00d2102e-fd5e-4dd1-adad-0a6b2bf3949f 08221c4d-f68b-4085-ab47-9c1d4b1b70d0 > -wi------ 100.00g > 2bfb1a73-4546-40f6-a5cc-17763e4c159f 08221c4d-f68b-4085-ab47-9c1d4b1b70d0 > -wi------ 102.00g > > followed live migration thread on vdsm and noticed an lvremove command > issues for the specific lv: > > Thread-1592::DEBUG::2013-03-20 > 21:09:24,396::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n > /sbin/lvm lvremove --config " devices { preferred_names = > [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 w > rite_cache_state=0 disable_after_error_count=3 filter = [ > \'a%3514f0c542a000602|3514f0c542a000603|3514f0c542a000604|3514f0c542a000605|3 > 514f0c542a000606|3514f0c542a000607|3514f0c542a000608|3514f0c542a000609|3514f0 > c > 542a00060a|3514f0c542a00060b%\', \'r%.*%\' ] } global { locking_type=1 > prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 > retain_days = 0 } " -f --autobackup n 4b14d3b3-7da4-4ed5-88d0-bfb3 > 1f3b8df7/2bfb1a73-4546-40f6-a5cc-17763e4c159f > 4b14d3b3-7da4-4ed5-88d0-bfb31f3b8df7/00d2102e-fd5e-4dd1-adad-0a6b2bf3949f' 1. This is for the 2 LVs on the source VG (4b14d3b3-7da4-4ed5-88d0-bfb31f3b8df7) while the lvs you list above are on the destination VG (08221c4d-f68b-4085-ab47-9c1d4b1b70d0) so I'm not sure what you're saying didn't work here. 2. From the 2 lvs listed I'm guessing the first one is the raw disk and the second one is the cow layer created for the migration? Need to check why the cow layer is so large. Fede? > (cwd None) > Thread-1592::DEBUG::2013-03-20 > 21:09:24,632::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = ' > WARNING: This metadata update is NOT backed up\n WARNING: This metadata > update is NOT backed up\n'; <rc> > = 0 > > from some unknown reason, lv was not removed. > > note that during storage live migration, VM was hibernated and restored. (In reply to comment #2) > (In reply to comment #0) > > Description of problem: > > > > flow: live storage migration of 100g disk (migration back to original domain) > > > > affect on user: disk is multiplied as snapshot is not removed > > > > repro steps: > > > > 1) run VM > > 2) install O.S > > 3) migrate disk from source to dest > > 4) migrate disk from dest to source > > 5) check disk size - 202g > > > > since disk is basically LV, I checked LV size and noticed 2 LVs: > > > > [root@puma31 tmp]# lvs > > LV VG > > Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert > > 00d2102e-fd5e-4dd1-adad-0a6b2bf3949f 08221c4d-f68b-4085-ab47-9c1d4b1b70d0 > > -wi------ 100.00g > > 2bfb1a73-4546-40f6-a5cc-17763e4c159f 08221c4d-f68b-4085-ab47-9c1d4b1b70d0 > > -wi------ 102.00g > > > > followed live migration thread on vdsm and noticed an lvremove command > > issues for the specific lv: > > > > Thread-1592::DEBUG::2013-03-20 > > 21:09:24,396::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n > > /sbin/lvm lvremove --config " devices { preferred_names = > > [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 w > > rite_cache_state=0 disable_after_error_count=3 filter = [ > > \'a%3514f0c542a000602|3514f0c542a000603|3514f0c542a000604|3514f0c542a000605|3 > > 514f0c542a000606|3514f0c542a000607|3514f0c542a000608|3514f0c542a000609|3514f0 > > c > > 542a00060a|3514f0c542a00060b%\', \'r%.*%\' ] } global { locking_type=1 > > prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 > > retain_days = 0 } " -f --autobackup n 4b14d3b3-7da4-4ed5-88d0-bfb3 > > 1f3b8df7/2bfb1a73-4546-40f6-a5cc-17763e4c159f > > 4b14d3b3-7da4-4ed5-88d0-bfb31f3b8df7/00d2102e-fd5e-4dd1-adad-0a6b2bf3949f' > > 1. This is for the 2 LVs on the source VG > (4b14d3b3-7da4-4ed5-88d0-bfb31f3b8df7) while the lvs you list above are on > the destination VG (08221c4d-f68b-4085-ab47-9c1d4b1b70d0) so I'm not sure > what you're saying didn't work here. > 2. From the 2 lvs listed I'm guessing the first one is the raw disk and the > second one is the cow layer created for the migration? Need to check why > the cow layer is so large. Fede? It looks like there's a race with the volume statistics. Basically if a volume extension request arrives when the VmStatsThread didn't update the apparentsize then we request an extension on top of the original raw volume size. Note the apparentsize reported (107374182400 == 100Gb) and the extension request (104448 == 102Gb): Thread-13987::DEBUG::2013-03-20 18:56:29,992::BindingXMLRPC::913::vds::(wrapper) client [10.35.161.6]::call vmSnapshot with ('9cd58a26-9188-4894-a7ae-0c8ea767e043', [{'baseVolumeID': '00d2102e-fd5e-4dd1-adad-0a6b2bf3949f', 'domainID': '4b14d3b3-7da4-4ed5-88d0-bfb31f3b8df7', 'volumeID': '2bfb1a73-4546-40f6-a5cc-17763e4c159f', 'imageID': '195c2ec0-b5bb-4278-9978-f66de0ba2063'}]) {} ... Thread-13978::DEBUG::2013-03-20 18:56:06,271::BindingXMLRPC::920::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '14274', 'displayIp': '0', 'displayPort': u'5900', 'session': 'Unknown', 'displaySecurePort': u'5901', 'timeOffset': '-43200', 'hash': '-3545512217363422887', 'balloonInfo': {'balloon_max': 4194304, 'balloon_cur': 4194304}, 'pauseCode': 'NOERR', 'clientIp': '10.35.200.31', 'kvmEnable': 'true', 'network': {u'vnet0': {'macAddr': '00:1a:4a:23:a1:1e', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet0'}}, 'vmId': '9cd58a26-9188-4894-a7ae-0c8ea767e043', 'displayType': 'qxl', 'cpuUser': '43.22', 'disks': {u'vda': {'readLatency': '243008', 'apparentsize': '107374182400', 'writeLatency': '73893423', 'imageID': '195c2ec0-b5bb-4278-9978-f66de0ba2063', 'flushLatency': '100700', 'readRate': '176523.01', 'truesize': '107374182400', 'writeRate': '13688918.23'}, u'hdc': {'readLatency': '6835', 'apparentsize': '0', 'writeLatency': '0', 'flushLatency': '0', 'readRate': '7.24', 'truesize': '0', 'writeRate': '0.00'}}, 'monitorResponse': '0', 'statsAge': '0.54', 'elapsedTime': '61', 'vmType': 'kvm', 'cpuSys': '6.82', 'appsList': [], 'guestIPs': ''}]} VM Channels Listener::DEBUG::2013-03-20 18:56:06,431::vmChannels::60::vds::(_handle_timeouts) Timeout on fileno 17. ... Thread-13987::DEBUG::2013-03-20 18:56:40,273::BindingXMLRPC::920::vds::(wrapper) return vmSnapshot with {'status': {'message': 'Done', 'code': 0}, 'quiesce': False} ... Thread-13998::DEBUG::2013-03-20 18:56:50,528::vm::830::vm.Vm::(extendDriveVolume) vmId=`9cd58a26-9188-4894-a7ae-0c8ea767e043`::Requesting an extension for the volume replication: {'newSize': 104448, 'domainID': '08221c4d-f68b-4085-ab47-9c1d4b1b70d0', 'name': u'vda', 'volumeID': '2bfb1a73-4546-40f6-a5cc-17763e4c159f', 'imageID': '195c2ec0-b5bb-4278-9978-f66de0ba2063', 'poolID': 'a31dbbd1-d491-4420-9747-85bbbe15cc25'} verified on sf16 with vdsm-4.10.2-18.0.el6ev.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0886.html |