Bug 923964 - vdsm: within few seconds after a live snapshot the volume extension requests might be too large
Summary: vdsm: within few seconds after a live snapshot the volume extension requests ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.2.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.2.0
Assignee: Federico Simoncelli
QA Contact: Dafna Ron
URL:
Whiteboard: storage
Depends On:
Blocks: 948448
TreeView+ depends on / blocked
 
Reported: 2013-03-20 19:43 UTC by Haim
Modified: 2022-07-09 07:05 UTC (History)
11 users (show)

Fixed In Version: vdsm-4.10.2-18.0.el6ev
Doc Type: Bug Fix
Doc Text:
Previously, during live storage migrations of 100GB disks, the disk (a logical volume) was duplicated, and the original disk was not removed. This led to situations in which a 100GB logical volume would be replaced with a 200GB logical volume. This was due to a race with the volume statistics. If a volume extension request arrived when the VmStatsThread didn't update the apparentsize, then VDSM requested an extension on top of the original raw volume size. This was fixed in vdsm-4.10.2-18.0.el6ev.
Clone Of:
Environment:
Last Closed: 2013-06-10 20:46:09 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:
abaron: Triaged+


Attachments (Terms of Use)
engine, vdsm and lvm (8.80 MB, application/x-tar)
2013-03-20 19:51 UTC, Haim
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-47297 0 None None None 2022-07-09 07:05:07 UTC
Red Hat Product Errata RHSA-2013:0886 0 normal SHIPPED_LIVE Moderate: rhev 3.2 - vdsm security and bug fix update 2013-06-11 00:25:02 UTC
oVirt gerrit 13346 0 None None None Never

Description Haim 2013-03-20 19:43:30 UTC
Description of problem:

flow: live storage migration of 100g disk (migration back to original domain)

affect on user: disk is multiplied as snapshot is not removed

repro steps:

1) run VM
2) install O.S
3) migrate disk from source to dest
4) migrate disk from dest to source 
5) check disk size - 202g 

since disk is basically LV, I checked LV size and noticed 2 LVs:

 [root@puma31 tmp]# lvs
  LV                                   VG                                   Attr      LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
  00d2102e-fd5e-4dd1-adad-0a6b2bf3949f 08221c4d-f68b-4085-ab47-9c1d4b1b70d0 -wi------ 100.00g                                             
  2bfb1a73-4546-40f6-a5cc-17763e4c159f 08221c4d-f68b-4085-ab47-9c1d4b1b70d0 -wi------ 102.00g   

followed live migration thread on vdsm and noticed an lvremove command issues for the specific lv:

Thread-1592::DEBUG::2013-03-20 21:09:24,396::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm lvremove --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 w
rite_cache_state=0 disable_after_error_count=3 filter = [ \'a%3514f0c542a000602|3514f0c542a000603|3514f0c542a000604|3514f0c542a000605|3514f0c542a000606|3514f0c542a000607|3514f0c542a000608|3514f0c542a000609|3514f0c
542a00060a|3514f0c542a00060b%\', \'r%.*%\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " -f --autobackup n 4b14d3b3-7da4-4ed5-88d0-bfb3
1f3b8df7/2bfb1a73-4546-40f6-a5cc-17763e4c159f 4b14d3b3-7da4-4ed5-88d0-bfb31f3b8df7/00d2102e-fd5e-4dd1-adad-0a6b2bf3949f' (cwd None)
Thread-1592::DEBUG::2013-03-20 21:09:24,632::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = '  WARNING: This metadata update is NOT backed up\n  WARNING: This metadata update is NOT backed up\n'; <rc> 
= 0

from some unknown reason, lv was not removed.

note that during storage live migration, VM was hibernated and restored.

Comment 1 Haim 2013-03-20 19:51:52 UTC
Created attachment 713446 [details]
engine, vdsm and lvm

Comment 2 Ayal Baron 2013-03-22 21:39:28 UTC
(In reply to comment #0)
> Description of problem:
> 
> flow: live storage migration of 100g disk (migration back to original domain)
> 
> affect on user: disk is multiplied as snapshot is not removed
> 
> repro steps:
> 
> 1) run VM
> 2) install O.S
> 3) migrate disk from source to dest
> 4) migrate disk from dest to source 
> 5) check disk size - 202g 
> 
> since disk is basically LV, I checked LV size and noticed 2 LVs:
> 
>  [root@puma31 tmp]# lvs
>   LV                                   VG                                  
> Attr      LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
>   00d2102e-fd5e-4dd1-adad-0a6b2bf3949f 08221c4d-f68b-4085-ab47-9c1d4b1b70d0
> -wi------ 100.00g                                             
>   2bfb1a73-4546-40f6-a5cc-17763e4c159f 08221c4d-f68b-4085-ab47-9c1d4b1b70d0
> -wi------ 102.00g   
> 
> followed live migration thread on vdsm and noticed an lvremove command
> issues for the specific lv:
> 
> Thread-1592::DEBUG::2013-03-20
> 21:09:24,396::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
> /sbin/lvm lvremove --config " devices { preferred_names =
> [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 w
> rite_cache_state=0 disable_after_error_count=3 filter = [
> \'a%3514f0c542a000602|3514f0c542a000603|3514f0c542a000604|3514f0c542a000605|3
> 514f0c542a000606|3514f0c542a000607|3514f0c542a000608|3514f0c542a000609|3514f0
> c
> 542a00060a|3514f0c542a00060b%\', \'r%.*%\' ] }  global {  locking_type=1 
> prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50 
> retain_days = 0 } " -f --autobackup n 4b14d3b3-7da4-4ed5-88d0-bfb3
> 1f3b8df7/2bfb1a73-4546-40f6-a5cc-17763e4c159f
> 4b14d3b3-7da4-4ed5-88d0-bfb31f3b8df7/00d2102e-fd5e-4dd1-adad-0a6b2bf3949f'

1. This is for the 2 LVs on the source VG (4b14d3b3-7da4-4ed5-88d0-bfb31f3b8df7) while the lvs you list above are on the destination VG (08221c4d-f68b-4085-ab47-9c1d4b1b70d0) so I'm not sure what you're saying didn't work here.
2. From the 2 lvs listed I'm guessing the first one is the raw disk and the second one is the cow layer created for the migration?  Need to check why the cow layer is so large. Fede?


> (cwd None)
> Thread-1592::DEBUG::2013-03-20
> 21:09:24,632::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = ' 
> WARNING: This metadata update is NOT backed up\n  WARNING: This metadata
> update is NOT backed up\n'; <rc> 
> = 0
> 
> from some unknown reason, lv was not removed.
> 
> note that during storage live migration, VM was hibernated and restored.

Comment 3 Federico Simoncelli 2013-03-26 16:03:02 UTC
(In reply to comment #2)
> (In reply to comment #0)
> > Description of problem:
> > 
> > flow: live storage migration of 100g disk (migration back to original domain)
> > 
> > affect on user: disk is multiplied as snapshot is not removed
> > 
> > repro steps:
> > 
> > 1) run VM
> > 2) install O.S
> > 3) migrate disk from source to dest
> > 4) migrate disk from dest to source 
> > 5) check disk size - 202g 
> > 
> > since disk is basically LV, I checked LV size and noticed 2 LVs:
> > 
> >  [root@puma31 tmp]# lvs
> >   LV                                   VG                                  
> > Attr      LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
> >   00d2102e-fd5e-4dd1-adad-0a6b2bf3949f 08221c4d-f68b-4085-ab47-9c1d4b1b70d0
> > -wi------ 100.00g                                             
> >   2bfb1a73-4546-40f6-a5cc-17763e4c159f 08221c4d-f68b-4085-ab47-9c1d4b1b70d0
> > -wi------ 102.00g   
> > 
> > followed live migration thread on vdsm and noticed an lvremove command
> > issues for the specific lv:
> > 
> > Thread-1592::DEBUG::2013-03-20
> > 21:09:24,396::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
> > /sbin/lvm lvremove --config " devices { preferred_names =
> > [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 w
> > rite_cache_state=0 disable_after_error_count=3 filter = [
> > \'a%3514f0c542a000602|3514f0c542a000603|3514f0c542a000604|3514f0c542a000605|3
> > 514f0c542a000606|3514f0c542a000607|3514f0c542a000608|3514f0c542a000609|3514f0
> > c
> > 542a00060a|3514f0c542a00060b%\', \'r%.*%\' ] }  global {  locking_type=1 
> > prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50 
> > retain_days = 0 } " -f --autobackup n 4b14d3b3-7da4-4ed5-88d0-bfb3
> > 1f3b8df7/2bfb1a73-4546-40f6-a5cc-17763e4c159f
> > 4b14d3b3-7da4-4ed5-88d0-bfb31f3b8df7/00d2102e-fd5e-4dd1-adad-0a6b2bf3949f'
> 
> 1. This is for the 2 LVs on the source VG
> (4b14d3b3-7da4-4ed5-88d0-bfb31f3b8df7) while the lvs you list above are on
> the destination VG (08221c4d-f68b-4085-ab47-9c1d4b1b70d0) so I'm not sure
> what you're saying didn't work here.
> 2. From the 2 lvs listed I'm guessing the first one is the raw disk and the
> second one is the cow layer created for the migration?  Need to check why
> the cow layer is so large. Fede?

It looks like there's a race with the volume statistics. Basically if a volume extension request arrives when the VmStatsThread didn't update the apparentsize then we request an extension on top of the original raw volume size. Note the apparentsize reported (107374182400 == 100Gb) and the extension request (104448 == 102Gb):

Thread-13987::DEBUG::2013-03-20 18:56:29,992::BindingXMLRPC::913::vds::(wrapper) client [10.35.161.6]::call vmSnapshot with ('9cd58a26-9188-4894-a7ae-0c8ea767e043', [{'baseVolumeID': '00d2102e-fd5e-4dd1-adad-0a6b2bf3949f', 'domainID': '4b14d3b3-7da4-4ed5-88d0-bfb31f3b8df7', 'volumeID': '2bfb1a73-4546-40f6-a5cc-17763e4c159f', 'imageID': '195c2ec0-b5bb-4278-9978-f66de0ba2063'}]) {}
...
Thread-13978::DEBUG::2013-03-20 18:56:06,271::BindingXMLRPC::920::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '14274', 'displayIp': '0', 'displayPort': u'5900', 'session': 'Unknown', 'displaySecurePort': u'5901', 'timeOffset': '-43200', 'hash': '-3545512217363422887', 'balloonInfo': {'balloon_max': 4194304, 'balloon_cur': 4194304}, 'pauseCode': 'NOERR', 'clientIp': '10.35.200.31', 'kvmEnable': 'true', 'network': {u'vnet0': {'macAddr': '00:1a:4a:23:a1:1e', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet0'}}, 'vmId': '9cd58a26-9188-4894-a7ae-0c8ea767e043', 'displayType': 'qxl', 'cpuUser': '43.22', 'disks': {u'vda': {'readLatency': '243008', 'apparentsize': '107374182400', 'writeLatency': '73893423', 'imageID': '195c2ec0-b5bb-4278-9978-f66de0ba2063', 'flushLatency': '100700', 'readRate': '176523.01', 'truesize': '107374182400', 'writeRate': '13688918.23'}, u'hdc': {'readLatency': '6835', 'apparentsize': '0', 'writeLatency': '0', 'flushLatency': '0', 'readRate': '7.24', 'truesize': '0', 'writeRate': '0.00'}}, 'monitorResponse': '0', 'statsAge': '0.54', 'elapsedTime': '61', 'vmType': 'kvm', 'cpuSys': '6.82', 'appsList': [], 'guestIPs': ''}]}
VM Channels Listener::DEBUG::2013-03-20 18:56:06,431::vmChannels::60::vds::(_handle_timeouts) Timeout on fileno 17.
...
Thread-13987::DEBUG::2013-03-20 18:56:40,273::BindingXMLRPC::920::vds::(wrapper) return vmSnapshot with {'status': {'message': 'Done', 'code': 0}, 'quiesce': False}
...
Thread-13998::DEBUG::2013-03-20 18:56:50,528::vm::830::vm.Vm::(extendDriveVolume) vmId=`9cd58a26-9188-4894-a7ae-0c8ea767e043`::Requesting an extension for the volume replication: {'newSize': 104448, 'domainID': '08221c4d-f68b-4085-ab47-9c1d4b1b70d0', 'name': u'vda', 'volumeID': '2bfb1a73-4546-40f6-a5cc-17763e4c159f', 'imageID': '195c2ec0-b5bb-4278-9978-f66de0ba2063', 'poolID': 'a31dbbd1-d491-4420-9747-85bbbe15cc25'}

Comment 5 Dafna Ron 2013-05-12 13:20:08 UTC
verified on sf16 with vdsm-4.10.2-18.0.el6ev.x86_64

Comment 7 errata-xmlrpc 2013-06-10 20:46:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0886.html


Note You need to log in before you can comment on or make changes to this bug.