Bug 923964

Summary: vdsm: within few seconds after a live snapshot the volume extension requests might be too large
Product: Red Hat Enterprise Virtualization Manager Reporter: Haim <hateya>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED ERRATA QA Contact: Dafna Ron <dron>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: abaron, amureini, bazulay, hateya, iheim, lpeer, scohen, sgrinber, yeylon, ykaul, zdover
Target Milestone: ---Flags: abaron: Triaged+
Target Release: 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: vdsm-4.10.2-18.0.el6ev Doc Type: Bug Fix
Doc Text:
Previously, during live storage migrations of 100GB disks, the disk (a logical volume) was duplicated, and the original disk was not removed. This led to situations in which a 100GB logical volume would be replaced with a 200GB logical volume. This was due to a race with the volume statistics. If a volume extension request arrived when the VmStatsThread didn't update the apparentsize, then VDSM requested an extension on top of the original raw volume size. This was fixed in vdsm-4.10.2-18.0.el6ev.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-06-10 20:46:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 948448    
Attachments:
Description Flags
engine, vdsm and lvm none

Description Haim 2013-03-20 19:43:30 UTC
Description of problem:

flow: live storage migration of 100g disk (migration back to original domain)

affect on user: disk is multiplied as snapshot is not removed

repro steps:

1) run VM
2) install O.S
3) migrate disk from source to dest
4) migrate disk from dest to source 
5) check disk size - 202g 

since disk is basically LV, I checked LV size and noticed 2 LVs:

 [root@puma31 tmp]# lvs
  LV                                   VG                                   Attr      LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
  00d2102e-fd5e-4dd1-adad-0a6b2bf3949f 08221c4d-f68b-4085-ab47-9c1d4b1b70d0 -wi------ 100.00g                                             
  2bfb1a73-4546-40f6-a5cc-17763e4c159f 08221c4d-f68b-4085-ab47-9c1d4b1b70d0 -wi------ 102.00g   

followed live migration thread on vdsm and noticed an lvremove command issues for the specific lv:

Thread-1592::DEBUG::2013-03-20 21:09:24,396::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm lvremove --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 w
rite_cache_state=0 disable_after_error_count=3 filter = [ \'a%3514f0c542a000602|3514f0c542a000603|3514f0c542a000604|3514f0c542a000605|3514f0c542a000606|3514f0c542a000607|3514f0c542a000608|3514f0c542a000609|3514f0c
542a00060a|3514f0c542a00060b%\', \'r%.*%\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " -f --autobackup n 4b14d3b3-7da4-4ed5-88d0-bfb3
1f3b8df7/2bfb1a73-4546-40f6-a5cc-17763e4c159f 4b14d3b3-7da4-4ed5-88d0-bfb31f3b8df7/00d2102e-fd5e-4dd1-adad-0a6b2bf3949f' (cwd None)
Thread-1592::DEBUG::2013-03-20 21:09:24,632::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = '  WARNING: This metadata update is NOT backed up\n  WARNING: This metadata update is NOT backed up\n'; <rc> 
= 0

from some unknown reason, lv was not removed.

note that during storage live migration, VM was hibernated and restored.

Comment 1 Haim 2013-03-20 19:51:52 UTC
Created attachment 713446 [details]
engine, vdsm and lvm

Comment 2 Ayal Baron 2013-03-22 21:39:28 UTC
(In reply to comment #0)
> Description of problem:
> 
> flow: live storage migration of 100g disk (migration back to original domain)
> 
> affect on user: disk is multiplied as snapshot is not removed
> 
> repro steps:
> 
> 1) run VM
> 2) install O.S
> 3) migrate disk from source to dest
> 4) migrate disk from dest to source 
> 5) check disk size - 202g 
> 
> since disk is basically LV, I checked LV size and noticed 2 LVs:
> 
>  [root@puma31 tmp]# lvs
>   LV                                   VG                                  
> Attr      LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
>   00d2102e-fd5e-4dd1-adad-0a6b2bf3949f 08221c4d-f68b-4085-ab47-9c1d4b1b70d0
> -wi------ 100.00g                                             
>   2bfb1a73-4546-40f6-a5cc-17763e4c159f 08221c4d-f68b-4085-ab47-9c1d4b1b70d0
> -wi------ 102.00g   
> 
> followed live migration thread on vdsm and noticed an lvremove command
> issues for the specific lv:
> 
> Thread-1592::DEBUG::2013-03-20
> 21:09:24,396::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
> /sbin/lvm lvremove --config " devices { preferred_names =
> [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 w
> rite_cache_state=0 disable_after_error_count=3 filter = [
> \'a%3514f0c542a000602|3514f0c542a000603|3514f0c542a000604|3514f0c542a000605|3
> 514f0c542a000606|3514f0c542a000607|3514f0c542a000608|3514f0c542a000609|3514f0
> c
> 542a00060a|3514f0c542a00060b%\', \'r%.*%\' ] }  global {  locking_type=1 
> prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50 
> retain_days = 0 } " -f --autobackup n 4b14d3b3-7da4-4ed5-88d0-bfb3
> 1f3b8df7/2bfb1a73-4546-40f6-a5cc-17763e4c159f
> 4b14d3b3-7da4-4ed5-88d0-bfb31f3b8df7/00d2102e-fd5e-4dd1-adad-0a6b2bf3949f'

1. This is for the 2 LVs on the source VG (4b14d3b3-7da4-4ed5-88d0-bfb31f3b8df7) while the lvs you list above are on the destination VG (08221c4d-f68b-4085-ab47-9c1d4b1b70d0) so I'm not sure what you're saying didn't work here.
2. From the 2 lvs listed I'm guessing the first one is the raw disk and the second one is the cow layer created for the migration?  Need to check why the cow layer is so large. Fede?


> (cwd None)
> Thread-1592::DEBUG::2013-03-20
> 21:09:24,632::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = ' 
> WARNING: This metadata update is NOT backed up\n  WARNING: This metadata
> update is NOT backed up\n'; <rc> 
> = 0
> 
> from some unknown reason, lv was not removed.
> 
> note that during storage live migration, VM was hibernated and restored.

Comment 3 Federico Simoncelli 2013-03-26 16:03:02 UTC
(In reply to comment #2)
> (In reply to comment #0)
> > Description of problem:
> > 
> > flow: live storage migration of 100g disk (migration back to original domain)
> > 
> > affect on user: disk is multiplied as snapshot is not removed
> > 
> > repro steps:
> > 
> > 1) run VM
> > 2) install O.S
> > 3) migrate disk from source to dest
> > 4) migrate disk from dest to source 
> > 5) check disk size - 202g 
> > 
> > since disk is basically LV, I checked LV size and noticed 2 LVs:
> > 
> >  [root@puma31 tmp]# lvs
> >   LV                                   VG                                  
> > Attr      LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
> >   00d2102e-fd5e-4dd1-adad-0a6b2bf3949f 08221c4d-f68b-4085-ab47-9c1d4b1b70d0
> > -wi------ 100.00g                                             
> >   2bfb1a73-4546-40f6-a5cc-17763e4c159f 08221c4d-f68b-4085-ab47-9c1d4b1b70d0
> > -wi------ 102.00g   
> > 
> > followed live migration thread on vdsm and noticed an lvremove command
> > issues for the specific lv:
> > 
> > Thread-1592::DEBUG::2013-03-20
> > 21:09:24,396::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
> > /sbin/lvm lvremove --config " devices { preferred_names =
> > [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 w
> > rite_cache_state=0 disable_after_error_count=3 filter = [
> > \'a%3514f0c542a000602|3514f0c542a000603|3514f0c542a000604|3514f0c542a000605|3
> > 514f0c542a000606|3514f0c542a000607|3514f0c542a000608|3514f0c542a000609|3514f0
> > c
> > 542a00060a|3514f0c542a00060b%\', \'r%.*%\' ] }  global {  locking_type=1 
> > prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50 
> > retain_days = 0 } " -f --autobackup n 4b14d3b3-7da4-4ed5-88d0-bfb3
> > 1f3b8df7/2bfb1a73-4546-40f6-a5cc-17763e4c159f
> > 4b14d3b3-7da4-4ed5-88d0-bfb31f3b8df7/00d2102e-fd5e-4dd1-adad-0a6b2bf3949f'
> 
> 1. This is for the 2 LVs on the source VG
> (4b14d3b3-7da4-4ed5-88d0-bfb31f3b8df7) while the lvs you list above are on
> the destination VG (08221c4d-f68b-4085-ab47-9c1d4b1b70d0) so I'm not sure
> what you're saying didn't work here.
> 2. From the 2 lvs listed I'm guessing the first one is the raw disk and the
> second one is the cow layer created for the migration?  Need to check why
> the cow layer is so large. Fede?

It looks like there's a race with the volume statistics. Basically if a volume extension request arrives when the VmStatsThread didn't update the apparentsize then we request an extension on top of the original raw volume size. Note the apparentsize reported (107374182400 == 100Gb) and the extension request (104448 == 102Gb):

Thread-13987::DEBUG::2013-03-20 18:56:29,992::BindingXMLRPC::913::vds::(wrapper) client [10.35.161.6]::call vmSnapshot with ('9cd58a26-9188-4894-a7ae-0c8ea767e043', [{'baseVolumeID': '00d2102e-fd5e-4dd1-adad-0a6b2bf3949f', 'domainID': '4b14d3b3-7da4-4ed5-88d0-bfb31f3b8df7', 'volumeID': '2bfb1a73-4546-40f6-a5cc-17763e4c159f', 'imageID': '195c2ec0-b5bb-4278-9978-f66de0ba2063'}]) {}
...
Thread-13978::DEBUG::2013-03-20 18:56:06,271::BindingXMLRPC::920::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Up', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '14274', 'displayIp': '0', 'displayPort': u'5900', 'session': 'Unknown', 'displaySecurePort': u'5901', 'timeOffset': '-43200', 'hash': '-3545512217363422887', 'balloonInfo': {'balloon_max': 4194304, 'balloon_cur': 4194304}, 'pauseCode': 'NOERR', 'clientIp': '10.35.200.31', 'kvmEnable': 'true', 'network': {u'vnet0': {'macAddr': '00:1a:4a:23:a1:1e', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet0'}}, 'vmId': '9cd58a26-9188-4894-a7ae-0c8ea767e043', 'displayType': 'qxl', 'cpuUser': '43.22', 'disks': {u'vda': {'readLatency': '243008', 'apparentsize': '107374182400', 'writeLatency': '73893423', 'imageID': '195c2ec0-b5bb-4278-9978-f66de0ba2063', 'flushLatency': '100700', 'readRate': '176523.01', 'truesize': '107374182400', 'writeRate': '13688918.23'}, u'hdc': {'readLatency': '6835', 'apparentsize': '0', 'writeLatency': '0', 'flushLatency': '0', 'readRate': '7.24', 'truesize': '0', 'writeRate': '0.00'}}, 'monitorResponse': '0', 'statsAge': '0.54', 'elapsedTime': '61', 'vmType': 'kvm', 'cpuSys': '6.82', 'appsList': [], 'guestIPs': ''}]}
VM Channels Listener::DEBUG::2013-03-20 18:56:06,431::vmChannels::60::vds::(_handle_timeouts) Timeout on fileno 17.
...
Thread-13987::DEBUG::2013-03-20 18:56:40,273::BindingXMLRPC::920::vds::(wrapper) return vmSnapshot with {'status': {'message': 'Done', 'code': 0}, 'quiesce': False}
...
Thread-13998::DEBUG::2013-03-20 18:56:50,528::vm::830::vm.Vm::(extendDriveVolume) vmId=`9cd58a26-9188-4894-a7ae-0c8ea767e043`::Requesting an extension for the volume replication: {'newSize': 104448, 'domainID': '08221c4d-f68b-4085-ab47-9c1d4b1b70d0', 'name': u'vda', 'volumeID': '2bfb1a73-4546-40f6-a5cc-17763e4c159f', 'imageID': '195c2ec0-b5bb-4278-9978-f66de0ba2063', 'poolID': 'a31dbbd1-d491-4420-9747-85bbbe15cc25'}

Comment 5 Dafna Ron 2013-05-12 13:20:08 UTC
verified on sf16 with vdsm-4.10.2-18.0.el6ev.x86_64

Comment 7 errata-xmlrpc 2013-06-10 20:46:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0886.html