Bug 1415849

Summary: Using sparse_logical_volumes=true for Nova ephemeral leads to disk access errors
Product: Red Hat OpenStack Reporter: David Hill <dhill>
Component: openstack-novaAssignee: Eoghan Glynn <eglynn>
Status: CLOSED WONTFIX QA Contact: Prasanth Anbalagan <panbalag>
Severity: low Docs Contact:
Priority: low    
Version: 9.0 (Mitaka)CC: berrange, dasmith, dhill, eglynn, justin.rackliffe, kchamart, mbooth, sbauza, sferdjao, sgordon, srevivo, vromanso
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-01-19 15:37:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Hill 2017-01-23 23:26:01 UTC
Description of problem:
Attempting to thin provision when images_type=lvm results in a broken volume chain and disk read errors in the Instances.

applicable configs from nova.conf
images_type=lvm
images_volume_group=<vgname>
volume_clear=none #this should bypass the heavy dd on a instance delete
sparse_logical_volumes=true #created disk errors
force_raw_images=false #avoids the heavy qemu convert

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Configure images_type=lvm and sparse_logical_volumes=true
2. Boot a VM using a flavor with 10GB and the RHEL 7.3 KVM guest
3.

Actual results:
Fails to boot 

Expected results:
Boots

Additional info:
This is easily reproducable and if this feature is not supported, perhaps it should be removed from OpenStack at all.

Comment 1 Justin Rackliffe 2017-01-24 13:47:15 UTC
There are some additional checks that could be done for config testing when sparse=true.

One key issue I found was in the OSP9 RHEV install lvm.conf had thin_pool_autoextend_threshold set to 100 so the initial sized thinpool lv would max out, but the qemu-img convert call would still return a 0.

I also found that on delete the thinlv would be removed (eg uuid_disk), but the autogenerated thinpool (datalv and metalv) would be left so you would collect the lvol### instances within lvm and I found no garbage collection routine.

The autoextend (I tested w 70 w a 20% increase) seemingly is not quick enough to match either qemu-img or a dd sourced data addition routine.  I would see a variety of sizes in LSize for the thinpool after a copy complete.  It seems like the writes are being ACK'd, but not being written back correctly.

As a note using lvm sparse or thick vols I don't see an alternative to the raw conversion so that is an expected overhead for using lvm.  I do find it a bit curious why lvm snaps of a thick or thin hosted _base image don't seem to be used when cow is true.

Comment 2 Matthew Booth 2017-01-27 13:58:33 UTC
For reference: not reproducible on devstack. Will try with packstack RHOS 9 and LVM.

In case I can't reproduce, could you please share a sosreport from an affected machine with a failure in the logs?