Bug 1413763 - LVM versus File Backed Performance for Guest Instances [NEEDINFO]
Summary: LVM versus File Backed Performance for Guest Instances
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 6.0 (Juno)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Eoghan Glynn
QA Contact: Prasanth Anbalagan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-16 22:13 UTC by Jeremy
Modified: 2020-03-11 15:36 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-09-08 08:40:23 UTC
Target Upstream Version:
eglynn: needinfo? (jmelvin)


Attachments (Terms of Use)

Description Jeremy 2017-01-16 22:13:01 UTC
Description of problem:I was using vdbench for testing at scale, but just recently went back to doing basic FIO tests.  I'm trying to work down through the layers to make sure I isolate as much as possible.

I'm now testing with a compute node that uses a single NVME SSD device that's plugged directly in to the PCI slot to remove any RAID/Array controller variables.

My FIO tests that I'm using are:

Read:
fio --name=randread --ioengine=libaio --iodepth=16 --rw=randread --bs=4k --direct=1 --size=512M --numjobs=3 --runtime=60 --group_reporting
Write:
fio --name=randwrite --ioengine=libaio --iodepth=16 --rw=randwrite --bs=4k --direct=1 --size=512M --numjobs=3 --runtime=60 --group_reporting

When I run this in on the underlying hypervisor, I got the following results:

WRITE: io=1536.0MB, aggrb=670445KB/s, minb=670445KB/s, maxb=670445KB/s, mint=2346msec, maxt=2346msec
READ: io=1536.0MB, aggrb=1326.5MB/s, minb=1326.5MB/s, maxb=1326.5MB/s, mint=1158msec, maxt=1158msec

But when I run on a QCOW backed VM on the same NVME device on OpenStack, this is what I am getting:

WRITE: io=1536.0MB, aggrb=111828KB/s, minb=111828KB/s, maxb=111828KB/s, mint=14065msec, maxt=14065msec
READ: io=1536.0MB, aggrb=41774KB/s, minb=41774KB/s, maxb=41774KB/s, mint=37651msec, maxt=37651msec

This seems like a huge penalty that I don't expect.  I've found a few articles online that suggest that there should be a penalty for using raw/qcow versus LVM, but it shouldn't be substanial.

I did find that OpenStack sets the "cache" value to "none", which will have a performance hit, but according to the documentation, that's the only caching mode that supports migration, which makes sense.


Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.aboce
2.
3.

Actual results:
slow ephemeral disk performance using lvm

Expected results:
no susbstantial performance hit when using lvm backed disks

Additional info:


Is it expected to have such performance loss when using lvm backed instance disks.

Comment 2 Jeremy 2017-01-20 14:51:24 UTC
As an update, customer configured a host to use raw today instead of qcow, launched a RHEL 7.2 guest instance, and was still seeing 25 MB/s for both read and write.  Something is definitely not right here.  The underlying disk is capable for 320 MB/s write and 700 MB/s read.

Comment 3 Jeremy 2017-01-23 17:46:57 UTC
I found that if I update the io mode to be "native" instead of threads, I'm seeing an increase to 100 MB/S and 300 MB/s with preallocation of the space.  There still is a hit if I don't pre-allocate but if you write to space that has previously been written, it's fine.

I did this based on Sebastian Han's article:

https://www.sebastien-han.fr/blog/2013/08/12/openstack-unexplained-high-cpu-load-on-compute-nodes/

Although - the last warning scares me a bit about making the files sparse versus not-sparse and causing corruption.  Could we get some clarification on that?

Comment 4 Jeremy 2017-01-24 22:27:07 UTC
Just as an update from the customer:

qcow/fully preallocated
qcow/sparse
raw/fully preallocated
raw/sparse

All result in 25 MB/S both read and write.  As soon as I switch to LVM the results go up to ~160 MB/s and ~400 MB/s.  Something is definitely not right here.

I found that if I switch the IO mode to "native" manually, I can get performance up to 100 MB/s and 300 Mb/s.  But Red Hat cautions against using IO=Native with sparse images, and OpenStack's default is IO=threads, so I would have to hack the code per Sebastians suggestion in order to accomplish this.

One of the Red Hat consultants that we've worked with in the past (Jon Jozwiak) did a quick test in his Kilo environment and was seeing 100 mb/s with qcow and sparse.  I just can't seem to find the problem.

Comment 5 Jeremy 2017-01-25 16:06:32 UTC
Update from customer:


I believe I have resolved the problem on my own.  It appears that our /dev/sdb1 partition was not aligned properly with the block sizes.  

I fixed this by doing the following on the parted command:

parted -a optimal /dev/sdb mkpart primary 0% 100%

Once I did that, our performance on qcow jumped to 100 MB/s for write and 450 mb/s for read for 4k block sizes which is much closer to what I would expect.  Read was the same across the board regardless of LVM/Raw/Qcow.  Write was 100/130/180 for qcow/raw/LVM, respectively.  Given this, we'll take the trade off of performance for the ability to migrate and thin provision.

Comment 6 Jeremy 2017-01-25 16:08:49 UTC
Update from customer:


I believe I have resolved the problem on my own.  It appears that our /dev/sdb1 partition was not aligned properly with the block sizes.  

I fixed this by doing the following on the parted command:

parted -a optimal /dev/sdb mkpart primary 0% 100%

Once I did that, our performance on qcow jumped to 100 MB/s for write and 450 mb/s for read for 4k block sizes which is much closer to what I would expect.  Read was the same across the board regardless of LVM/Raw/Qcow.  Write was 100/130/180 for qcow/raw/LVM, respectively.  Given this, we'll take the trade off of performance for the ability to migrate and thin provision.

Comment 8 Red Hat Bugzilla Rules Engine 2017-06-04 02:40:53 UTC
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 9 Kashyap Chamarthy 2017-09-08 08:40:23 UTC
From comment#5, the customer has resolved the issue by aligning the block size with the GNU `parted`.

So I'm turning the state of the bug to: CLOSED, NOTABUG.


Please feel free to re-open (with more data) if this reoccurs.


Note You need to log in before you can comment on or make changes to this bug.