Bug 1116558

Summary: Allow qemu-img to bypass the host cache (check, compare, convert, rebase, amend)
Product: Red Hat Enterprise Linux 6 Reporter: Markus Stockhausen <mst>
Component: qemu-kvmAssignee: Hanna Czenczek <hreitz>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.6CC: amureini, bazulay, bsarathy, bugs, chayang, fsimonce, gklein, hreitz, iheim, juzhang, mgoldboi, michen, mkenneth, mst, qzhang, rbalakri, s.kieske, sluo, stefanha, tlavigne, virt-maint, yeylon
Target Milestone: rcKeywords: Rebase
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: qemu-kvm-0.12.1.2-2.442.el6 Doc Type: Rebase: Bug Fixes and Enhancements
Doc Text:
Story Points: ---
Clone Of:
: 1138690 1138691 1157264 (view as bug list) Environment:
Last Closed: 2014-10-14 07:02:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1138690, 1138691, 1157264, 1166481    
Attachments:
Description Flags
mem usage
none
vm swapped
none
nfs I/O
none
swap I/O
none
second test
none
vdsm.log none

Description Markus Stockhausen 2014-07-06 10:19:28 UTC
Description of problem:

During snapshot deletion (on a NFS datacenter share) vdsm on SPM issues qemu-img command to handle merging of disk images. This process reads data using the hypervisor page cache. As this data might be hotter than any VM memory the system starts swapping out memory pages of the running VMs. 

With each VM disk being only accessed once during that process the system should avoid using the page cache.

- Either by accessing the files with direct I/O
- Or by clearing the page cache in regular intervals during that operation

Version-Release number of selected component (if applicable):

Ovirt 3.4.2
Fedora 20 hypervisor node (kernel 3.14.8-200.fc20.x86_64)
vdsm-4.14.9-0.fc20

How reproducible:

100%


Steps to Reproduce:
1. Choose OVirt NFS based datacenter
1. Chose VM with a disk that is larger than the free memory on SPM node
2. Stop VM
3. Create snapshot of the large disk (takes a few seconds)
4. Delete snapshot (takes long)
5. watch sapping on SPM node

Actual results:

- page cache usage on SPM increases
- memory fills up to 100%
- paging swaps out memory of VMs

Expected results:

- No paging should occur

Additional info:

- This behaviour can be seen even with swappiness set to 0
- Graphs attached

Comment 1 Markus Stockhausen 2014-07-06 10:26:50 UTC
From my observations vdsm seems to issue a page cache clearing but only at the end  of the deletion process. 


This is far too late in case : disksize VM > free memsize SPM

Comment 2 Markus Stockhausen 2014-07-06 10:32:29 UTC
graph descriptions:

spm1.png: mem usage on spm node. 
- purple area: file cache usage
- blue area: VM memory usage

spm2.png: swap file usage
- devided into the swapped memory per running VM

spm3.png: infiniband throughput
- we are using NFS over IPoIB so this shows the NFS throughput

spm4.png: swap I/O

Comment 3 Markus Stockhausen 2014-07-06 10:32:54 UTC
Created attachment 914941 [details]
mem usage

Comment 4 Markus Stockhausen 2014-07-06 10:33:15 UTC
Created attachment 914942 [details]
vm swapped

Comment 5 Markus Stockhausen 2014-07-06 10:33:39 UTC
Created attachment 914943 [details]
nfs I/O

Comment 6 Markus Stockhausen 2014-07-06 10:34:05 UTC
Created attachment 914944 [details]
swap I/O

Comment 7 Markus Stockhausen 2014-07-06 10:39:16 UTC
For complete documentation the timestamps of the delete operation:

2014-Jul-06, 12:23
Snapshot 'MST_TEST' deletion for VM 'colvm02' has been completed.

2014-Jul-06, 11:27
Snapshot 'MST_TEST' deletion for VM 'colvm02' was initiated by MST.

2014-Jul-06, 11:24
Snapshot 'MST_TEST' creation for VM 'colvm02' has been completed.

2014-Jul-06, 11:24
Snapshot 'MST_TEST' creation for VM 'colvm02' was initiated by MST.

Comment 8 Markus Stockhausen 2014-07-06 11:30:31 UTC
Additional info: mount options are OVirt defaults:

root@hypervisor# mount
...
10.10.x.x:/var/nas1/OVirtIB on /rhev/data-center/mnt/10.10.30.251:_var_nas1_OVirtIB 
type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,
          proto=tcp,timeo=600,retrans=6,sec=sys,mountaddr=10.10.30.251,mountvers=3,
          mountport=46406,mountproto=udp,local_lock=none,addr=10.10.30.251)

Comment 9 Allon Mureinik 2014-07-08 13:40:39 UTC
Fede, please take a look.
Worth considering if there's any low hanging fruit here, or whether we're just redoing this in 3.6 as part of the SPM removal anyway.

Comment 10 Federico Simoncelli 2014-07-14 10:47:53 UTC
There are no logs attached but we already run qemu-img using -t (cache) none (O_DIRECT).

Please attach vdsm logs so we can determine if we missed to add the parameter or if it's qemu-img not satisfying it.

Comment 11 Markus Stockhausen 2014-07-14 14:46:06 UTC
The qemu-img command seems right, but the cache is used nevertheless:

[root@SPM] # ps -ef | grep qemu-img
/usr/bin/qemu-img convert -t none -f qcow2 /rhev/data-center/mnt/10.10.30.251:_var_nas1_OVirtIB/965ca3b6-4f9c-4e81-b6e8-5ed4a9e58545/images/401de48b-867c-4447-939f-17e5aeca6943/cb6ffbcc-603b-41a3-8977-faafe746ef92 -O raw /rhev/data-center/mnt/10.10.30.251:_var_nas1_OVirtIB/965ca3b6-4f9c-4e81-b6e8-5ed4a9e58545/images/401de48b-867c-4447-939f-17e5aeca6943/cb6ffbcc-603b-41a3-8977-faafe746ef92_MERGE

Attached you will find the vdsm log and another rrdgraph of the memory usage on our SPM. Before starting the test we removed all VMs from the SPM (memory usage going down). Afterwards some minutes of silence (during dd /dev/zero command) and then page cache usage raising to the top.

vdsm.log attached too. It misses the dd command but the qemu-img command should have been logged.

Comment 12 Markus Stockhausen 2014-07-14 14:46:28 UTC
Created attachment 917883 [details]
second test

Comment 13 Markus Stockhausen 2014-07-14 14:47:46 UTC
Created attachment 917884 [details]
vdsm.log

Comment 14 Markus Stockhausen 2014-07-14 16:41:10 UTC
Maybe not directly related. But at least a similar discussion about working with qemu-img and the pagecache: http://lists.gnu.org/archive/html/qemu-devel/2014-02/msg04770.html

Comment 15 Markus Stockhausen 2014-07-14 16:54:13 UTC
This might be of interest too: http://patchwork.ozlabs.org/patch/354349/

But I'm unsure if it could fix my observed behaviour.

Comment 16 Federico Simoncelli 2014-07-14 21:49:31 UTC
When qemu-img convert is used with the option -t (cache) none it is still using the host cache for the source image:

$ qemu-img create -f raw image1.raw 20G
$ strace -e trace=open qemu-img convert -t none -f raw -O qcow2 image1.raw image1.qcow2

open("image1.raw", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 6
open("image1.raw", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 6
open("image1.raw", O_RDONLY|O_CLOEXEC)  = 6
open("image1.qcow2", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("image1.qcow2", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("image1.qcow2", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("image1.qcow2", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("image1.qcow2", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0644) = 7
open("image1.qcow2", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 7
open("image1.qcow2", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 7
open("image1.qcow2", O_RDWR|O_CLOEXEC)  = 7
open("image1.qcow2", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 7
open("image1.qcow2", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 7
open("image1.qcow2", O_RDWR|O_CLOEXEC)  = 7
open("image1.qcow2", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 7
open("image1.qcow2", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 7
open("image1.qcow2", O_RDWR|O_DIRECT|O_CLOEXEC) = 7

As we see O_DIRECT is used only for image1.qcow (and not for image1.raw).

Also other commands could be affected by this issue, most notably qemu-img rebase.

Comment 18 Ademar Reis 2014-07-14 21:57:41 UTC
Markus, thank you for taking the time to enter a bug report with us. We're not able to guarantee the timeliness or suitability of a resolution for issues entered here because this is not a mechanism for requesting support.

If this issue is critical or in any way time sensitive, please raise a ticket through your regular Red Hat support channels to make certain it receives the proper attention and prioritization that will result in a timely resolution.

For information on how to contact the Red Hat production support team, please visit: https://www.redhat.com/support/process/production/#howto

Comment 19 Hanna Czenczek 2014-07-18 19:39:10 UTC
Well, qemu-img --help gives the following information about the cache mode:

'cache' is the cache mode used to write the output disk image

And the man page states the following about -t <cache>:

specifies the cache mode that should be used with the (destination) file.


Therefore, the current behavior is documented and seems to be desired. Changing this to applying the cache mode to all images involved would therefore not be a bug fix.

We can now either extend -t to mean all images involved (at least those opened explicitly by qemu-img) or introduce a new switch which sets the cache mode for the source files. The first idea seems intuitive enough, but breaks with current (documented!) behavior. The second idea doesn't seem so nice, but it would not deviate from the current behavior and use cases such as this which really have to switch of all caches would be covered.

If nobody disagrees, I'll see to implementing the second option (another switch for source file caching).

Comment 20 Markus Stockhausen 2014-07-20 18:19:49 UTC
From the "does not break expected behaviour" point of view I agree that implementing another switch is the way to go. The current possible workarounds for an  RHEV/OVirt setups include:

- trigger linux page cache flushes on the SPM if free memory falls below a threshold

- Ensure that SPM runs on an empty host.

So no trouble to implement a sustainable solution. 

Hopefully the VDSM guys can give some feedback if they require additional features to detect the new option. From my understanding the new option will be standard for any qemu-img operation in a RHEV/OVirt cluster.

Comment 21 Allon Mureinik 2014-07-21 06:52:46 UTC
(In reply to Markus Stockhausen from comment #20)
> Hopefully the VDSM guys can give some feedback if they require additional
> features to detect the new option. From my understanding the new option will
> be standard for any qemu-img operation in a RHEV/OVirt cluster.
Fede, can you address this please?

Comment 23 Qunfang Zhang 2014-07-31 05:40:35 UTC
Some test instruction from Max (Thanks Max!):

One way is described in the BZ: If you call "strace -e trace=open qemu-img convert -t none -T none source.img destination.img", the last time each file (source.img and destination.img) is opened should be with the O_DIRECT flag. Currently, "strace -e trace=open qemu-img convert -t none source.img destination.img" will reveal that source.img is indeed not opened with O_DIRECT.

Another way is to keep track of the system memory spent on caches: First, you execute "sync; echo 3 > /proc/sys/vm/drop_caches" to drop all caches. Then, you execute the command like "free -m; qemu-img convert -t none source.img destination.img; free -m". If you compare the output of free before and after qemu-img convert has run, you'll see that the numbers under "used" and "cached" have increased by approximately the source image size (if you have enough RAM).

When it is possible to set the source image cache mode, those numbers in the output of "free -m; qemu-img convert -t none -T none source.img destination.img; free -m" should not have significantly changed (which is the reason why we want to be able to specify the source image cache mode, so the host cache is not "polluted" by qemu).

(Note that I'm using "-T" here as the switch for setting the source image cache mode; the name of the switch is however still subject to discussion, so it might change)

Currently on qemu-kvm-rhev-0.12.1.2-2.431.el6.x86_64:

[root@localhost test]# strace -e trace=open qemu-img convert -t none  source.img destination.img
open("/etc/ld.so.cache", O_RDONLY)      = 3
open("/lib64/librt.so.1", O_RDONLY)     = 3
open("/lib64/libpthread.so.0", O_RDONLY) = 3
open("/lib64/libgmodule-2.0.so.0", O_RDONLY) = 3
open("/lib64/libglib-2.0.so.0", O_RDONLY) = 3
open("/lib64/libaio.so.1", O_RDONLY)    = 3
open("/usr/lib64/libusbredirparser.so.1", O_RDONLY) = 3
open("/lib64/libz.so.1", O_RDONLY)      = 3
open("/usr/lib64/libgfapi.so.0", O_RDONLY) = 3
open("/usr/lib64/libgfrpc.so.0", O_RDONLY) = 3
open("/usr/lib64/libgfxdr.so.0", O_RDONLY) = 3
open("/lib64/libuuid.so.1", O_RDONLY)   = 3
open("/lib64/libc.so.6", O_RDONLY)      = 3
open("/lib64/libdl.so.2", O_RDONLY)     = 3
open("/usr/lib64/libglusterfs.so.0", O_RDONLY) = 3
open("/usr/lib64/libcrypto.so.10", O_RDONLY) = 3
open("source.img", O_RDONLY|O_NONBLOCK) = 3
open("source.img", O_RDONLY|O_DSYNC|O_CLOEXEC) = 3
open("source.img", O_RDONLY|O_NONBLOCK) = 3
open("source.img", O_RDONLY|O_CLOEXEC)  = 3
open("destination.img", O_RDONLY|O_NONBLOCK) = 5
open("destination.img", O_RDONLY|O_NONBLOCK) = 5
open("destination.img", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 5
open("destination.img", O_RDONLY|O_NONBLOCK) = 5
open("destination.img", O_RDWR|O_DIRECT|O_CLOEXEC) = 5

O_DIRECT is used only for the destination.img image. 

We will verify the bug according to Max's comment once it's fixed.

Comment 26 Stefan Hajnoczi 2014-08-26 15:06:17 UTC
RHEL6.6 backport posted.  Patch is merged upstream.

Comment 27 Jeff Nelson 2014-08-29 03:28:27 UTC
Fix included in qemu-kvm-0.12.1.2-2.442.el6

Comment 29 Chao Yang 2014-09-02 06:58:14 UTC
Verified pass on qemu-kvm-0.12.1.2-2.443.el6.x86_64. 

Details are as following:

Scenario 1:
----------

# sync; echo 3 > /proc/sys/vm/drop_caches
# free -m;qemu-img convert -t none -f raw -O qcow2 image-1.raw image-test.qcow2; free -m
             total       used       free     shared    buffers     cached
Mem:          7746        815       6930         61          5        139
                                                                      ^^^
-/+ buffers/cache:        670       7075
Swap:         7887         10       7877
             total       used       free     shared    buffers     cached
Mem:          7746       6971        775         61          5       6284
                                                                      ^^^
-/+ buffers/cache:        681       7064
Swap:         7887         10       7877

# sync; echo 3 > /proc/sys/vm/drop_caches
# free -m;qemu-img convert -t none -f raw -O qcow2 image-1.raw image-test.qcow2 -T none; free -m
             total       used       free     shared    buffers     cached
Mem:          7746        801       6945         61          0        130
-/+ buffers/cache:        670       7076
Swap:         7887         10       7877
             total       used       free     shared    buffers     cached
Mem:          7746        804       6941         61          2        132
-/+ buffers/cache:        670       7076
Swap:         7887         10       7877


As you can see above, cached column differs a lot from without '-T none' and with '-T none'



Scenario 2:
----------

2.1 qemu-img check

# strace -e trace=open qemu-img check image-1.qcow2 

--- check-without-T.log	2014-09-02 14:03:35.553982498 +0800
+++ check-with-T.log	2014-09-02 14:03:24.762982631 +0800
@@ -17,6 +17,6 @@ open("/usr/lib64/libcrypto.so.10", O_RDO
 open("image-1.qcow2", O_RDONLY|O_NONBLOCK) = 3
 open("image-1.qcow2", O_RDONLY|O_DSYNC|O_CLOEXEC) = 3
 open("image-1.qcow2", O_RDONLY|O_NONBLOCK) = 3
-open("image-1.qcow2", O_RDONLY|O_CLOEXEC) = 3
+open("image-1.qcow2", O_RDONLY|O_DIRECT|O_CLOEXEC) = 3
 No errors were found on the image.
 Image end offset: 262144


2.2 qemu-img compare

# strace -e trace=open qemu-img compare -f raw -F qcow2 image-1.raw image-1.qcow2
--- compare-without-T.log	2014-09-02 14:05:17.366982406 +0800
+++ compare-with-T.log	2014-09-02 14:05:30.818982404 +0800
@@ -15,7 +15,7 @@ open("/lib64/libdl.so.2", O_RDONLY)     
 open("/usr/lib64/libglusterfs.so.0", O_RDONLY) = 3
 open("/usr/lib64/libcrypto.so.10", O_RDONLY) = 3
 open("image-1.raw", O_RDONLY|O_NONBLOCK) = 3
-open("image-1.raw", O_RDONLY|O_CLOEXEC) = 3
+open("image-1.raw", O_RDONLY|O_DIRECT|O_CLOEXEC) = 3
 open("image-1.qcow2", O_RDONLY|O_NONBLOCK) = 5
-open("image-1.qcow2", O_RDONLY|O_CLOEXEC) = 5
+open("image-1.qcow2", O_RDONLY|O_DIRECT|O_CLOEXEC) = 5
 Images are identical.


2.3 qemu-img convert

# strace -e trace=open qemu-img convert -t none -f raw -O qcow2 image-1.raw image-1.qcow2
--- convert-without-T.log	2014-09-02 14:00:14.143982449 +0800
+++ convert-with-T.log	2014-09-02 13:59:53.472982484 +0800
@@ -15,7 +15,7 @@ open("/lib64/libdl.so.2", O_RDONLY)     
 open("/usr/lib64/libglusterfs.so.0", O_RDONLY) = 3
 open("/usr/lib64/libcrypto.so.10", O_RDONLY) = 3
 open("image-1.raw", O_RDONLY|O_NONBLOCK) = 3
-open("image-1.raw", O_RDONLY|O_CLOEXEC) = 3
+open("image-1.raw", O_RDONLY|O_DIRECT|O_CLOEXEC) = 3
 open("image-1.qcow2", O_RDONLY|O_NONBLOCK) = 5
 open("image-1.qcow2", O_RDONLY|O_NONBLOCK) = 5
 open("image-1.qcow2", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 5


2.4 qemu-img rebase

# strace -e trace=open qemu-img rebase -f qcow2 -t none -b image-1.qcow2 -F qcow2 image-sn2.qcow2
--- rebase-without-T.log	2014-09-02 14:32:58.803982118 +0800
+++ rebase-with-T.log	2014-09-02 14:36:59.260982079 +0800
@@ -27,10 +27,10 @@ open("image-1.qcow2", O_RDONLY|O_DIRECT|
 open("image-sn1.qcow2", O_RDONLY|O_NONBLOCK) = 7
 open("image-sn1.qcow2", O_RDONLY|O_DSYNC|O_CLOEXEC) = 7
 open("image-sn1.qcow2", O_RDONLY|O_NONBLOCK) = 7
-open("image-sn1.qcow2", O_RDONLY|O_CLOEXEC) = 7
+open("image-sn1.qcow2", O_RDONLY|O_DIRECT|O_CLOEXEC) = 7
 open("image-1.qcow2", O_RDONLY|O_NONBLOCK) = 8
 open("image-1.qcow2", O_RDONLY|O_DSYNC|O_CLOEXEC) = 8
 open("image-1.qcow2", O_RDONLY|O_NONBLOCK) = 8
-open("image-1.qcow2", O_RDONLY|O_CLOEXEC) = 8
+open("image-1.qcow2", O_RDONLY|O_DIRECT|O_CLOEXEC) = 8
 open("image-1.qcow2", O_RDONLY|O_NONBLOCK) = 9
-open("image-1.qcow2", O_RDONLY|O_CLOEXEC) = 9
+open("image-1.qcow2", O_RDONLY|O_DIRECT|O_CLOEXEC) = 9


2.5 qemu-img amend is not supported by rhel6



As per above, this issue has fixed.

Comment 31 Markus Stockhausen 2014-09-02 07:54:23 UTC
Now that the bug is fixed in the qemu lower layers, is there anything that has to be fixed for the upper vdsm/Ovirt layers? I guess we must provide the -T flag to run smoothly in an integrated environment. 

Remember that this bug was initially targeted at OVirt.

Comment 34 Ademar Reis 2014-09-05 13:07:03 UTC
(In reply to Markus Stockhausen from comment #31)
> Now that the bug is fixed in the qemu lower layers, is there anything that
> has to be fixed for the upper vdsm/Ovirt layers? I guess we must provide the
> -T flag to run smoothly in an integrated environment. 
> 
> Remember that this bug was initially targeted at OVirt.

Thanks for the reminder. The clone for vdsm is Bug 1138690

Comment 35 errata-xmlrpc 2014-10-14 07:02:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1490.html