Bug 1116558
Summary: | Allow qemu-img to bypass the host cache (check, compare, convert, rebase, amend) | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Markus Stockhausen <mst> | ||||||||||||||
Component: | qemu-kvm | Assignee: | Hanna Czenczek <hreitz> | ||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||
Priority: | unspecified | ||||||||||||||||
Version: | 6.6 | CC: | amureini, bazulay, bsarathy, bugs, chayang, fsimonce, gklein, hreitz, iheim, juzhang, mgoldboi, michen, mkenneth, mst, qzhang, rbalakri, s.kieske, sluo, stefanha, tlavigne, virt-maint, yeylon | ||||||||||||||
Target Milestone: | rc | Keywords: | Rebase | ||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||
OS: | Unspecified | ||||||||||||||||
Whiteboard: | storage | ||||||||||||||||
Fixed In Version: | qemu-kvm-0.12.1.2-2.442.el6 | Doc Type: | Rebase: Bug Fixes and Enhancements | ||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | |||||||||||||||||
: | 1138690 1138691 1157264 (view as bug list) | Environment: | |||||||||||||||
Last Closed: | 2014-10-14 07:02:16 UTC | Type: | Bug | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Bug Depends On: | |||||||||||||||||
Bug Blocks: | 1138690, 1138691, 1157264, 1166481 | ||||||||||||||||
Attachments: |
|
Description
Markus Stockhausen
2014-07-06 10:19:28 UTC
From my observations vdsm seems to issue a page cache clearing but only at the end of the deletion process. This is far too late in case : disksize VM > free memsize SPM graph descriptions: spm1.png: mem usage on spm node. - purple area: file cache usage - blue area: VM memory usage spm2.png: swap file usage - devided into the swapped memory per running VM spm3.png: infiniband throughput - we are using NFS over IPoIB so this shows the NFS throughput spm4.png: swap I/O Created attachment 914941 [details]
mem usage
Created attachment 914942 [details]
vm swapped
Created attachment 914943 [details]
nfs I/O
Created attachment 914944 [details]
swap I/O
For complete documentation the timestamps of the delete operation: 2014-Jul-06, 12:23 Snapshot 'MST_TEST' deletion for VM 'colvm02' has been completed. 2014-Jul-06, 11:27 Snapshot 'MST_TEST' deletion for VM 'colvm02' was initiated by MST. 2014-Jul-06, 11:24 Snapshot 'MST_TEST' creation for VM 'colvm02' has been completed. 2014-Jul-06, 11:24 Snapshot 'MST_TEST' creation for VM 'colvm02' was initiated by MST. Additional info: mount options are OVirt defaults: root@hypervisor# mount ... 10.10.x.x:/var/nas1/OVirtIB on /rhev/data-center/mnt/10.10.30.251:_var_nas1_OVirtIB type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache, proto=tcp,timeo=600,retrans=6,sec=sys,mountaddr=10.10.30.251,mountvers=3, mountport=46406,mountproto=udp,local_lock=none,addr=10.10.30.251) Fede, please take a look. Worth considering if there's any low hanging fruit here, or whether we're just redoing this in 3.6 as part of the SPM removal anyway. There are no logs attached but we already run qemu-img using -t (cache) none (O_DIRECT). Please attach vdsm logs so we can determine if we missed to add the parameter or if it's qemu-img not satisfying it. The qemu-img command seems right, but the cache is used nevertheless: [root@SPM] # ps -ef | grep qemu-img /usr/bin/qemu-img convert -t none -f qcow2 /rhev/data-center/mnt/10.10.30.251:_var_nas1_OVirtIB/965ca3b6-4f9c-4e81-b6e8-5ed4a9e58545/images/401de48b-867c-4447-939f-17e5aeca6943/cb6ffbcc-603b-41a3-8977-faafe746ef92 -O raw /rhev/data-center/mnt/10.10.30.251:_var_nas1_OVirtIB/965ca3b6-4f9c-4e81-b6e8-5ed4a9e58545/images/401de48b-867c-4447-939f-17e5aeca6943/cb6ffbcc-603b-41a3-8977-faafe746ef92_MERGE Attached you will find the vdsm log and another rrdgraph of the memory usage on our SPM. Before starting the test we removed all VMs from the SPM (memory usage going down). Afterwards some minutes of silence (during dd /dev/zero command) and then page cache usage raising to the top. vdsm.log attached too. It misses the dd command but the qemu-img command should have been logged. Created attachment 917883 [details]
second test
Created attachment 917884 [details]
vdsm.log
Maybe not directly related. But at least a similar discussion about working with qemu-img and the pagecache: http://lists.gnu.org/archive/html/qemu-devel/2014-02/msg04770.html This might be of interest too: http://patchwork.ozlabs.org/patch/354349/ But I'm unsure if it could fix my observed behaviour. When qemu-img convert is used with the option -t (cache) none it is still using the host cache for the source image: $ qemu-img create -f raw image1.raw 20G $ strace -e trace=open qemu-img convert -t none -f raw -O qcow2 image1.raw image1.qcow2 open("image1.raw", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 6 open("image1.raw", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 6 open("image1.raw", O_RDONLY|O_CLOEXEC) = 6 open("image1.qcow2", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("image1.qcow2", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("image1.qcow2", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("image1.qcow2", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("image1.qcow2", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0644) = 7 open("image1.qcow2", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 7 open("image1.qcow2", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 7 open("image1.qcow2", O_RDWR|O_CLOEXEC) = 7 open("image1.qcow2", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 7 open("image1.qcow2", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 7 open("image1.qcow2", O_RDWR|O_CLOEXEC) = 7 open("image1.qcow2", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 7 open("image1.qcow2", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 7 open("image1.qcow2", O_RDWR|O_DIRECT|O_CLOEXEC) = 7 As we see O_DIRECT is used only for image1.qcow (and not for image1.raw). Also other commands could be affected by this issue, most notably qemu-img rebase. Markus, thank you for taking the time to enter a bug report with us. We're not able to guarantee the timeliness or suitability of a resolution for issues entered here because this is not a mechanism for requesting support. If this issue is critical or in any way time sensitive, please raise a ticket through your regular Red Hat support channels to make certain it receives the proper attention and prioritization that will result in a timely resolution. For information on how to contact the Red Hat production support team, please visit: https://www.redhat.com/support/process/production/#howto Well, qemu-img --help gives the following information about the cache mode: 'cache' is the cache mode used to write the output disk image And the man page states the following about -t <cache>: specifies the cache mode that should be used with the (destination) file. Therefore, the current behavior is documented and seems to be desired. Changing this to applying the cache mode to all images involved would therefore not be a bug fix. We can now either extend -t to mean all images involved (at least those opened explicitly by qemu-img) or introduce a new switch which sets the cache mode for the source files. The first idea seems intuitive enough, but breaks with current (documented!) behavior. The second idea doesn't seem so nice, but it would not deviate from the current behavior and use cases such as this which really have to switch of all caches would be covered. If nobody disagrees, I'll see to implementing the second option (another switch for source file caching). From the "does not break expected behaviour" point of view I agree that implementing another switch is the way to go. The current possible workarounds for an RHEV/OVirt setups include: - trigger linux page cache flushes on the SPM if free memory falls below a threshold - Ensure that SPM runs on an empty host. So no trouble to implement a sustainable solution. Hopefully the VDSM guys can give some feedback if they require additional features to detect the new option. From my understanding the new option will be standard for any qemu-img operation in a RHEV/OVirt cluster. (In reply to Markus Stockhausen from comment #20) > Hopefully the VDSM guys can give some feedback if they require additional > features to detect the new option. From my understanding the new option will > be standard for any qemu-img operation in a RHEV/OVirt cluster. Fede, can you address this please? Some test instruction from Max (Thanks Max!): One way is described in the BZ: If you call "strace -e trace=open qemu-img convert -t none -T none source.img destination.img", the last time each file (source.img and destination.img) is opened should be with the O_DIRECT flag. Currently, "strace -e trace=open qemu-img convert -t none source.img destination.img" will reveal that source.img is indeed not opened with O_DIRECT. Another way is to keep track of the system memory spent on caches: First, you execute "sync; echo 3 > /proc/sys/vm/drop_caches" to drop all caches. Then, you execute the command like "free -m; qemu-img convert -t none source.img destination.img; free -m". If you compare the output of free before and after qemu-img convert has run, you'll see that the numbers under "used" and "cached" have increased by approximately the source image size (if you have enough RAM). When it is possible to set the source image cache mode, those numbers in the output of "free -m; qemu-img convert -t none -T none source.img destination.img; free -m" should not have significantly changed (which is the reason why we want to be able to specify the source image cache mode, so the host cache is not "polluted" by qemu). (Note that I'm using "-T" here as the switch for setting the source image cache mode; the name of the switch is however still subject to discussion, so it might change) Currently on qemu-kvm-rhev-0.12.1.2-2.431.el6.x86_64: [root@localhost test]# strace -e trace=open qemu-img convert -t none source.img destination.img open("/etc/ld.so.cache", O_RDONLY) = 3 open("/lib64/librt.so.1", O_RDONLY) = 3 open("/lib64/libpthread.so.0", O_RDONLY) = 3 open("/lib64/libgmodule-2.0.so.0", O_RDONLY) = 3 open("/lib64/libglib-2.0.so.0", O_RDONLY) = 3 open("/lib64/libaio.so.1", O_RDONLY) = 3 open("/usr/lib64/libusbredirparser.so.1", O_RDONLY) = 3 open("/lib64/libz.so.1", O_RDONLY) = 3 open("/usr/lib64/libgfapi.so.0", O_RDONLY) = 3 open("/usr/lib64/libgfrpc.so.0", O_RDONLY) = 3 open("/usr/lib64/libgfxdr.so.0", O_RDONLY) = 3 open("/lib64/libuuid.so.1", O_RDONLY) = 3 open("/lib64/libc.so.6", O_RDONLY) = 3 open("/lib64/libdl.so.2", O_RDONLY) = 3 open("/usr/lib64/libglusterfs.so.0", O_RDONLY) = 3 open("/usr/lib64/libcrypto.so.10", O_RDONLY) = 3 open("source.img", O_RDONLY|O_NONBLOCK) = 3 open("source.img", O_RDONLY|O_DSYNC|O_CLOEXEC) = 3 open("source.img", O_RDONLY|O_NONBLOCK) = 3 open("source.img", O_RDONLY|O_CLOEXEC) = 3 open("destination.img", O_RDONLY|O_NONBLOCK) = 5 open("destination.img", O_RDONLY|O_NONBLOCK) = 5 open("destination.img", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 5 open("destination.img", O_RDONLY|O_NONBLOCK) = 5 open("destination.img", O_RDWR|O_DIRECT|O_CLOEXEC) = 5 O_DIRECT is used only for the destination.img image. We will verify the bug according to Max's comment once it's fixed. RHEL6.6 backport posted. Patch is merged upstream. Fix included in qemu-kvm-0.12.1.2-2.442.el6 Verified pass on qemu-kvm-0.12.1.2-2.443.el6.x86_64. Details are as following: Scenario 1: ---------- # sync; echo 3 > /proc/sys/vm/drop_caches # free -m;qemu-img convert -t none -f raw -O qcow2 image-1.raw image-test.qcow2; free -m total used free shared buffers cached Mem: 7746 815 6930 61 5 139 ^^^ -/+ buffers/cache: 670 7075 Swap: 7887 10 7877 total used free shared buffers cached Mem: 7746 6971 775 61 5 6284 ^^^ -/+ buffers/cache: 681 7064 Swap: 7887 10 7877 # sync; echo 3 > /proc/sys/vm/drop_caches # free -m;qemu-img convert -t none -f raw -O qcow2 image-1.raw image-test.qcow2 -T none; free -m total used free shared buffers cached Mem: 7746 801 6945 61 0 130 -/+ buffers/cache: 670 7076 Swap: 7887 10 7877 total used free shared buffers cached Mem: 7746 804 6941 61 2 132 -/+ buffers/cache: 670 7076 Swap: 7887 10 7877 As you can see above, cached column differs a lot from without '-T none' and with '-T none' Scenario 2: ---------- 2.1 qemu-img check # strace -e trace=open qemu-img check image-1.qcow2 --- check-without-T.log 2014-09-02 14:03:35.553982498 +0800 +++ check-with-T.log 2014-09-02 14:03:24.762982631 +0800 @@ -17,6 +17,6 @@ open("/usr/lib64/libcrypto.so.10", O_RDO open("image-1.qcow2", O_RDONLY|O_NONBLOCK) = 3 open("image-1.qcow2", O_RDONLY|O_DSYNC|O_CLOEXEC) = 3 open("image-1.qcow2", O_RDONLY|O_NONBLOCK) = 3 -open("image-1.qcow2", O_RDONLY|O_CLOEXEC) = 3 +open("image-1.qcow2", O_RDONLY|O_DIRECT|O_CLOEXEC) = 3 No errors were found on the image. Image end offset: 262144 2.2 qemu-img compare # strace -e trace=open qemu-img compare -f raw -F qcow2 image-1.raw image-1.qcow2 --- compare-without-T.log 2014-09-02 14:05:17.366982406 +0800 +++ compare-with-T.log 2014-09-02 14:05:30.818982404 +0800 @@ -15,7 +15,7 @@ open("/lib64/libdl.so.2", O_RDONLY) open("/usr/lib64/libglusterfs.so.0", O_RDONLY) = 3 open("/usr/lib64/libcrypto.so.10", O_RDONLY) = 3 open("image-1.raw", O_RDONLY|O_NONBLOCK) = 3 -open("image-1.raw", O_RDONLY|O_CLOEXEC) = 3 +open("image-1.raw", O_RDONLY|O_DIRECT|O_CLOEXEC) = 3 open("image-1.qcow2", O_RDONLY|O_NONBLOCK) = 5 -open("image-1.qcow2", O_RDONLY|O_CLOEXEC) = 5 +open("image-1.qcow2", O_RDONLY|O_DIRECT|O_CLOEXEC) = 5 Images are identical. 2.3 qemu-img convert # strace -e trace=open qemu-img convert -t none -f raw -O qcow2 image-1.raw image-1.qcow2 --- convert-without-T.log 2014-09-02 14:00:14.143982449 +0800 +++ convert-with-T.log 2014-09-02 13:59:53.472982484 +0800 @@ -15,7 +15,7 @@ open("/lib64/libdl.so.2", O_RDONLY) open("/usr/lib64/libglusterfs.so.0", O_RDONLY) = 3 open("/usr/lib64/libcrypto.so.10", O_RDONLY) = 3 open("image-1.raw", O_RDONLY|O_NONBLOCK) = 3 -open("image-1.raw", O_RDONLY|O_CLOEXEC) = 3 +open("image-1.raw", O_RDONLY|O_DIRECT|O_CLOEXEC) = 3 open("image-1.qcow2", O_RDONLY|O_NONBLOCK) = 5 open("image-1.qcow2", O_RDONLY|O_NONBLOCK) = 5 open("image-1.qcow2", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 5 2.4 qemu-img rebase # strace -e trace=open qemu-img rebase -f qcow2 -t none -b image-1.qcow2 -F qcow2 image-sn2.qcow2 --- rebase-without-T.log 2014-09-02 14:32:58.803982118 +0800 +++ rebase-with-T.log 2014-09-02 14:36:59.260982079 +0800 @@ -27,10 +27,10 @@ open("image-1.qcow2", O_RDONLY|O_DIRECT| open("image-sn1.qcow2", O_RDONLY|O_NONBLOCK) = 7 open("image-sn1.qcow2", O_RDONLY|O_DSYNC|O_CLOEXEC) = 7 open("image-sn1.qcow2", O_RDONLY|O_NONBLOCK) = 7 -open("image-sn1.qcow2", O_RDONLY|O_CLOEXEC) = 7 +open("image-sn1.qcow2", O_RDONLY|O_DIRECT|O_CLOEXEC) = 7 open("image-1.qcow2", O_RDONLY|O_NONBLOCK) = 8 open("image-1.qcow2", O_RDONLY|O_DSYNC|O_CLOEXEC) = 8 open("image-1.qcow2", O_RDONLY|O_NONBLOCK) = 8 -open("image-1.qcow2", O_RDONLY|O_CLOEXEC) = 8 +open("image-1.qcow2", O_RDONLY|O_DIRECT|O_CLOEXEC) = 8 open("image-1.qcow2", O_RDONLY|O_NONBLOCK) = 9 -open("image-1.qcow2", O_RDONLY|O_CLOEXEC) = 9 +open("image-1.qcow2", O_RDONLY|O_DIRECT|O_CLOEXEC) = 9 2.5 qemu-img amend is not supported by rhel6 As per above, this issue has fixed. Now that the bug is fixed in the qemu lower layers, is there anything that has to be fixed for the upper vdsm/Ovirt layers? I guess we must provide the -T flag to run smoothly in an integrated environment. Remember that this bug was initially targeted at OVirt. (In reply to Markus Stockhausen from comment #31) > Now that the bug is fixed in the qemu lower layers, is there anything that > has to be fixed for the upper vdsm/Ovirt layers? I guess we must provide the > -T flag to run smoothly in an integrated environment. > > Remember that this bug was initially targeted at OVirt. Thanks for the reminder. The clone for vdsm is Bug 1138690 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1490.html |