+++ This bug was initially created as a clone of Bug #1127460 +++ Description of problem: After automatic extend of a disk, VM stop with error: "abnormal vm stop". When using thin provisioning on block storage, when disk becomes too full, vdsm ask the spm to extend the disk. After disk is extended, vdsm refresh the lv. Soon after refreshing the lv, the vm is paused. It is possible to reproduce this issue without extending a disk, simply by refreshing the lv used by the vm while the vm is writing to it. Version-Release number of selected component (if applicable): - vdsm master - vdsm from ovirt-3.5 - vdsm from ovirt-3.4 Platforms where issue can be reproduced: Fedora 19 Fedora 20 Platform where issue cannot be reproduced: RHEL 6.5 Tested disk interfaces - virio - virtio-scsi - ide How reproducible: Always Steps to Reproduce: 1. Start a vm 2. Run code writing to disk on the vm: while true; do date > guest.log 2>&1; sync; sleep 1; done 3. Run lvchange --refresh vgname/lvname (see refresh.out for the full command line used) Actual results: After few milliseconds or seconds, the vm stop Expected results: Vm should continue to run normally Additional info: In /var/log/libvirt/qemu/vmname.log we don't see any error In /var/log/libvirt/libvirt.log we see this error: 2014-08-06 21:15:39.793+0000: 821: debug : qemuMonitorIOProcess:393 : QEMU_MONITOR_IO_PROCESS: mon=0x7f89a800c480 buf={"timestamp": {"seconds": 1407359739, "microseconds": 7937 42}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "operation": "write", "action": "stop"}} len=173 In /var/vdsm/vdsm.log we see this error: Thu Aug 7 00:15:39 IDT 2014 -------- refreshing lv -------- libvirtEventLoop::INFO::2014-08-07 00:15:39,795::vm::4681::vm.Vm::(_onIOError) vmId=`acf8b4d1-4218-4dad-a665-d9000fbe20dc`::abnormal vm stop device virtio-disk0 error libvirtEventLoop::DEBUG::2014-08-07 00:15:39,796::vm::5350::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`acf8b4d1-4218-4dad-a665-d9000fbe20dc`::event Suspended detail 2 opaque None (The refreshing lv line is written from the refresh script) In /var/log/messages we don't see any error after the refresh. Attached files - after-extend-1/ - pause triggered by automatic extend - after-extend-2/ - another instance triggered by automatic extend - refresh-virtio - pause triggered by refreshing the lv when disk used virtio interface - refresh-virtio-scsi - pause triggered by refreshing the lv when disk used virtio-scsi interface - refresh-ide - pause triggered by refreshing the lv when disk used ide interface - refresh.sh - script refreshing the vm disk, using same command line used by vdsm - udevadm-monitor.sh - script for logging kernel and udev event while refreshing - rpm-qa.out - output of rpm -qa on the host In each directory, you can find these files: - messages - from /var/log/messages - vdsm.log - supervdsm.log - libvirt.log - from /var/log/libvirt/libvirt.log - ovirt-3.4-fc-vm01.log - qeum vm log from /var/log/libvirt/qeum - refresh.out - output of refresh.sh - udevadm-monitor.out - output of udevadm-monitor.sh --- Additional comment from Nir Soffer on 2014-08-06 18:34:54 EDT --- Zdenek, can you take a look at this? Is is possible that lvm is doing something wrong that cause an error in qeum? Please look in refresh.out - it contains the output of lvchange --refresh -vvvv vg/lv --- Additional comment from Nir Soffer on 2014-08-06 18:39:32 EDT --- Kevin, can you take a look at this? If this is not lvm, this must be qemu :-) --- Additional comment from Nir Soffer on 2014-08-06 18:55:42 EDT --- Additional info: I failed to reproduce this without running a vm using vdsm. 1. Running a vm using qemu I used a command line similar to the command line used by libvirt, when running a vim using vdsm (see start-vm.sh) I created a pv, vg and lv of 1G, and on top of it, qemu image of 10G, using the same parameters used by vdsm. I started the vm from pxa, and installed Fedora 20. This flow will cause the vm to pause when using vdsm. While the installer was running, I extended the lv on another host and refresh the lv on the same host where qemu was running. this is a good simulation of automatic extend done by vdsm. I could not reproduce the issue on Fedora 19, Fedora 20 and RHEL 7.0. 2. Using dd to write to an lv I created a pv, vg and lv using the same parameters used by vdsm. Then I run dd, copying few gigabytes to to the lv While dd was running, I extended the lv on form another host, and refreshed the lvm on the host were dd was running. I could not reproduce any error, dd always complete successfully. You can find the scripts used to test in reproduce.tar.gz --- Additional comment from Nir Soffer on 2014-08-06 19:02:03 EDT --- Files: - conf.sh - configuration used by other scripts (simulate vdsm configuraion) - extend.sh - script for extending lv on another host - refresh.sh - script for refreshing lv on the machine running qemu - write.sh - script for running dd and refreshing lv while dd is running - start-vm.sh - script for staring vm, using most vdsm parameters - start-vm-vdsm.sh - copied from /var/log/livbirt/qemu/vm.log - this is how vdsm is running qemu through libvirt. --- Additional comment from Francesco Romani on 2014-08-08 03:15:59 EDT --- Investigating. May be relevant: https://bugs.launchpad.net/qemu/+bug/1284090 --- Additional comment from Francesco Romani on 2014-08-08 03:34:56 EDT --- There is little I can add so far to the analysis Nir did. VDSM is reacting correctly to the events libvirt (on behalf of QEMU) is sending. Maybe on RHEL (where the issue cannot be reproduced) a 'resume' event is sent after the lvchange? To share the RHEL libvirtd/qemu log could be helpful. We need to make sure first that the lower levels of the stack (qemu, lvm) are behaving correctly. --- Additional comment from Nir Soffer on 2014-08-08 08:35:24 EDT --- (In reply to Francesco Romani from comment #6) > Maybe on RHEL (where the issue cannot be reproduced) a 'resume' event is > sent after the lvchange? To share the RHEL libvirtd/qemu log could be > helpful. Note that I can trigger this outside of vdsm - it is not related to what vdsm does after extending a disk. --- Additional comment from Francesco Romani on 2014-08-13 10:27:49 EDT --- removed from blockers for 3.5 Reason: - happens also on oVirt 3.4, so it is not a 3.5 regression - all the evidence yet points to lower levels of the stack (qemu, lvm) - there is little (if any) room for workarounds in VDSM that said, this bug is still very high priority and not by any means less serious. --- Additional comment from Kevin Wolf on 2014-08-13 11:09:11 EDT --- A difference between RHEL and upstream is that RHEL includes more information to the BLOCK_IO_ERROR event. Specifically it includes a field that informs the management tool about the error code (__com.redhat_reason). As far as I know, this is what is checked by VDSM in order to determine whether we have an ordinary I/O error or got an -ENOSPC and need to resize the LV. No idea whether it also affects how the restart of the VM is done. The missing error code fields also mean that it's hard to see from the logs what the exact error is that happened (this is what comment 5 refers to). --- Additional comment from Nir Soffer on 2014-08-13 11:35:44 EDT --- (In reply to Kevin Wolf from comment #9) Kevin, what is the next step? How can we get more info from qemu to understand this failure? --- Additional comment from Nir Soffer on 2014-08-13 11:37:38 EDT --- Adding back the needinfo for Zdenek. --- Additional comment from Francesco Romani on 2014-08-13 12:08:07 EDT --- (In reply to Kevin Wolf from comment #9) > A difference between RHEL and upstream is that RHEL includes more information > to the BLOCK_IO_ERROR event. Specifically it includes a field that informs > the > management tool about the error code (__com.redhat_reason). As far as I know, > this is what is checked by VDSM in order to determine whether we have an > ordinary > I/O error or got an -ENOSPC and need to resize the LV. No idea whether it > also > affects how the restart of the VM is done. That's probably it. The error code is what VDSM uses to trigger automatic resize of the volume (if it is 'ENOSPC', it does the resize), and this explains why on RHEL the issue couldn't be reproduced. VDSM *does* the resume the VM ('continue') but only *after* a succesfull disk extension (in vdsm/virt/vm.py:__afterVolumeExtension). The lack of error reason doesn't put in motion this chain of actions. --- Additional comment from Nir Soffer on 2014-08-13 13:14:03 EDT --- (In reply to Francesco Romani from comment #12) > (In reply to Kevin Wolf from comment #9) > > A difference between RHEL and upstream is that RHEL includes more information > > to the BLOCK_IO_ERROR event. Specifically it includes a field that informs > > the > > management tool about the error code (__com.redhat_reason). As far as I know, > > this is what is checked by VDSM in order to determine whether we have an > > ordinary > > I/O error or got an -ENOSPC and need to resize the LV. No idea whether it > > also > > affects how the restart of the VM is done. > > That's probably it. > > The error code is what VDSM uses to trigger automatic resize of the volume > (if it is 'ENOSPC', it does the resize), and this explains why on RHEL the > issue couldn't be reproduced. > > VDSM *does* the resume the VM ('continue') but only *after* a succesfull > disk extension (in vdsm/virt/vm.py:__afterVolumeExtension). > > The lack of error reason doesn't put in motion this chain of actions. I don't think this is related. Refreshing the lv cause BLOCK_IO error in qemu, *after* successful extend or without any extend. This error causes abnormal stop that cannot be recovered without restarting the vm. We should understand why we get this error. --- Additional comment from Francesco Romani on 2014-08-19 11:07:04 EDT --- Eric, what is puzzling is that apparently this issue could not be reproduced outside VDSM, even using the same QEMU command line. The biggest missing part is libvirt. Is that possible that the interaction between QEMU and libvirt has a role here? --- Additional comment from Eric Blake on 2014-08-19 13:08:50 EDT --- What event API is VDSM using to track IO error cause? If the code is using virConnectDomainEventRegisterAny() with VIR_DOMAIN_EVENT_ID_IO_ERROR_REASON, then on RHEL-based qemu, you will get one of four strings ("enospc", "eio", "eperm", "eother"), and on Fedora-based qemu you will always get one string (""). Upstream qemu is considering adding a reason field (to match what downstream RHEL has already had for a couple of years), but right now the debate is on whether it has to be a full-featured string or whether a simple boolean for nospace is sufficient. Libvirt is just acting as a passthrough for the reason field. --- Additional comment from Eric Blake on 2014-08-19 13:16:30 EDT --- (In reply to Francesco Romani from comment #12) > The error code is what VDSM uses to trigger automatic resize of the volume > (if it is 'ENOSPC', it does the resize), and this explains why on RHEL the > issue couldn't be reproduced. > > VDSM *does* the resume the VM ('continue') but only *after* a succesfull > disk extension (in vdsm/virt/vm.py:__afterVolumeExtension). > > The lack of error reason doesn't put in motion this chain of actions. So if I understand correctly, when libvirt gives a reason of "" (because you are using a qemu that doesn't provide a reason), you treat it as a fatal error, regardless of whether the guest was paused due to ENOSPC vs paused due to some other reason? --- Additional comment from Francesco Romani on 2014-08-21 06:59:34 EDT --- (In reply to Eric Blake from comment #16) > (In reply to Francesco Romani from comment #12) > > The error code is what VDSM uses to trigger automatic resize of the volume > > (if it is 'ENOSPC', it does the resize), and this explains why on RHEL the > > issue couldn't be reproduced. > > > > VDSM *does* the resume the VM ('continue') but only *after* a succesfull > > disk extension (in vdsm/virt/vm.py:__afterVolumeExtension). > > > > The lack of error reason doesn't put in motion this chain of actions. > > So if I understand correctly, when libvirt gives a reason of "" (because you > are using a qemu that doesn't provide a reason), you treat it as a fatal > error, regardless of whether the guest was paused due to ENOSPC vs paused > due to some other reason? This is correct: VDSM has no mean to distinguish what happened. We cannot just use werror=enospc in QEMU, of course passing through libvirt, because VDSM needs to be aware of the other errors. --- Additional comment from Francesco Romani on 2014-08-21 07:17:25 EDT --- I spent quite some time reproducing locally this issue, and I believe I reached a stable point. Quick summary: - VDSM using u/s QEMU cannot trasparently extend the volume because the lack of the 'reason' field. Not VDSM bug, QEMU bug upstream filed as stated into https://bugzilla.redhat.com/show_bug.cgi?id=1127460#c5 ; We need fix on QEMU. - the LV refresh issue seems unrelated and not an issue Now I'm going to explain the above points with more details: +++ This is how the flow is supposed to work: flow#1. VDSM runs a VM normally on a thin-provisioned, qcow2 formatted, block device using LVM. Through libvirt, qemu is configured with werror=stop. flow#2. the drive runs out of space flow#2.a. QEMU stops the VM (instructed by VDSM at point #1) flow#2.b. QEMU reports a BLOCK_IO_ERROR with reason='enospc' to signal the space exausted flow#3. libvirt just translates the monitor event in its own event format flow#4. VDSM detects the event flow#4.a. runs lvextend $options on the affected LV flow#4.b. runs lvchange $options on the affected LV flow#4.c. sends un-pause the VM using the the 'continue' command flow#5. the VM restarts Please note: if no one sends, through the QEMU monitor, the 'continue' command (flow#4.c), the VM will remain paused! And this is what I believe happened in the original report, because we see +++ What I believe happened in the original report: We surely seen this: "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "operation": "write", "action": "stop"}} In the reproduce scripts/logs I see werror=stop in the QEMU command line AFAIK the only source of BLOCK_IO_ERROR is a failed write (looked at the QEMU sources), so this confirms that QEMU runned out of space and triggered the (broken) flow below. Please note that timing is critical here: if LV is extended before QEMU hits the limit, then AFAIK it will happily run without issue. This fully explain all what happened on the original report. +++ Impact of LV refresh There is a pending question, re-stated in https://bugzilla.redhat.com/show_bug.cgi?id=1127460#c3 : can LV refresh *alone* cause harm to QEMU and make it stop? I don't see how it could be possible, but it is worth to check anyway. To verify this I did the following: - create a LV volume both on physical disc and then through ISCSI; - create a qcow2 image on it - make sure the image is big enough, so we *do not* run out of space and we do not hit the broken flow - run QEMU using slightly amended reproduce scripts, using the relevant options used by VDSM - install Fedora 20 - while QEMU is accessing the disk, continuosly refresh every 5s the affected LV This test passed cleanly on F20 and RHEL7. +++ Conclusion I believe the LV refresh lead is a red herring. The issue here is the lack of reason on BLOCK_IO_ERROR which completely break the automatically extend flow. But the bug was reported, and it is way out of the control of VDSM. VDSM has no issue here. Because all the above, I'm going to decrease priority of this bug. --- Additional comment from Francesco Romani on 2014-08-21 07:26:43 EDT --- scripts and log to verify if LV refresh is harming QEMU. scripts/chainrefresh.sh - runs the refreshing every 5s and logs the output scripts/start-vm-vdsm2.sh - runs a QEMU using (almost) the same parameters of VDSM. Amended the SPICE parameters and the paths, not relevant to this BZ scripts/start-vm2.sh - simpler QEMU invocation, used only once on F20 scripts/conf.sh - parameters: LVM config, LV and VG paths logs: f20_simple: run of start-vm2.sh on a F20 host, virt-preview repo enabled, LV on a physical disk f20_vdsm: like above, but using start-vm-vdsm2.sh rhel7: like above, on RHEL7, using stock packages plus RHEV repo. on each log dir: refresh.log: output of scripts/chainrefresh.sh qemu_mon.log: transcript of the QEMU monitor messages, either direct connection (f20_simple) or through qmp-shell. --- Additional comment from Francesco Romani on 2014-08-21 07:41:24 EDT --- (In reply to Francesco Romani from comment #18) > AFAIK the only source of BLOCK_IO_ERROR is a failed write (looked at the > QEMU sources), so this confirms that QEMU runned out of space and triggered > the (broken) flow below. I need to correct myself. What I mean here is only a failed write can trigger a BLOCK_IO_ERROR as it was reported. There are indeed other possible source of errors like failed reads. However, this is a minor point; the core point is there is no evidence of a LV refresh alone can cause an I/O error. --- Additional comment from Francesco Romani on 2014-08-21 07:47:26 EDT --- Workaround does exist for not-RHEL hosts. For CentOS, there is a qemu-kvm-rhev package which does report the 'reason' field, thus the automatic extend can take place. --- Additional comment from Nir Soffer on 2014-08-21 10:28:36 EDT --- (In reply to Francesco Romani from comment #20) > However, this is a minor point; the core point is there is no evidence of a > LV refresh alone can cause an I/O error. Francesco, I think you are not running lvm refresh incorrectly: lvchange -vvvvvv --refresh $vg_name/$lv_name On Fedora and EL 7, this command does nothing because lvmetad daemon is caching metadata. vdsm uses the use_lvmetad=0 option to actualy go to the storage. You must run lvm commands with the --config options used by vdsm, this is why I m using conf.sh And you should also create the pvs, vgs, and lvs using the same parameters that vdsm is using. Please see attachment 924640 [details] for the details. --- Additional comment from Nir Soffer on 2014-08-21 12:20:37 EDT --- Francesco, can you confirm that you tested the refresh incorrectly? see comment 22. --- Additional comment from Francesco Romani on 2014-08-21 19:30:38 EDT --- (In reply to Nir Soffer from comment #23) > Francesco, can you confirm that you tested the refresh incorrectly? see > comment 22. Yes. I haven't used the --config option and the options you originally used. I'll do some testing again. --- Additional comment from Michal Skrivanek on 2014-08-22 04:17:23 EDT --- the reported reason is not relevant anymore as we ship qemu-kvm-rhev for all platforms there's no report about this issue in any known environment - hence decreasing the urgency there's no data corruption, "worst" case is VM gets paused - hence decreasing severity possible issue in resize and/or lvm refresh flows - hence moving to storage not a blocker for 3.5 at this point --- Additional comment from Kevin Wolf on 2014-08-22 05:41:06 EDT --- (In reply to Francesco Romani from comment #18) > - VDSM using u/s QEMU cannot trasparently extend the volume because the lack > of the 'reason' field. Not VDSM bug, QEMU bug upstream filed as stated into > https://bugzilla.redhat.com/show_bug.cgi?id=1127460#c5 ; We need fix on QEMU. > > [...] > > +++ This is how the flow is supposed to work: > flow#1. VDSM runs a VM normally on a thin-provisioned, qcow2 formatted, > block device using LVM. Through libvirt, qemu is configured with werror=stop. > flow#2. the drive runs out of space > flow#2.a. QEMU stops the VM (instructed by VDSM at point #1) > flow#2.b. QEMU reports a BLOCK_IO_ERROR with reason='enospc' to signal the > space exausted Please note that this is already not the regular flow, but the backup solution. Generally management should try to extend the LVs early enough that qemu never runs into an ENOSPC condition. In order to achieve this, it uses query-block results, specifically the high watermark. If VDSM wants to be able to cope with qemu versions that don't include an error reason (which are all upstream versions up to now), it could still simply check the high watermark after each I/O error and if it's close to the LV size, it could resize and give it a try if the VM can resume. --- Additional comment from Nir Soffer on 2014-08-24 06:32:14 EDT --- (In reply to Michal Skrivanek from comment #25) > the reported reason is not relevant anymore as we ship qemu-kvm-rhev for all > platforms > there's no report about this issue in any known environment > - hence decreasing the urgency This is a report from my (known?) environment. Do you suggest to wait until users complain about it? > possible issue in resize and/or lvm refresh flows > - hence moving to storage It is not related to resize, this bug show how you can trigger a pause by refreshing an lv. There is no problem with storage - we do extend the disk, but qemu is pausing the vm after the extend completed successfully. - hence returning to virt --- Additional comment from Francesco Romani on 2014-08-25 02:08:31 EDT --- (In reply to Nir Soffer from comment #1) > Zdenek, can you take a look at this? Is is possible that lvm is doing > something wrong that cause an error in qeum? > > Please look in refresh.out - it contains the output of > lvchange --refresh -vvvv vg/lv Ping? --- Additional comment from Francesco Romani on 2014-08-25 05:34:00 EDT --- Looks like SELinux could be the culprit here. Did the following test with SELinux *disabled* (see last line) on RHEL7 GENji> 11:28:54 root [/home/fromani/bz1127460/reproduce]$ ./refresh.sh refreshing lv... lvchange --config devices { ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } --refresh 8ceb838a-4e74-420d-b1e2-817c0e9f8eea/aa39c714-3095-4b86-af56-a2666849a573 WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it! lv size: WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it! 9.00g GENji> 11:28:58 root [/home/fromani/bz1127460/reproduce]$ date Mon Aug 25 11:29:27 CEST 2014 GENji> 11:29:27 root [/home/fromani/bz1127460/reproduce]$ ls -lh /dev/8ceb838a-4e74-420d-b1e2-817c0e9f8eea/aa39c714-3095-4b86-af56-a2666849a573 lrwxrwxrwx. 1 root root 8 Aug 25 11:28 /dev/8ceb838a-4e74-420d-b1e2-817c0e9f8eea/aa39c714-3095-4b86-af56-a2666849a573 -> ../dm-10 GENji> 11:30:14 root [/home/fromani/bz1127460/reproduce]$ ls -lh /dev/8ceb838a-4eaudit2why < /var/log/audit/audit.log > grep dm-10 | tail GENji> 11:30:34 root [/home/fromani/bz1127460/reproduce]$ rm grep rm: remove regular file ‘grep’? y GENji> 11:30:40 root [/home/fromani/bz1127460/reproduce]$ audit2why < /var/log/audit/audit.log | grep dm-10 | tail type=AVC msg=audit(1408957266.925:3526): avc: denied { write } for pid=18698 comm="qemu-kvm" path="/dev/dm-10" dev="devtmpfs" ino=291267 scontext=system_u:system_r:svirt_t:s0:c301,c701 tcontext=system_u:object_r:fixed_disk_device_t:s0 tclass=blk_file type=AVC msg=audit(1408958068.190:3612): avc: denied { write } for pid=18698 comm="qemu-kvm" path="/dev/dm-10" dev="devtmpfs" ino=291267 scontext=system_u:system_r:svirt_t:s0:c301,c701 tcontext=system_u:object_r:fixed_disk_device_t:s0 tclass=blk_file type=AVC msg=audit(1408958366.925:3635): avc: denied { write } for pid=18698 comm="qemu-kvm" path="/dev/dm-10" dev="devtmpfs" ino=291267 scontext=system_u:system_r:svirt_t:s0:c301,c701 tcontext=system_u:object_r:fixed_disk_device_t:s0 tclass=blk_file type=AVC msg=audit(1408958671.665:3671): avc: denied { write } for pid=18698 comm="qemu-kvm" path="/dev/dm-10" dev="devtmpfs" ino=291267 scontext=system_u:system_r:svirt_t:s0:c301,c701 tcontext=system_u:object_r:fixed_disk_device_t:s0 tclass=blk_file type=AVC msg=audit(1408958679.231:3672): avc: denied { read } for pid=18698 comm="qemu-kvm" path="/dev/dm-10" dev="devtmpfs" ino=291267 scontext=system_u:system_r:svirt_t:s0:c301,c701 tcontext=system_u:object_r:fixed_disk_device_t:s0 tclass=blk_file type=AVC msg=audit(1408958702.908:3680): avc: denied { write } for pid=18698 comm="qemu-kvm" path="/dev/dm-10" dev="devtmpfs" ino=291267 scontext=system_u:system_r:svirt_t:s0:c301,c701 tcontext=system_u:object_r:fixed_disk_device_t:s0 tclass=blk_file type=AVC msg=audit(1408958749.529:3684): avc: denied { read } for pid=18698 comm="qemu-kvm" path="/dev/dm-10" dev="devtmpfs" ino=291267 scontext=system_u:system_r:svirt_t:s0:c301,c701 tcontext=system_u:object_r:fixed_disk_device_t:s0 tclass=blk_file type=AVC msg=audit(1408958947.313:3713): avc: denied { write } for pid=18698 comm="qemu-kvm" path="/dev/dm-10" dev="devtmpfs" ino=291267 scontext=system_u:system_r:svirt_t:s0:c301,c701 tcontext=system_u:object_r:fixed_disk_device_t:s0 tclass=blk_file type=AVC msg=audit(1408958953.521:3735): avc: denied { read } for pid=18698 comm="qemu-kvm" path="/dev/dm-10" dev="devtmpfs" ino=291267 scontext=system_u:system_r:svirt_t:s0:c301,c701 tcontext=system_u:object_r:fixed_disk_device_t:s0 tclass=blk_file type=AVC msg=audit(1408959007.604:3746): avc: denied { write } for pid=18698 comm="qemu-kvm" path="/dev/dm-10" dev="devtmpfs" ino=291267 scontext=system_u:system_r:svirt_t:s0:c301,c701 tcontext=system_u:object_r:fixed_disk_device_t:s0 tclass=blk_file GENji> 11:30:47 root [/home/fromani/bz1127460/reproduce]$ date --date='@1408959007' Mon Aug 25 11:30:07 CEST 2014 GENji> 11:31:01 root [/home/fromani/bz1127460/reproduce]$ getenforce Permissive GENji> 11:31:19 root [/home/fromani/bz1127460/reproduce]$ virsh list Id Name State ---------------------------------------------------- 9 F20_C1 running GENji> 11:32:49 root [/home/fromani/bz1127460/reproduce]$ vdsClient localhost list table 56d1c657-dd76-4609-a207-c050699be5be 18698 F20_C1 Up IIRC libvirt does some SELinux setup for the VM, and this could explain why the issue couldn't be reproduced running QEMU alone. Nir, can you please confirm that disabling SELinux fixes the issue on your setup as well? --- Additional comment from Zdenek Kabelac on 2014-08-25 05:50:50 EDT --- Since you mention selinux here - we are noticing for some time our lvm2 test suite that running it on selinux enabled system slows whole runtime of test suite by fact 4 or even more - i.e. instead of 15minutes it could be more than hour - so there are issue to be resolved. We have had discussion with D.Walsh how to audit this thing since lvm2 is doing some operations based on RHEL5 selinux usage - but with RHEL6 - some things are made differently. --- Additional comment from Michal Skrivanek on 2014-08-25 08:20:19 EDT --- Can you please check from libvirt's point of view? Seems that's our next lead... --- Additional comment from Francesco Romani on 2014-08-26 03:41:58 EDT --- I'd like te add a few more details. The following applies to a stock RHEL7; I't like to reiterate that also https://bugzilla.redhat.com/show_bug.cgi?id=1127460#c29 was referring to a stock RHEL7. So, on RHEL7 with SELinux *enabled* (everything as default), we have: GENji> 09:36:03 root [/home/fromani/bz1127460/reproduce]$ ls -lh /dev/8ceb838a-4e74-420d-b1e2-817c0e9f8eea/aa39c714-3095-4b86-af56-a2666849a573 lrwxrwxrwx. 1 root root 8 Aug 26 09:34 /dev/8ceb838a-4e74-420d-b1e2-817c0e9f8eea/aa39c714-3095-4b86-af56-a2666849a573 -> ../dm-27 GENji> 09:36:13 root [/home/fromani/bz1127460/reproduce]$ ls -lh /dev/dm- ls: cannot access /dev/dm-: No such file or directory GENji> 09:36:19 root [/home/fromani/bz1127460/reproduce]$ ls -lh /dev/dm-2 brw-rw----. 1 root disk 253, 2 Aug 26 09:34 /dev/dm-2 GENji> 09:36:20 root [/home/fromani/bz1127460/reproduce]$ ls -lh /dev/dm-27 brw-rw----. 1 vdsm qemu 253, 27 Aug 26 09:36 /dev/dm-27 GENji> 09:36:22 root [/home/fromani/bz1127460/reproduce]$ ls -lhZ /dev/dm-27 brw-rw----. vdsm qemu system_u:object_r:svirt_image_t:s0:c575,c891 /dev/dm-27 GENji> 09:36:26 root [/home/fromani/bz1127460/reproduce]$ virsh list Id Name State ---------------------------------------------------- 2 F20_C1 running GENji> 09:36:35 root [/home/fromani/bz1127460/reproduce]$ ./refresh.sh refreshing lv... lvchange --config devices { ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } --refresh 8ceb838a-4e74-420d-b1e2-817c0e9f8eea/aa39c714-3095-4b86-af56-a2666849a573 WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it! lv size: WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it! 11.00g GENji> 09:36:40 root [/home/fromani/bz1127460/reproduce]$ ls -lhZ /dev/dm-27 brw-rw----. vdsm qemu system_u:object_r:fixed_disk_device_t:s0 /dev/dm-27 GENji> 09:36:41 root [/home/fromani/bz1127460/reproduce]$ virsh list Id Name State ---------------------------------------------------- 2 F20_C1 paused As this exceprt shows, the LV refresh makes the device lose the SELinux labels. Proper, required labeling is ensured by libvirt and documented here: http://libvirt.org/drvqemu.html I'd like to quote in particular: "Likewise physical block devices must be labelled system_u:object_r:virt_image_t. " The incorrect labeling after LV refresh will prevent any further I/O operation from QEMU. I have every reason to believe that the above will apply also on F20 and on F19, as the original bug report documents. Will verify on Fedora ASAP. --- Additional comment from Francesco Romani on 2014-08-26 04:29:44 EDT --- Same on stock Fedora 20: [root@benji reproduce]# ls -lhZ /dev/8ceb838a-4e74-420d-b1e2-817c0e9f8eea/94856712-7503-466e-8586-6b66981b7b23 lrwxrwxrwx. root root system_u:object_r:device_t:s0 /dev/8ceb838a-4e74-420d-b1e2-817c0e9f8eea/94856712-7503-466e-8586-6b66981b7b23 -> ../dm-8 [root@benji reproduce]# ls -lhZ /dev/dm-8 brw-rw----. vdsm qemu system_u:object_r:svirt_image_t:s0:c516,c990 /dev/dm-8 [root@benji reproduce]# chmod 0755 refresh.sh [root@benji reproduce]# ./refresh.sh refreshing lv... lvchange --config devices { ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } --refresh 8ceb838a-4e74-420d-b1e2-817c0e9f8eea/94856712-7503-466e-8586-6b66981b7b23 WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it! lv size: WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it! 2.00g [root@benji reproduce]# ls -lhZ /dev/dm-8 brw-rw----. vdsm qemu system_u:object_r:fixed_disk_device_t:s0 /dev/dm-8 [root@benji reproduce]# cat /etc/redhat-release Fedora release 20 (Heisenbug) [root@benji reproduce]# getenforce Enforcing [root@benji reproduce]# --- Additional comment from Sven Kieske on 2014-08-28 04:02:12 EDT --- (In reply to Michal Skrivanek from comment #25) > the reported reason is not relevant anymore as we ship qemu-kvm-rhev for all > platforms I'm sorry but this seems not to be true for EL6: There is a jenkins job running which builds these packages but they don't get installed by default on centos 6.5. instead the original qemu-kvm package provided by centos repo is used which lacks some features. So there are still some steps missing in providing the mentioned package via the official ovirt repo. --- Additional comment from Allon Mureinik on 2014-08-28 04:31:19 EDT --- (In reply to Sven Kieske from comment #34) > (In reply to Michal Skrivanek from comment #25) > > the reported reason is not relevant anymore as we ship qemu-kvm-rhev for all > > platforms > > I'm sorry but this seems not to be true for EL6: The fix for bug 1127763 makes vdsm depend on qemu-kvm-rhev, which outdates qemu-kvm, so yum installing VDSM should pull. I verified this behavior with vdsm-14.6.2 on Centos 6.5 - if you have seen a different behavior, that a bug - could you please file one with all the details about the OS and the vdsm you're installing? --- Additional comment from Sven Kieske on 2014-08-28 05:13:59 EDT --- (In reply to Allon Mureinik from comment #35) > (In reply to Sven Kieske from comment #34) > > (In reply to Michal Skrivanek from comment #25) > > > the reported reason is not relevant anymore as we ship qemu-kvm-rhev for all > > > platforms > > > > I'm sorry but this seems not to be true for EL6: > The fix for bug 1127763 makes vdsm depend on qemu-kvm-rhev, which outdates > qemu-kvm, so yum installing VDSM should pull. > > I verified this behavior with vdsm-14.6.2 on Centos 6.5 - if you have seen a > different behavior, that a bug - could you please file one with all the > details about the OS and the vdsm you're installing? Okay sorry, than my information was just outdated as I'm not running bleeding edge RC software on my production environment ;) When I look at this bug it is just fixed for 3.5 RC. Will this get backported to 3.4.4/5 ? I'm currently still running stuff from 3.3.z repos (locally mirrored) and just plan the upgrade to 3.4.z So I guess for most real world deployments your statement is still not true/makes people think it should work. But for the upcoming 3.5 release it might be correct, I can't test this atm. --- Additional comment from Nir Soffer on 2014-08-28 09:18:38 EDT --- (In reply to Francesco Romani from comment #33) Francesco, about the libvirt selinux setup: 1. Is this new behavior? is it used on rhel 6.5? 2. Do we enable this feature or it is used by default? 3. Can we disable this "feature" I think that libvrit setting a selinux label on *our* device is wrong. We should control our devices permissions and selinux context, and libvirt must use what we provided. We set udev rules to set the permissions and ownership of volumes before they are used. So a possible fix may be to modify these rules to add the selinux context. 280 def appropriateDevice(self, guid, thiefId): 281 ruleFile = _UDEV_RULE_FILE_NAME % (guid, thiefId) 282 rule = 'SYMLINK=="mapper/%s", OWNER="%s", GROUP="%s"\n' % ( 283 guid, DISKIMAGE_USER, DISKIMAGE_GROUP) 284 with open(ruleFile, "w") as rf: 285 self.log.debug("Creating rule %s: %r", ruleFile, rule) 286 rf.write(rule) --- Additional comment from Allon Mureinik on 2014-08-28 14:42:38 EDT --- (In reply to Sven Kieske from comment #36) > (In reply to Allon Mureinik from comment #35) > > (In reply to Sven Kieske from comment #34) > > > (In reply to Michal Skrivanek from comment #25) > > > > the reported reason is not relevant anymore as we ship qemu-kvm-rhev for all > > > > platforms > > > > > > I'm sorry but this seems not to be true for EL6: > > The fix for bug 1127763 makes vdsm depend on qemu-kvm-rhev, which outdates > > qemu-kvm, so yum installing VDSM should pull. > > > > I verified this behavior with vdsm-14.6.2 on Centos 6.5 - if you have seen a > > different behavior, that a bug - could you please file one with all the > > details about the OS and the vdsm you're installing? > > Okay sorry, than my information was just outdated > as I'm not running bleeding edge RC software on my production environment ;) > > When I look at this bug it is just fixed for 3.5 RC. > > Will this get backported to 3.4.4/5 ? I've cloned bug 1127763 to bug 1135061 to track the backporting of this issue. It may require some work, but I don't see any reason why we can't do it. --- Additional comment from Jaroslav Suchanek on 2014-08-29 09:33:55 EDT --- Jiri, can you please comment it? Thanks. --- Additional comment from Jiri Denemark on 2014-08-29 11:52:45 EDT --- So from commet 33, it looks like the SELinux label is lost when the LV is refreshed. If it's not possible to fix it, I think the only solution is to create a udev rule that would restore the label to make sure it is restored as soon as possible. However, unless vdsm uses static SELinux labels, the udev rule will have to be changed everytime a domain is started/stopped. I guess the main question here is why does the label disappear? Is it because the device is removed and recreated during refresh or something calls restorecon? And can anything that causes it be avoided? --- Additional comment from Nir Soffer on 2014-09-01 13:55:44 EDT --- Zdendec, can you explain why a device loose the selinux label after refresh? Is this expected? a bug? --- Additional comment from Nir Soffer on 2014-09-01 14:14:03 EDT --- Assuming that we do need to keep selinux label on the device, we can use the new SECLABEL{module} key. The feature is available in systemd git, and hopefully will be backported to RHEL 7. Vdsm rule should look like this: SYMLINK=="mapper/xyz", OWNER="vdsm", GROUP="kvm", SECLABEL{selinux}="virt_image_t" Until this feature is available, this should also work: SYMLINK=="mapper/xyz", OWNER="vdsm", GROUP="kvm", RUN+="/bin/chcon -t virt_image_t $env{DEVNAME} $env{DEVLINKS}" Although it seems that it did not work for Oracle folks having similar issue, and they are running the chcon command from a script, instead of directly in the udev rule. See bug 1015300 for more info. --- Additional comment from Zdenek Kabelac on 2014-09-02 07:25:48 EDT --- LVM2 is not setting any selinux labels - it's all now happing in udev - so all things are going though udev rules - there is a template-like rule file: 12-dm-permissions.rules where you could see how the se-labels can be handled. Also using 'SYMLINK==' is no-way-to-go - check ENV{DM_VG_NAME} and other already set vars. Other thing which might be worth to check/test here is lvm.conf option: devices { disable_after_error_count = 1 } This should help to eliminate failing device noticeable faster. --- Additional comment from Nir Soffer on 2014-09-02 13:20:01 EDT --- (In reply to Zdenek Kabelac from comment #43) > LVM2 is not setting any selinux labels - it's all now happing in udev Sure, but is it expected that after a refresh (lvchange --refresh), a label set on the device will be lost? We see this error only on Fedora 19, Fedora 20 and RHEL 7, but not on RHEL 6. Is lvm refresh different on these versions? --- Additional comment from Nir Soffer on 2014-09-14 14:48:26 EDT --- --- Additional comment from Nir Soffer on 2014-09-15 02:22:15 EDT --- The udev rule mentioned in comment 37 is not relevant, we use these only for direct luns. vdsm images get their permissions from 12-vdsm-lvm.rules. --- Additional comment from Allon Mureinik on 2014-09-16 08:42:50 EDT --- So what's the verdict here? Do we need a new SELinux policy? --- Additional comment from Nir Soffer on 2014-09-16 09:22:05 EDT --- (In reply to Allon Mureinik from comment #47) > So what's the verdict here? > Do we need a new SELinux policy? No. We are blocked on: - getting an answer for comment 44 - understand why it works for RHEL 6.5 We can fix this corner by adding libvirt selinux label on our images. I tested and it does prevent the pausing on refresh. However this may break other flows like reading or writing data to an image using qemu or dd. --- Additional comment from Zdenek Kabelac on 2014-09-16 10:47:08 EDT --- (In reply to Nir Soffer from comment #44) > (In reply to Zdenek Kabelac from comment #43) > > LVM2 is not setting any selinux labels - it's all now happing in udev > Sure, but is it expected that after a refresh (lvchange --refresh), a label > set on the device will be lost? > > We see this error only on Fedora 19, Fedora 20 and RHEL 7, but not on RHEL > 6. Is lvm refresh different on these versions? Nope - lvrefresh hasn't been change for a long time - it's just suspend/resume per active LV. All the selinux label magic is hidden in the udev rules processing. --- Additional comment from Francesco Romani on 2014-09-23 05:49:26 EDT --- (In reply to Nir Soffer from comment #48) > We can fix this corner by adding libvirt selinux label on our images. I > tested and it does prevent the pausing on refresh. However this may break > other flows like reading or writing data to an image using qemu or dd. I agree. The root cause is not yet sorted out. I believe the most likely cause is an udev policy change -maybe due to systemd? but it hasn't pinpointed yet. The real fix is probably against udev rules. If so, I'm not convinced VDSM is the right place to deliver this fix. However, will work on this direction. --- Additional comment from Sandro Bonazzola on 2014-09-24 04:11:24 EDT --- Re-targeting since 3.4.4 has been released and only security/critical fixed will be allowed for 3.4.z. 3.5.0 is also in blockers only phase, so re-targeting to 3.5.1. --- Additional comment from Nir Soffer on 2014-09-28 15:35:46 EDT --- Testing on RHEL 6.5 show: 1. Libvirt set the same selinux label (svirt_image_t) on the block device backing the lv (e.g. /dev/dm-40) 2. The selinux label is *not* lost after refreshing the lv 3. There is no udev rule setting this label, so it is probably libvirt So the question is why selinux label is lost on RHEL 7 (and Fedora), and not on RHEL 6.5? Zdenek so you have any idea why this happens on RHEL 7 (and Fedora) and not on RHEL 6? Can you get someone from device mapper to look into this? It seems that the only thing we (vdsm) can do is to update our lvm rules to add this selinux label to all standard vdsm images. --- Additional comment from Zdenek Kabelac on 2014-09-29 03:40:19 EDT --- There was no change for selinux on lvm2 code base between RHEL 6 & 7 - and in fact the code base is mostly equal - depends on which version the release is based on. Looking at vdsm package's udev rule file - it seems that even in RHEL6 there was nothing 'selinux' related - rules only set OWNER and GROUP - so I'd have 'guessed' it's something 'selinux' policy based. I think some selinux expert is needed to answer your question. Since maybe version 6 allowed to set context from OWNER:GROUP while version 7 needs process (which is udev in this case) and the fix should go along comment 42. (wow we have number 42 in the answer :)) Note - looking at those huge udev rules matching options - I think it would be fairly easier to use some simple common LV name prefix (or VG name if all vdsm volumes have its own VG) like "VDSM_" and match LV devices with proper prefix (DM_VG_NAME, DM_LV_NAME). --- Additional comment from Nir Soffer on 2014-09-29 08:14:56 EDT --- --- Additional comment from Federico Simoncelli on 2014-09-30 05:50:40 EDT --- For reference the underlying bug is 1147910 --- Additional comment from Nir Soffer on 2014-10-01 08:22:00 EDT --- We have temporary solution merged - we are waiting for a real fix from systemd, but do not depend on it any more. --- Additional comment from Allon Mureinik on 2014-10-01 11:24:52 EDT --- (In reply to Nir Soffer from comment #56) > We have temporary solution merged - we are waiting for a real fix from > systemd, but do not depend on it any more. Let's please open a bug on VDSM to remind us to consume the relevant RPM when it's fixed.
Verified to be working on : Red Hat Enterprise Virtualization Manager Version: 3.5.0-0.14.beta.el6ev VDSM :vdsm-4.16.6-1.el7.x86_64 On host with OS :Red Hat Enterprise Linux Server release 7.0 (Maipo)
Please backport this patch and under test and have a look http://lists.nongnu.org/archive/html/qemu-devel/2014-08/msg05346.html [Qemu-devel] [PATCH] block: extend BLOCK_IO_ERROR event with nospace ind From: Luiz Capitulino Subject: [Qemu-devel] [PATCH] block: extend BLOCK_IO_ERROR event with nospace indicator Date: Fri, 29 Aug 2014 16:07:27 -0400 Management software, such as RHEV's vdsm, want to be able to allocate disk space on demand. The basic use case is to start a VM with a small disk and then the disk is enlarged when QEMU hits a ENOSPC condition. To this end, the management software has to be notified when QEMU encounters ENOSPC. The solution implemented by this commit is simple: it extends the BLOCK_IO_ERROR with a 'nospace' key, which is true when QEMU is stopped due to ENOSPC. Note that support for querying this event is already present in query-block by means of the 'io-status' key. Also, the new 'nospace' BLOCK_IO_ERROR field shares the same semantics with 'io-status', which basically means that werror= has to be set to either 'stop' or 'enospc' to enable 'nospace'. Finally, this commit also updates the 'io-status' key doc in the schema with a list of supported device models. Signed-off-by: Luiz Capitulino <address@hidden> --- Three important observations: 1. We've talked with oVirt and OpenStack folks. oVirt folks say that this implementation is enough for their use-case. OpenStack don't need this feature 2. While testing this with a raw image on a (smaller) ext2 file mounted via the loopback device, I get half "Invalid argument" I/O errors and half "No space" errors". This means that half of the BLOCK_IO_ERROR events that are emitted for this test-case will have nospace=false and the other half nospace=true. I don't know why I'm getting those "Invalid argument" errors, can anyone of the block layer comment on this? I don't get that with a qcow2 image (I get nospace=true for all events) 3. I think this should go via block tree block.c | 22 ++++++++++++++-------- qapi/block-core.json | 8 +++++++- 2 files changed, 21 insertions(+), 9 deletions(-) diff --git a/block.c b/block.c index 1df13ac..b334e35 100644 --- a/block.c +++ b/block.c @@ -3632,6 +3632,18 @@ BlockErrorAction bdrv_get_error_action(BlockDriverState *bs, bool is_read, int e } } +static void send_qmp_error_event(BlockDriverState *bs, + BlockErrorAction action, + bool is_read, int error) +{ + BlockErrorAction ac; + + ac = is_read ? IO_OPERATION_TYPE_READ : IO_OPERATION_TYPE_WRITE; + qapi_event_send_block_io_error(bdrv_get_device_name(bs), ac, action, + bdrv_iostatus_is_enabled(bs), + error == ENOSPC, &error_abort); +} + /* This is done by device models because, while the block layer knows * about the error, it does not know whether an operation comes from * the device or the block layer (from a job, for example). @@ -3657,16 +3669,10 @@ void bdrv_error_action(BlockDriverState *bs, BlockErrorAction action, * also ensures that the STOP/RESUME pair of events is emitted. */ qemu_system_vmstop_request_prepare(); - qapi_event_send_block_io_error(bdrv_get_device_name(bs), - is_read ? IO_OPERATION_TYPE_READ : - IO_OPERATION_TYPE_WRITE, - action, &error_abort); + send_qmp_error_event(bs, action, is_read, error); qemu_system_vmstop_request(RUN_STATE_IO_ERROR); } else { - qapi_event_send_block_io_error(bdrv_get_device_name(bs), - is_read ? IO_OPERATION_TYPE_READ : - IO_OPERATION_TYPE_WRITE, - action, &error_abort); + send_qmp_error_event(bs, action, is_read, error); } } diff --git a/qapi/block-core.json b/qapi/block-core.json index fb74c56..567e0a6 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -336,6 +336,7 @@ # # @io-status: #optional @BlockDeviceIoStatus. Only present if the device # supports it and the VM is configured to stop on errors +# (supported device models: virtio-blk, ide, scsi-disk) # # @inserted: #optional @BlockDeviceInfo describing the device if media is # present @@ -1569,6 +1570,11 @@ # # @action: action that has been taken # +# @nospace: #optional true if I/O error was caused due to a no-space +# condition. This key is only present if query-block's +# io-status is present, please see query-block documentation +# for more information (since: 2.2) +# # Note: If action is "stop", a STOP event will eventually follow the # BLOCK_IO_ERROR event # @@ -1576,7 +1582,7 @@ ## { 'event': 'BLOCK_IO_ERROR', 'data': { 'device': 'str', 'operation': 'IoOperationType', - 'action': 'BlockErrorAction' } } + 'action': 'BlockErrorAction', '*nospace': 'bool' } } ## # @BLOCK_JOB_COMPLETED -- 1.9.3
The workaround Also can reduce a large number of guest OS high concurrent I/o read and write for lvextend pause or stop [irs] volume_utilization_percent = 50 volume_utilization_chunk_mb = 2048 vol_size_sample_interval = 60 As you can see, by default we only check once per minute if extension is required. You could specify a smaller interval in /etc/vdsm/vdsm.conf to check more frequently. Also, you could increase the chunk_mb value to 2048 and 4096 so that extensions are bigger each time.
Paolo, I assume that you're the person to address comment 3 and 4 at?
VM abnormal stop after LV refreshing when using thin provisioning on block storage comment 4: in vdsm source->vdsm/vdsm/config.py.in ('volume_utilization_chunk_mb', '4096', None) ... Of course the question at the moment, I still in the test did not find a more perfect solution, but the comment 4 now I have already tested the effect is very good, of course I plan on libvirt and qemu-kvm and VDSM do a balance patch in this three projects.i hope everyone better suggestion is put forward Of course I major in centos 6. X test....
(In reply to sky from comment #4) > The workaround Also can reduce a large number of guest OS high concurrent > I/o read and write for lvextend pause or stop > > [irs] > volume_utilization_percent = 50 > volume_utilization_chunk_mb = 2048 > vol_size_sample_interval = 60 > > As you can see, by default we only check once per minute if extension is > required. This is *not* the configuration we use (default is 2 seconds), and changing this is not supported. Your vms *will* pause if you use this configuration.
(In reply to sky from comment #3) > Please backport this patch and under test and have a look This but is not related to qemu; it was caused by undocumented and backward incompatible behavior change in udev. It was solved by modifying vdsm udev rules. Looks like you commented on the wrong bug.