+++ This bug was initially created as a clone of Bug #1404952 +++ +++ This bug was initially created as a clone of Bug #1354251 +++ Description of problem: A VM is unable to migrate from Host A to B due to EACCESS on a block device. 1. Destination (B) qemu process is created, opens block device DEV 2. In final stages of migration, qemu process closes device DEV 3. udev event is triggered rewriting the permissions of DEV 4. qemu process opens again device DEV before actually running the VM 5. qemu process fails to open DEV, receiving EACCESS as the permissions changed. Version-Release number of selected component (if applicable): udev-147-2.51.el6.x86_64 libvirt-0.10.2-29.el6_5.12.x86_64 qemu-kvm-0.12.1.2-2.415.el6_5.15.x86_64 (in function drives_reopen) How reproducible: 100% - thank you Roman Hodain/Gordon Watson Steps to Reproduce: # cat /etc/udev/rules.d/51-qemu.rules KERNEL=="sd*", GROUP="qemu" # chown root:root /dev/sd*; ls -l /dev/sd*; ./test.out /dev/sda; sleep 1; ls -l /dev/sd*; brw-rw----. 1 root root 8, 0 Jul 10 03:12 /dev/sda brw-rw----. 1 root root 8, 16 Jul 10 03:14 /dev/sdb Opening device /dev/sda Closing device /dev/sda brw-rw----. 1 root qemu 8, 0 Jul 10 03:29 /dev/sda brw-rw----. 1 root root 8, 16 Jul 10 03:14 /dev/sdb ------------------------------------------------------------------------------ #include <stdio.h> #include <fcntl.h> int main ( int argc, char **argv ) { printf("Opening device %s\n", argv[1]); int fd = open(argv[1], O_RDWR); if (fd<0) { printf("Unable to open %s", argv[1]); return(1); } printf("Closing device %s\n", argv[1]); close(fd); return 0; } Actual results: Permissions changed Expected results: Permissions not changed Additional info: Are RHEL6 and RHEL7 missing this rule in 50-udev-default.rules? > ACTION!="add", GOTO="default_end" https://cgit.freedesktop.org/systemd/systemd/commit/?id=22582bb2cbe85b40de5f561589e0468dac769515 The rule which is causing trouble in this case is: SUBSYSTEM=="block", GROUP="disk" From our tests until now, this can be easily worked around by adding the qemu process to the disk group. However, what is the correct behavior here? Should udev rewrite it or not? I don't see how libvirt can prevent it from happening. Perhaps there is, but I cannot see any libudev mechanism to prevent this. And since this happens within qemu (it's really a close followed by an open) I am not sure how libvirt could be modified to reapply these permissions mid-flight. Similar issues and related information: https://www.redhat.com/archives/libvir-list/2011-April/msg00113.html https://bugzilla.redhat.com/show_bug.cgi?id=588348 --- Additional comment from Michal Sekletar on 2016-07-11 14:11:49 CEST --- There is an earlier commit that actually introduced this behaviour, i.e. not changing permissions on change events [1]. Commit you linked in bug description is only a rework of that. Conceptually, it made possible to change permissions of a device node on other event types as well, but introduced check in the rules file that prevents resetting permissions on change events. This was previously handled directly in the C code. That being said, I think we can also introduce such change in rules. Basically we would only add 'ACTION!="add", GOTO="default_end"' at appropriate place and jump label LABEL="default_end" at the end of file. We don't need part of the patch that touches C code because we don't have that code in RHEL6 version of udev. [1] https://github.com/systemd/systemd/commit/48a849ee17fb25e0001bfcc0f28a4aa633d016a1 --- Additional comment from Michal Privoznik on 2016-10-26 14:38:12 CEST --- First round of the patches proposed on the upstream list: https://www.redhat.com/archives/libvir-list/2016-October/msg01151.html --- Additional comment from Michal Privoznik on 2016-11-24 15:50:17 CET --- Patches proposed upstream: https://www.redhat.com/archives/libvir-list/2016-November/msg01208.html However, the change is fairly invasive, therefore I don't think we can guarantee stable package if they are backported. --- Additional comment from Michal Privoznik on 2016-12-07 09:40:33 CET --- v2 proposed upstream: https://www.redhat.com/archives/libvir-list/2016-December/msg00231.html --- Additional comment from Michal Privoznik on 2016-12-15 09:34:00 CET --- Moving to POST: commit f444faa94a0e30f7dfdd47dce18b526abb0aaa9f Author: Michal Privoznik <mprivozn> AuthorDate: Tue Dec 6 17:35:12 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 qemu: Enable mount namespace https://bugzilla.redhat.com/show_bug.cgi?id=1404952 Signed-off-by: Michal Privoznik <mprivozn> commit 661887f558208074169b0d3340c457108b6a023d Author: Michal Privoznik <mprivozn> AuthorDate: Fri Nov 18 16:34:45 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 qemu: Let users opt-out from containerization Given how intrusive previous patches are, it might happen that there's a bug or imperfection. Lets give users a way out: if they set 'namespaces' to an empty array in qemu.conf the feature is suppressed. Signed-off-by: Michal Privoznik <mprivozn> commit f95c5c48d416dcdab2f3ee7718f12a02833c9339 Author: Michal Privoznik <mprivozn> AuthorDate: Fri Nov 18 15:19:12 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 qemu: Manage /dev entry on RNG hotplug When attaching a device to a domain that's using separate mount namespace we must maintain /dev entries in order for qemu process to see them. Signed-off-by: Michal Privoznik <mprivozn> commit f5fdf23a68d5c9838890451c1c50b4ae1062d8d2 Author: Michal Privoznik <mprivozn> AuthorDate: Fri Nov 18 14:53:27 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 qemu: Manage /dev entry on chardev hotplug When attaching a device to a domain that's using separate mount namespace we must maintain /dev entries in order for qemu process to see them. Signed-off-by: Michal Privoznik <mprivozn> commit 6e57492839c0f644ade6b4174993f33f72b66ba3 Author: Michal Privoznik <mprivozn> AuthorDate: Wed Nov 16 15:27:47 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 qemu: Manage /dev entry on hostdev hotplug When attaching a device to a domain that's using separate mount namespace we must maintain /dev entries in order for qemu process to see them. Signed-off-by: Michal Privoznik <mprivozn> commit 81df21507bef94ae53a056156e4aa6661f29237a Author: Michal Privoznik <mprivozn> AuthorDate: Tue Nov 15 16:53:04 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 qemu: Manage /dev entry on disk hotplug When attaching a device to a domain that's using separate mount namespace we must maintain /dev entries in order for qemu process to see them. Signed-off-by: Michal Privoznik <mprivozn> commit eadaa975480d10eb057eb72bf833888e88e948e8 Author: Michal Privoznik <mprivozn> AuthorDate: Wed Nov 23 11:52:57 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 qemu: Enter the namespace on relabelling Instead of trying to fix our security drivers, we can use a simple trick to relabel paths in both namespace and the host. I mean, if we enter the namespace some paths are still shared with the host so any change done to them is visible from the host too. Therefore, we can just enter the namespace and call SetAllLabel()/RestoreAllLabel() from there. Yes, it has slight overhead because we have to fork in order to enter the namespace. But on the other hand, no complexity is added to our code. Signed-off-by: Michal Privoznik <mprivozn> commit 2160f338a74543634e26aeddef1e4c63184660da Author: Michal Privoznik <mprivozn> AuthorDate: Tue Nov 15 16:10:23 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 qemu: Prepare RNGs when starting a domain When starting a domain and separate mount namespace is used, we have to create all the /dev entries that are configured for the domain. Signed-off-by: Michal Privoznik <mprivozn> commit 8ec8a8c5ffa0a4b662013313116b4f166bfe989e Author: Michal Privoznik <mprivozn> AuthorDate: Tue Nov 15 16:03:02 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 qemu: Prepare inputs when starting a domain When starting a domain and separate mount namespace is used, we have to create all the /dev entries that are configured for the domain. Signed-off-by: Michal Privoznik <mprivozn> commit 2c654490f355c8d8e7ad4748952008391299b411 Author: Michal Privoznik <mprivozn> AuthorDate: Tue Nov 15 15:25:15 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 qemu: Prepare TPM when starting a domain When starting a domain and separate mount namespace is used, we have to create all the /dev entries that are configured for the domain. Signed-off-by: Michal Privoznik <mprivozn> commit 4e4451019cb2e6dea355e93e946e0169023753c6 Author: Michal Privoznik <mprivozn> AuthorDate: Tue Nov 15 15:17:05 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 qemu: Prepare chardevs when starting a domain When starting a domain and separate mount namespace is used, we have to create all the /dev entries that are configured for the domain. Signed-off-by: Michal Privoznik <mprivozn> commit 73267cec46e08e74f6297c44b8f47c68180b3712 Author: Michal Privoznik <mprivozn> AuthorDate: Tue Nov 15 14:37:52 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 qemu: Prepare hostdevs when starting a domain When starting a domain and separate mount namespace is used, we have to create all the /dev entries that are configured for the domain. Signed-off-by: Michal Privoznik <mprivozn> commit 054202d02062c313e01e6c8b0084b91a738d13aa Author: Michal Privoznik <mprivozn> AuthorDate: Mon Nov 14 17:36:45 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 qemu: Prepare disks when starting a domain When starting a domain and separate mount namespace is used, we have to create all the /dev entries that are configured for the domain. Signed-off-by: Michal Privoznik <mprivozn> commit bb4e529664a6e1ef08030aefc96f21f14eba2aea Author: Michal Privoznik <mprivozn> AuthorDate: Tue Nov 15 11:30:18 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 qemu: Spawn qemu under mount namespace Prime time. When it comes to spawning qemu process and relabelling all the devices it's going to touch, there's inherent race with other applications in the system (e.g. udev). Instead of trying convincing udev to not touch libvirt managed devices, we can create a separate mount namespace for the qemu, and mount our own /dev there. Of course this puts more work onto us as we have to maintain /dev files on each domain start and device hot(un-)plug. On the other hand, this enhances security also. From technical POV, on domain startup process the parent (libvirtd) creates: /var/lib/libvirt/qemu/$domain.dev /var/lib/libvirt/qemu/$domain.devpts The child (which is going to be qemu eventually) calls unshare() to create new mount namespace. From now on anything that child does is invisible to the parent. Child then mounts tmpfs on $domain.dev (so that it still sees original /dev from the host) and creates some devices (as explained in one of the previous patches). The devices have to be created exactly as they are in the host (including perms, seclabels, ACLs, ...). After that it moves $domain.dev mount to /dev. What's the $domain.devpts mount there for then you ask? QEMU can create PTYs for some chardevs. And historically we exposed the host ends in our domain XML allowing users to connect to them. Therefore we must preserve devpts mount to be shared with the host's one. To make this patch as small as possible, creating of devices configured for domain in question is implemented in next patches. Signed-off-by: Michal Privoznik <mprivozn> commit a5896e8ca404a2e975808728328e44efd49a7960 Author: Michal Privoznik <mprivozn> AuthorDate: Tue Nov 15 11:28:51 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 qemu_cgroup: Expose defaultDeviceACL This is a list of devices that qemu needs for its run (apart from what's configured for domain). The devices on the list are enabled in the CGroups by default so they will be good candidates for initial /dev for new qemu. Signed-off-by: Michal Privoznik <mprivozn> commit 5ac52bd0fe80d1741071250f485ae54375508e48 Author: Michal Privoznik <mprivozn> AuthorDate: Tue Dec 6 16:06:02 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 virscsivhost: Introduce virSCSIVHostDeviceGetPath We will need this function in near future so that we know what /dev device corresponds to the SCSI device. Signed-off-by: Michal Privoznik <mprivozn> commit 6bcacd55e537c0fc3b793949637197e82a9dffcb Author: Michal Privoznik <mprivozn> AuthorDate: Wed Nov 16 15:27:20 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 virscsi: Introduce virSCSIDeviceGetPath We will need this function in near future so that we know what /dev device corresponds to the SCSI device. Signed-off-by: Michal Privoznik <mprivozn> commit c4237d8e0ca090c7db76e5a226efa0ed2305835d Author: Michal Privoznik <mprivozn> AuthorDate: Wed Nov 16 15:26:59 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 virusb: Introduce virUSBDeviceGetPath We will need this function in near future so that we know what /dev device corresponds to the USB device. Signed-off-by: Michal Privoznik <mprivozn> commit 654b4d48bcdeeaf31df131644544bb1277f0f8bb Author: Michal Privoznik <mprivozn> AuthorDate: Tue Nov 22 11:14:08 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 virfile: Introduce ACL helpers Namely, virFileGetACLs, virFileSetACLs, virFileFreeACLs and virFileCopyACLs. These functions are going to be required when we are creating /dev for qemu. We have copy anything that's in host's /dev exactly as is. Including ACLs. Signed-off-by: Michal Privoznik <mprivozn> commit 1a7c9a5d5087d562fabfd7b7ff3cd1b9b19e9419 Author: Michal Privoznik <mprivozn> AuthorDate: Thu Nov 10 16:17:48 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 virfile: Introduce virFileSetupDev This part of code that LXC currently uses will be reused so move to a generic function. Signed-off-by: Michal Privoznik <mprivozn> commit 48a12d3b2554cc7a4255ef9ff8564c0a3ef7c1b3 Author: Michal Privoznik <mprivozn> AuthorDate: Thu Nov 10 14:55:48 2016 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Dec 15 09:25:16 2016 +0100 virprocess: Introduce virProcessSetupPrivateMountNS This part of code that LXC currently uses will be reused so move to a generic function. Signed-off-by: Michal Privoznik <mprivozn> v2.5.0-121-gf444faa94a --- Additional comment from Jaroslav Suchanek on 2016-12-15 09:37:59 CET ---
I'm not sure it is relevant to RHV, Michal, can you please get back to me/us to see how does it fit with what RHV is doing
(In reply to Michal Skrivanek from comment #1) > I'm not sure it is relevant to RHV, Michal, can you please get back to me/us > to see how does it fit with what RHV is doing Sure. The thing is: when libvirt starts a domain / attaches a device to an already running one, it also sets up security labels (uid:gid + SELinux label). However, there's a small window between libvirt setting it all up and qemu opening the device. Problem was that during this small window udev came and decided to process events. Which ended up in mangled security labels rendering qemu unable to open the device. The only way we could solve this problem was to take udev out of the picture and thus spawn qemu processes under their own mount namespace. Here /dev is a tmpfs managed by libvirt (entries are created on domain startup/hotplug) so udev no longer has ability to screw up things. Ideally, this change is transparent to upper layers. However, it may happen that in my implementation I've forgotten about some silly corner case. Therefore, what I'd advise you to do is to test device passthrough, various disk backends for your VMs, etc. Anything that has something to do with /dev/* files. Also, IIRC you guys have your own udev rules to address the issue I'm describing above, right? Well, you will not need them any longer :-)
There we go: https://www.redhat.com/archives/libvir-list/2017-January/msg00106.html
(In reply to Michal Privoznik from comment #2) > Also, IIRC you guys have your own udev rules to address the issue I'm > describing above, right? Well, you will not need them any longer :-) We do need our udev rules, this his how ovirt lvs get the vdsm:kvm owner. Without the udev rules we will not be able to manage these devices, regardless of what libvirt is doing. Note that we do not know anything about the libvirt private devices, so the paths that we are using will still be the same paths to /rhev/data-center/... or /run/vdms/storage/..., and the paths libvirt report back to vdsm must be the same. We are also accessing file based images while the vm is running, for example for checking current file size on NFS storage, so the real file and libvirt version of the file must be in sync. For block storage we extend and refresh the lvs while the vm is running, so the libvirt private device must work for this use case. After live merge, we delete the device that was removed from the chain, this again must work with libvirt private devices.
(In reply to Nir Soffer from comment #4) > (In reply to Michal Privoznik from comment #2) > > Also, IIRC you guys have your own udev rules to address the issue I'm > > describing above, right? Well, you will not need them any longer :-) > > > Note that we do not know anything about the libvirt private devices And that's how it is supposed to be. We don't want anybody to mess with the namespace. It's libvirt internals. > so the > paths > that we are using will still be the same paths to /rhev/data-center/... or > /run/vdms/storage/..., and the paths libvirt report back to vdsm must be the > same. Sure. The reported paths (and also those which qemu sees) are not going to change. > > We are also accessing file based images while the vm is running, for example > for > checking current file size on NFS storage, so the real file and libvirt > version > of the file must be in sync. They are. It is the same device after all. And for the file based storage there is no change from namespaces POV. It's just /dev that is not shared with the parent namespace. Everything else is shared and thus the same. > > For block storage we extend and refresh the lvs while the vm is running, so > the > libvirt private device must work for this use case. It should. But you should also check if there no funny bug around. > > After live merge, we delete the device that was removed from the chain, this > again > must work with libvirt private devices. This is a bit tricky. Libvirt does not remove the device after a block job. However, if you lvremove it, the one that is in the namespace stops working. I mean, /dev/something entry is still going to be there, but upon any access kernel shall return an error as it is a non-existent device.
(In reply to Michal Privoznik from comment #5) > > After live merge, we delete the device that was removed from the chain, this > > again > > must work with libvirt private devices. > > This is a bit tricky. Libvirt does not remove the device after a block job. > However, if you lvremove it, the one that is in the namespace stops working. > I mean, /dev/something entry is still going to be there, but upon any access > kernel shall return an error as it is a non-existent device. We typically only deactivate the lv, but from libvirt point of view, this is the same as removing the lv, the device does not exists on the host. After live merge the device is not part of the qcow chain, so libvirt or qemu are not expected to access is. Maybe libvirt should remove the private device after it was removed from the vm, in the same way it treat devices detached from the vm? Can we test this changes now? on Fedora? RHEL?
(In reply to Nir Soffer from comment #6) > (In reply to Michal Privoznik from comment #5) > > > After live merge, we delete the device that was removed from the chain, this > > > again > > > must work with libvirt private devices. > > > > This is a bit tricky. Libvirt does not remove the device after a block job. > > However, if you lvremove it, the one that is in the namespace stops working. > > I mean, /dev/something entry is still going to be there, but upon any access > > kernel shall return an error as it is a non-existent device. > > We typically only deactivate the lv, but from libvirt point of view, this > is the same as removing the lv, the device does not exists on the host. > > After live merge the device is not part of the qcow chain, so libvirt or qemu > are not expected to access is. True. So no problem there. > > Maybe libvirt should remove the private device after it was removed from the > vm, in the same way it treat devices detached from the vm? It can't. The device might still be in use - as weirdly as it sounds - for instance as a second disk of the domain (in RO mode). Or it is in a backing chain of a different disk or something. That's why when it comes to disks, libvirt does not remove the device from the namespace. Had it been any other type of device, libvirt would remove it. > > Can we test this changes now? on Fedora? RHEL? Sure. There are still some bugs we are fixing upstream. Therefore, fedora build has this feature disabled, and RHEL build is waiting for the fixes. Stay tuned.
Any updates?
Yes, libvirt with namespaces enabled (and a tons of fixes) was build: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=542044 libvirt-3.1.0-1.el7 This is the build that is ready to be tested.
This bug is in modified status but no patches are attached to it. Can you please check if the fix is included in latest build and move to QE if so?
it depends on libvirt-3.1.0-1.el7 so it is safe to move it to on_qa
Verify with: Software Version:4.2.0.2-0.1.el7 Migrate VM with block device. Migration pass
Moving back to verify bot problem
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1489
BZ<2>Jira Resync