Bug 1752978
Summary: | unable to execute QEMU command '__com.redhat_drive_add': could not open disk image "Operation not permitted" | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | amashah | ||||||
Component: | libvirt | Assignee: | Michal Privoznik <mprivozn> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Han Han <hhan> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 7.6 | CC: | chhudson, dyuan, hhan, hreitz, jdenemar, jinzhao, jsuchane, juzhang, kwolf, lmen, mprivozn, mtessun, qinwang, virt-maint, xuzhang | ||||||
Target Milestone: | rc | Keywords: | Upstream, ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | libvirt-4.5.0-29.el7 | Doc Type: | Bug Fix | ||||||
Doc Text: |
Cause:
To enhance security and work around some races with udev (or any other software that mangles SELinux labels on /dev/* nodes whilst a domain is running) libvirt spawns each domain in its own, private namespace with a private /dev. This puts additional burden on libvirt which then has to update /dev/* nodes on some APIs like virDomainAttachDevice (aka virsh attach-device) or remove some on virDomainDetachDevice (aka virsh detach-device). For a generic devices this works perfectly as libvirt looks into the host's /dev and creates corresponding /dev/* nodes in the domain's private namespace exactly as it is in the host. However, there is one exception - disks. On a disk hot unplug libvirt is not removing any /dev/* nodes from the private namespace because they might still be in use by a backing chain of some other disk. And this is what is causing the problem. Imagine /dev/nvme0n1 disk which has some MAJOR:MINOR number (these are there to identify a device uniquely on kernel level). Now hotplug the disk into a domain => libvirt creates the exact copy in the domain's namespace. Then hotunplug the disk from the domain => libvirt keeps the /dev/nvme0n1 in the domain namespace. Now, hotunplug the NVMe disk from the host and hot plug it back again => The MINOR number is likely to change after this. However, at this point there is a discrepancy between MINOR number in the host and the one in the domain's namespace.
Consequence:
Qemu is trying to open a different device than it thinks and because of devices CGroup it is denied the access.
Fix:
The fix consists of forcibly creating /dev/nvme0n1 (or in general any other device) even if it exists in the domain's private namespace. This way we can be sure that it is the exact copy as in the host's namespace.
Result:
Hotplugging disks multiple times work again.
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 1775679 1775680 (view as bug list) | Environment: | |||||||
Last Closed: | 2020-03-31 19:59:02 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1775679, 1775680 | ||||||||
Attachments: |
|
Comment 7
Michal Privoznik
2019-09-25 13:16:43 UTC
The qemu namespace is enabled as default in the sos report. Simple test attach/detach for nvme disk on nest virt vm with emulated nvme, not reproduced. Configure: No change in libvirtd.conf and qemu.conf Version: libvirt-4.5.0-10.el7_6.12.x86_64 qemu-kvm-1.5.3-160.el7.x86_64 Steps: 1. Start a vm 2. Prepare nvme disk xml: # cat /tmp/disk.xml <disk type='block' device='disk'> <driver name='qemu' type='raw' cache='none'/> <source dev='/dev/disk/by-id/nvme-QEMU_NVMe_Ctrl_1234'/> <target dev='vde' bus='virtio'/> <serial>29A0A008T4YF</serial> </disk> 3. Attach/detach the nvme disk # virsh attach-device e1n2-1-control /tmp/disk.xml Device attached successfully # virsh detach-device e1n2-1-control /tmp/disk.xml Device detached successfully Created attachment 1623384 [details]
The script to test attach detach
Hello, I reproduced as following, not reproduced:
Environment:
Nest VIRT env, emulated nvme( :-< no physical nvme card)
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 1
On-line CPU(s) list: 0
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 16
Model: 2
Model name: AMD Opteron 23xx (Gen 3 Class Opteron)
Stepping: 3
CPU MHz: 2293.736
BogoMIPS: 4587.47
Virtualization: AMD-V
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
L3 cache: 16384K
NUMA node0 CPU(s): 0
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow art rep_good nopl extd_apicid pni cx16 x2apic popcnt tsc_deadline_timer hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw retpoline_amd ibp_disable vmmcall
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 252:0 0 10G 0 disk
├─vda1 252:1 0 1G 0 part /boot
└─vda2 252:2 0 9G 0 part
├─rhel-root 253:0 0 8G 0 lvm /
└─rhel-swap 253:1 0 1G 0 lvm [SWAP]
nvme0n1 259:0 0 100M 0 disk
Version:
kernel-3.10.0-957.el7.x86_64
libvirt-4.5.0-10.el7_6.12.x86_64
qemu-kvm-1.5.3-160.el7.x86_64
Steps:
The script is to attach/detach nvme disk with random interval in loops and output the error msg of attach/detach.
Do live attach/detach:
# ./test.sh --live /dev/disk/by-id/nvme-QEMU_NVMe_Ctrl_1234 /tmp/test1
Do inactive attach/detach:
# ./test.sh --config /dev/disk/by-id/nvme-QEMU_NVMe_Ctrl_1234 /tmp/test2
Do persistant attach/detach:
# ./test.sh --persistent /dev/disk/by-id/nvme-QEMU_NVMe_Ctrl_1234 /tmp/test3
I run them for some days, but the msg like 'could not open disk image "Operation not permitted"' was not found in error logs.
However, they triggered a qemu-kvm OOM:
# dmesg|tail
[1134700.830395] Out of memory: Kill process 13990 (qemu-kvm) score 907 or sacrifice child
[1134700.983891] Killed process 13990 (qemu-kvm) total-vm:6914780kB, anon-rss:3335492kB, file-rss:0kB, shmem-rss:0kB
Reset needinfo Thanks for the logs. Here's my analysis (it's going to be long). I've started at libvirtd.log and looked for error. This is one of the failed attempts: attached libvirtd.log: 2019-11-04 21:40:42.290+0000: 21944: debug : virThreadJobSet:96 : Thread 21944 (virNetServerHandleJob) is now running job remoteDispatchDomainAttachDeviceFlags 2019-11-04 21:40:42.290+0000: 21944: debug : virDomainAttachDeviceFlags:8137 : dom=0x7f151011d2e0, (VM: name=e1n3-1-control, uuid=d28f0939-7e71-4f68-b4e6-2217a68e89de), xml=<disk type='block' device='disk'> <driver name='qemu' type='raw' cache='none'/> <source dev='/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4'/> <target dev='vde' bus='virtio'/> <serial>19J0A02YT4YF</serial> </disk> , flags=0x3 .... 2019-11-04 21:40:42.419+0000: 21944: debug : qemuSetupImagePathCgroup:77 : Allow path /dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4, perms: rw 2019-11-04 21:40:42.419+0000: 21944: debug : virCgroupSetValueStr:796 : Set value '/sys/fs/cgroup/devices/machine.slice/machine-qemu\x2d5\x2de1n3\x2d1\x2dcontrol.scope/devices.allow' to 'b 259:17 rw' 2019-11-04 21:40:42.424+0000: 21944: debug : qemuDomainObjEnterMonitorInternal:7051 : Entering monitor (mon=0x7f14f4001a50 vm=0x7f14f4002880 name=e1n3-1-control) 2019-11-04 21:40:42.424+0000: 21944: debug : qemuMonitorAddDrive:3164 : drive=file=/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4,format=raw,if=none,id=drive-virtio-disk4,serial=19J0A02YT4YF,cache=none 2019-11-04 21:40:42.424+0000: 21944: debug : qemuMonitorAddDrive:3166 : mon:0x7f14f4001a50 vm:0x7f14f4002880 json:1 fd:30 2019-11-04 21:40:42.424+0000: 21944: debug : qemuMonitorJSONCommandWithFd:305 : Send command '{"execute":"__com.redhat_drive_add","arguments":{"file":"/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4","format":"raw","id":"drive-virtio-disk4","serial":"19J0A02YT4YF","cache":"none"},"id":"libvirt-39"}' for write with FD -1 2019-11-04 21:40:42.424+0000: 21944: info : qemuMonitorSend:1083 : QEMU_MONITOR_SEND_MSG: mon=0x7f14f4001a50 msg={"execute":"__com.redhat_drive_add","arguments":{"file":"/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4","format":"raw","id":"drive-virtio-disk4","serial":"19J0A02YT4YF","cache":"none"},"id":"libvirt-39"} fd=-1 2019-11-04 21:40:42.425+0000: 21830: info : qemuMonitorIOWrite:551 : QEMU_MONITOR_IO_WRITE: mon=0x7f14f4001a50 buf={"execute":"__com.redhat_drive_add","arguments":{"file":"/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4","format":"raw","id":"drive-virtio-disk4","serial":"19J0A02YT4YF","cache":"none"},"id":"libvirt-39"} len=212 ret=212 errno=0 2019-11-04 21:40:42.431+0000: 21830: debug : qemuMonitorJSONIOProcessLine:197 : Line [{"id": "libvirt-39", "error": {"class": "GenericError", "desc": "could not open disk image /dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4: Could not open '/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4': Operation not permitted"}}] 2019-11-04 21:40:42.431+0000: 21830: info : qemuMonitorJSONIOProcessLine:217 : QEMU_MONITOR_RECV_REPLY: mon=0x7f14f4001a50 reply={"id": "libvirt-39", "error": {"class": "GenericError", "desc": "could not open disk image /dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4: Could not open '/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4': Operation not permitted"}} 2019-11-04 21:40:42.432+0000: 21944: debug : qemuMonitorJSONCommandWithFd:310 : Receive command reply ret=0 rxObject=0x55b3e9e63dd0 2019-11-04 21:40:42.432+0000: 21944: debug : qemuMonitorJSONCheckError:385 : unable to execute QEMU command {"execute":"__com.redhat_drive_add","arguments":{"file":"/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4","format":"raw","id":"drive-virtio-disk4","serial":"19J0A02YT4YF","cache":"none"},"id":"libvirt-39"}: {"id":"libvirt-39","error":{"class":"GenericError","desc":"could not open disk image /dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4: Could not open '/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4': Operation not permitted"}} 2019-11-04 21:40:42.432+0000: 21944: error : qemuMonitorJSONCheckError:396 : internal error: unable to execute QEMU command '__com.redhat_drive_add': could not open disk image /dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4: Could not open '/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4': Operation not permitted 2019-11-04 21:40:42.432+0000: 21944: debug : qemuDomainObjExitMonitorInternal:7074 : Exited monitor (mon=0x7f14f4001a50 vm=0x7f14f4002880 name=e1n3-1-control) 2019-11-04 21:40:42.433+0000: 21944: debug : qemuTeardownImageCgroup:177 : Disabling device mapper control 2019-11-04 21:40:42.433+0000: 21944: debug : virCgroupSetValueStr:796 : Set value '/sys/fs/cgroup/devices/machine.slice/machine-qemu\x2d5\x2de1n3\x2d1\x2dcontrol.scope/devices.deny' to 'c 10:236 rwm' 2019-11-04 21:40:42.434+0000: 21944: debug : qemuTeardownImageCgroup:188 : Deny path /dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4 2019-11-04 21:40:42.434+0000: 21944: debug : virCgroupSetValueStr:796 : Set value '/sys/fs/cgroup/devices/machine.slice/machine-qemu\x2d5\x2de1n3\x2d1\x2dcontrol.scope/devices.deny' to 'b 259:17 rwm' We can see libvirtd allowing qemu access to the disk in cgroups at time 21:40:42.419+0000 TID 21944 (please note that libvirt's log timestamps are in UTC). We can see this also in audit.log: var/log/audit/audit.log from sosreport: node=e1n3 type=VIRT_RESOURCE msg=audit(1572903642.421:1512471): pid=21830 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm resrc=cgroup reason=allow vm="e1n3-1-control" uuid=d28f0939-7e71-4f68-b4e6-2217a68e89de cgroup="/sys/fs/cgroup/devices/machine.slice/machine-qemu\x2d5\x2de1n3\x2d1\x2dcontrol.scope/" class=path path="/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4" rdev=103:11 acl=rw exe="/usr/sbin/libvirtd" hostname=? addr=? terminal=? res=success' node=e1n3 type=VIRT_RESOURCE msg=audit(1572903642.431:1512472): pid=21830 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm resrc=disk reason=attach vm="e1n3-1-control" uuid=d28f0939-7e71-4f68-b4e6-2217a68e89de old-disk="?" new-disk="/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4" exe="/usr/sbin/libvirtd" hostname=? addr=? terminal=? res=failed' node=e1n3 type=VIRT_RESOURCE msg=audit(1572903642.432:1512473): pid=21830 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm resrc=cgroup reason=deny vm="e1n3-1-control" uuid=d28f0939-7e71-4f68-b4e6-2217a68e89de cgroup="/sys/fs/cgroup/devices/machine.slice/machine-qemu\x2d5\x2de1n3\x2d1\x2dcontrol.scope/" class=path path="/dev/mapper/control" rdev=0A:EC acl=rwm exe="/usr/sbin/libvirtd" hostname=? addr=? terminal=? res=success' node=e1n3 type=VIRT_RESOURCE msg=audit(1572903642.433:1512474): pid=21830 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm resrc=cgroup reason=deny vm="e1n3-1-control" uuid=d28f0939-7e71-4f68-b4e6-2217a68e89de cgroup="/sys/fs/cgroup/devices/machine.slice/machine-qemu\x2d5\x2de1n3\x2d1\x2dcontrol.scope/" class=path path="/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4" rdev=103:11 acl=rwm exe="/usr/sbin/libvirtd" hostname=? addr=? terminal=? res=success' Moreover, the same can be seen in strace output: libvirtd_str2.21944 from attached strace: 16:40:42.419552 write(26, "2019-11-04 21:40:42.419+0000: 21"..., 150) = 150 <0.000026> 16:40:42.419614 stat("/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4", {st_dev=makedev(0, 5), st_ino=61652115, st_mode=S_IFBLK|0660, st_nlink=1, st_uid=0, st_gid=6, st_blksize=4096, st_blocks=0, st_rdev=makedev(259, 17), st_atime=2019/11/04-16:40:23.169856511, st_mtime=2019/11/04-11:24:44.089302277, st_ctime=2019/11/04-11:24:44.089302277}) = 0 <0.000029> 16:40:42.419706 gettid() = 21944 <0.000016> 16:40:42.419757 write(26, "2019-11-04 21:40:42.419+0000: 21"..., 200) = 200 <0.000021> 16:40:42.419814 open("/sys/fs/cgroup/devices/machine.slice/machine-qemu\\x2d5\\x2de1n3\\x2d1\\x2dcontrol.scope/devices.allow", O_WRONLY|O_TRUNC) = 28 <0.000618> 16:40:42.420466 write(28, "b 259:17 rw", 11) = 11 <0.000024> 16:40:42.420520 close(28) = 0 <0.000021> Indeed, libvirt puts correct MAJ:MIN numbers into the devices.allow. The path provided by user is a symlink and resolves to /dev/nvme3n1p4. From the attached sosreport: /dev/disk/by-id: lrwxrwxrwx. 1 0 0 13 Nov 4 11:24 nvme-KCM5DRUG3T84_19J0A02YT4YF -> ../../nvme3n1 lrwxrwxrwx. 1 0 0 15 Nov 4 11:24 nvme-KCM5DRUG3T84_19J0A02YT4YF-part1 -> ../../nvme3n1p1 lrwxrwxrwx. 1 0 0 15 Nov 4 11:25 nvme-KCM5DRUG3T84_19J0A02YT4YF-part2 -> ../../nvme3n1p2 lrwxrwxrwx. 1 0 0 15 Nov 4 11:25 nvme-KCM5DRUG3T84_19J0A02YT4YF-part3 -> ../../nvme3n1p3 lrwxrwxrwx. 1 0 0 15 Nov 4 11:24 nvme-KCM5DRUG3T84_19J0A02YT4YF-part4 -> ../../nvme3n1p4 /dev: crw-------. 1 0 0 241, 3 Nov 4 11:21 nvme3 brw-rw----. 1 0 6 259, 7 Nov 4 11:24 nvme3n1 brw-rw----. 1 0 6 259, 9 Nov 4 11:24 nvme3n1p1 brw-rw----. 1 0 6 259, 10 Nov 4 11:25 nvme3n1p2 brw-rw----. 1 0 6 259, 16 Nov 4 11:25 nvme3n1p3 brw-rw----. 1 0 6 259, 17 Nov 4 11:24 nvme3n1p4 Therefore, "b 259:17 rw" is the correct string to write to devices.allow. Would be nice to see strace of qemu to pinpoint syscall it's getting EPERM to. Note, that qemu is getting EPERM and no EACCESS - seclabels are set properly. The error comes from CGroups. Could this be case where qemu succeeded with open()-ing the file but then failed to do an ioctl() for instance and claims the whole disk opening procedure failed? But, from my analysis it looks like kernel is mistakenly denying access. Switching over to qemu to confirm it's not the former (unlikely). Feel free to switch to kernel afterwards. Hi Qing, Could you please have a try in QEMU side? Thanks. Patch proposed upstream: https://www.redhat.com/archives/libvir-list/2019-November/msg00911.html I've pushed the patch upstream: cdd8a6690e qemu: Forcibly mknod() even if it exists v5.9.0-290-gcdd8a6690e Verified on libvirt-4.5.0-31.el7.x86_64 qemu-kvm-rhev-2.12.0-38.el7.x86_64: Follow the steps of https://bugzilla.redhat.com/show_bug.cgi?id=1775680#c4 Created attachment 1652930 [details]
The scripts for bug verify
Test results:
Domain pc started
Device attached successfully
Device detached successfully
Device attached successfully
Device detached successfully
Device attached successfully
Device detached successfully
Device attached successfully
Device detached successfully
Domain pc destroyed
All attach/detach works after mknod in namespace.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1094 |