Bug 1752978

Summary: unable to execute QEMU command '__com.redhat_drive_add': could not open disk image "Operation not permitted"
Product: Red Hat Enterprise Linux 7 Reporter: amashah
Component: libvirtAssignee: Michal Privoznik <mprivozn>
Status: CLOSED ERRATA QA Contact: Han Han <hhan>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.6CC: chhudson, dyuan, hhan, hreitz, jdenemar, jinzhao, jsuchane, juzhang, kwolf, lmen, mprivozn, mtessun, qinwang, virt-maint, xuzhang
Target Milestone: rcKeywords: Upstream, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-4.5.0-29.el7 Doc Type: Bug Fix
Doc Text:
Cause: To enhance security and work around some races with udev (or any other software that mangles SELinux labels on /dev/* nodes whilst a domain is running) libvirt spawns each domain in its own, private namespace with a private /dev. This puts additional burden on libvirt which then has to update /dev/* nodes on some APIs like virDomainAttachDevice (aka virsh attach-device) or remove some on virDomainDetachDevice (aka virsh detach-device). For a generic devices this works perfectly as libvirt looks into the host's /dev and creates corresponding /dev/* nodes in the domain's private namespace exactly as it is in the host. However, there is one exception - disks. On a disk hot unplug libvirt is not removing any /dev/* nodes from the private namespace because they might still be in use by a backing chain of some other disk. And this is what is causing the problem. Imagine /dev/nvme0n1 disk which has some MAJOR:MINOR number (these are there to identify a device uniquely on kernel level). Now hotplug the disk into a domain => libvirt creates the exact copy in the domain's namespace. Then hotunplug the disk from the domain => libvirt keeps the /dev/nvme0n1 in the domain namespace. Now, hotunplug the NVMe disk from the host and hot plug it back again => The MINOR number is likely to change after this. However, at this point there is a discrepancy between MINOR number in the host and the one in the domain's namespace. Consequence: Qemu is trying to open a different device than it thinks and because of devices CGroup it is denied the access. Fix: The fix consists of forcibly creating /dev/nvme0n1 (or in general any other device) even if it exists in the domain's private namespace. This way we can be sure that it is the exact copy as in the host's namespace. Result: Hotplugging disks multiple times work again.
Story Points: ---
Clone Of:
: 1775679 1775680 (view as bug list) Environment:
Last Closed: 2020-03-31 19:59:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1775679, 1775680    
Attachments:
Description Flags
The script to test attach detach
none
The scripts for bug verify none

Comment 7 Michal Privoznik 2019-09-25 13:16:43 UTC
Couple of notes:

1) qemu is getting EPERM which means something is preventing it from opening the file. It can be permissions or CGroups. Libvirt debug logs would help here so that I can rule one of the two out.

2) Are they running libvirt with namespaces enabled? That is, 'namespaces' is either commented or set to: namespaces=["mount]  in qemu.conf. From the sosreport it looks like they have namespaces enabled, but again debug logs would help here.

There's a how to get debug logs:
https://wiki.libvirt.org/page/DebugLogs

Comment 8 Han Han 2019-09-25 14:52:26 UTC
The qemu namespace is enabled as default in the sos report.

Simple test attach/detach for nvme disk on nest virt vm with emulated nvme, not reproduced.

Configure:
No change in libvirtd.conf and qemu.conf

Version:
libvirt-4.5.0-10.el7_6.12.x86_64
qemu-kvm-1.5.3-160.el7.x86_64

Steps:

1. Start a vm

2. Prepare nvme disk xml:

# cat /tmp/disk.xml
<disk type='block' device='disk'>
  <driver name='qemu' type='raw' cache='none'/>
  <source dev='/dev/disk/by-id/nvme-QEMU_NVMe_Ctrl_1234'/>
  <target dev='vde' bus='virtio'/>
  <serial>29A0A008T4YF</serial>
</disk>

3. Attach/detach the nvme disk
# virsh attach-device e1n2-1-control /tmp/disk.xml 
Device attached successfully

# virsh detach-device e1n2-1-control /tmp/disk.xml
Device detached successfully

Comment 13 Han Han 2019-10-08 03:02:33 UTC
Created attachment 1623384 [details]
The script to test attach detach

Hello, I reproduced as following, not reproduced:
Environment:
Nest VIRT env, emulated nvme( :-< no physical nvme card)
# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                1
On-line CPU(s) list:   0
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            16
Model:                 2
Model name:            AMD Opteron 23xx (Gen 3 Class Opteron)
Stepping:              3
CPU MHz:               2293.736
BogoMIPS:              4587.47
Virtualization:        AMD-V
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
L3 cache:              16384K
NUMA node0 CPU(s):     0
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow art rep_good nopl extd_apicid pni cx16 x2apic popcnt tsc_deadline_timer hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw retpoline_amd ibp_disable vmmcall

# lsblk
NAME          MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda           252:0    0   10G  0 disk 
├─vda1        252:1    0    1G  0 part /boot
└─vda2        252:2    0    9G  0 part 
  ├─rhel-root 253:0    0    8G  0 lvm  /
  └─rhel-swap 253:1    0    1G  0 lvm  [SWAP]
nvme0n1       259:0    0  100M  0 disk


Version:
kernel-3.10.0-957.el7.x86_64
libvirt-4.5.0-10.el7_6.12.x86_64
qemu-kvm-1.5.3-160.el7.x86_64


Steps:
The script is to attach/detach nvme disk with random interval in loops and output the error msg of attach/detach.
Do live attach/detach:
# ./test.sh --live /dev/disk/by-id/nvme-QEMU_NVMe_Ctrl_1234 /tmp/test1

Do inactive attach/detach:
# ./test.sh --config /dev/disk/by-id/nvme-QEMU_NVMe_Ctrl_1234 /tmp/test2

Do persistant attach/detach:
# ./test.sh --persistent /dev/disk/by-id/nvme-QEMU_NVMe_Ctrl_1234 /tmp/test3

I run them for some days, but the msg like 'could not open disk image "Operation not permitted"' was not found in error logs.

However, they triggered a qemu-kvm OOM:
# dmesg|tail
[1134700.830395] Out of memory: Kill process 13990 (qemu-kvm) score 907 or sacrifice child
[1134700.983891] Killed process 13990 (qemu-kvm) total-vm:6914780kB, anon-rss:3335492kB, file-rss:0kB, shmem-rss:0kB

Comment 14 Han Han 2019-10-08 03:03:20 UTC
Reset needinfo

Comment 18 Michal Privoznik 2019-11-05 12:30:33 UTC
Thanks for the logs. Here's my analysis (it's going to be long). I've started at libvirtd.log and looked for error. This is one of the failed attempts:

attached libvirtd.log:
2019-11-04 21:40:42.290+0000: 21944: debug : virThreadJobSet:96 : Thread 21944 (virNetServerHandleJob) is now running job remoteDispatchDomainAttachDeviceFlags
2019-11-04 21:40:42.290+0000: 21944: debug : virDomainAttachDeviceFlags:8137 : dom=0x7f151011d2e0, (VM: name=e1n3-1-control, uuid=d28f0939-7e71-4f68-b4e6-2217a68e89de), xml=<disk type='block' device='disk'>
  <driver name='qemu' type='raw' cache='none'/>
  <source dev='/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4'/>
  <target dev='vde' bus='virtio'/>
  <serial>19J0A02YT4YF</serial>
</disk>
, flags=0x3
....
2019-11-04 21:40:42.419+0000: 21944: debug : qemuSetupImagePathCgroup:77 : Allow path /dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4, perms: rw
2019-11-04 21:40:42.419+0000: 21944: debug : virCgroupSetValueStr:796 : Set value '/sys/fs/cgroup/devices/machine.slice/machine-qemu\x2d5\x2de1n3\x2d1\x2dcontrol.scope/devices.allow' to 'b 259:17 rw'
2019-11-04 21:40:42.424+0000: 21944: debug : qemuDomainObjEnterMonitorInternal:7051 : Entering monitor (mon=0x7f14f4001a50 vm=0x7f14f4002880 name=e1n3-1-control)
2019-11-04 21:40:42.424+0000: 21944: debug : qemuMonitorAddDrive:3164 : drive=file=/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4,format=raw,if=none,id=drive-virtio-disk4,serial=19J0A02YT4YF,cache=none
2019-11-04 21:40:42.424+0000: 21944: debug : qemuMonitorAddDrive:3166 : mon:0x7f14f4001a50 vm:0x7f14f4002880 json:1 fd:30
2019-11-04 21:40:42.424+0000: 21944: debug : qemuMonitorJSONCommandWithFd:305 : Send command '{"execute":"__com.redhat_drive_add","arguments":{"file":"/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4","format":"raw","id":"drive-virtio-disk4","serial":"19J0A02YT4YF","cache":"none"},"id":"libvirt-39"}' for write with FD -1
2019-11-04 21:40:42.424+0000: 21944: info : qemuMonitorSend:1083 : QEMU_MONITOR_SEND_MSG: mon=0x7f14f4001a50 msg={"execute":"__com.redhat_drive_add","arguments":{"file":"/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4","format":"raw","id":"drive-virtio-disk4","serial":"19J0A02YT4YF","cache":"none"},"id":"libvirt-39"}
 fd=-1
2019-11-04 21:40:42.425+0000: 21830: info : qemuMonitorIOWrite:551 : QEMU_MONITOR_IO_WRITE: mon=0x7f14f4001a50 buf={"execute":"__com.redhat_drive_add","arguments":{"file":"/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4","format":"raw","id":"drive-virtio-disk4","serial":"19J0A02YT4YF","cache":"none"},"id":"libvirt-39"}
 len=212 ret=212 errno=0
2019-11-04 21:40:42.431+0000: 21830: debug : qemuMonitorJSONIOProcessLine:197 : Line [{"id": "libvirt-39", "error": {"class": "GenericError", "desc": "could not open disk image /dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4: Could not open '/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4': Operation not permitted"}}]
2019-11-04 21:40:42.431+0000: 21830: info : qemuMonitorJSONIOProcessLine:217 : QEMU_MONITOR_RECV_REPLY: mon=0x7f14f4001a50 reply={"id": "libvirt-39", "error": {"class": "GenericError", "desc": "could not open disk image /dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4: Could not open '/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4': Operation not permitted"}}
2019-11-04 21:40:42.432+0000: 21944: debug : qemuMonitorJSONCommandWithFd:310 : Receive command reply ret=0 rxObject=0x55b3e9e63dd0
2019-11-04 21:40:42.432+0000: 21944: debug : qemuMonitorJSONCheckError:385 : unable to execute QEMU command {"execute":"__com.redhat_drive_add","arguments":{"file":"/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4","format":"raw","id":"drive-virtio-disk4","serial":"19J0A02YT4YF","cache":"none"},"id":"libvirt-39"}: {"id":"libvirt-39","error":{"class":"GenericError","desc":"could not open disk image /dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4: Could not open '/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4': Operation not permitted"}}
2019-11-04 21:40:42.432+0000: 21944: error : qemuMonitorJSONCheckError:396 : internal error: unable to execute QEMU command '__com.redhat_drive_add': could not open disk image /dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4: Could not open '/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4': Operation not permitted
2019-11-04 21:40:42.432+0000: 21944: debug : qemuDomainObjExitMonitorInternal:7074 : Exited monitor (mon=0x7f14f4001a50 vm=0x7f14f4002880 name=e1n3-1-control)
2019-11-04 21:40:42.433+0000: 21944: debug : qemuTeardownImageCgroup:177 : Disabling device mapper control
2019-11-04 21:40:42.433+0000: 21944: debug : virCgroupSetValueStr:796 : Set value '/sys/fs/cgroup/devices/machine.slice/machine-qemu\x2d5\x2de1n3\x2d1\x2dcontrol.scope/devices.deny' to 'c 10:236 rwm'
2019-11-04 21:40:42.434+0000: 21944: debug : qemuTeardownImageCgroup:188 : Deny path /dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4
2019-11-04 21:40:42.434+0000: 21944: debug : virCgroupSetValueStr:796 : Set value '/sys/fs/cgroup/devices/machine.slice/machine-qemu\x2d5\x2de1n3\x2d1\x2dcontrol.scope/devices.deny' to 'b 259:17 rwm'



We can see libvirtd allowing qemu access to the disk in cgroups at time 21:40:42.419+0000 TID 21944 (please note that libvirt's log timestamps are in UTC). We can see this also in audit.log:

var/log/audit/audit.log from sosreport:
node=e1n3 type=VIRT_RESOURCE msg=audit(1572903642.421:1512471): pid=21830 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm resrc=cgroup reason=allow vm="e1n3-1-control" uuid=d28f0939-7e71-4f68-b4e6-2217a68e89de cgroup="/sys/fs/cgroup/devices/machine.slice/machine-qemu\x2d5\x2de1n3\x2d1\x2dcontrol.scope/" class=path path="/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4" rdev=103:11 acl=rw exe="/usr/sbin/libvirtd" hostname=? addr=? terminal=? res=success'
node=e1n3 type=VIRT_RESOURCE msg=audit(1572903642.431:1512472): pid=21830 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm resrc=disk reason=attach vm="e1n3-1-control" uuid=d28f0939-7e71-4f68-b4e6-2217a68e89de old-disk="?" new-disk="/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4" exe="/usr/sbin/libvirtd" hostname=? addr=? terminal=? res=failed'
node=e1n3 type=VIRT_RESOURCE msg=audit(1572903642.432:1512473): pid=21830 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm resrc=cgroup reason=deny vm="e1n3-1-control" uuid=d28f0939-7e71-4f68-b4e6-2217a68e89de cgroup="/sys/fs/cgroup/devices/machine.slice/machine-qemu\x2d5\x2de1n3\x2d1\x2dcontrol.scope/" class=path path="/dev/mapper/control" rdev=0A:EC acl=rwm exe="/usr/sbin/libvirtd" hostname=? addr=? terminal=? res=success'
node=e1n3 type=VIRT_RESOURCE msg=audit(1572903642.433:1512474): pid=21830 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm resrc=cgroup reason=deny vm="e1n3-1-control" uuid=d28f0939-7e71-4f68-b4e6-2217a68e89de cgroup="/sys/fs/cgroup/devices/machine.slice/machine-qemu\x2d5\x2de1n3\x2d1\x2dcontrol.scope/" class=path path="/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4" rdev=103:11 acl=rwm exe="/usr/sbin/libvirtd" hostname=? addr=? terminal=? res=success'


Moreover, the same can be seen in strace output:

libvirtd_str2.21944 from attached strace:
16:40:42.419552 write(26, "2019-11-04 21:40:42.419+0000: 21"..., 150) = 150 <0.000026>
16:40:42.419614 stat("/dev/disk/by-id/nvme-KCM5DRUG3T84_19J0A02YT4YF-part4", {st_dev=makedev(0, 5), st_ino=61652115, st_mode=S_IFBLK|0660, st_nlink=1, st_uid=0, st_gid=6, st_blksize=4096, st_blocks=0, st_rdev=makedev(259, 17), st_atime=2019/11/04-16:40:23.169856511, st_mtime=2019/11/04-11:24:44.089302277, st_ctime=2019/11/04-11:24:44.089302277}) = 0 <0.000029>
16:40:42.419706 gettid()                = 21944 <0.000016>
16:40:42.419757 write(26, "2019-11-04 21:40:42.419+0000: 21"..., 200) = 200 <0.000021>
16:40:42.419814 open("/sys/fs/cgroup/devices/machine.slice/machine-qemu\\x2d5\\x2de1n3\\x2d1\\x2dcontrol.scope/devices.allow", O_WRONLY|O_TRUNC) = 28 <0.000618>
16:40:42.420466 write(28, "b 259:17 rw", 11) = 11 <0.000024>
16:40:42.420520 close(28)               = 0 <0.000021>


Indeed, libvirt puts correct MAJ:MIN numbers into the devices.allow. The path provided by user is a symlink and resolves to /dev/nvme3n1p4. From the attached sosreport:

/dev/disk/by-id:
lrwxrwxrwx. 1 0 0   13 Nov  4 11:24 nvme-KCM5DRUG3T84_19J0A02YT4YF -> ../../nvme3n1
lrwxrwxrwx. 1 0 0   15 Nov  4 11:24 nvme-KCM5DRUG3T84_19J0A02YT4YF-part1 -> ../../nvme3n1p1
lrwxrwxrwx. 1 0 0   15 Nov  4 11:25 nvme-KCM5DRUG3T84_19J0A02YT4YF-part2 -> ../../nvme3n1p2
lrwxrwxrwx. 1 0 0   15 Nov  4 11:25 nvme-KCM5DRUG3T84_19J0A02YT4YF-part3 -> ../../nvme3n1p3
lrwxrwxrwx. 1 0 0   15 Nov  4 11:24 nvme-KCM5DRUG3T84_19J0A02YT4YF-part4 -> ../../nvme3n1p4

/dev:
crw-------.  1 0  0 241,   3 Nov  4 11:21 nvme3
brw-rw----.  1 0  6 259,   7 Nov  4 11:24 nvme3n1
brw-rw----.  1 0  6 259,   9 Nov  4 11:24 nvme3n1p1
brw-rw----.  1 0  6 259,  10 Nov  4 11:25 nvme3n1p2
brw-rw----.  1 0  6 259,  16 Nov  4 11:25 nvme3n1p3
brw-rw----.  1 0  6 259,  17 Nov  4 11:24 nvme3n1p4


Therefore, "b 259:17 rw" is the correct string to write to devices.allow. Would be nice to see strace of qemu to pinpoint syscall it's getting EPERM to. Note, that qemu is getting EPERM and no EACCESS - seclabels are set properly. The error comes from CGroups. Could this be case where qemu succeeded with open()-ing the file but then failed to do an ioctl() for instance and claims the whole disk opening procedure failed? But, from my analysis it looks like kernel is mistakenly denying access. Switching over to qemu to confirm it's not the former (unlikely). Feel free to switch to kernel afterwards.

Comment 20 CongLi 2019-11-06 01:40:41 UTC
Hi Qing,

Could you please have a try in QEMU side?

Thanks.

Comment 37 Michal Privoznik 2019-11-20 09:52:18 UTC
Patch proposed upstream:

https://www.redhat.com/archives/libvir-list/2019-November/msg00911.html

Comment 38 Michal Privoznik 2019-11-20 13:19:44 UTC
I've pushed the patch upstream:

cdd8a6690e qemu: Forcibly mknod() even if it exists

v5.9.0-290-gcdd8a6690e

Comment 49 Han Han 2020-01-17 01:53:44 UTC
Verified on libvirt-4.5.0-31.el7.x86_64 qemu-kvm-rhev-2.12.0-38.el7.x86_64:
Follow the steps of https://bugzilla.redhat.com/show_bug.cgi?id=1775680#c4

Comment 50 Han Han 2020-01-17 01:55:27 UTC
Created attachment 1652930 [details]
The scripts for bug verify

Test results:
Domain pc started

Device attached successfully

Device detached successfully

Device attached successfully

Device detached successfully

Device attached successfully

Device detached successfully

Device attached successfully

Device detached successfully

Domain pc destroyed

All attach/detach works after mknod in namespace.

Comment 52 errata-xmlrpc 2020-03-31 19:59:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1094