Description of problem: In cgroups v2 the devices controller was dropped in favor of eBPF programs. We need to implement support for eBPF cgroups programs in order to be able filter access to devices. This is not critical features as we already create a namespaces for QEMU process to isolate it from the host. https://www.kernel.org/doc/Documentation/cgroup-v2.txt https://www.kernel.org/doc/Documentation/networking/filter.txt https://cilium.readthedocs.io/en/v1.3/bpf/
Upstream commits: 8addef2bef vircgroupmock: mock virCgroupV2DevicesAvailable c359cb9aee vircgroup: workaround devices in hybrid mode 884479b42b vircgroup: introduce virCgroupV2DenyAllDevices 285aefb31c vircgroup: introduce virCgroupV2AllowAllDevices d5b09ce5d9 vircgroup: introduce virCgroupV2DenyDevice 5d49651912 vircgroup: introduce virCgroupV2AllowDevice b18b0ce609 vircgroup: introduce virCgroupV2DevicesGetKey 63cfe7b84d vircgroup: introduce virCgroupV2DeviceGetPerms 6a24bd75ed vircgroup: introduce virCgroupV2DevicesRemoveProg ef747499a5 vircgroup: introduce virCgroupV2DevicesPrepareProg afa2788662 vircgroup: introduce virCgroupV2DevicesCreateProg ce11a5c59f vircgroup: introduce virCgroupV2DevicesDetectProg 48423a0b5d vircgroup: introduce virCgroupV2DevicesAttachProg 30b6ddc44c vircgroup: introduce virCgroupV2DevicesAvailable 07946d6e39 util: introduce virbpf helpers Released in libvirt-5.10.0
(In reply to Pavel Hrdina from comment #1) > Upstream commits: > > 8addef2bef vircgroupmock: mock virCgroupV2DevicesAvailable > c359cb9aee vircgroup: workaround devices in hybrid mode > 884479b42b vircgroup: introduce virCgroupV2DenyAllDevices > 285aefb31c vircgroup: introduce virCgroupV2AllowAllDevices > d5b09ce5d9 vircgroup: introduce virCgroupV2DenyDevice > 5d49651912 vircgroup: introduce virCgroupV2AllowDevice > b18b0ce609 vircgroup: introduce virCgroupV2DevicesGetKey > 63cfe7b84d vircgroup: introduce virCgroupV2DeviceGetPerms > 6a24bd75ed vircgroup: introduce virCgroupV2DevicesRemoveProg > ef747499a5 vircgroup: introduce virCgroupV2DevicesPrepareProg > afa2788662 vircgroup: introduce virCgroupV2DevicesCreateProg > ce11a5c59f vircgroup: introduce virCgroupV2DevicesDetectProg > 48423a0b5d vircgroup: introduce virCgroupV2DevicesAttachProg > 30b6ddc44c vircgroup: introduce virCgroupV2DevicesAvailable > 07946d6e39 util: introduce virbpf helpers > > Released in libvirt-5.10.0 Hi Paval, Could you please give us a sample about how to use the eBPF in libvirt when cgroupv2 enabled? Such as how to set the device ACL and where to check it. I have 3 questions about this, could you please help to confirm them before the bug is ON_QA, thx 1. Will the cgroup_device_acl setting still take effect? In /etc/libvirt/qemu.conf, there are some default device acl setting as follow, will it still take effect and can be used to set eBPF? # cat /etc/libvirt/qemu.conf | grep cgroup_device_acl -A 9 # cgroup_device_acl = [ # "/dev/null", "/dev/full", "/dev/zero", # "/dev/random", "/dev/urandom", # "/dev/ptmx", "/dev/kvm", # "/dev/rtc","/dev/hpet", "/dev/vfio/vfio" #] 2. Seems no new virsh cmd introduced, so does it mean all ACL is set automatically? And where to check if the setting take effect? For example, in cgroupV1, if vm has a virtual disk pointing to host's /dev/sdb, which major:minor is 8:16. When vm started, an entry "b 8:16 rw" will be added to /sys/fs/cgroup/devices/machine.slice/machine-qemu\\x2dguest.scope/devices.list\ Is eBPF doing someting similar? 3. What else needs to be tested besides above 2 areas? (1. qemu.conf acl setting. 2. when vm using host devices, set acl automatically)
(In reply to yisun from comment #2) > (In reply to Pavel Hrdina from comment #1) > > Upstream commits: > > > > 8addef2bef vircgroupmock: mock virCgroupV2DevicesAvailable > > c359cb9aee vircgroup: workaround devices in hybrid mode > > 884479b42b vircgroup: introduce virCgroupV2DenyAllDevices > > 285aefb31c vircgroup: introduce virCgroupV2AllowAllDevices > > d5b09ce5d9 vircgroup: introduce virCgroupV2DenyDevice > > 5d49651912 vircgroup: introduce virCgroupV2AllowDevice > > b18b0ce609 vircgroup: introduce virCgroupV2DevicesGetKey > > 63cfe7b84d vircgroup: introduce virCgroupV2DeviceGetPerms > > 6a24bd75ed vircgroup: introduce virCgroupV2DevicesRemoveProg > > ef747499a5 vircgroup: introduce virCgroupV2DevicesPrepareProg > > afa2788662 vircgroup: introduce virCgroupV2DevicesCreateProg > > ce11a5c59f vircgroup: introduce virCgroupV2DevicesDetectProg > > 48423a0b5d vircgroup: introduce virCgroupV2DevicesAttachProg > > 30b6ddc44c vircgroup: introduce virCgroupV2DevicesAvailable > > 07946d6e39 util: introduce virbpf helpers > > > > Released in libvirt-5.10.0 > > Hi Paval, > Could you please give us a sample about how to use the eBPF in libvirt when > cgroupv2 enabled? Such as how to set the device ACL and where to check it. I > have 3 questions about this, could you please help to confirm them before > the bug is ON_QA, thx Hi Yi, Sure. First of all it is used automatically for every VM. If you start a VM there are some default devices added to eBPF map based on the cgroup_device_acl and also all other devices directly used by that VM. > 1. Will the cgroup_device_acl setting still take effect? > In /etc/libvirt/qemu.conf, there are some default device acl setting as > follow, will it still take effect and can be used to set eBPF? > # cat /etc/libvirt/qemu.conf | grep cgroup_device_acl -A 9 > # cgroup_device_acl = [ > # "/dev/null", "/dev/full", "/dev/zero", > # "/dev/random", "/dev/urandom", > # "/dev/ptmx", "/dev/kvm", > # "/dev/rtc","/dev/hpet", "/dev/vfio/vfio" > #] These are still used with cgroups v2 as well and they are enabled for each VM as a set of default devices. > 2. Seems no new virsh cmd introduced, so does it mean all ACL is set > automatically? And where to check if the setting take effect? > For example, in cgroupV1, if vm has a virtual disk pointing to host's > /dev/sdb, which major:minor is 8:16. When vm started, an entry "b 8:16 rw" > will be added to > /sys/fs/cgroup/devices/machine.slice/machine-qemu\\x2dguest.scope/devices. > list\ > Is eBPF doing someting similar? The usage of eBPF in libvirt for cgroups v2 is similar, if the VM is started there are rules added to eBPF map as well for each device. It is possible to check the rules, but it's not that simple as for cgroups v1. > 3. What else needs to be tested besides above 2 areas? (1. qemu.conf acl > setting. 2. when vm using host devices, set acl automatically) I would say we should test everything as for cgroups v1: 1) if eBPF is actually used (eBPF program and eBPF map are created for VM and map has some entries in the map) 2) if changes in qemu.conf are properly reflected in eBPF 3) if devices used by VM are properly enabled in eBPF 4) if eBPF program and eBPF map are removed once the VM is shut off. 5) The default eBPF map can hold 64 different devices, so it would be nice to test if you add more then 64 devices that there is a new larger eBPF map created and contains all devices correctly and the old map is removed Now how you can actually check all the eBPF complexity. Unfortunately it's not as simple as for cgroups v1. For eBPF there is a tool developed by kernel called "bpftool" which can be used to list eBPF programs and eBPF maps. You will have to set selinux to perimisive mode (setenforce 0) in order to access the eBPF program and map created by libvirt. To get a program attached to specific cgroup you can run: $ bpftool cgroup list /sys/fs/cgroup/machine.slice/machine-qemu\\x2d8\\x2dgeneric.scope/ ID AttachType AttachFlags Name 94 device With the ID you can get info about the program including its maps: $ bpftool prog show id 94 94: cgroup_device tag b95e404d31962705 gpl loaded_at 2020-02-27T09:59:24+0100 uid 0 xlated 616B jited 347B memlock 4096B map_ids 2 With the map id you can dump the content of the map: $ bpftool map dump id 2 key: 07 00 00 00 01 00 00 00 value: 02 00 06 00 key: 05 00 00 00 01 00 00 00 value: 02 00 06 00 key: e8 00 00 00 0a 00 00 00 value: 02 00 06 00 key: ff ff ff ff 88 00 00 00 value: 02 00 06 00 key: e4 00 00 00 0a 00 00 00 value: 02 00 06 00 key: 03 00 00 00 01 00 00 00 value: 02 00 06 00 key: 00 00 00 00 fb 00 00 00 value: 02 00 06 00 key: 81 00 00 00 e2 00 00 00 value: 02 00 06 00 key: 08 00 00 00 01 00 00 00 value: 02 00 06 00 key: 09 00 00 00 01 00 00 00 value: 02 00 06 00 key: 02 00 00 00 05 00 00 00 value: 02 00 06 00 Found 11 elements All of this can be automated because bpftool can print it's output as JSON using -j option: $ bpftool -j cgroup list /sys/fs/cgroup/machine.slice/machine-qemu\\x2d8\\x2dgeneric.scope/ [{"id":94,"attach_type":"device","attach_flags":"","name":""}] $ bpftool -j prog show id 94 {"id":94,"type":"cgroup_device","tag":"b95e404d31962705","gpl_compatible":true,"loaded_at":1582793964,"uid":0,"bytes_xlated":616,"jited":true,"bytes_jited":347,"bytes_memlock":4096,"map_ids":[2]} $ bpftool -j map dump id 2 [{"key":["0x07","0x00","0x00","0x00","0x01","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0x05","0x00","0x00","0x00","0x01","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0xe8","0x00","0x00","0x00","0x0a","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0xff","0xff","0xff","0xff","0x88","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0xe4","0x00","0x00","0x00","0x0a","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0x03","0x00","0x00","0x00","0x01","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0x00","0x00","0x00","0x00","0xfb","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0x81","0x00","0x00","0x00","0xe2","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0x08","0x00","0x00","0x00","0x01","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0x09","0x00","0x00","0x00","0x01","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0x02","0x00","0x00","0x00","0x05","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]}] Now if you look at the dump of the map, each entry has "key" and "value" and here comes the tricky part: - the key is a 64bit unsigned int and contains major:minor - the value is 32bit unsigned int and has the rules for each device rwm:bc Parsing the values directly is not possible as they are in revers order. Here is a simple python script that can print the rules for a VM in a readable format, the same as cgroups v1: import json import subprocess import sys if len(sys.argv) != 2: print('Only one argument is valid') exit(-1) progs = json.loads(subprocess.run(['bpftool', '-j', 'cgroup', 'list', sys.argv[1]], capture_output=True).stdout) prog = json.loads(subprocess.run(['bpftool', '-j', 'prog', 'show', 'id', str(progs[0]['id'])], capture_output=True).stdout) rules = json.loads(subprocess.run(['bpftool', '-j', 'map', 'dump', 'id', str(prog['map_ids'][0])], capture_output=True).stdout) def list_to_num(key): ret = 0 for i, n in enumerate(key): ret += int(n, 16) << i * 8 return '*' if ret >= 2 ** 31 else ret def get_types(val): if val == '0x01': return 'b' elif val == '0x02': return 'c' elif val == '0x03': return 'a' def get_perms(val): val = int(val, 16) return '{0}{1}{2}'.format( 'r' if val & 2 else '', 'w' if val & 4 else '', 'm' if val & 1 else '' ) for rule in rules: minor = list_to_num(rule['key'][0:4]) major = list_to_num(rule['key'][4:]) types = get_types(rule['value'][0]) perms = get_perms(rule['value'][2]) print('{0} {1}:{2} {3}'.format(types, major, minor, perms))
After testing it on rhel8 the script is not working, but it's easy to fix. You need to replace 'capture_output=True' by 'stdout=subprocess.PIPE' because capture_output was introduced in python 3.7 but on rhel8 we have 3.6.8.
(In reply to Pavel Hrdina from comment #5) > (In reply to yisun from comment #2) > > (In reply to Pavel Hrdina from comment #1) > > > Upstream commits: > > > > > > 8addef2bef vircgroupmock: mock virCgroupV2DevicesAvailable > > > c359cb9aee vircgroup: workaround devices in hybrid mode > > > 884479b42b vircgroup: introduce virCgroupV2DenyAllDevices > > > 285aefb31c vircgroup: introduce virCgroupV2AllowAllDevices > > > d5b09ce5d9 vircgroup: introduce virCgroupV2DenyDevice > > > 5d49651912 vircgroup: introduce virCgroupV2AllowDevice > > > b18b0ce609 vircgroup: introduce virCgroupV2DevicesGetKey > > > 63cfe7b84d vircgroup: introduce virCgroupV2DeviceGetPerms > > > 6a24bd75ed vircgroup: introduce virCgroupV2DevicesRemoveProg > > > ef747499a5 vircgroup: introduce virCgroupV2DevicesPrepareProg > > > afa2788662 vircgroup: introduce virCgroupV2DevicesCreateProg > > > ce11a5c59f vircgroup: introduce virCgroupV2DevicesDetectProg > > > 48423a0b5d vircgroup: introduce virCgroupV2DevicesAttachProg > > > 30b6ddc44c vircgroup: introduce virCgroupV2DevicesAvailable > > > 07946d6e39 util: introduce virbpf helpers > > > > > > Released in libvirt-5.10.0 > > > > Hi Paval, > > Could you please give us a sample about how to use the eBPF in libvirt when > > cgroupv2 enabled? Such as how to set the device ACL and where to check it. I > > have 3 questions about this, could you please help to confirm them before > > the bug is ON_QA, thx > > Hi Yi, > > Sure. First of all it is used automatically for every VM. If you start a > VM there > are some default devices added to eBPF map based on the cgroup_device_acl > and also > all other devices directly used by that VM. > > > 1. Will the cgroup_device_acl setting still take effect? > > In /etc/libvirt/qemu.conf, there are some default device acl setting as > > follow, will it still take effect and can be used to set eBPF? > > # cat /etc/libvirt/qemu.conf | grep cgroup_device_acl -A 9 > > # cgroup_device_acl = [ > > # "/dev/null", "/dev/full", "/dev/zero", > > # "/dev/random", "/dev/urandom", > > # "/dev/ptmx", "/dev/kvm", > > # "/dev/rtc","/dev/hpet", "/dev/vfio/vfio" > > #] > > These are still used with cgroups v2 as well and they are enabled for each VM > as a set of default devices. > > > 2. Seems no new virsh cmd introduced, so does it mean all ACL is set > > automatically? And where to check if the setting take effect? > > For example, in cgroupV1, if vm has a virtual disk pointing to host's > > /dev/sdb, which major:minor is 8:16. When vm started, an entry "b 8:16 rw" > > will be added to > > /sys/fs/cgroup/devices/machine.slice/machine-qemu\\x2dguest.scope/devices. > > list\ > > Is eBPF doing someting similar? > > The usage of eBPF in libvirt for cgroups v2 is similar, if the VM is started > there > are rules added to eBPF map as well for each device. It is possible to > check the rules, > but it's not that simple as for cgroups v1. > > > 3. What else needs to be tested besides above 2 areas? (1. qemu.conf acl > > setting. 2. when vm using host devices, set acl automatically) > > I would say we should test everything as for cgroups v1: > > 1) if eBPF is actually used (eBPF program and eBPF map are created for VM > and map has some entries in the map) > 2) if changes in qemu.conf are properly reflected in eBPF > 3) if devices used by VM are properly enabled in eBPF > 4) if eBPF program and eBPF map are removed once the VM is shut off. > 5) The default eBPF map can hold 64 different devices, so it would be nice > to test if you add more then 64 devices that there is a new larger eBPF map > created and contains all devices correctly and the old map is removed > > > Now how you can actually check all the eBPF complexity. Unfortunately it's > not as simple as for cgroups v1. > For eBPF there is a tool developed by kernel called "bpftool" which can be > used to list eBPF programs and eBPF maps. > You will have to set selinux to perimisive mode (setenforce 0) in order to > access the eBPF program and map created by libvirt. > > To get a program attached to specific cgroup you can run: > > $ bpftool cgroup list > /sys/fs/cgroup/machine.slice/machine-qemu\\x2d8\\x2dgeneric.scope/ > ID AttachType AttachFlags Name > 94 device > > With the ID you can get info about the program including its maps: > > $ bpftool prog show id 94 > 94: cgroup_device tag b95e404d31962705 gpl > loaded_at 2020-02-27T09:59:24+0100 uid 0 > xlated 616B jited 347B memlock 4096B map_ids 2 > > With the map id you can dump the content of the map: > > $ bpftool map dump id 2 > key: 07 00 00 00 01 00 00 00 value: 02 00 06 00 > key: 05 00 00 00 01 00 00 00 value: 02 00 06 00 > key: e8 00 00 00 0a 00 00 00 value: 02 00 06 00 > key: ff ff ff ff 88 00 00 00 value: 02 00 06 00 > key: e4 00 00 00 0a 00 00 00 value: 02 00 06 00 > key: 03 00 00 00 01 00 00 00 value: 02 00 06 00 > key: 00 00 00 00 fb 00 00 00 value: 02 00 06 00 > key: 81 00 00 00 e2 00 00 00 value: 02 00 06 00 > key: 08 00 00 00 01 00 00 00 value: 02 00 06 00 > key: 09 00 00 00 01 00 00 00 value: 02 00 06 00 > key: 02 00 00 00 05 00 00 00 value: 02 00 06 00 > Found 11 elements > > All of this can be automated because bpftool can print it's output as JSON > using -j option: > > $ bpftool -j cgroup list > /sys/fs/cgroup/machine.slice/machine-qemu\\x2d8\\x2dgeneric.scope/ > [{"id":94,"attach_type":"device","attach_flags":"","name":""}] > > $ bpftool -j prog show id 94 > {"id":94,"type":"cgroup_device","tag":"b95e404d31962705","gpl_compatible": > true,"loaded_at":1582793964,"uid":0,"bytes_xlated":616,"jited":true, > "bytes_jited":347,"bytes_memlock":4096,"map_ids":[2]} > > $ bpftool -j map dump id 2 > [{"key":["0x07","0x00","0x00","0x00","0x01","0x00","0x00","0x00"],"value": > ["0x02","0x00","0x06","0x00"]},{"key":["0x05","0x00","0x00","0x00","0x01", > "0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0xe8", > "0x00","0x00","0x00","0x0a","0x00","0x00","0x00"],"value":["0x02","0x00", > "0x06","0x00"]},{"key":["0xff","0xff","0xff","0xff","0x88","0x00","0x00", > "0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0xe4","0x00","0x00", > "0x00","0x0a","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]}, > {"key":["0x03","0x00","0x00","0x00","0x01","0x00","0x00","0x00"],"value": > ["0x02","0x00","0x06","0x00"]},{"key":["0x00","0x00","0x00","0x00","0xfb", > "0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0x81", > "0x00","0x00","0x00","0xe2","0x00","0x00","0x00"],"value":["0x02","0x00", > "0x06","0x00"]},{"key":["0x08","0x00","0x00","0x00","0x01","0x00","0x00", > "0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0x09","0x00","0x00", > "0x00","0x01","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]}, > {"key":["0x02","0x00","0x00","0x00","0x05","0x00","0x00","0x00"],"value": > ["0x02","0x00","0x06","0x00"]}] > > > Now if you look at the dump of the map, each entry has "key" and "value" and > here comes the tricky part: > > - the key is a 64bit unsigned int and contains major:minor > - the value is 32bit unsigned int and has the rules for each device > rwm:bc > > Parsing the values directly is not possible as they are in revers order. > > Here is a simple python script that can print the rules for a VM in a > readable format, the same as cgroups v1: > > import json > import subprocess > import sys > > if len(sys.argv) != 2: > print('Only one argument is valid') > exit(-1) > > progs = json.loads(subprocess.run(['bpftool', '-j', 'cgroup', 'list', > sys.argv[1]], capture_output=True).stdout) > > prog = json.loads(subprocess.run(['bpftool', '-j', 'prog', 'show', 'id', > str(progs[0]['id'])], capture_output=True).stdout) > > rules = json.loads(subprocess.run(['bpftool', '-j', 'map', 'dump', 'id', > str(prog['map_ids'][0])], capture_output=True).stdout) > > def list_to_num(key): > ret = 0 > for i, n in enumerate(key): > ret += int(n, 16) << i * 8 > return '*' if ret >= 2 ** 31 else ret > > def get_types(val): > if val == '0x01': > return 'b' > elif val == '0x02': > return 'c' > elif val == '0x03': > return 'a' > > def get_perms(val): > val = int(val, 16) > return '{0}{1}{2}'.format( > 'r' if val & 2 else '', > 'w' if val & 4 else '', > 'm' if val & 1 else '' > ) > > for rule in rules: > minor = list_to_num(rule['key'][0:4]) > major = list_to_num(rule['key'][4:]) > types = get_types(rule['value'][0]) > perms = get_perms(rule['value'][2]) > > print('{0} {1}:{2} {3}'.format(types, major, minor, perms)) Thanks a million for such a clear introduction and a lovely script to get the bpf map! It will save me a lot of time.
Test the 64 entries limit first, all python files in this comment can be found in attachments Scenario 1: When bpf map exceed 64 entries, it will be removed and a new map will be generated with the newly attach device 1. generate 60 scsi disk on hosts [root@hp-dl320eg8-05 bz1717396]# python creaet_blocks.py create 40 [root@hp-dl320eg8-05 bz1717396]# lsscsi [0:0:0:0] disk ATA MM0500GBKAK HPGE /dev/sda [6:0:0:0] disk LIO-ORG device.iscsi-di 4.0 /dev/sdb [7:0:0:0] disk LIO-ORG device.iscsi-di 4.0 /dev/sdc [8:0:0:0] disk LIO-ORG device.iscsi-di 4.0 /dev/sdd [9:0:0:0] disk LIO-ORG device.iscsi-di 4.0 /dev/sde [.... [68:0:0:0] disk LIO-ORG device.iscsi-di 4.0 /dev/sdbl [69:0:0:0] disk LIO-ORG device.iscsi-di 4.0 /dev/sdbm <==== the newly scsi block devices are from /dev/sdb to /dev/sdbm 2. Prepare virtual disk xml [root@hp-dl320eg8-05 bz1717396]# python generate_disk_xml.py sdc vdb 60 <disk type='block' device='disk'> <driver name='qemu' type='raw'/> <source dev='/dev/sdc'/> <target dev='vdb' bus='virtio'/> </disk> <disk type='block' device='disk'> <driver name='qemu' type='raw'/> <source dev='/dev/sdd'/> <target dev='vdc' bus='virtio'/> </disk> <disk type='block' device='disk'> <driver name='qemu' type='raw'/> <source dev='/dev/sde'/> <target dev='vdd' bus='virtio'/> </disk> ... 3. Edit the vm's xml and add virtual disks to vm, make sure there are 63 entries in vm's bpf map [root@hp-dl320eg8-05 bz1717396]# virsh start vm1 [root@hp-dl320eg8-05 bz1717396]# bpftool cgroup list /sys/fs/cgroup/machine.slice/machine-qemu\\x2d4\\x2dvm1.scope/ ID AttachType AttachFlags Name 9 device [root@hp-dl320eg8-05 bz1717396]# bpftool prog show id 9 9: cgroup_device tag b95e404d31962705 gpl loaded_at 2020-03-03T10:11:30-0500 uid 0 xlated 616B jited 350B memlock 4096B map_ids 9 [root@hp-dl320eg8-05 bz1717396]# bpftool map dump id 9 key: 00 00 00 00 42 00 00 00 value: 01 00 06 00 key: e0 00 00 00 08 00 00 00 value: 01 00 06 00 … key: 50 00 00 00 42 00 00 00 value: 01 00 06 00 key: d0 00 00 00 41 00 00 00 value: 01 00 06 00 key: 40 00 00 00 43 00 00 00 value: 01 00 06 00 Found 63 elements 4. Prepare a unused disk xml and attach it to vm [root@hp-dl320eg8-05 bz1717396]# cat disk1.xml <disk type='block' device='disk'> <driver name='qemu' type='raw'/> <source dev='/dev/sdbc'/> <target dev='vddd' bus='virtio'/> </disk> [root@hp-dl320eg8-05 bz1717396]# virsh attach-device vm1 disk1.xml Device attached successfully 5. Now the vm's bpf map will have 64 entries. Since default bpf map size is 64, so the original bpf map will be removed and a new larger one will be generated, containing the original entries and the newly attached entry [root@hp-dl320eg8-05 bz1717396]# bpftool map dump id 9 Error: get map by id (9): No such file or directory <==== original bpf map gone [root@hp-dl320eg8-05 bz1717396]# bpftool cgroup list /sys/fs/cgroup/machine.slice/machine-qemu\\x2d4\\x2dvm1.scope/ ID AttachType AttachFlags Name 10 device [root@hp-dl320eg8-05 bz1717396]# bpftool prog show id 10 10: cgroup_device tag b95e404d31962705 gpl loaded_at 2020-03-03T10:13:25-0500 uid 0 xlated 616B jited 350B memlock 4096B map_ids 10 [root@hp-dl320eg8-05 bz1717396]# bpftool map dump id 10 key: d0 00 00 00 42 00 00 00 value: 01 00 06 00 key: 60 00 00 00 41 00 00 00 value: 01 00 06 00 … key: 80 00 00 00 41 00 00 00 value: 01 00 06 00 key: 30 00 00 00 41 00 00 00 value: 01 00 06 00 key: 90 00 00 00 42 00 00 00 value: 01 00 06 00 key: 20 00 00 00 42 00 00 00 value: 01 00 06 00 Found 64 elements <==== new bpf map with ID=10 created and have 64 elements now [root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d4\\x2dvm1.scope/ | grep sdbc b 67:96 rw sdbc <==== new host device exists in vm’s bpf map [root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d4\\x2dvm1.scope b 66:208 rw sdat b 65:96 rw sdw b 65:160 rw sdaa b 65:16 rw sdr c 10:232 rw kvm b 66:192 rw sdas ... <==== original entries exist, (I used vimdiff to check the output for convenience)
(In reply to yisun from comment #8) > Test the 64 entries limit first, all python files in this comment can be > found in attachments > Scenario 1: When bpf map exceed 64 entries, it will be removed and a new map > will be generated with the newly attach device > 1. generate 60 scsi disk on hosts > [root@hp-dl320eg8-05 bz1717396]# python creaet_blocks.py create 40 Here should be # python creaet_blocks.py create 60
Created attachment 1667219 [details] generate disk xml
Created attachment 1667220 [details] get_bpf_map
Created attachment 1667221 [details] create_blocks
Scenario 2: Check the /etc/libvirt/qemu.conf related setting takes effect: 1. Check qemu.conf default settings edit qemu.conf as: #cgroup_device_acl = [ # "/dev/null", "/dev/full", "/dev/zero", # "/dev/random", "/dev/urandom", # "/dev/ptmx", "/dev/kvm", # "/dev/rtc","/dev/hpet" #] Restart libvirtd Destroy and start vm [root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d6\\x2dvm1.scope/ c 10:228 rw hpet c 251:0 rw rtc0 c 1:9 rw urandom c 10:232 rw kvm c 1:7 rw full c 1:8 rw random c 136:* rw c 5:2 rw ptmx c 1:5 rw zero c 1:3 rw null <==== correct as qemu.conf's default setting 2. Remove a device (/dev/hpet) in qemu.conf cgroup_device_acl = [ "/dev/null", "/dev/full", "/dev/zero", "/dev/random", "/dev/urandom", "/dev/ptmx", "/dev/kvm", "/dev/rtc" ] Restart libvirtd Destroy and start vm [root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d7\\x2dvm1.scope/ c 5:2 rw ptmx c 136:* rw c 1:7 rw full c 1:8 rw random c 1:3 rw null c 10:232 rw kvm c 1:9 rw urandom c 251:0 rw rtc0 c 1:5 rw zero <==== hpet not existing as expected 3. Add a device (/dev/sdz) in qemu.conf cgroup_device_acl = [ "/dev/null", "/dev/full", "/dev/zero", "/dev/random", "/dev/urandom", "/dev/ptmx", "/dev/kvm", "/dev/rtc", "/dev/sdz" ] Restart libvirtd Destroy and start vm [root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d8\\x2dvm1.scope/ c 5:2 rw ptmx b 65:144 rw sdz <==== sdz entry exists as expected c 136:* rw c 1:9 rw urandom c 1:5 rw zero c 1:8 rw random c 251:0 rw rtc0 c 10:232 rw kvm c 1:3 rw null c 1:7 rw full 4. Add a non-existing device (/dev/non_existing) in qemu.conf cgroup_device_acl = [ "/dev/null", "/dev/full", "/dev/zero", "/dev/random", "/dev/urandom", "/dev/ptmx", "/dev/kvm", "/dev/rtc", "/dev/non_existing" ] Restart libvirtd Destroy and start vm [root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d9\\x2dvm1.scope/ c 251:0 rw rtc0 c 1:5 rw zero c 1:9 rw urandom c 10:232 rw kvm c 1:7 rw full c 1:8 rw random c 1:3 rw null c 5:2 rw ptmx c 136:* rw <==== nothing wrong 4. Add duplicated devices (/dev/sdz) in qemu.conf cgroup_device_acl = [ "/dev/null", "/dev/full", "/dev/zero", "/dev/random", "/dev/urandom", "/dev/ptmx", "/dev/kvm", "/dev/rtc", "/dev/sdz", "/dev/sdz" ] [root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d10\\x2dvm1.scope/ b 65:144 rw sdz c 1:8 rw random c 5:2 rw ptmx c 1:9 rw urandom c 1:5 rw zero c 1:3 rw null c 10:232 rw kvm c 251:0 rw rtc0 c 1:7 rw full c 136:* rw <==== nothing wrong and /dev/sdz exists
Scenario 3: Check the newly attached host device will show up in vm's bpf map 1. attach a host block [root@hp-dl320eg8-05 bz1717396]# cat disk.xml <disk type='block' device='disk'> <driver name='qemu' type='raw'/> <source dev='/dev/sdbb'/> <target dev='vdbd' bus='virtio'/> </disk> [root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d10\\x2dvm1.scope/ b 65:144 rw sdz c 1:8 rw random c 5:2 rw ptmx c 1:9 rw urandom c 1:5 rw zero c 1:3 rw null c 10:232 rw kvm c 251:0 rw rtc0 c 1:7 rw full c 136:* rw [root@hp-dl320eg8-05 bz1717396]# virsh attach-device vm1 disk.xml Device attached successfully [root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d10\\x2dvm1.scope/ b 65:144 rw sdz c 1:8 rw random b 67:80 rw sdbb <==== sdbb exists as expected c 5:2 rw ptmx c 1:9 rw urandom c 1:5 rw zero c 1:3 rw null c 10:232 rw kvm c 251:0 rw rtc0 c 1:7 rw full c 136:* rw 2. attach a image with backing chain pointing to a host block device [root@hp-dl320eg8-05 bz1717396]# qemu-img create -f qcow2 /tmp/layer1.qcow2 1G Formatting '/tmp/layer1.qcow2', fmt=qcow2 size=1073741824 cluster_size=65536 lazy_refcounts=off refcount_bits=16 [root@hp-dl320eg8-05 bz1717396]# qemu-img rebase -b /dev/sdy -F raw /tmp/layer1.qcow2 [root@hp-dl320eg8-05 bz1717396]# qemu-img info /tmp/layer1.qcow2 --backing-chain image: /tmp/layer1.qcow2 file format: qcow2 virtual size: 1 GiB (1073741824 bytes) disk size: 196 KiB cluster_size: 65536 backing file: /dev/sdy backing file format: raw Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false image: /dev/sdy file format: raw virtual size: 50 MiB (52428800 bytes) disk size: 0 B [root@hp-dl320eg8-05 bz1717396]# cat disk_with_backing_chain.xml <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/tmp/layer1.qcow2'/> <target dev='vdzz' bus='virtio'/> </disk> [root@hp-dl320eg8-05 bz1717396]# virsh attach-device vm1 disk_with_backing_chain.xml Device attached successfully [root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d10\\x2dvm1.scope/ b 65:144 rw sdz c 1:8 rw random b 65:128 r sdy <==== sdy exists as expected b 67:80 rw sdbb c 5:2 rw ptmx c 1:9 rw urandom c 1:5 rw zero c 1:3 rw null c 10:232 rw kvm c 251:0 rw rtc0 c 1:7 rw full c 136:* rw
With comment 8 to comment 14, the test result is: PASS And all the scenarios carried out with: qemu-kvm-4.2.0-13.module+el8.2.0+5898+fb4bceae.x86_64 libvirt-6.0.0-7.module+el8.2.0+5869+c23fe68b.x86_64 systemd-239-23.el8.x86_64 And cgroup2 enabled: [root@hp-dl320eg8-05 bz1717396]# mount | grep cgroup2 cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate)
Forgot one scenario, add it. Scenario 4: When vm destroyed, the bpf map should be removed [root@hp-dl320eg8-05 ~]# virsh list Id Name State ---------------------- 10 vm1 running [root@hp-dl320eg8-05 ~]# bpftool cgroup list /sys/fs/cgroup/machine.slice/machine-qemu\\x2d10\\x2dvm1.scope/ ID AttachType AttachFlags Name 16 device [root@hp-dl320eg8-05 ~]# bpftool prog show id 16 16: cgroup_device tag b95e404d31962705 gpl loaded_at 2020-03-03T10:49:58-0500 uid 0 xlated 616B jited 350B memlock 4096B map_ids 16 [root@hp-dl320eg8-05 ~]# bpftool map show id 16 16: hash flags 0x0 key 8B value 4B max_entries 64 memlock 12288B [root@hp-dl320eg8-05 ~]# virsh destroy vm1 Domain vm1 destroyed [root@hp-dl320eg8-05 ~]# bpftool prog show id 16 Error: get by id (16): No such file or directory <==== removed as expected [root@hp-dl320eg8-05 ~]# bpftool map show id 16 Error: get map by id (16): No such file or directory <==== removed as expected
Hi Pavel, I met a problem during test, could you please help to confirm if this is by desgin, thx Problem: When detach a device from vm, the device's key:value pair still exists in bpf map, but value is reset to all zeros. Should the whole entry be removed, or current behavior is expected? thx Steps: [root@hp-dl320eg8-05 bz1717396]# virsh start vm1 Domain vm1 started [root@hp-dl320eg8-05 bz1717396]# bpftool cgroup list /sys/fs/cgroup/machine.slice/machine-qemu\\x2d2\\x2dvm1.scope/ ID AttachType AttachFlags Name 18 device [root@hp-dl320eg8-05 bz1717396]# bpftool prog show id 18 18: cgroup_device tag b95e404d31962705 gpl loaded_at 2020-03-03T23:33:56-0500 uid 0 xlated 616B jited 350B memlock 4096B map_ids 18 [root@hp-dl320eg8-05 bz1717396]# bpftool map dump id 18 key: 07 00 00 00 01 00 00 00 value: 02 00 06 00 key: ff ff ff ff 88 00 00 00 value: 02 00 06 00 key: 03 00 00 00 01 00 00 00 value: 02 00 06 00 key: 02 00 00 00 05 00 00 00 value: 02 00 06 00 key: 09 00 00 00 01 00 00 00 value: 02 00 06 00 key: 00 00 00 00 fb 00 00 00 value: 02 00 06 00 key: 08 00 00 00 01 00 00 00 value: 02 00 06 00 key: 90 00 00 00 41 00 00 00 value: 01 00 06 00 key: 05 00 00 00 01 00 00 00 value: 02 00 06 00 key: e8 00 00 00 0a 00 00 00 value: 02 00 06 00 Found 10 elements [root@hp-dl320eg8-05 bz1717396]# cat disk.xml <disk type='block' device='disk'> <driver name='qemu' type='raw'/> <source dev='/dev/sdbb'/> <target dev='vdbd' bus='virtio'/> </disk> [root@hp-dl320eg8-05 bz1717396]# virsh attach-device vm1 disk.xml Device attached successfully [root@hp-dl320eg8-05 bz1717396]# bpftool map dump id 18 key: 07 00 00 00 01 00 00 00 value: 02 00 06 00 key: ff ff ff ff 88 00 00 00 value: 02 00 06 00 key: 03 00 00 00 01 00 00 00 value: 02 00 06 00 key: 02 00 00 00 05 00 00 00 value: 02 00 06 00 key: 09 00 00 00 01 00 00 00 value: 02 00 06 00 key: 00 00 00 00 fb 00 00 00 value: 02 00 06 00 key: 08 00 00 00 01 00 00 00 value: 02 00 06 00 key: 90 00 00 00 41 00 00 00 value: 01 00 06 00 key: 50 00 00 00 43 00 00 00 value: 01 00 06 00 key: 05 00 00 00 01 00 00 00 value: 02 00 06 00 key: e8 00 00 00 0a 00 00 00 value: 02 00 06 00 Found 11 elements <====== NEW ENTRY IS - key: 50 00 00 00 43 00 00 00 value: 01 00 06 00 [root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d2\\x2dvm1.scope/ c 1:7 rw full c 136:* rw c 1:3 rw null c 5:2 rw ptmx c 1:9 rw urandom c 251:0 rw rtc0 c 1:8 rw random b 65:144 rw sdz b 67:80 rw sdbb c 1:5 rw zero c 10:232 rw kvm [root@hp-dl320eg8-05 bz1717396]# virsh detach-device vm1 disk.xml Device detached successfully [root@hp-dl320eg8-05 bz1717396]# bpftool map dump id 18 key: 07 00 00 00 01 00 00 00 value: 02 00 06 00 key: ff ff ff ff 88 00 00 00 value: 02 00 06 00 key: 03 00 00 00 01 00 00 00 value: 02 00 06 00 key: 02 00 00 00 05 00 00 00 value: 02 00 06 00 key: 09 00 00 00 01 00 00 00 value: 02 00 06 00 key: 00 00 00 00 fb 00 00 00 value: 02 00 06 00 key: 08 00 00 00 01 00 00 00 value: 02 00 06 00 key: 90 00 00 00 41 00 00 00 value: 01 00 06 00 key: 50 00 00 00 43 00 00 00 value: 00 00 00 00 key: 05 00 00 00 01 00 00 00 value: 02 00 06 00 key: e8 00 00 00 0a 00 00 00 value: 02 00 06 00 Found 11 elements <===== Not back to 10 elements [root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d2\\x2dvm1.scope/ c 1:7 rw full c 136:* rw c 1:3 rw null c 5:2 rw ptmx c 1:9 rw urandom c 251:0 rw rtc0 c 1:8 rw random b 65:144 rw sdz None 67:80 c 1:5 rw zero c 10:232 rw kvm <==== Now the major:minor==67:80 device still exists in map but value has been set to all zeros That means, when dump map id=18 following entry still exists, but the value is erase to all zeros FROM: # bpftool map dump id 18 key: 50 00 00 00 43 00 00 00 value: 01 00 06 00 ... TO: # bpftool map dump id 18 key: 50 00 00 00 43 00 00 00 value: 00 00 00 00 ...
Nice catch of the issue that the entry is not removed. It's a minor issue as the device is still forbidden but it would be nice to fix it. Can you please create a new BZ for that issue? Thanks.
(In reply to Pavel Hrdina from comment #18) > Nice catch of the issue that the entry is not removed. It's a minor issue > as the device is still forbidden but it would be nice to fix it. Can you > please create a new BZ for that issue? Thanks. Thanks for confirming, new issue reported: Bug 1810356 - [cgroup2] When detach a device from vm, the device entry not removed from vm's bpf map
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2017