Bug 1717396 - RFE: add cgroups v2 BPF devices support
Summary: RFE: add cgroups v2 BPF devices support
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: 8.0
Assignee: Pavel Hrdina
QA Contact: yisun
URL:
Whiteboard:
Depends On: 1513930 1656432
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-06-05 12:11 UTC by Pavel Hrdina
Modified: 2020-11-06 04:04 UTC (History)
22 users (show)

Fixed In Version: libvirt-6.0.0-1.el8
Doc Type: Enhancement
Doc Text:
Clone Of: 1717394
Environment:
Last Closed: 2020-05-05 09:46:14 UTC
Type: Feature Request
Target Upstream Version:
Embargoed:
knoel: mirror+


Attachments (Terms of Use)
generate disk xml (1.67 KB, text/plain)
2020-03-03 15:41 UTC, yisun
no flags Details
get_bpf_map (1.66 KB, text/plain)
2020-03-03 15:42 UTC, yisun
no flags Details
create_blocks (1.94 KB, text/plain)
2020-03-03 15:43 UTC, yisun
no flags Details

Description Pavel Hrdina 2019-06-05 12:11:20 UTC
Description of problem:

In cgroups v2 the devices controller was dropped in favor of eBPF programs.
We need to implement support for eBPF cgroups programs in order to be able
filter access to devices.

This is not critical features as we already create a namespaces for QEMU
process to isolate it from the host.

https://www.kernel.org/doc/Documentation/cgroup-v2.txt
https://www.kernel.org/doc/Documentation/networking/filter.txt
https://cilium.readthedocs.io/en/v1.3/bpf/

Comment 1 Pavel Hrdina 2020-01-10 16:08:00 UTC
Upstream commits:

8addef2bef vircgroupmock: mock virCgroupV2DevicesAvailable
c359cb9aee vircgroup: workaround devices in hybrid mode
884479b42b vircgroup: introduce virCgroupV2DenyAllDevices
285aefb31c vircgroup: introduce virCgroupV2AllowAllDevices
d5b09ce5d9 vircgroup: introduce virCgroupV2DenyDevice
5d49651912 vircgroup: introduce virCgroupV2AllowDevice
b18b0ce609 vircgroup: introduce virCgroupV2DevicesGetKey
63cfe7b84d vircgroup: introduce virCgroupV2DeviceGetPerms
6a24bd75ed vircgroup: introduce virCgroupV2DevicesRemoveProg
ef747499a5 vircgroup: introduce virCgroupV2DevicesPrepareProg
afa2788662 vircgroup: introduce virCgroupV2DevicesCreateProg
ce11a5c59f vircgroup: introduce virCgroupV2DevicesDetectProg
48423a0b5d vircgroup: introduce virCgroupV2DevicesAttachProg
30b6ddc44c vircgroup: introduce virCgroupV2DevicesAvailable
07946d6e39 util: introduce virbpf helpers

Released in libvirt-5.10.0

Comment 2 yisun 2020-01-16 06:40:45 UTC
(In reply to Pavel Hrdina from comment #1)
> Upstream commits:
> 
> 8addef2bef vircgroupmock: mock virCgroupV2DevicesAvailable
> c359cb9aee vircgroup: workaround devices in hybrid mode
> 884479b42b vircgroup: introduce virCgroupV2DenyAllDevices
> 285aefb31c vircgroup: introduce virCgroupV2AllowAllDevices
> d5b09ce5d9 vircgroup: introduce virCgroupV2DenyDevice
> 5d49651912 vircgroup: introduce virCgroupV2AllowDevice
> b18b0ce609 vircgroup: introduce virCgroupV2DevicesGetKey
> 63cfe7b84d vircgroup: introduce virCgroupV2DeviceGetPerms
> 6a24bd75ed vircgroup: introduce virCgroupV2DevicesRemoveProg
> ef747499a5 vircgroup: introduce virCgroupV2DevicesPrepareProg
> afa2788662 vircgroup: introduce virCgroupV2DevicesCreateProg
> ce11a5c59f vircgroup: introduce virCgroupV2DevicesDetectProg
> 48423a0b5d vircgroup: introduce virCgroupV2DevicesAttachProg
> 30b6ddc44c vircgroup: introduce virCgroupV2DevicesAvailable
> 07946d6e39 util: introduce virbpf helpers
> 
> Released in libvirt-5.10.0

Hi Paval, 
Could you please give us a sample about how to use the eBPF in libvirt when cgroupv2 enabled? Such as how to set the device ACL and where to check it. I have 3 questions about this, could you please help to confirm them before the bug is ON_QA, thx

1. Will the cgroup_device_acl setting still take effect? 
In /etc/libvirt/qemu.conf, there are some default device acl setting as follow, will it still take effect and can be used to set eBPF?
# cat /etc/libvirt/qemu.conf | grep cgroup_device_acl -A 9
# cgroup_device_acl = [
# "/dev/null", "/dev/full", "/dev/zero",
# "/dev/random", "/dev/urandom",
# "/dev/ptmx", "/dev/kvm",
# "/dev/rtc","/dev/hpet", "/dev/vfio/vfio"
#]

2. Seems no new virsh cmd introduced, so does it mean all ACL is set automatically? And where to check if the setting take effect?
For example, in cgroupV1, if vm has a virtual disk pointing to host's /dev/sdb, which major:minor is 8:16. When vm started, an entry "b 8:16 rw" will be added to  /sys/fs/cgroup/devices/machine.slice/machine-qemu\\x2dguest.scope/devices.list\
Is eBPF doing someting similar?

3. What else needs to be tested besides above 2 areas? (1. qemu.conf acl setting. 2. when vm using host devices, set acl automatically)

Comment 5 Pavel Hrdina 2020-02-27 12:36:23 UTC
(In reply to yisun from comment #2)
> (In reply to Pavel Hrdina from comment #1)
> > Upstream commits:
> > 
> > 8addef2bef vircgroupmock: mock virCgroupV2DevicesAvailable
> > c359cb9aee vircgroup: workaround devices in hybrid mode
> > 884479b42b vircgroup: introduce virCgroupV2DenyAllDevices
> > 285aefb31c vircgroup: introduce virCgroupV2AllowAllDevices
> > d5b09ce5d9 vircgroup: introduce virCgroupV2DenyDevice
> > 5d49651912 vircgroup: introduce virCgroupV2AllowDevice
> > b18b0ce609 vircgroup: introduce virCgroupV2DevicesGetKey
> > 63cfe7b84d vircgroup: introduce virCgroupV2DeviceGetPerms
> > 6a24bd75ed vircgroup: introduce virCgroupV2DevicesRemoveProg
> > ef747499a5 vircgroup: introduce virCgroupV2DevicesPrepareProg
> > afa2788662 vircgroup: introduce virCgroupV2DevicesCreateProg
> > ce11a5c59f vircgroup: introduce virCgroupV2DevicesDetectProg
> > 48423a0b5d vircgroup: introduce virCgroupV2DevicesAttachProg
> > 30b6ddc44c vircgroup: introduce virCgroupV2DevicesAvailable
> > 07946d6e39 util: introduce virbpf helpers
> > 
> > Released in libvirt-5.10.0
> 
> Hi Paval, 
> Could you please give us a sample about how to use the eBPF in libvirt when
> cgroupv2 enabled? Such as how to set the device ACL and where to check it. I
> have 3 questions about this, could you please help to confirm them before
> the bug is ON_QA, thx

Hi Yi,

Sure.  First of all it is used automatically for every VM.  If you start a VM there
are some default devices added to eBPF map based on the cgroup_device_acl and also
all other devices directly used by that VM.

> 1. Will the cgroup_device_acl setting still take effect? 
> In /etc/libvirt/qemu.conf, there are some default device acl setting as
> follow, will it still take effect and can be used to set eBPF?
> # cat /etc/libvirt/qemu.conf | grep cgroup_device_acl -A 9
> # cgroup_device_acl = [
> # "/dev/null", "/dev/full", "/dev/zero",
> # "/dev/random", "/dev/urandom",
> # "/dev/ptmx", "/dev/kvm",
> # "/dev/rtc","/dev/hpet", "/dev/vfio/vfio"
> #]

These are still used with cgroups v2 as well and they are enabled for each VM
as a set of default devices.

> 2. Seems no new virsh cmd introduced, so does it mean all ACL is set
> automatically? And where to check if the setting take effect?
> For example, in cgroupV1, if vm has a virtual disk pointing to host's
> /dev/sdb, which major:minor is 8:16. When vm started, an entry "b 8:16 rw"
> will be added to 
> /sys/fs/cgroup/devices/machine.slice/machine-qemu\\x2dguest.scope/devices.
> list\
> Is eBPF doing someting similar?

The usage of eBPF in libvirt for cgroups v2 is similar, if the VM is started there
are rules added to eBPF map as well for each device.  It is possible to check the rules,
but it's not that simple as for cgroups v1.

> 3. What else needs to be tested besides above 2 areas? (1. qemu.conf acl
> setting. 2. when vm using host devices, set acl automatically)

I would say we should test everything as for cgroups v1:

1) if eBPF is actually used (eBPF program and eBPF map are created for VM and map has some entries in the map)
2) if changes in qemu.conf are properly reflected in eBPF
3) if devices used by VM are properly enabled in eBPF
4) if eBPF program and eBPF map are removed once the VM is shut off.
5) The default eBPF map can hold 64 different devices, so it would be nice to test if you add more then 64 devices that there is a new larger eBPF map created and contains all devices correctly and the old map is removed


Now how you can actually check all the eBPF complexity.  Unfortunately it's not as simple as for cgroups v1.
For eBPF there is a tool developed by kernel called "bpftool" which can be used to list eBPF programs and eBPF maps.
You will have to set selinux to perimisive mode (setenforce 0) in order to access the eBPF program and map created by libvirt.

To get a program attached to specific cgroup you can run:

$ bpftool cgroup list /sys/fs/cgroup/machine.slice/machine-qemu\\x2d8\\x2dgeneric.scope/
ID       AttachType      AttachFlags     Name           
94       device 

With the ID you can get info about the program including its maps:

$ bpftool prog show id 94
94: cgroup_device  tag b95e404d31962705  gpl
	loaded_at 2020-02-27T09:59:24+0100  uid 0
	xlated 616B  jited 347B  memlock 4096B  map_ids 2

With the map id you can dump the content of the map:

$ bpftool map dump id 2
key: 07 00 00 00 01 00 00 00  value: 02 00 06 00
key: 05 00 00 00 01 00 00 00  value: 02 00 06 00
key: e8 00 00 00 0a 00 00 00  value: 02 00 06 00
key: ff ff ff ff 88 00 00 00  value: 02 00 06 00
key: e4 00 00 00 0a 00 00 00  value: 02 00 06 00
key: 03 00 00 00 01 00 00 00  value: 02 00 06 00
key: 00 00 00 00 fb 00 00 00  value: 02 00 06 00
key: 81 00 00 00 e2 00 00 00  value: 02 00 06 00
key: 08 00 00 00 01 00 00 00  value: 02 00 06 00
key: 09 00 00 00 01 00 00 00  value: 02 00 06 00
key: 02 00 00 00 05 00 00 00  value: 02 00 06 00
Found 11 elements

All of this can be automated because bpftool can print it's output as JSON using -j option:

$ bpftool -j cgroup list /sys/fs/cgroup/machine.slice/machine-qemu\\x2d8\\x2dgeneric.scope/
[{"id":94,"attach_type":"device","attach_flags":"","name":""}]

$ bpftool -j prog show id 94
{"id":94,"type":"cgroup_device","tag":"b95e404d31962705","gpl_compatible":true,"loaded_at":1582793964,"uid":0,"bytes_xlated":616,"jited":true,"bytes_jited":347,"bytes_memlock":4096,"map_ids":[2]}

$ bpftool -j map dump id 2
[{"key":["0x07","0x00","0x00","0x00","0x01","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0x05","0x00","0x00","0x00","0x01","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0xe8","0x00","0x00","0x00","0x0a","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0xff","0xff","0xff","0xff","0x88","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0xe4","0x00","0x00","0x00","0x0a","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0x03","0x00","0x00","0x00","0x01","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0x00","0x00","0x00","0x00","0xfb","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0x81","0x00","0x00","0x00","0xe2","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0x08","0x00","0x00","0x00","0x01","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0x09","0x00","0x00","0x00","0x01","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0x02","0x00","0x00","0x00","0x05","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]}]


Now if you look at the dump of the map, each entry has "key" and "value" and here comes the tricky part:

    - the key is a 64bit unsigned int and contains major:minor
    - the value is 32bit unsigned int and has the rules for each device rwm:bc

Parsing the values directly is not possible as they are in revers order.

Here is a simple python script that can print the rules for a VM in a readable format, the same as cgroups v1:

import json
import subprocess
import sys

if len(sys.argv) != 2:
    print('Only one argument is valid')
    exit(-1)

progs = json.loads(subprocess.run(['bpftool', '-j', 'cgroup', 'list', sys.argv[1]], capture_output=True).stdout)

prog = json.loads(subprocess.run(['bpftool', '-j', 'prog', 'show', 'id', str(progs[0]['id'])], capture_output=True).stdout)

rules = json.loads(subprocess.run(['bpftool', '-j', 'map', 'dump', 'id', str(prog['map_ids'][0])], capture_output=True).stdout)

def list_to_num(key):
    ret = 0
    for i, n in enumerate(key):
        ret += int(n, 16) << i * 8
    return '*' if ret >= 2 ** 31 else ret

def get_types(val):
    if val == '0x01':
        return 'b'
    elif val == '0x02':
        return 'c'
    elif val == '0x03':
        return 'a'

def get_perms(val):
    val = int(val, 16)
    return '{0}{1}{2}'.format(
            'r' if val & 2 else '',
            'w' if val & 4 else '',
            'm' if val & 1 else ''
    )

for rule in rules:
    minor = list_to_num(rule['key'][0:4])
    major = list_to_num(rule['key'][4:])
    types = get_types(rule['value'][0])
    perms = get_perms(rule['value'][2])

    print('{0} {1}:{2} {3}'.format(types, major, minor, perms))

Comment 6 Pavel Hrdina 2020-02-27 12:51:31 UTC
After testing it on rhel8 the script is not working, but it's easy to fix.

You need to replace 'capture_output=True' by 'stdout=subprocess.PIPE'
because capture_output was introduced in python 3.7 but on rhel8 we have 3.6.8.

Comment 7 yisun 2020-03-02 13:56:06 UTC
(In reply to Pavel Hrdina from comment #5)
> (In reply to yisun from comment #2)
> > (In reply to Pavel Hrdina from comment #1)
> > > Upstream commits:
> > > 
> > > 8addef2bef vircgroupmock: mock virCgroupV2DevicesAvailable
> > > c359cb9aee vircgroup: workaround devices in hybrid mode
> > > 884479b42b vircgroup: introduce virCgroupV2DenyAllDevices
> > > 285aefb31c vircgroup: introduce virCgroupV2AllowAllDevices
> > > d5b09ce5d9 vircgroup: introduce virCgroupV2DenyDevice
> > > 5d49651912 vircgroup: introduce virCgroupV2AllowDevice
> > > b18b0ce609 vircgroup: introduce virCgroupV2DevicesGetKey
> > > 63cfe7b84d vircgroup: introduce virCgroupV2DeviceGetPerms
> > > 6a24bd75ed vircgroup: introduce virCgroupV2DevicesRemoveProg
> > > ef747499a5 vircgroup: introduce virCgroupV2DevicesPrepareProg
> > > afa2788662 vircgroup: introduce virCgroupV2DevicesCreateProg
> > > ce11a5c59f vircgroup: introduce virCgroupV2DevicesDetectProg
> > > 48423a0b5d vircgroup: introduce virCgroupV2DevicesAttachProg
> > > 30b6ddc44c vircgroup: introduce virCgroupV2DevicesAvailable
> > > 07946d6e39 util: introduce virbpf helpers
> > > 
> > > Released in libvirt-5.10.0
> > 
> > Hi Paval, 
> > Could you please give us a sample about how to use the eBPF in libvirt when
> > cgroupv2 enabled? Such as how to set the device ACL and where to check it. I
> > have 3 questions about this, could you please help to confirm them before
> > the bug is ON_QA, thx
> 
> Hi Yi,
> 
> Sure.  First of all it is used automatically for every VM.  If you start a
> VM there
> are some default devices added to eBPF map based on the cgroup_device_acl
> and also
> all other devices directly used by that VM.
> 
> > 1. Will the cgroup_device_acl setting still take effect? 
> > In /etc/libvirt/qemu.conf, there are some default device acl setting as
> > follow, will it still take effect and can be used to set eBPF?
> > # cat /etc/libvirt/qemu.conf | grep cgroup_device_acl -A 9
> > # cgroup_device_acl = [
> > # "/dev/null", "/dev/full", "/dev/zero",
> > # "/dev/random", "/dev/urandom",
> > # "/dev/ptmx", "/dev/kvm",
> > # "/dev/rtc","/dev/hpet", "/dev/vfio/vfio"
> > #]
> 
> These are still used with cgroups v2 as well and they are enabled for each VM
> as a set of default devices.
> 
> > 2. Seems no new virsh cmd introduced, so does it mean all ACL is set
> > automatically? And where to check if the setting take effect?
> > For example, in cgroupV1, if vm has a virtual disk pointing to host's
> > /dev/sdb, which major:minor is 8:16. When vm started, an entry "b 8:16 rw"
> > will be added to 
> > /sys/fs/cgroup/devices/machine.slice/machine-qemu\\x2dguest.scope/devices.
> > list\
> > Is eBPF doing someting similar?
> 
> The usage of eBPF in libvirt for cgroups v2 is similar, if the VM is started
> there
> are rules added to eBPF map as well for each device.  It is possible to
> check the rules,
> but it's not that simple as for cgroups v1.
> 
> > 3. What else needs to be tested besides above 2 areas? (1. qemu.conf acl
> > setting. 2. when vm using host devices, set acl automatically)
> 
> I would say we should test everything as for cgroups v1:
> 
> 1) if eBPF is actually used (eBPF program and eBPF map are created for VM
> and map has some entries in the map)
> 2) if changes in qemu.conf are properly reflected in eBPF
> 3) if devices used by VM are properly enabled in eBPF
> 4) if eBPF program and eBPF map are removed once the VM is shut off.
> 5) The default eBPF map can hold 64 different devices, so it would be nice
> to test if you add more then 64 devices that there is a new larger eBPF map
> created and contains all devices correctly and the old map is removed
> 
> 
> Now how you can actually check all the eBPF complexity.  Unfortunately it's
> not as simple as for cgroups v1.
> For eBPF there is a tool developed by kernel called "bpftool" which can be
> used to list eBPF programs and eBPF maps.
> You will have to set selinux to perimisive mode (setenforce 0) in order to
> access the eBPF program and map created by libvirt.
> 
> To get a program attached to specific cgroup you can run:
> 
> $ bpftool cgroup list
> /sys/fs/cgroup/machine.slice/machine-qemu\\x2d8\\x2dgeneric.scope/
> ID       AttachType      AttachFlags     Name           
> 94       device 
> 
> With the ID you can get info about the program including its maps:
> 
> $ bpftool prog show id 94
> 94: cgroup_device  tag b95e404d31962705  gpl
> 	loaded_at 2020-02-27T09:59:24+0100  uid 0
> 	xlated 616B  jited 347B  memlock 4096B  map_ids 2
> 
> With the map id you can dump the content of the map:
> 
> $ bpftool map dump id 2
> key: 07 00 00 00 01 00 00 00  value: 02 00 06 00
> key: 05 00 00 00 01 00 00 00  value: 02 00 06 00
> key: e8 00 00 00 0a 00 00 00  value: 02 00 06 00
> key: ff ff ff ff 88 00 00 00  value: 02 00 06 00
> key: e4 00 00 00 0a 00 00 00  value: 02 00 06 00
> key: 03 00 00 00 01 00 00 00  value: 02 00 06 00
> key: 00 00 00 00 fb 00 00 00  value: 02 00 06 00
> key: 81 00 00 00 e2 00 00 00  value: 02 00 06 00
> key: 08 00 00 00 01 00 00 00  value: 02 00 06 00
> key: 09 00 00 00 01 00 00 00  value: 02 00 06 00
> key: 02 00 00 00 05 00 00 00  value: 02 00 06 00
> Found 11 elements
> 
> All of this can be automated because bpftool can print it's output as JSON
> using -j option:
> 
> $ bpftool -j cgroup list
> /sys/fs/cgroup/machine.slice/machine-qemu\\x2d8\\x2dgeneric.scope/
> [{"id":94,"attach_type":"device","attach_flags":"","name":""}]
> 
> $ bpftool -j prog show id 94
> {"id":94,"type":"cgroup_device","tag":"b95e404d31962705","gpl_compatible":
> true,"loaded_at":1582793964,"uid":0,"bytes_xlated":616,"jited":true,
> "bytes_jited":347,"bytes_memlock":4096,"map_ids":[2]}
> 
> $ bpftool -j map dump id 2
> [{"key":["0x07","0x00","0x00","0x00","0x01","0x00","0x00","0x00"],"value":
> ["0x02","0x00","0x06","0x00"]},{"key":["0x05","0x00","0x00","0x00","0x01",
> "0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0xe8",
> "0x00","0x00","0x00","0x0a","0x00","0x00","0x00"],"value":["0x02","0x00",
> "0x06","0x00"]},{"key":["0xff","0xff","0xff","0xff","0x88","0x00","0x00",
> "0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0xe4","0x00","0x00",
> "0x00","0x0a","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},
> {"key":["0x03","0x00","0x00","0x00","0x01","0x00","0x00","0x00"],"value":
> ["0x02","0x00","0x06","0x00"]},{"key":["0x00","0x00","0x00","0x00","0xfb",
> "0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0x81",
> "0x00","0x00","0x00","0xe2","0x00","0x00","0x00"],"value":["0x02","0x00",
> "0x06","0x00"]},{"key":["0x08","0x00","0x00","0x00","0x01","0x00","0x00",
> "0x00"],"value":["0x02","0x00","0x06","0x00"]},{"key":["0x09","0x00","0x00",
> "0x00","0x01","0x00","0x00","0x00"],"value":["0x02","0x00","0x06","0x00"]},
> {"key":["0x02","0x00","0x00","0x00","0x05","0x00","0x00","0x00"],"value":
> ["0x02","0x00","0x06","0x00"]}]
> 
> 
> Now if you look at the dump of the map, each entry has "key" and "value" and
> here comes the tricky part:
> 
>     - the key is a 64bit unsigned int and contains major:minor
>     - the value is 32bit unsigned int and has the rules for each device
> rwm:bc
> 
> Parsing the values directly is not possible as they are in revers order.
> 
> Here is a simple python script that can print the rules for a VM in a
> readable format, the same as cgroups v1:
> 
> import json
> import subprocess
> import sys
> 
> if len(sys.argv) != 2:
>     print('Only one argument is valid')
>     exit(-1)
> 
> progs = json.loads(subprocess.run(['bpftool', '-j', 'cgroup', 'list',
> sys.argv[1]], capture_output=True).stdout)
> 
> prog = json.loads(subprocess.run(['bpftool', '-j', 'prog', 'show', 'id',
> str(progs[0]['id'])], capture_output=True).stdout)
> 
> rules = json.loads(subprocess.run(['bpftool', '-j', 'map', 'dump', 'id',
> str(prog['map_ids'][0])], capture_output=True).stdout)
> 
> def list_to_num(key):
>     ret = 0
>     for i, n in enumerate(key):
>         ret += int(n, 16) << i * 8
>     return '*' if ret >= 2 ** 31 else ret
> 
> def get_types(val):
>     if val == '0x01':
>         return 'b'
>     elif val == '0x02':
>         return 'c'
>     elif val == '0x03':
>         return 'a'
> 
> def get_perms(val):
>     val = int(val, 16)
>     return '{0}{1}{2}'.format(
>             'r' if val & 2 else '',
>             'w' if val & 4 else '',
>             'm' if val & 1 else ''
>     )
> 
> for rule in rules:
>     minor = list_to_num(rule['key'][0:4])
>     major = list_to_num(rule['key'][4:])
>     types = get_types(rule['value'][0])
>     perms = get_perms(rule['value'][2])
> 
>     print('{0} {1}:{2} {3}'.format(types, major, minor, perms))
Thanks a million for such a clear introduction and a lovely script to get the bpf map! It will save me a lot of time.

Comment 8 yisun 2020-03-03 15:33:42 UTC
Test the 64 entries limit first, all python files in this comment can be found in attachments
Scenario 1: When bpf map exceed 64 entries, it will be removed and a new map will be generated with the newly attach device
1. generate 60 scsi disk on hosts
[root@hp-dl320eg8-05 bz1717396]# python creaet_blocks.py create 40
[root@hp-dl320eg8-05 bz1717396]# lsscsi
[0:0:0:0]    disk    ATA      MM0500GBKAK      HPGE  /dev/sda
[6:0:0:0]    disk    LIO-ORG  device.iscsi-di  4.0   /dev/sdb
[7:0:0:0]    disk    LIO-ORG  device.iscsi-di  4.0   /dev/sdc
[8:0:0:0]    disk    LIO-ORG  device.iscsi-di  4.0   /dev/sdd
[9:0:0:0]    disk    LIO-ORG  device.iscsi-di  4.0   /dev/sde
[....
[68:0:0:0]   disk    LIO-ORG  device.iscsi-di  4.0   /dev/sdbl
[69:0:0:0]   disk    LIO-ORG  device.iscsi-di  4.0   /dev/sdbm
<==== the newly scsi block devices are from /dev/sdb to /dev/sdbm

2. Prepare virtual disk xml
[root@hp-dl320eg8-05 bz1717396]# python generate_disk_xml.py sdc vdb 60
<disk type='block' device='disk'>
    <driver name='qemu' type='raw'/>
    <source dev='/dev/sdc'/>
    <target dev='vdb' bus='virtio'/>
</disk>
<disk type='block' device='disk'>
    <driver name='qemu' type='raw'/>
    <source dev='/dev/sdd'/>
    <target dev='vdc' bus='virtio'/>
</disk>
<disk type='block' device='disk'>
    <driver name='qemu' type='raw'/>
    <source dev='/dev/sde'/>
    <target dev='vdd' bus='virtio'/>
</disk>
...

3. Edit the vm's xml and add virtual disks to vm, make sure there are 63 entries in vm's bpf map
[root@hp-dl320eg8-05 bz1717396]# virsh start vm1
[root@hp-dl320eg8-05 bz1717396]# bpftool cgroup list /sys/fs/cgroup/machine.slice/machine-qemu\\x2d4\\x2dvm1.scope/
ID       AttachType      AttachFlags     Name
9        device

[root@hp-dl320eg8-05 bz1717396]# bpftool prog show id 9
9: cgroup_device  tag b95e404d31962705  gpl
	loaded_at 2020-03-03T10:11:30-0500  uid 0
	xlated 616B  jited 350B  memlock 4096B  map_ids 9


[root@hp-dl320eg8-05 bz1717396]# bpftool map dump id 9
key: 00 00 00 00 42 00 00 00  value: 01 00 06 00
key: e0 00 00 00 08 00 00 00  value: 01 00 06 00
…
key: 50 00 00 00 42 00 00 00  value: 01 00 06 00
key: d0 00 00 00 41 00 00 00  value: 01 00 06 00
key: 40 00 00 00 43 00 00 00  value: 01 00 06 00
Found 63 elements

4. Prepare a unused disk xml and attach it to vm
[root@hp-dl320eg8-05 bz1717396]# cat disk1.xml
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw'/>
      <source dev='/dev/sdbc'/>
      <target dev='vddd' bus='virtio'/>
    </disk>

[root@hp-dl320eg8-05 bz1717396]# virsh attach-device vm1 disk1.xml
Device attached successfully

5. Now the vm's bpf map will have 64 entries. Since default bpf map size is 64, so the original bpf map will be removed and a new larger one will be generated, containing the original entries and the newly attached entry
[root@hp-dl320eg8-05 bz1717396]# bpftool map dump id 9
Error: get map by id (9): No such file or directory
<==== original bpf map gone

[root@hp-dl320eg8-05 bz1717396]# bpftool cgroup list /sys/fs/cgroup/machine.slice/machine-qemu\\x2d4\\x2dvm1.scope/
ID       AttachType      AttachFlags     Name
10       device

[root@hp-dl320eg8-05 bz1717396]# bpftool prog show id 10
10: cgroup_device  tag b95e404d31962705  gpl
	loaded_at 2020-03-03T10:13:25-0500  uid 0
	xlated 616B  jited 350B  memlock 4096B  map_ids 10

[root@hp-dl320eg8-05 bz1717396]# bpftool map dump id 10
key: d0 00 00 00 42 00 00 00  value: 01 00 06 00
key: 60 00 00 00 41 00 00 00  value: 01 00 06 00
…
key: 80 00 00 00 41 00 00 00  value: 01 00 06 00
key: 30 00 00 00 41 00 00 00  value: 01 00 06 00
key: 90 00 00 00 42 00 00 00  value: 01 00 06 00
key: 20 00 00 00 42 00 00 00  value: 01 00 06 00
Found 64 elements
<==== new bpf map with ID=10 created and have 64 elements now

[root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d4\\x2dvm1.scope/ | grep sdbc
b 67:96 rw		sdbc
<==== new host device exists in vm’s bpf map

[root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d4\\x2dvm1.scope
b 66:208 rw		sdat
b 65:96 rw		sdw
b 65:160 rw		sdaa
b 65:16 rw		sdr
c 10:232 rw		kvm
b 66:192 rw		sdas
...
<==== original entries exist, (I used vimdiff to check the output for convenience)

Comment 9 yisun 2020-03-03 15:35:00 UTC
(In reply to yisun from comment #8)
> Test the 64 entries limit first, all python files in this comment can be
> found in attachments
> Scenario 1: When bpf map exceed 64 entries, it will be removed and a new map
> will be generated with the newly attach device
> 1. generate 60 scsi disk on hosts
> [root@hp-dl320eg8-05 bz1717396]# python creaet_blocks.py create 40
Here should be # python creaet_blocks.py create 60

Comment 10 yisun 2020-03-03 15:41:35 UTC
Created attachment 1667219 [details]
generate disk xml

Comment 11 yisun 2020-03-03 15:42:48 UTC
Created attachment 1667220 [details]
get_bpf_map

Comment 12 yisun 2020-03-03 15:43:21 UTC
Created attachment 1667221 [details]
create_blocks

Comment 13 yisun 2020-03-03 15:56:38 UTC
Scenario 2: Check the /etc/libvirt/qemu.conf related setting takes effect:
1. Check qemu.conf default settings
edit qemu.conf as:
#cgroup_device_acl = [
#    "/dev/null", "/dev/full", "/dev/zero",
#    "/dev/random", "/dev/urandom",
#    "/dev/ptmx", "/dev/kvm",
#    "/dev/rtc","/dev/hpet"
#]

Restart libvirtd
Destroy and start vm 

[root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d6\\x2dvm1.scope/
c 10:228 rw		hpet
c 251:0 rw		rtc0
c 1:9 rw		urandom
c 10:232 rw		kvm
c 1:7 rw		full
c 1:8 rw		random
c 136:* rw
c 5:2 rw		ptmx
c 1:5 rw		zero
c 1:3 rw		null
<==== correct as qemu.conf's default setting

2. Remove a device (/dev/hpet) in qemu.conf
cgroup_device_acl = [
    "/dev/null", "/dev/full", "/dev/zero",
    "/dev/random", "/dev/urandom",
    "/dev/ptmx", "/dev/kvm",
    "/dev/rtc"
]

Restart libvirtd
Destroy and start vm 

[root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d7\\x2dvm1.scope/
c 5:2 rw		ptmx
c 136:* rw
c 1:7 rw		full
c 1:8 rw		random
c 1:3 rw		null
c 10:232 rw		kvm
c 1:9 rw		urandom
c 251:0 rw		rtc0
c 1:5 rw		zero
<==== hpet not existing as expected


3. Add a device (/dev/sdz) in qemu.conf
cgroup_device_acl = [
    "/dev/null", "/dev/full", "/dev/zero",
    "/dev/random", "/dev/urandom",
    "/dev/ptmx", "/dev/kvm",
    "/dev/rtc", "/dev/sdz"
]

Restart libvirtd
Destroy and start vm 

[root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d8\\x2dvm1.scope/
c 5:2 rw		ptmx
b 65:144 rw		sdz
<==== sdz entry exists as expected
c 136:* rw
c 1:9 rw		urandom
c 1:5 rw		zero
c 1:8 rw		random
c 251:0 rw		rtc0
c 10:232 rw		kvm
c 1:3 rw		null
c 1:7 rw		full

4. Add a non-existing device (/dev/non_existing) in qemu.conf
cgroup_device_acl = [
    "/dev/null", "/dev/full", "/dev/zero",
    "/dev/random", "/dev/urandom",
    "/dev/ptmx", "/dev/kvm",
    "/dev/rtc", "/dev/non_existing"
]

Restart libvirtd
Destroy and start vm 

[root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d9\\x2dvm1.scope/
c 251:0 rw		rtc0
c 1:5 rw		zero
c 1:9 rw		urandom
c 10:232 rw		kvm
c 1:7 rw		full
c 1:8 rw		random
c 1:3 rw		null
c 5:2 rw		ptmx
c 136:* rw
<==== nothing wrong

4. Add duplicated devices (/dev/sdz) in qemu.conf
cgroup_device_acl = [
    "/dev/null", "/dev/full", "/dev/zero",
    "/dev/random", "/dev/urandom",
    "/dev/ptmx", "/dev/kvm",
    "/dev/rtc",
    "/dev/sdz", "/dev/sdz"
]

[root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d10\\x2dvm1.scope/
b 65:144 rw		sdz
c 1:8 rw		random
c 5:2 rw		ptmx
c 1:9 rw		urandom
c 1:5 rw		zero
c 1:3 rw		null
c 10:232 rw		kvm
c 251:0 rw		rtc0
c 1:7 rw		full
c 136:* rw
<==== nothing wrong and /dev/sdz exists

Comment 14 yisun 2020-03-03 16:18:23 UTC
Scenario 3: Check the newly attached host device will show up in vm's bpf map
1. attach a host block
[root@hp-dl320eg8-05 bz1717396]# cat disk.xml
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw'/>
      <source dev='/dev/sdbb'/>
      <target dev='vdbd' bus='virtio'/>
    </disk>
[root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d10\\x2dvm1.scope/
b 65:144 rw		sdz
c 1:8 rw		random
c 5:2 rw		ptmx
c 1:9 rw		urandom
c 1:5 rw		zero
c 1:3 rw		null
c 10:232 rw		kvm
c 251:0 rw		rtc0
c 1:7 rw		full
c 136:* rw
[root@hp-dl320eg8-05 bz1717396]# virsh attach-device vm1 disk.xml
Device attached successfully

[root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d10\\x2dvm1.scope/
b 65:144 rw		sdz
c 1:8 rw		random
b 67:80 rw		sdbb
<==== sdbb exists as expected
c 5:2 rw		ptmx
c 1:9 rw		urandom
c 1:5 rw		zero
c 1:3 rw		null
c 10:232 rw		kvm
c 251:0 rw		rtc0
c 1:7 rw		full
c 136:* rw

2. attach a image with backing chain pointing to a host block device
[root@hp-dl320eg8-05 bz1717396]# qemu-img create -f qcow2 /tmp/layer1.qcow2 1G
Formatting '/tmp/layer1.qcow2', fmt=qcow2 size=1073741824 cluster_size=65536 lazy_refcounts=off refcount_bits=16

[root@hp-dl320eg8-05 bz1717396]# qemu-img rebase -b /dev/sdy -F raw /tmp/layer1.qcow2

[root@hp-dl320eg8-05 bz1717396]# qemu-img info /tmp/layer1.qcow2 --backing-chain
image: /tmp/layer1.qcow2
file format: qcow2
virtual size: 1 GiB (1073741824 bytes)
disk size: 196 KiB
cluster_size: 65536
backing file: /dev/sdy
backing file format: raw
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

image: /dev/sdy
file format: raw
virtual size: 50 MiB (52428800 bytes)
disk size: 0 B

[root@hp-dl320eg8-05 bz1717396]# cat disk_with_backing_chain.xml
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/tmp/layer1.qcow2'/>
      <target dev='vdzz' bus='virtio'/>
    </disk>

[root@hp-dl320eg8-05 bz1717396]# virsh attach-device vm1 disk_with_backing_chain.xml
Device attached successfully

[root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d10\\x2dvm1.scope/
b 65:144 rw		sdz
c 1:8 rw		random
b 65:128 r		sdy
<==== sdy exists as expected
b 67:80 rw		sdbb
c 5:2 rw		ptmx
c 1:9 rw		urandom
c 1:5 rw		zero
c 1:3 rw		null
c 10:232 rw		kvm
c 251:0 rw		rtc0
c 1:7 rw		full
c 136:* rw

Comment 15 yisun 2020-03-03 16:27:02 UTC
With comment 8 to comment 14, the test result is: PASS

And all the scenarios carried out with:
qemu-kvm-4.2.0-13.module+el8.2.0+5898+fb4bceae.x86_64
libvirt-6.0.0-7.module+el8.2.0+5869+c23fe68b.x86_64
systemd-239-23.el8.x86_64

And cgroup2 enabled:
[root@hp-dl320eg8-05 bz1717396]# mount | grep cgroup2
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate)

Comment 16 yisun 2020-03-04 03:19:52 UTC
Forgot one scenario, add it.
Scenario 4: When vm destroyed, the bpf map should be removed

[root@hp-dl320eg8-05 ~]# virsh list
 Id   Name   State
----------------------
 10   vm1    running

[root@hp-dl320eg8-05 ~]# bpftool cgroup list /sys/fs/cgroup/machine.slice/machine-qemu\\x2d10\\x2dvm1.scope/
ID       AttachType      AttachFlags     Name
16       device

[root@hp-dl320eg8-05 ~]# bpftool prog show id 16
16: cgroup_device  tag b95e404d31962705  gpl
	loaded_at 2020-03-03T10:49:58-0500  uid 0
	xlated 616B  jited 350B  memlock 4096B  map_ids 16

[root@hp-dl320eg8-05 ~]# bpftool map show id 16
16: hash  flags 0x0
	key 8B  value 4B  max_entries 64  memlock 12288B

[root@hp-dl320eg8-05 ~]# virsh destroy vm1
Domain vm1 destroyed

[root@hp-dl320eg8-05 ~]# bpftool prog show id 16
Error: get by id (16): No such file or directory
<==== removed as expected

[root@hp-dl320eg8-05 ~]# bpftool map show id 16
Error: get map by id (16): No such file or directory
<==== removed as expected

Comment 17 yisun 2020-03-04 04:47:58 UTC
Hi Pavel, 
I met a problem during test, could you please help to confirm if this is by desgin, thx

Problem:
When detach a device from vm, the device's key:value pair still exists in bpf map, but value is reset to all zeros. Should the whole entry be removed, or current behavior is expected? thx

Steps:
[root@hp-dl320eg8-05 bz1717396]# virsh start vm1
Domain vm1 started

[root@hp-dl320eg8-05 bz1717396]# bpftool cgroup list /sys/fs/cgroup/machine.slice/machine-qemu\\x2d2\\x2dvm1.scope/
ID       AttachType      AttachFlags     Name
18       device

[root@hp-dl320eg8-05 bz1717396]# bpftool prog show id 18
18: cgroup_device  tag b95e404d31962705  gpl
	loaded_at 2020-03-03T23:33:56-0500  uid 0
	xlated 616B  jited 350B  memlock 4096B  map_ids 18

[root@hp-dl320eg8-05 bz1717396]# bpftool map dump id 18
key: 07 00 00 00 01 00 00 00  value: 02 00 06 00
key: ff ff ff ff 88 00 00 00  value: 02 00 06 00
key: 03 00 00 00 01 00 00 00  value: 02 00 06 00
key: 02 00 00 00 05 00 00 00  value: 02 00 06 00
key: 09 00 00 00 01 00 00 00  value: 02 00 06 00
key: 00 00 00 00 fb 00 00 00  value: 02 00 06 00
key: 08 00 00 00 01 00 00 00  value: 02 00 06 00
key: 90 00 00 00 41 00 00 00  value: 01 00 06 00
key: 05 00 00 00 01 00 00 00  value: 02 00 06 00
key: e8 00 00 00 0a 00 00 00  value: 02 00 06 00
Found 10 elements

[root@hp-dl320eg8-05 bz1717396]# cat disk.xml
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw'/>
      <source dev='/dev/sdbb'/>
      <target dev='vdbd' bus='virtio'/>
    </disk>
[root@hp-dl320eg8-05 bz1717396]# virsh attach-device vm1 disk.xml
Device attached successfully

[root@hp-dl320eg8-05 bz1717396]# bpftool map dump id 18
key: 07 00 00 00 01 00 00 00  value: 02 00 06 00
key: ff ff ff ff 88 00 00 00  value: 02 00 06 00
key: 03 00 00 00 01 00 00 00  value: 02 00 06 00
key: 02 00 00 00 05 00 00 00  value: 02 00 06 00
key: 09 00 00 00 01 00 00 00  value: 02 00 06 00
key: 00 00 00 00 fb 00 00 00  value: 02 00 06 00
key: 08 00 00 00 01 00 00 00  value: 02 00 06 00
key: 90 00 00 00 41 00 00 00  value: 01 00 06 00
key: 50 00 00 00 43 00 00 00  value: 01 00 06 00
key: 05 00 00 00 01 00 00 00  value: 02 00 06 00
key: e8 00 00 00 0a 00 00 00  value: 02 00 06 00
Found 11 elements
<======  NEW ENTRY IS - key: 50 00 00 00 43 00 00 00  value: 01 00 06 00

[root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d2\\x2dvm1.scope/
c 1:7 rw		full
c 136:* rw
c 1:3 rw		null
c 5:2 rw		ptmx
c 1:9 rw		urandom
c 251:0 rw		rtc0
c 1:8 rw		random
b 65:144 rw		sdz
b 67:80 rw		sdbb
c 1:5 rw		zero
c 10:232 rw		kvm

[root@hp-dl320eg8-05 bz1717396]# virsh detach-device vm1 disk.xml
Device detached successfully

[root@hp-dl320eg8-05 bz1717396]# bpftool map dump id 18
key: 07 00 00 00 01 00 00 00  value: 02 00 06 00
key: ff ff ff ff 88 00 00 00  value: 02 00 06 00
key: 03 00 00 00 01 00 00 00  value: 02 00 06 00
key: 02 00 00 00 05 00 00 00  value: 02 00 06 00
key: 09 00 00 00 01 00 00 00  value: 02 00 06 00
key: 00 00 00 00 fb 00 00 00  value: 02 00 06 00
key: 08 00 00 00 01 00 00 00  value: 02 00 06 00
key: 90 00 00 00 41 00 00 00  value: 01 00 06 00
key: 50 00 00 00 43 00 00 00  value: 00 00 00 00
key: 05 00 00 00 01 00 00 00  value: 02 00 06 00
key: e8 00 00 00 0a 00 00 00  value: 02 00 06 00
Found 11 elements
<===== Not back to 10 elements

[root@hp-dl320eg8-05 bz1717396]# python get_bpf_map.py /sys/fs/cgroup/machine.slice/machine-qemu\\x2d2\\x2dvm1.scope/
c 1:7 rw		full
c 136:* rw
c 1:3 rw		null
c 5:2 rw		ptmx
c 1:9 rw		urandom
c 251:0 rw		rtc0
c 1:8 rw		random
b 65:144 rw		sdz
None 67:80
c 1:5 rw		zero
c 10:232 rw		kvm
<==== Now the major:minor==67:80 device still exists in map but value has been set to all zeros
That means, when dump map id=18 following entry still exists, but the value is erase to all zeros
FROM:
# bpftool map dump id 18
key: 50 00 00 00 43 00 00 00  value: 01 00 06 00
...
TO:
# bpftool map dump id 18
key: 50 00 00 00 43 00 00 00  value: 00 00 00 00
...

Comment 18 Pavel Hrdina 2020-03-04 15:59:28 UTC
Nice catch of the issue that the entry is not removed.  It's a minor issue as the device is still forbidden but it would be nice to fix it. Can you please create a new BZ for that issue? Thanks.

Comment 19 yisun 2020-03-05 04:12:07 UTC
(In reply to Pavel Hrdina from comment #18)
> Nice catch of the issue that the entry is not removed.  It's a minor issue
> as the device is still forbidden but it would be nice to fix it. Can you
> please create a new BZ for that issue? Thanks.

Thanks for confirming, new issue reported: Bug 1810356 - [cgroup2] When detach a device from vm, the device entry not removed from vm's bpf map

Comment 21 errata-xmlrpc 2020-05-05 09:46:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2017


Note You need to log in before you can comment on or make changes to this bug.