Description of problem: I'm trying to run `sudo` inside a ppc64le container that has qemu installed. `sudo` fails every time with a pam error because of the settings in /etc/security/limits.d/95-kvm-memlock.conf. Here is the full reproducer: ``` [builder@buildvm-ppc64le-fcos01 coreos-assembler]$ podman run -it --rm registry.fedoraproject.org/fedora:37 Trying to pull registry.fedoraproject.org/fedora:37... Getting image source signatures Copying blob e0f3236c0125 done Copying config 0ca9d427ba done Writing manifest to image destination Storing signatures [root@0d99309b336c /]# sudo dnf install qemu-system-ppc-core --quiet -y Installed: SLOF-20210217-4.git33a7322d.fc36.noarch capstone-4.0.2-9.fc36.ppc64le cryptsetup-libs-2.4.3-2.fc36.ppc64le daxctl-libs-73-1.fc37.ppc64le dbus-1:1.14.0-1.fc37.ppc64le dbus-broker-29-5.fc36.ppc64le dbus-common-1:1.14.0-1.fc37.noarch device-mapper-1.02.175-7.fc36.ppc64le device-mapper-libs-1.02.175-7.fc36.ppc64le diffutils-3.8-2.fc36.ppc64le fuse3-libs-3.10.5-4.fc37.ppc64le ipxe-roms-qemu-20220210-1.git64113751.fc37.noarch kmod-libs-29-7.fc36.ppc64le libaio-0.3.111-13.fc36.ppc64le libargon2-20171227-9.fc37.ppc64le libbpf-2:0.7.0-3.fc37.ppc64le libfdisk-2.38-3.fc37.ppc64le libfdt-1.6.1-2.fc35.ppc64le libibverbs-39.0-1.fc36.ppc64le libjpeg-turbo-2.1.3-1.fc37.ppc64le libnl3-3.6.0-1.fc37.ppc64le libpmem-1.11.1-4.fc36.ppc64le libpng-2:1.6.37-12.fc36.ppc64le librdmacm-39.0-1.fc36.ppc64le libseccomp-2.5.3-2.fc36.ppc64le libslirp-4.7.0-1.fc37.ppc64le liburing-2.0-3.fc36.ppc64le libxkbcommon-1.4.0-1.fc36.ppc64le lzo-2.10-6.fc36.ppc64le ndctl-libs-73-1.fc37.ppc64le numactl-libs-2.0.14-5.fc36.ppc64le openbios-1:20200725-4.git7f28286.fc36.noarch pixman-0.40.0-5.fc36.ppc64le qemu-common-2:7.0.0-1.fc37.ppc64le qemu-system-ppc-core-2:7.0.0-1.fc37.ppc64le qrencode-libs-4.1.1-2.fc36.ppc64le seavgabios-bin-1.16.0-1.fc37.noarch snappy-1.1.9-4.fc36.ppc64le systemd-251~rc1-3.fc37.ppc64le systemd-networkd-251~rc1-3.fc37.ppc64le systemd-pam-251~rc1-3.fc37.ppc64le systemd-resolved-251~rc1-3.fc37.ppc64le xkeyboard-config-2.35.1-1.fc37.noarch [root@0d99309b336c /]# sudo /bin/true sudo: pam_open_session: Permission denied sudo: policy plugin failed session initialization [root@0d99309b336c /]# rm -f /etc/security/limits.d/95-kvm-memlock.conf [root@0d99309b336c /]# sudo /bin/true [root@0d99309b336c /]# echo $? 0 [root@0d99309b336c /]# rpm -q qemu-common qemu-common-7.0.0-1.fc37.ppc64le ``` Version-Release number of selected component (if applicable): qemu-common-7.0.0-1.fc37.ppc64le How reproducible: Always Steps to Reproduce: See description. Additional info: This happens on F35 and F36 today as well.
Note this file was added for https://bugzilla.redhat.com/show_bug.cgi?id=1293024
For reference the file contains: * hard memlock 65536 * soft memlock 65536 Are there other files in /etc/security/limits* which have memlock entries on this architecture, and what are those limits?
Also it'd be interesting if you can run “ulimit -a” before and after deleting the file.
So my expectation is the pam_limits.so is attempting to apply the memlock limits. In a real machine there's no problem doing this as 64kb is lost in the noise. In a container, however, even if able to become root, there's no guarantee the container configuration will permit you to change limits. Indeed I'd expect changing of memlock limits to be explicitly blocked by default. pam_limits.so is thus getting a EPERM and treating this as a fatal problem. I'm pretty unsure of the right answer to this problem. It is reasonable for QEMU to want the elevated limits out of the box, otherwise KVM is simply guaranteed broken until someone manually sets the limits. It is reasonable of containers to block raising of limits by default. It is also (historically) reasonable for pam_limits.so to treat the failure as fatal, given that root should be able to raise limits no trouble. I think perhaps that needs to change though, given our container centric modern world, but I wonder about unintended consequences of ignoring errors from changing limits. Especially if the limits config was intended to /lower/ limits from their default, it would be a security issue to ignore them. Also if the container is not granting some memlock allowance by default, then it is impossible to run KVM guests inside the container anyway.
(In reply to Richard W.M. Jones from comment #2) > For reference the file contains: > > * hard memlock 65536 > * soft memlock 65536 > > Are there other files in /etc/security/limits* which have memlock > entries on this architecture, and what are those limits? No. /etc/security/limits.conf just contains examples that are all commented out. There are no other files in /etc/security/limits.d/ than 95-kvm-memlock.conf. (In reply to Richard W.M. Jones from comment #3) > Also it'd be interesting if you can run “ulimit -a” before and > after deleting the file. ``` [root@5e1b08e45daf /]# ulimit -a real-time non-blocking time (microseconds, -R) unlimited core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 129873 max locked memory (kbytes, -l) 8192 max memory size (kbytes, -m) unlimited open files (-n) 524288 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 129873 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited [root@5e1b08e45daf /]# [root@5e1b08e45daf /]# sudo /bin/true sudo: pam_open_session: Permission denied sudo: policy plugin failed session initialization [root@5e1b08e45daf /]# [root@5e1b08e45daf /]# rm -f /etc/security/limits.d/95-kvm-memlock.conf [root@5e1b08e45daf /]# [root@5e1b08e45daf /]# ulimit -a real-time non-blocking time (microseconds, -R) unlimited core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 129873 max locked memory (kbytes, -l) 8192 max memory size (kbytes, -m) unlimited open files (-n) 524288 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 129873 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited [root@5e1b08e45daf /]# [root@5e1b08e45daf /]# sudo /bin/true [root@5e1b08e45daf /]# echo $? 0 ```
I find it quite surprising that installing virtualization software ends up changing the PAM configuration. Is this only for unprivileged qemu AKA qemu:///session in libvirt terms? What is it that qemu wants to lock? Why aren't the OS defaults sufficient for that? If this is for qemu:///session, there are things like gnome-keyring that already exist to maintain secrets locked in RAM.
It's the amount of memory that processes can lock. It's PAM that happens to implement this limit at login time, but not really anything to do with PAM. qemu (on ppc64 only) needs to mlock more memory than the usual default on Linux in order to enable VFIO. It seems as if the root cause is that a certain number of pages need to be locked (depending on the number of devices attached to the guest), and page size on ppc64 is much bigger than other architectures (64K), so ppc64 has to lock a lot more memory than other arches. The libvirt patch which originally fixed this should help to understand a bit about what's going on, especially patch 6: https://listman.redhat.com/archives/libvir-list/2015-November/thread.html#121000
(In reply to Richard W.M. Jones from comment #7) > so ppc64 has to lock a lot more memory than other arches. I should add: ... and this happens to be larger than the amount of mlocked memory permitted by default by Linux, so we have to adjust it upwards in order for any guest to be able to run. I think the real solution here is going to be for containers on ppc64 to just raise the default mlock limit as well.
(In reply to Richard W.M. Jones from comment #7) > in order to enable VFIO. ... and it's not just VFIO as is clear from patch 6/6 above. VFIO happens to also add a bunch of mlocked memory required, but plain guest devices do too.
(In reply to Colin Walters from comment #6) > I find it quite surprising that installing virtualization software ends up > changing the PAM configuration. Is this only for unprivileged qemu AKA > qemu:///session in libvirt terms? > > What is it that qemu wants to lock? Why aren't the OS defaults sufficient > for that? If this is for qemu:///session, there are things like > gnome-keyring that already exist to maintain secrets locked in RAM. There's a fair bit of history on this topic across https://bugzilla.redhat.com/show_bug.cgi?id=1350735 https://bugzilla.redhat.com/show_bug.cgi?id=1293024 https://listman.redhat.com/archives/libvir-list/2015-November/msg00769.html TL;DR is this specific comment is the most important; https://bugzilla.redhat.com/show_bug.cgi?id=1293024#c16 [quote] The key thing here is that on ppc64, unlike x86, the hardware page tables are encoded as a big hash table, rather than a set of radix trees. Each guest needs its own hashed page table (HPT). These can get quite large - it can vary depending on a number of things, but the usual rule of thumb is that the HPT is 1/128th to 1/64th of RAM size, with a minimum size of 16MiB. For PAPR paravirtualized guests this HPT is accessed entirely via hypercall and does not exist within the guest's RAM - it needs to be allocated on the host above and beyond the guest's RAM image. When using the "HV" KVM implementation (the only one we're targetting) the HPT has to be _host_ physically contiguous, unswappable memory (because it's read directly by hardware. [/quote] So the setting in 95-kvm-memlock.conf allows users to boot KVM guests upto approx 4 GB in size. Beyond that, the admin will have to modify this file to allow even more locked RAM. The 4 GB limit setup by default at least allows KVM to be somewhat useful out of the box for unprivileged users launching QEMU directly, or via libvirt's unprivileged qemu:///session. This is something libguestfs or GNOME boxes would hit on ppc64 The libvirt qemu:///system should not need this setting IIUC, since virtqemud is privileged, it can set the memlock limit directly without PAM being involved. I guess one possible option would be to move this config into 'libvirt-daemon-kvm' instead of qemu-system-ppc-core. This would make it possible to install QEMU without pulling in the limits file. A "default" KVM install of libvirt would get the limits, but a minimial libvirt install can also avoid pulling it in.
This message is a reminder that Fedora Linux 35 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora Linux 35 on 2022-12-13. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of '35'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, change the 'version' to a later Fedora Linux version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora Linux 35 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora Linux, you are encouraged to change the 'version' to a later version prior to this bug being closed.
Fedora Linux 35 entered end-of-life (EOL) status on 2022-12-13. Fedora Linux 35 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora Linux please feel free to reopen this bug against that version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see the version field. If you are unable to reopen this bug, please file a new report against an active release. Thank you for reporting this bug and we are sorry it could not be fixed.