Bug 2082149 - ppc64le: /etc/security/limits.d/95-kvm-memlock.conf causes sudo to fail in container
Summary: ppc64le: /etc/security/limits.d/95-kvm-memlock.conf causes sudo to fail in co...
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: qemu
Version: 35
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Fedora Virtualization Maintainers
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-05 13:34 UTC by Dusty Mabe
Modified: 2022-12-13 17:54 UTC (History)
11 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2022-12-13 17:54:57 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Dusty Mabe 2022-05-05 13:34:05 UTC
Description of problem:

I'm trying to run `sudo` inside a ppc64le container that has qemu
installed. `sudo` fails every time with a pam error because of the
settings in /etc/security/limits.d/95-kvm-memlock.conf.

Here is the full reproducer:

```
[builder@buildvm-ppc64le-fcos01 coreos-assembler]$ podman run -it --rm registry.fedoraproject.org/fedora:37
Trying to pull registry.fedoraproject.org/fedora:37...
Getting image source signatures
Copying blob e0f3236c0125 done
Copying config 0ca9d427ba done
Writing manifest to image destination
Storing signatures
[root@0d99309b336c /]# sudo dnf install qemu-system-ppc-core --quiet -y

Installed:
  SLOF-20210217-4.git33a7322d.fc36.noarch      capstone-4.0.2-9.fc36.ppc64le             cryptsetup-libs-2.4.3-2.fc36.ppc64le          daxctl-libs-73-1.fc37.ppc64le
  dbus-1:1.14.0-1.fc37.ppc64le                 dbus-broker-29-5.fc36.ppc64le             dbus-common-1:1.14.0-1.fc37.noarch            device-mapper-1.02.175-7.fc36.ppc64le
  device-mapper-libs-1.02.175-7.fc36.ppc64le   diffutils-3.8-2.fc36.ppc64le              fuse3-libs-3.10.5-4.fc37.ppc64le              ipxe-roms-qemu-20220210-1.git64113751.fc37.noarch
  kmod-libs-29-7.fc36.ppc64le                  libaio-0.3.111-13.fc36.ppc64le            libargon2-20171227-9.fc37.ppc64le             libbpf-2:0.7.0-3.fc37.ppc64le
  libfdisk-2.38-3.fc37.ppc64le                 libfdt-1.6.1-2.fc35.ppc64le               libibverbs-39.0-1.fc36.ppc64le                libjpeg-turbo-2.1.3-1.fc37.ppc64le
  libnl3-3.6.0-1.fc37.ppc64le                  libpmem-1.11.1-4.fc36.ppc64le             libpng-2:1.6.37-12.fc36.ppc64le               librdmacm-39.0-1.fc36.ppc64le
  libseccomp-2.5.3-2.fc36.ppc64le              libslirp-4.7.0-1.fc37.ppc64le             liburing-2.0-3.fc36.ppc64le                   libxkbcommon-1.4.0-1.fc36.ppc64le
  lzo-2.10-6.fc36.ppc64le                      ndctl-libs-73-1.fc37.ppc64le              numactl-libs-2.0.14-5.fc36.ppc64le            openbios-1:20200725-4.git7f28286.fc36.noarch
  pixman-0.40.0-5.fc36.ppc64le                 qemu-common-2:7.0.0-1.fc37.ppc64le        qemu-system-ppc-core-2:7.0.0-1.fc37.ppc64le   qrencode-libs-4.1.1-2.fc36.ppc64le
  seavgabios-bin-1.16.0-1.fc37.noarch          snappy-1.1.9-4.fc36.ppc64le               systemd-251~rc1-3.fc37.ppc64le                systemd-networkd-251~rc1-3.fc37.ppc64le
  systemd-pam-251~rc1-3.fc37.ppc64le           systemd-resolved-251~rc1-3.fc37.ppc64le   xkeyboard-config-2.35.1-1.fc37.noarch

[root@0d99309b336c /]# sudo /bin/true
sudo: pam_open_session: Permission denied
sudo: policy plugin failed session initialization
[root@0d99309b336c /]# rm -f /etc/security/limits.d/95-kvm-memlock.conf
[root@0d99309b336c /]# sudo /bin/true
[root@0d99309b336c /]# echo $?
0
[root@0d99309b336c /]# rpm -q qemu-common
qemu-common-7.0.0-1.fc37.ppc64le
```


Version-Release number of selected component (if applicable):

qemu-common-7.0.0-1.fc37.ppc64le


How reproducible:

Always


Steps to Reproduce:

See description.


Additional info:

This happens on F35 and F36 today as well.

Comment 1 Dusty Mabe 2022-05-05 13:35:57 UTC
Note this file was added for https://bugzilla.redhat.com/show_bug.cgi?id=1293024

Comment 2 Richard W.M. Jones 2022-05-05 13:45:12 UTC
For reference the file contains:

*       hard    memlock         65536
*       soft    memlock         65536

Are there other files in /etc/security/limits* which have memlock
entries on this architecture, and what are those limits?

Comment 3 Richard W.M. Jones 2022-05-05 13:47:51 UTC
Also it'd be interesting if you can run “ulimit -a” before and
after deleting the file.

Comment 4 Daniel Berrangé 2022-05-05 14:06:30 UTC
So my expectation is the pam_limits.so is attempting to apply the memlock limits. In a real machine there's no problem doing this as 64kb is lost in the noise. In a container, however, even if able to become root, there's no guarantee the container configuration  will permit you to change limits. Indeed I'd expect changing of memlock limits to be explicitly blocked by default. pam_limits.so is thus getting a EPERM and treating this as a fatal problem.

I'm pretty unsure of the right answer to this problem. 

It is reasonable for QEMU to want the elevated limits out of the box, otherwise KVM is simply guaranteed broken until someone manually sets the limits. 

It is reasonable of containers to block raising of limits by default. 

It is also (historically) reasonable for pam_limits.so to treat the failure as fatal, given that root should be able to raise limits no trouble. I think perhaps that needs to change though, given our container centric modern world, but I wonder about unintended consequences of ignoring errors from changing limits. Especially if the limits config was intended to /lower/ limits from their default, it would be a security issue to ignore them.

Also if the container is not granting some memlock allowance by default, then it is impossible to run KVM guests inside the container anyway.

Comment 5 Dusty Mabe 2022-05-05 14:45:51 UTC
(In reply to Richard W.M. Jones from comment #2)
> For reference the file contains:
> 
> *       hard    memlock         65536
> *       soft    memlock         65536
> 
> Are there other files in /etc/security/limits* which have memlock
> entries on this architecture, and what are those limits?

No. /etc/security/limits.conf just contains examples that are all commented out. There are no other files in /etc/security/limits.d/ than 95-kvm-memlock.conf.


(In reply to Richard W.M. Jones from comment #3)
> Also it'd be interesting if you can run “ulimit -a” before and
> after deleting the file.


```
[root@5e1b08e45daf /]# ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 129873
max locked memory           (kbytes, -l) 8192
max memory size             (kbytes, -m) unlimited
open files                          (-n) 524288
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 129873
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited
[root@5e1b08e45daf /]# 
[root@5e1b08e45daf /]# sudo /bin/true
sudo: pam_open_session: Permission denied
sudo: policy plugin failed session initialization
[root@5e1b08e45daf /]# 
[root@5e1b08e45daf /]# rm -f /etc/security/limits.d/95-kvm-memlock.conf
[root@5e1b08e45daf /]# 
[root@5e1b08e45daf /]# ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 129873
max locked memory           (kbytes, -l) 8192
max memory size             (kbytes, -m) unlimited
open files                          (-n) 524288
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 129873
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited
[root@5e1b08e45daf /]# 
[root@5e1b08e45daf /]# sudo /bin/true
[root@5e1b08e45daf /]# echo $?
0

```

Comment 6 Colin Walters 2022-05-16 22:02:05 UTC
I find it quite surprising that installing virtualization software ends up changing the PAM configuration.  Is this only for unprivileged qemu AKA qemu:///session in libvirt terms?

What is it that qemu wants to lock?  Why aren't the OS defaults sufficient for that?  If this is for qemu:///session, there are things like gnome-keyring that already exist to maintain secrets locked in RAM.

Comment 7 Richard W.M. Jones 2022-05-17 07:35:22 UTC
It's the amount of memory that processes can lock.  It's PAM that happens to
implement this limit at login time, but not really anything to do with PAM.

qemu (on ppc64 only) needs to mlock more memory than the usual default on Linux
in order to enable VFIO.  It seems as if the root cause is that a certain number
of pages need to be locked (depending on the number of devices attached to the
guest), and page size on ppc64 is much bigger than other architectures (64K), so
ppc64 has to lock a lot more memory than other arches.

The libvirt patch which originally fixed this should help to understand a bit
about what's going on, especially patch 6:
https://listman.redhat.com/archives/libvir-list/2015-November/thread.html#121000

Comment 8 Richard W.M. Jones 2022-05-17 07:37:32 UTC
(In reply to Richard W.M. Jones from comment #7)
> so ppc64 has to lock a lot more memory than other arches.

I should add: ... and this happens to be larger than the amount of mlocked
memory permitted by default by Linux, so we have to adjust it upwards in
order for any guest to be able to run.

I think the real solution here is going to be for containers on ppc64 to
just raise the default mlock limit as well.

Comment 9 Richard W.M. Jones 2022-05-17 07:40:32 UTC
(In reply to Richard W.M. Jones from comment #7)
> in order to enable VFIO.

... and it's not just VFIO as is clear from patch 6/6 above.  VFIO
happens to also add a bunch of mlocked memory required, but plain
guest devices do too.

Comment 10 Daniel Berrangé 2022-05-17 13:39:38 UTC
(In reply to Colin Walters from comment #6)
> I find it quite surprising that installing virtualization software ends up
> changing the PAM configuration.  Is this only for unprivileged qemu AKA
> qemu:///session in libvirt terms?
> 
> What is it that qemu wants to lock?  Why aren't the OS defaults sufficient
> for that?  If this is for qemu:///session, there are things like
> gnome-keyring that already exist to maintain secrets locked in RAM.

There's a fair bit of history on this topic across

https://bugzilla.redhat.com/show_bug.cgi?id=1350735
https://bugzilla.redhat.com/show_bug.cgi?id=1293024
https://listman.redhat.com/archives/libvir-list/2015-November/msg00769.html

TL;DR is this specific comment is the most important;

  https://bugzilla.redhat.com/show_bug.cgi?id=1293024#c16

[quote]
The key thing here is that on ppc64, unlike x86, the hardware page tables are encoded as a big hash table, rather than a set of radix trees.  Each guest needs its own hashed page table (HPT).  These can get quite large - it can vary depending on a number of things, but the usual rule of thumb is that the HPT is 1/128th to 1/64th of RAM size, with a minimum size of 16MiB.

For PAPR paravirtualized guests this HPT is accessed entirely via hypercall and does not exist within the guest's RAM - it needs to be allocated on the host above and beyond the guest's RAM image.  When using the "HV" KVM implementation (the only one we're targetting) the HPT has to be _host_ physically contiguous, unswappable memory (because it's read directly by hardware.
[/quote]

So the setting in 95-kvm-memlock.conf allows users to boot KVM guests upto approx 4 GB in size. Beyond that, the admin will have to modify this file to allow even more locked RAM.

The 4 GB limit setup by default at least allows KVM to be somewhat useful out of the box for unprivileged users launching QEMU directly, or via libvirt's unprivileged qemu:///session. This is something libguestfs or GNOME boxes would hit on ppc64

The libvirt qemu:///system should not need this setting IIUC, since virtqemud is privileged, it can set the memlock limit directly without PAM being involved.


I guess one possible option would be to move this config into 'libvirt-daemon-kvm' instead of qemu-system-ppc-core.  This would make it possible to install QEMU without pulling  in the limits file. A "default" KVM install of libvirt would get the limits, but a minimial libvirt install can also avoid pulling it in.

Comment 11 Ben Cotton 2022-11-29 18:52:14 UTC
This message is a reminder that Fedora Linux 35 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 35 on 2022-12-13.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '35'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 35 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 12 Ben Cotton 2022-12-13 17:54:57 UTC
Fedora Linux 35 entered end-of-life (EOL) status on 2022-12-13.

Fedora Linux 35 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.