Bug 1293024 - memory locking limit for regular users is too low to launch guests through libvirt
memory locking limit for regular users is too low to launch guests through li...
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: qemu (Show other bugs)
24
ppc64le Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Fedora Virtualization Maintainers
Fedora Extras Quality Assurance
:
Depends On:
Blocks: TRACKER-bugs-affecting-libguestfs
  Show dependency treegraph
 
Reported: 2015-12-19 04:53 EST by Richard W.M. Jones
Modified: 2016-10-28 15:51 EDT (History)
21 users (show)

See Also:
Fixed In Version: qemu-2.6.2-4.fc24
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1350735 1350772 (view as bug list)
Environment:
Last Closed: 2016-10-28 15:51:35 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Richard W.M. Jones 2015-12-19 04:53:35 EST
Description of problem:

When launching either a ppc64 or ppc64le guest (x86-64 host) I get:

ERROR    internal error: Process exited prior to exec: libvirt:  error : cannot limit locked memory to 46137344: Operation not permitted

Version-Release number of selected component (if applicable):

libvirt-1.3.0-1.fc24.x86_64
kernel 4.2.6-301.fc23.x86_64

How reproducible:

100%

Steps to Reproduce:
1. Run this virt-install command:

virt-install --name=tmp-fed0fb92 --ram=4096 --vcpus=1 --os-type=linux --os-variant=fedora21 --arch ppc64le --machine pseries --initrd-inject=/tmp/tmp.sVjN8w5nyk '--extra-args=ks=file:/tmp.sVjN8w5nyk console=tty0 console=hvc0 proxy=http://cache.home.annexia.org:3128' --disk fedora-23-ppc64le,size=6,format=raw --serial pty --location=https://download.fedoraproject.org/pub/fedora-secondary/releases/21/Server/ppc64le/os/ --nographics --noreboot

(The same failure happens with ppc64).
Comment 1 Richard W.M. Jones 2015-12-19 04:56:29 EST
It's OK with an x86-64 guest.
Comment 2 Richard W.M. Jones 2015-12-19 05:00:33 EST
I worked around it by increasing my user account's locked memory
limit (ulimit -l) to unlimited.  I wonder if the error message comes
from qemu?
Comment 3 Richard W.M. Jones 2015-12-19 05:04:44 EST
Smallest reproducer is this command (NB: as NON-root):

$ virt-install --name=tmp-bz1293024 --ram=4096 --vcpus=1 --os-type=linux --os-variant=fedora22 --disk /var/tmp/fedora-23.img,size=6,format=raw --serial pty --location=https://download.fedoraproject.org/pub/fedora-secondary/releases/23/Server/ppc64le/os/ --nographics --noreboot --arch ppc64le

Note: If you are playing with ulimit, you have to kill libvirtd
since it could use the previous ulimit from another session.
Comment 4 Jan Kurik 2016-02-24 09:09:40 EST
This bug appears to have been reported against 'rawhide' during the Fedora 24 development cycle.
Changing version to '24'.

More information and reason for this action is here:
https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora24#Rawhide_Rebase
Comment 5 Cole Robinson 2016-03-16 19:43:19 EDT
Rich do you still see this with latest rawhide?

(the mem locking error comes from libvirt... apparently ppc64 needs some explicit mem locking? that's what the code says, but I didn't dig deeper than that)
Comment 6 Richard W.M. Jones 2016-03-17 12:11:30 EDT
There doesn't appear to be a Rawhide repo for ppc64le yet.

Unless something has changed in libvirt or virt-install to fix
this, I doubt very much that it is fixed.
Comment 7 Cole Robinson 2016-03-17 12:13:51 EDT
Andrea, any thoughts on this? Have you seen this issue?
Comment 8 Richard W.M. Jones 2016-03-24 14:46:38 EDT
Still happening on libvirt-1.3.2-3.fc24.x86_64 (x86-64 host, running
Ubuntu/ppc64le guest).
Comment 9 Andrea Bolognani 2016-03-29 09:11:01 EDT
(In reply to Cole Robinson from comment #7)
> Andrea, any thoughts on this? Have you seen this issue?

I hadn't, thanks for bringing it up.

The issue Rich's seeing is caused by

  https://bugzilla.redhat.com/show_bug.cgi?id=1273480

having been fixed.

Short version is that ppc64 guests always need some amount
of memory to be locked, and that amount is guaranteed to be
more than the default 64 KiB allowance.

libvirt tries to raise the limit to prevent the allocation
from failing, but it can only do that successfully when
running as root.
Comment 10 Richard W.M. Jones 2016-04-07 15:51:02 EDT
I set the architecture to ppc64le, but in fact it affects
ppc64 also.  In answer to comment 5, it affects Fedora 24 too.
Comment 11 Andrea Bolognani 2016-04-08 04:55:42 EDT
(In reply to Richard W.M. Jones from comment #10)
> I set the architecture to ppc64le, but in fact it affects
> ppc64 also.  In answer to comment 5, it affects Fedora 24 too.

Yeah, this will affect both ppc64 variants and any version of
libvirt from 1.3.0 on.

Unfortunately I don't really see a way to fix this: the memory
locking limit really needs to be quite high on ppc64,
definitely higher than the default: the fact that this was not
enforced before was a bug and could lead to more trouble later
on.

When libvirtd is running as root we can adjust the limit
ourselves quite easily; when it's running as a regular user,
we're of course unable to do that.

At least the error message is IMHO quite clear and hints at
the solution.
Comment 12 Cole Robinson 2016-04-26 17:42:04 EDT
bug 1273480 seems to be all about hostdev assignment, which rich isn't doing. I see this commit:

commit 16562bbc587add5a03a01c8eb8607c9e05819607
Author: Andrea Bolognani <abologna@redhat.com>
Date:   Fri Nov 13 10:58:07 2015 +0100

    qemu: Always set locked memory limit for ppc64 domains
    
    Unlike other architectures, ppc64 domains need to lock memory
    even when VFIO is not used.


But I don't see where the need for unconditional locked memory is explained... Can you point me to that discussion?
Comment 13 Andrea Bolognani 2016-04-28 08:08:52 EDT
(In reply to Cole Robinson from comment #12)
> bug 1273480 seems to be all about hostdev assignment, which rich isn't
> doing. I see this commit:
> 
> commit 16562bbc587add5a03a01c8eb8607c9e05819607
> Author: Andrea Bolognani <abologna@redhat.com>
> Date:   Fri Nov 13 10:58:07 2015 +0100
> 
>     qemu: Always set locked memory limit for ppc64 domains
>     
>     Unlike other architectures, ppc64 domains need to lock memory
>     even when VFIO is not used.
> 
> 
> But I don't see where the need for unconditional locked memory is
> explained... Can you point me to that discussion?

See David's detailed explanation[1] from back when the patch
series was posted on libvir-list.

On a related note, there's been some progress recently toward
getting some of that memory actually accounted for.


[1] https://www.redhat.com/archives/libvir-list/2015-November/msg00769.html
Comment 14 Cole Robinson 2016-04-29 08:00:32 EDT
Thanks for the pointer.  So if ppc64 doesn't do this memlocking, do things fail 100% of the time? Or is this a heuristic that maybe is triggering a false positive? Rich maybe you can edit libvirt and figure it out.

If this has the ponential to be wrong in the non-VFIO case, I suggest at least making it a non-fatal error if the daemon is unprivileged, and logging a VIR_WARN instead.

An additional bit we could do is have qemu-system-ppc64 ship a /etc/security/limits.d file to up the memlock limit on pcc64 hosts
Comment 15 Andrea Bolognani 2016-05-05 04:51:40 EDT
(In reply to Cole Robinson from comment #14)
> Thanks for the pointer.  So if ppc64 doesn't do this memlocking, do things
> fail 100% of the time? Or is this a heuristic that maybe is triggering a
> false positive? Rich maybe you can edit libvirt and figure it out.
> 
> If this has the ponential to be wrong in the non-VFIO case, I suggest at
> least making it a non-fatal error if the daemon is unprivileged, and logging
> a VIR_WARN instead.
> 
> An additional bit we could do is have qemu-system-ppc64 ship a
> /etc/security/limits.d file to up the memlock limit on pcc64 hosts

My understanding is that the consequences of not raising the
memory locking limit appropriately can be pretty severe.

David, can you give us more details please? What could happen
if users ran QEMU with the default memory locking limit of
64 KiB?
Comment 16 David Gibson 2016-05-26 02:08:22 EDT
Cole,

The key thing here is that on ppc64, unlike x86, the hardware page tables are encoded as a big hash table, rather than a set of radix trees.  Each guest needs its own hashed page table (HPT).  These can get quite large - it can vary depending on a number of things, but the usual rule of thumb is that the HPT is 1/128th to 1/64th of RAM size, with a minimum size of 16MiB.

For PAPR paravirtualized guests this HPT is accessed entirely via hypercall and does not exist within the guest's RAM - it needs to be allocated on the host above and beyond the guest's RAM image.  When using the "HV" KVM implementation (the only one we're targetting) the HPT has to be _host_ physically contiguous, unswappable memory (because it's read directly by hardware.

At the moment, the host kernel doesn't actually need the locked memory limit - it allows unprivileged users (with permission to create VMs) to allocate HPTs anyway, but this is really a bug.  As it stands a non-privileged user could create a whole pile of tiny VMs (it doesn't even need to actually execute any instructions in the VMs) and consume an unbounded amount of host memory with those 16MiB HPTs.

So we plan to fix that in the kernel.  In the meantime libvirt treats things as if the kernel enforced that limit even though it doesn't yet, to avoid having yet more ugly kernel version dependencies.


Andrea, would it make any sense to have failure of the setrlimit in libvirt cause only a warning, not a fatal error?  In that case it wouldn't prevent things working in situations where it can for other reasons (old kernel which doesn't enforce limits, PR KVM which doesn't require it..).
Comment 17 Peter Krempa 2016-05-26 03:28:32 EDT
(In reply to David Gibson from comment #16)

[...]

> Andrea, would it make any sense to have failure of the setrlimit in libvirt
> cause only a warning, not a fatal error?  In that case it wouldn't prevent
> things working in situations where it can for other reasons (old kernel
> which doesn't enforce limits, PR KVM which doesn't require it..).

Not really. Warnings are not presented to the user just logged to the log file so its very likely to get ignored.
Comment 18 Andrea Bolognani 2016-05-26 04:20:07 EDT
(In reply to David Gibson from comment #16)
> Cole,
> 
> The key thing here is that on ppc64, unlike x86, the hardware page tables
> are encoded as a big hash table, rather than a set of radix trees.  Each
> guest needs its own hashed page table (HPT).  These can get quite large - it
> can vary depending on a number of things, but the usual rule of thumb is
> that the HPT is 1/128th to 1/64th of RAM size, with a minimum size of 16MiB.
> 
> For PAPR paravirtualized guests this HPT is accessed entirely via hypercall
> and does not exist within the guest's RAM - it needs to be allocated on the
> host above and beyond the guest's RAM image.  When using the "HV" KVM
> implementation (the only one we're targetting) the HPT has to be _host_
> physically contiguous, unswappable memory (because it's read directly by
> hardware.
> 
> At the moment, the host kernel doesn't actually need the locked memory limit
> - it allows unprivileged users (with permission to create VMs) to allocate
> HPTs anyway, but this is really a bug.

So IIUC the bug is that, by not accounting for that memory
properly, the kernel is allowing it to be allocated as
potentially non-contiguous and swappable, which will result
in failure right away (non-contiguous) or as soon as it has
been swapped out (swappable). Is that right?

> As it stands a non-privileged user
> could create a whole pile of tiny VMs (it doesn't even need to actually
> execute any instructions in the VMs) and consume an unbounded amount of host
> memory with those 16MiB HPTs.

That's not really something QEMU specific, though, is it?
The same user could just as easily start a bunch of random
processes, each one allocating 16MiB+ and get the same result.

> So we plan to fix that in the kernel.  In the meantime libvirt treats things
> as if the kernel enforced that limit even though it doesn't yet, to avoid
> having yet more ugly kernel version dependencies.
> 
> 
> Andrea, would it make any sense to have failure of the setrlimit in libvirt
> cause only a warning, not a fatal error?  In that case it wouldn't prevent
> things working in situations where it can for other reasons (old kernel
> which doesn't enforce limits, PR KVM which doesn't require it..).

I don't think that's a good idea.

First of all, we'd have to be able to tell whether raising
the limit is actually needed or not, which would probably be
tricky - especially considering that libvirt currently doesn't
know anything about the difference between HV and PR KVM.

Most importantly, we'd be allowing users to start guests that
we know full well may run into trouble later. I'd rather error
out early than have the guest behave erratically down the line
for no apparent reason.

Peter's point about warnings having very little visibility is
also a good one.
Comment 19 David Gibson 2016-05-26 18:11:08 EDT
> > At the moment, the host kernel doesn't actually need the locked memory limit
> > - it allows unprivileged users (with permission to create VMs) to allocate
> > HPTs anyway, but this is really a bug.

> So IIUC the bug is that, by not accounting for that memory
> properly, the kernel is allowing it to be allocated as
> potentially non-contiguous and swappable, which will result
> in failure right away (non-contiguous) or as soon as it has
> been swapped out (swappable). Is that right?

No.  The HPT *will* be allocated contiguous and non-swappable (it's allocated with CMA) - it's just not accounted against the process / user's locked memory limit.  That's why this is a security bug.

> > As it stands a non-privileged user
> > could create a whole pile of tiny VMs (it doesn't even need to actually
> > execute any instructions in the VMs) and consume an unbounded amount of host
> > memory with those 16MiB HPTs.

> That's not really something QEMU specific, though, is it?
> The same user could just as easily start a bunch of random
> processes, each one allocating 16MiB+ and get the same result.

No, because in that case the memory would be non-contiguous and swappable.
Comment 20 Andrea Bolognani 2016-06-09 10:26:56 EDT
Got it.

So I guess our options are:

  a) Raise locked memory limit for users to something like
     64 MiB, so they can run guests of reasonable size (4 GiB)
     without running into errors. Appliances created by
     libguestfs are going to be even smaller than that, I
     assume, so they would work

  b) Teach libvirt about the difference between kvm_hv and
     kvm_pr, only try to tweak the locked memory limit when
     using HV, and have libguestfs always use PR

  c) Force libguestfs to use the direct backend on ppc64

  d) Leave things as they are, basically restricting
     libguestfs usage to the root user

a) and c) are definitely hacks, but could be implemented
fairly quickly and removed once a better solution is in
place.

b) looks like it would be the proper solution but, as with
all thing libvirt, rushing an implementation without thinking
hard at the design has the potential to paint us in a corner.

d) is probably not acceptable.
Comment 21 David Gibson 2016-06-14 02:01:23 EDT
In the short term, I think we need to go with option (a).  That's the only really feasible way we can handle this in the next RHEL release, I think.

(b).. I really dislike.  We try to avoid explicitly exposing the PR/HV distinction even to qemu as much as possible - instead using explicit capabilities for various features.  Exposing and using that distinction a layer beyond qemu is going to open several new cans of worms.  For one thing, whether the kernel picks HV or PR can depend on a number of details of both host and guest configuration, so you can't really reliably know which one it's going to be before starting it.

(c) I'm not quite sure what "direct mode" entails.

(d) is.. yeah, certainly suboptimal.


Other things we could try:

(e) Change KVM so that if it's unable to allocate the HPT due to locked memory limit, it will fall back to PR-KVM.  In a sense that's the most pedantically correct, but I dislike it, because I suspect the result will be lots of people's VMs going slow for non-obvious reasons.

(f) Put something distinctive in the error qemu reports when it hits the HPT allocation problem, and only have libvirt try to alter the limit and retry if qemu dies with that error.  Involves an extra qemu invocation, which sucks.

(g) Introduce some new kind of "VM limits" stuff into RHEL startup scripts, that will adjust users locked memory limits based on some sort of # of VMs and max size of VMs values configured by admin.  This is basically a sophisticated version of (a).


Ugh.. none of these are great :/.
Comment 22 Andrea Bolognani 2016-06-14 06:33:32 EDT
(In reply to David Gibson from comment #21)
> In the short term, I think we need to go with option (a).  That's the only
> really feasible way we can handle this in the next RHEL release, I think.

I guess we would have to make qemu-kvm-rhev ship a
/etc/security/limits.d/qemu-kvm-rhev-memlock.conf file that
sets the new limit. It wouldn't make sense to raise the
limit for hosts that are not going to act as hypervisors.

> (b).. I really dislike.  We try to avoid explicitly exposing the PR/HV
> distinction even to qemu as much as possible - instead using explicit
> capabilities for various features.  Exposing and using that distinction a
> layer beyond qemu is going to open several new cans of worms.  For one
> thing, whether the kernel picks HV or PR can depend on a number of details
> of both host and guest configuration, so you can't really reliably know
> which one it's going to be before starting it.

Okay then.

> (c) I'm not quite sure what "direct mode" entails.

Basically libguestfs will call QEMU itself instead of going
through libvirt. guestfish will give you this hint:

  libguestfs: error: could not create appliance through libvirt.

  Try running qemu directly without libvirt using this environment variable:
  export LIBGUESTFS_BACKEND=direct

and if you do that you'll of course be able to avoid the error
raised by libvirt.

I don't know what other implications there are to using the
direct backend, though. Rich?

> (d) is.. yeah, certainly suboptimal.
> 
> 
> Other things we could try:
> 
> (e) Change KVM so that if it's unable to allocate the HPT due to locked
> memory limit, it will fall back to PR-KVM.  In a sense that's the most
> pedantically correct, but I dislike it, because I suspect the result will be
> lots of people's VMs going slow for non-obvious reasons.

Yeah, doing this kind of stuff outside of user's control is
never going to end well. Better to fail with a clear error
message than trying to patch things up behind the scenes.

> (f) Put something distinctive in the error qemu reports when it hits the HPT
> allocation problem, and only have libvirt try to alter the limit and retry
> if qemu dies with that error.  Involves an extra qemu invocation, which
> sucks.

libvirt is not really designed in a way that allows you to
just try calling QEMU with some arguments and, if that fails,
call it again with different arguments. So QEMU would have to
expose the information through QMP somehow, for libvirt to
probe beforehand. I'm not sure whether this approach would
even be feasible.

> (g) Introduce some new kind of "VM limits" stuff into RHEL startup scripts,
> that will adjust users locked memory limits based on some sort of # of VMs
> and max size of VMs values configured by admin.  This is basically a
> sophisticated version of (a).

The limits are be per-process, though. So the only thing
that really matters is how much memory you want to allow
for an unpriviledged guest. PCI passthrough is not going
to be a factor unless you're root, and in that case you
can set the limit as you please.

> Ugh.. none of these are great :/.
Comment 23 Daniel Berrange 2016-06-14 06:40:48 EDT
(In reply to Andrea Bolognani from comment #22)
> (In reply to David Gibson from comment #21)
> > In the short term, I think we need to go with option (a).  That's the only
> > really feasible way we can handle this in the next RHEL release, I think.
> 
> I guess we would have to make qemu-kvm-rhev ship a
> /etc/security/limits.d/qemu-kvm-rhev-memlock.conf file that
> sets the new limit. It wouldn't make sense to raise the
> limit for hosts that are not going to act as hypervisors.

Such files will have no effect. The limits.conf files are processed by PAM, and when libvirt launches QEMU and sets its UID, PAM is not involved in any way.

IOW, if we need to set limits for QEMU, libvirt has to set them explicitly. The same would apply for other apps launching QEMU, unless they actually use 'su' to run QEMU as a diffferent account, which I don't believe any do.
Comment 24 Andrea Bolognani 2016-06-14 07:14:00 EDT
(In reply to Daniel Berrange from comment #23)
> > I guess we would have to make qemu-kvm-rhev ship a
> > /etc/security/limits.d/qemu-kvm-rhev-memlock.conf file that
> > sets the new limit. It wouldn't make sense to raise the
> > limit for hosts that are not going to act as hypervisors.
> 
> Such files will have no effect. The limits.conf files are processed by PAM,
> and when libvirt launches QEMU and sets its UID, PAM is not involved in any
> way.
> 
> IOW, if we need to set limits for QEMU, libvirt has to set them explicitly.
> The same would apply for other apps launching QEMU, unless they actually use
> 'su' to run QEMU as a diffferent account, which I don't believe any do.

For user sessions, the libvirt daemon is autostarted and
will inherit the user's limits.

I tried dropping

  *       hard    memlock         64000
  *       soft    memlock         64000

in /etc/security/limits.d/qemu-kvm-rhev-memlock.conf and,
after logging out and in again, I was able to install a
guest and use guestfish from my unprivileged account.
Comment 25 Richard W.M. Jones 2016-06-14 07:28:43 EDT
(In reply to Andrea Bolognani from comment #22)
> > (c) I'm not quite sure what "direct mode" entails.
> 
> Basically libguestfs will call QEMU itself instead of going
> through libvirt. guestfish will give you this hint:
> 
>   libguestfs: error: could not create appliance through libvirt.
> 
>   Try running qemu directly without libvirt using this environment variable:
>   export LIBGUESTFS_BACKEND=direct
> 
> and if you do that you'll of course be able to avoid the error
> raised by libvirt.
> 
> I don't know what other implications there are to using the
> direct backend, though. Rich?

It's not supported, nor encouraged in RHEL.  In this case it's a DIY
workaround, but it ought to be fixed in libvirt (or qemu, or wherever,
but in any case not by end users).
Comment 26 Andrea Bolognani 2016-06-28 05:01:55 EDT
Moving this to qemu, as the only short-term (and possibly
long-term) solution seems to be the one outlined in
Comment 20 (proposal A) and POC-ed in Comment 24, ie. ship
a /etc/security/limits.d/qemu-memlock.conf file that raises
the memory locking limit to something like 64 MiB, thus
allowing regular users to run smallish guests.
Comment 28 Fedora Update System 2016-10-20 17:58:36 EDT
qemu-2.6.2-3.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-a6e707557e
Comment 29 Fedora Update System 2016-10-26 18:26:41 EDT
qemu-2.6.2-4.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-9f5fc14b30
Comment 30 Fedora Update System 2016-10-28 15:51:35 EDT
qemu-2.6.2-4.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.