Description of problem: When launching either a ppc64 or ppc64le guest (x86-64 host) I get: ERROR internal error: Process exited prior to exec: libvirt: error : cannot limit locked memory to 46137344: Operation not permitted Version-Release number of selected component (if applicable): libvirt-1.3.0-1.fc24.x86_64 kernel 4.2.6-301.fc23.x86_64 How reproducible: 100% Steps to Reproduce: 1. Run this virt-install command: virt-install --name=tmp-fed0fb92 --ram=4096 --vcpus=1 --os-type=linux --os-variant=fedora21 --arch ppc64le --machine pseries --initrd-inject=/tmp/tmp.sVjN8w5nyk '--extra-args=ks=file:/tmp.sVjN8w5nyk console=tty0 console=hvc0 proxy=http://cache.home.annexia.org:3128' --disk fedora-23-ppc64le,size=6,format=raw --serial pty --location=https://download.fedoraproject.org/pub/fedora-secondary/releases/21/Server/ppc64le/os/ --nographics --noreboot (The same failure happens with ppc64).
It's OK with an x86-64 guest.
I worked around it by increasing my user account's locked memory limit (ulimit -l) to unlimited. I wonder if the error message comes from qemu?
Smallest reproducer is this command (NB: as NON-root): $ virt-install --name=tmp-bz1293024 --ram=4096 --vcpus=1 --os-type=linux --os-variant=fedora22 --disk /var/tmp/fedora-23.img,size=6,format=raw --serial pty --location=https://download.fedoraproject.org/pub/fedora-secondary/releases/23/Server/ppc64le/os/ --nographics --noreboot --arch ppc64le Note: If you are playing with ulimit, you have to kill libvirtd since it could use the previous ulimit from another session.
This bug appears to have been reported against 'rawhide' during the Fedora 24 development cycle. Changing version to '24'. More information and reason for this action is here: https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora24#Rawhide_Rebase
Rich do you still see this with latest rawhide? (the mem locking error comes from libvirt... apparently ppc64 needs some explicit mem locking? that's what the code says, but I didn't dig deeper than that)
There doesn't appear to be a Rawhide repo for ppc64le yet. Unless something has changed in libvirt or virt-install to fix this, I doubt very much that it is fixed.
Andrea, any thoughts on this? Have you seen this issue?
Still happening on libvirt-1.3.2-3.fc24.x86_64 (x86-64 host, running Ubuntu/ppc64le guest).
(In reply to Cole Robinson from comment #7) > Andrea, any thoughts on this? Have you seen this issue? I hadn't, thanks for bringing it up. The issue Rich's seeing is caused by https://bugzilla.redhat.com/show_bug.cgi?id=1273480 having been fixed. Short version is that ppc64 guests always need some amount of memory to be locked, and that amount is guaranteed to be more than the default 64 KiB allowance. libvirt tries to raise the limit to prevent the allocation from failing, but it can only do that successfully when running as root.
I set the architecture to ppc64le, but in fact it affects ppc64 also. In answer to comment 5, it affects Fedora 24 too.
(In reply to Richard W.M. Jones from comment #10) > I set the architecture to ppc64le, but in fact it affects > ppc64 also. In answer to comment 5, it affects Fedora 24 too. Yeah, this will affect both ppc64 variants and any version of libvirt from 1.3.0 on. Unfortunately I don't really see a way to fix this: the memory locking limit really needs to be quite high on ppc64, definitely higher than the default: the fact that this was not enforced before was a bug and could lead to more trouble later on. When libvirtd is running as root we can adjust the limit ourselves quite easily; when it's running as a regular user, we're of course unable to do that. At least the error message is IMHO quite clear and hints at the solution.
bug 1273480 seems to be all about hostdev assignment, which rich isn't doing. I see this commit: commit 16562bbc587add5a03a01c8eb8607c9e05819607 Author: Andrea Bolognani <abologna> Date: Fri Nov 13 10:58:07 2015 +0100 qemu: Always set locked memory limit for ppc64 domains Unlike other architectures, ppc64 domains need to lock memory even when VFIO is not used. But I don't see where the need for unconditional locked memory is explained... Can you point me to that discussion?
(In reply to Cole Robinson from comment #12) > bug 1273480 seems to be all about hostdev assignment, which rich isn't > doing. I see this commit: > > commit 16562bbc587add5a03a01c8eb8607c9e05819607 > Author: Andrea Bolognani <abologna> > Date: Fri Nov 13 10:58:07 2015 +0100 > > qemu: Always set locked memory limit for ppc64 domains > > Unlike other architectures, ppc64 domains need to lock memory > even when VFIO is not used. > > > But I don't see where the need for unconditional locked memory is > explained... Can you point me to that discussion? See David's detailed explanation[1] from back when the patch series was posted on libvir-list. On a related note, there's been some progress recently toward getting some of that memory actually accounted for. [1] https://www.redhat.com/archives/libvir-list/2015-November/msg00769.html
Thanks for the pointer. So if ppc64 doesn't do this memlocking, do things fail 100% of the time? Or is this a heuristic that maybe is triggering a false positive? Rich maybe you can edit libvirt and figure it out. If this has the ponential to be wrong in the non-VFIO case, I suggest at least making it a non-fatal error if the daemon is unprivileged, and logging a VIR_WARN instead. An additional bit we could do is have qemu-system-ppc64 ship a /etc/security/limits.d file to up the memlock limit on pcc64 hosts
(In reply to Cole Robinson from comment #14) > Thanks for the pointer. So if ppc64 doesn't do this memlocking, do things > fail 100% of the time? Or is this a heuristic that maybe is triggering a > false positive? Rich maybe you can edit libvirt and figure it out. > > If this has the ponential to be wrong in the non-VFIO case, I suggest at > least making it a non-fatal error if the daemon is unprivileged, and logging > a VIR_WARN instead. > > An additional bit we could do is have qemu-system-ppc64 ship a > /etc/security/limits.d file to up the memlock limit on pcc64 hosts My understanding is that the consequences of not raising the memory locking limit appropriately can be pretty severe. David, can you give us more details please? What could happen if users ran QEMU with the default memory locking limit of 64 KiB?
Cole, The key thing here is that on ppc64, unlike x86, the hardware page tables are encoded as a big hash table, rather than a set of radix trees. Each guest needs its own hashed page table (HPT). These can get quite large - it can vary depending on a number of things, but the usual rule of thumb is that the HPT is 1/128th to 1/64th of RAM size, with a minimum size of 16MiB. For PAPR paravirtualized guests this HPT is accessed entirely via hypercall and does not exist within the guest's RAM - it needs to be allocated on the host above and beyond the guest's RAM image. When using the "HV" KVM implementation (the only one we're targetting) the HPT has to be _host_ physically contiguous, unswappable memory (because it's read directly by hardware. At the moment, the host kernel doesn't actually need the locked memory limit - it allows unprivileged users (with permission to create VMs) to allocate HPTs anyway, but this is really a bug. As it stands a non-privileged user could create a whole pile of tiny VMs (it doesn't even need to actually execute any instructions in the VMs) and consume an unbounded amount of host memory with those 16MiB HPTs. So we plan to fix that in the kernel. In the meantime libvirt treats things as if the kernel enforced that limit even though it doesn't yet, to avoid having yet more ugly kernel version dependencies. Andrea, would it make any sense to have failure of the setrlimit in libvirt cause only a warning, not a fatal error? In that case it wouldn't prevent things working in situations where it can for other reasons (old kernel which doesn't enforce limits, PR KVM which doesn't require it..).
(In reply to David Gibson from comment #16) [...] > Andrea, would it make any sense to have failure of the setrlimit in libvirt > cause only a warning, not a fatal error? In that case it wouldn't prevent > things working in situations where it can for other reasons (old kernel > which doesn't enforce limits, PR KVM which doesn't require it..). Not really. Warnings are not presented to the user just logged to the log file so its very likely to get ignored.
(In reply to David Gibson from comment #16) > Cole, > > The key thing here is that on ppc64, unlike x86, the hardware page tables > are encoded as a big hash table, rather than a set of radix trees. Each > guest needs its own hashed page table (HPT). These can get quite large - it > can vary depending on a number of things, but the usual rule of thumb is > that the HPT is 1/128th to 1/64th of RAM size, with a minimum size of 16MiB. > > For PAPR paravirtualized guests this HPT is accessed entirely via hypercall > and does not exist within the guest's RAM - it needs to be allocated on the > host above and beyond the guest's RAM image. When using the "HV" KVM > implementation (the only one we're targetting) the HPT has to be _host_ > physically contiguous, unswappable memory (because it's read directly by > hardware. > > At the moment, the host kernel doesn't actually need the locked memory limit > - it allows unprivileged users (with permission to create VMs) to allocate > HPTs anyway, but this is really a bug. So IIUC the bug is that, by not accounting for that memory properly, the kernel is allowing it to be allocated as potentially non-contiguous and swappable, which will result in failure right away (non-contiguous) or as soon as it has been swapped out (swappable). Is that right? > As it stands a non-privileged user > could create a whole pile of tiny VMs (it doesn't even need to actually > execute any instructions in the VMs) and consume an unbounded amount of host > memory with those 16MiB HPTs. That's not really something QEMU specific, though, is it? The same user could just as easily start a bunch of random processes, each one allocating 16MiB+ and get the same result. > So we plan to fix that in the kernel. In the meantime libvirt treats things > as if the kernel enforced that limit even though it doesn't yet, to avoid > having yet more ugly kernel version dependencies. > > > Andrea, would it make any sense to have failure of the setrlimit in libvirt > cause only a warning, not a fatal error? In that case it wouldn't prevent > things working in situations where it can for other reasons (old kernel > which doesn't enforce limits, PR KVM which doesn't require it..). I don't think that's a good idea. First of all, we'd have to be able to tell whether raising the limit is actually needed or not, which would probably be tricky - especially considering that libvirt currently doesn't know anything about the difference between HV and PR KVM. Most importantly, we'd be allowing users to start guests that we know full well may run into trouble later. I'd rather error out early than have the guest behave erratically down the line for no apparent reason. Peter's point about warnings having very little visibility is also a good one.
> > At the moment, the host kernel doesn't actually need the locked memory limit > > - it allows unprivileged users (with permission to create VMs) to allocate > > HPTs anyway, but this is really a bug. > So IIUC the bug is that, by not accounting for that memory > properly, the kernel is allowing it to be allocated as > potentially non-contiguous and swappable, which will result > in failure right away (non-contiguous) or as soon as it has > been swapped out (swappable). Is that right? No. The HPT *will* be allocated contiguous and non-swappable (it's allocated with CMA) - it's just not accounted against the process / user's locked memory limit. That's why this is a security bug. > > As it stands a non-privileged user > > could create a whole pile of tiny VMs (it doesn't even need to actually > > execute any instructions in the VMs) and consume an unbounded amount of host > > memory with those 16MiB HPTs. > That's not really something QEMU specific, though, is it? > The same user could just as easily start a bunch of random > processes, each one allocating 16MiB+ and get the same result. No, because in that case the memory would be non-contiguous and swappable.
Got it. So I guess our options are: a) Raise locked memory limit for users to something like 64 MiB, so they can run guests of reasonable size (4 GiB) without running into errors. Appliances created by libguestfs are going to be even smaller than that, I assume, so they would work b) Teach libvirt about the difference between kvm_hv and kvm_pr, only try to tweak the locked memory limit when using HV, and have libguestfs always use PR c) Force libguestfs to use the direct backend on ppc64 d) Leave things as they are, basically restricting libguestfs usage to the root user a) and c) are definitely hacks, but could be implemented fairly quickly and removed once a better solution is in place. b) looks like it would be the proper solution but, as with all thing libvirt, rushing an implementation without thinking hard at the design has the potential to paint us in a corner. d) is probably not acceptable.
In the short term, I think we need to go with option (a). That's the only really feasible way we can handle this in the next RHEL release, I think. (b).. I really dislike. We try to avoid explicitly exposing the PR/HV distinction even to qemu as much as possible - instead using explicit capabilities for various features. Exposing and using that distinction a layer beyond qemu is going to open several new cans of worms. For one thing, whether the kernel picks HV or PR can depend on a number of details of both host and guest configuration, so you can't really reliably know which one it's going to be before starting it. (c) I'm not quite sure what "direct mode" entails. (d) is.. yeah, certainly suboptimal. Other things we could try: (e) Change KVM so that if it's unable to allocate the HPT due to locked memory limit, it will fall back to PR-KVM. In a sense that's the most pedantically correct, but I dislike it, because I suspect the result will be lots of people's VMs going slow for non-obvious reasons. (f) Put something distinctive in the error qemu reports when it hits the HPT allocation problem, and only have libvirt try to alter the limit and retry if qemu dies with that error. Involves an extra qemu invocation, which sucks. (g) Introduce some new kind of "VM limits" stuff into RHEL startup scripts, that will adjust users locked memory limits based on some sort of # of VMs and max size of VMs values configured by admin. This is basically a sophisticated version of (a). Ugh.. none of these are great :/.
(In reply to David Gibson from comment #21) > In the short term, I think we need to go with option (a). That's the only > really feasible way we can handle this in the next RHEL release, I think. I guess we would have to make qemu-kvm-rhev ship a /etc/security/limits.d/qemu-kvm-rhev-memlock.conf file that sets the new limit. It wouldn't make sense to raise the limit for hosts that are not going to act as hypervisors. > (b).. I really dislike. We try to avoid explicitly exposing the PR/HV > distinction even to qemu as much as possible - instead using explicit > capabilities for various features. Exposing and using that distinction a > layer beyond qemu is going to open several new cans of worms. For one > thing, whether the kernel picks HV or PR can depend on a number of details > of both host and guest configuration, so you can't really reliably know > which one it's going to be before starting it. Okay then. > (c) I'm not quite sure what "direct mode" entails. Basically libguestfs will call QEMU itself instead of going through libvirt. guestfish will give you this hint: libguestfs: error: could not create appliance through libvirt. Try running qemu directly without libvirt using this environment variable: export LIBGUESTFS_BACKEND=direct and if you do that you'll of course be able to avoid the error raised by libvirt. I don't know what other implications there are to using the direct backend, though. Rich? > (d) is.. yeah, certainly suboptimal. > > > Other things we could try: > > (e) Change KVM so that if it's unable to allocate the HPT due to locked > memory limit, it will fall back to PR-KVM. In a sense that's the most > pedantically correct, but I dislike it, because I suspect the result will be > lots of people's VMs going slow for non-obvious reasons. Yeah, doing this kind of stuff outside of user's control is never going to end well. Better to fail with a clear error message than trying to patch things up behind the scenes. > (f) Put something distinctive in the error qemu reports when it hits the HPT > allocation problem, and only have libvirt try to alter the limit and retry > if qemu dies with that error. Involves an extra qemu invocation, which > sucks. libvirt is not really designed in a way that allows you to just try calling QEMU with some arguments and, if that fails, call it again with different arguments. So QEMU would have to expose the information through QMP somehow, for libvirt to probe beforehand. I'm not sure whether this approach would even be feasible. > (g) Introduce some new kind of "VM limits" stuff into RHEL startup scripts, > that will adjust users locked memory limits based on some sort of # of VMs > and max size of VMs values configured by admin. This is basically a > sophisticated version of (a). The limits are be per-process, though. So the only thing that really matters is how much memory you want to allow for an unpriviledged guest. PCI passthrough is not going to be a factor unless you're root, and in that case you can set the limit as you please. > Ugh.. none of these are great :/.
(In reply to Andrea Bolognani from comment #22) > (In reply to David Gibson from comment #21) > > In the short term, I think we need to go with option (a). That's the only > > really feasible way we can handle this in the next RHEL release, I think. > > I guess we would have to make qemu-kvm-rhev ship a > /etc/security/limits.d/qemu-kvm-rhev-memlock.conf file that > sets the new limit. It wouldn't make sense to raise the > limit for hosts that are not going to act as hypervisors. Such files will have no effect. The limits.conf files are processed by PAM, and when libvirt launches QEMU and sets its UID, PAM is not involved in any way. IOW, if we need to set limits for QEMU, libvirt has to set them explicitly. The same would apply for other apps launching QEMU, unless they actually use 'su' to run QEMU as a diffferent account, which I don't believe any do.
(In reply to Daniel Berrange from comment #23) > > I guess we would have to make qemu-kvm-rhev ship a > > /etc/security/limits.d/qemu-kvm-rhev-memlock.conf file that > > sets the new limit. It wouldn't make sense to raise the > > limit for hosts that are not going to act as hypervisors. > > Such files will have no effect. The limits.conf files are processed by PAM, > and when libvirt launches QEMU and sets its UID, PAM is not involved in any > way. > > IOW, if we need to set limits for QEMU, libvirt has to set them explicitly. > The same would apply for other apps launching QEMU, unless they actually use > 'su' to run QEMU as a diffferent account, which I don't believe any do. For user sessions, the libvirt daemon is autostarted and will inherit the user's limits. I tried dropping * hard memlock 64000 * soft memlock 64000 in /etc/security/limits.d/qemu-kvm-rhev-memlock.conf and, after logging out and in again, I was able to install a guest and use guestfish from my unprivileged account.
(In reply to Andrea Bolognani from comment #22) > > (c) I'm not quite sure what "direct mode" entails. > > Basically libguestfs will call QEMU itself instead of going > through libvirt. guestfish will give you this hint: > > libguestfs: error: could not create appliance through libvirt. > > Try running qemu directly without libvirt using this environment variable: > export LIBGUESTFS_BACKEND=direct > > and if you do that you'll of course be able to avoid the error > raised by libvirt. > > I don't know what other implications there are to using the > direct backend, though. Rich? It's not supported, nor encouraged in RHEL. In this case it's a DIY workaround, but it ought to be fixed in libvirt (or qemu, or wherever, but in any case not by end users).
Moving this to qemu, as the only short-term (and possibly long-term) solution seems to be the one outlined in Comment 20 (proposal A) and POC-ed in Comment 24, ie. ship a /etc/security/limits.d/qemu-memlock.conf file that raises the memory locking limit to something like 64 MiB, thus allowing regular users to run smallish guests.
qemu-2.6.2-3.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-a6e707557e
qemu-2.6.2-4.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-9f5fc14b30
qemu-2.6.2-4.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.
one possible fallout from this change: https://bugzilla.redhat.com/show_bug.cgi?id=2082149