Red Hat Bugzilla – Bug 1350772
Memory locking is not required for non-KVM ppc64 guests
Last modified: 2016-11-03 14:47:43 EDT
+++ This bug was initially created as a clone of Bug #1293024 +++ Description of problem: When launching either a ppc64 or ppc64le guest (x86-64 host) I get: ERROR internal error: Process exited prior to exec: libvirt: error : cannot limit locked memory to 46137344: Operation not permitted Version-Release number of selected component (if applicable): libvirt-1.3.0-1.fc24.x86_64 kernel 4.2.6-301.fc23.x86_64 How reproducible: 100% Steps to Reproduce: 1. Run this virt-install command: virt-install --name=tmp-fed0fb92 --ram=4096 --vcpus=1 --os-type=linux --os-variant=fedora21 --arch ppc64le --machine pseries --initrd-inject=/tmp/tmp.sVjN8w5nyk '--extra-args=ks=file:/tmp.sVjN8w5nyk console=tty0 console=hvc0 proxy=http://cache.home.annexia.org:3128' --disk fedora-23-ppc64le,size=6,format=raw --serial pty --location=https://download.fedoraproject.org/pub/fedora-secondary/releases/21/Server/ppc64le/os/ --nographics --noreboot (The same failure happens with ppc64). --- Additional comment from Richard W.M. Jones on 2015-12-19 04:56:29 EST --- It's OK with an x86-64 guest. --- Additional comment from Richard W.M. Jones on 2015-12-19 05:00:33 EST --- I worked around it by increasing my user account's locked memory limit (ulimit -l) to unlimited. I wonder if the error message comes from qemu? --- Additional comment from Richard W.M. Jones on 2015-12-19 05:04:44 EST --- Smallest reproducer is this command (NB: as NON-root): $ virt-install --name=tmp-bz1293024 --ram=4096 --vcpus=1 --os-type=linux --os-variant=fedora22 --disk /var/tmp/fedora-23.img,size=6,format=raw --serial pty --location=https://download.fedoraproject.org/pub/fedora-secondary/releases/23/Server/ppc64le/os/ --nographics --noreboot --arch ppc64le Note: If you are playing with ulimit, you have to kill libvirtd since it could use the previous ulimit from another session. --- Additional comment from Jan Kurik on 2016-02-24 09:09:40 EST --- This bug appears to have been reported against 'rawhide' during the Fedora 24 development cycle. Changing version to '24'. More information and reason for this action is here: https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora24#Rawhide_Rebase --- Additional comment from Cole Robinson on 2016-03-16 19:43:19 EDT --- Rich do you still see this with latest rawhide? (the mem locking error comes from libvirt... apparently ppc64 needs some explicit mem locking? that's what the code says, but I didn't dig deeper than that) --- Additional comment from Richard W.M. Jones on 2016-03-17 12:11:30 EDT --- There doesn't appear to be a Rawhide repo for ppc64le yet. Unless something has changed in libvirt or virt-install to fix this, I doubt very much that it is fixed. --- Additional comment from Cole Robinson on 2016-03-17 12:13:51 EDT --- Andrea, any thoughts on this? Have you seen this issue? --- Additional comment from Richard W.M. Jones on 2016-03-24 14:46:38 EDT --- Still happening on libvirt-1.3.2-3.fc24.x86_64 (x86-64 host, running Ubuntu/ppc64le guest). --- Additional comment from Andrea Bolognani on 2016-03-29 09:11:01 EDT --- (In reply to Cole Robinson from comment #7) > Andrea, any thoughts on this? Have you seen this issue? I hadn't, thanks for bringing it up. The issue Rich's seeing is caused by https://bugzilla.redhat.com/show_bug.cgi?id=1273480 having been fixed. Short version is that ppc64 guests always need some amount of memory to be locked, and that amount is guaranteed to be more than the default 64 KiB allowance. libvirt tries to raise the limit to prevent the allocation from failing, but it can only do that successfully when running as root. --- Additional comment from Richard W.M. Jones on 2016-04-07 15:51:02 EDT --- I set the architecture to ppc64le, but in fact it affects ppc64 also. In answer to comment 5, it affects Fedora 24 too. --- Additional comment from Andrea Bolognani on 2016-04-08 04:55:42 EDT --- (In reply to Richard W.M. Jones from comment #10) > I set the architecture to ppc64le, but in fact it affects > ppc64 also. In answer to comment 5, it affects Fedora 24 too. Yeah, this will affect both ppc64 variants and any version of libvirt from 1.3.0 on. Unfortunately I don't really see a way to fix this: the memory locking limit really needs to be quite high on ppc64, definitely higher than the default: the fact that this was not enforced before was a bug and could lead to more trouble later on. When libvirtd is running as root we can adjust the limit ourselves quite easily; when it's running as a regular user, we're of course unable to do that. At least the error message is IMHO quite clear and hints at the solution. --- Additional comment from Cole Robinson on 2016-04-26 17:42:04 EDT --- bug 1273480 seems to be all about hostdev assignment, which rich isn't doing. I see this commit: commit 16562bbc587add5a03a01c8eb8607c9e05819607 Author: Andrea Bolognani <abologna@redhat.com> Date: Fri Nov 13 10:58:07 2015 +0100 qemu: Always set locked memory limit for ppc64 domains Unlike other architectures, ppc64 domains need to lock memory even when VFIO is not used. But I don't see where the need for unconditional locked memory is explained... Can you point me to that discussion? --- Additional comment from Andrea Bolognani on 2016-04-28 08:08:52 EDT --- (In reply to Cole Robinson from comment #12) > bug 1273480 seems to be all about hostdev assignment, which rich isn't > doing. I see this commit: > > commit 16562bbc587add5a03a01c8eb8607c9e05819607 > Author: Andrea Bolognani <abologna@redhat.com> > Date: Fri Nov 13 10:58:07 2015 +0100 > > qemu: Always set locked memory limit for ppc64 domains > > Unlike other architectures, ppc64 domains need to lock memory > even when VFIO is not used. > > > But I don't see where the need for unconditional locked memory is > explained... Can you point me to that discussion? See David's detailed explanation[1] from back when the patch series was posted on libvir-list. On a related note, there's been some progress recently toward getting some of that memory actually accounted for. [1] https://www.redhat.com/archives/libvir-list/2015-November/msg00769.html --- Additional comment from Cole Robinson on 2016-04-29 08:00:32 EDT --- Thanks for the pointer. So if ppc64 doesn't do this memlocking, do things fail 100% of the time? Or is this a heuristic that maybe is triggering a false positive? Rich maybe you can edit libvirt and figure it out. If this has the ponential to be wrong in the non-VFIO case, I suggest at least making it a non-fatal error if the daemon is unprivileged, and logging a VIR_WARN instead. An additional bit we could do is have qemu-system-ppc64 ship a /etc/security/limits.d file to up the memlock limit on pcc64 hosts --- Additional comment from Andrea Bolognani on 2016-05-05 04:51:40 EDT --- (In reply to Cole Robinson from comment #14) > Thanks for the pointer. So if ppc64 doesn't do this memlocking, do things > fail 100% of the time? Or is this a heuristic that maybe is triggering a > false positive? Rich maybe you can edit libvirt and figure it out. > > If this has the ponential to be wrong in the non-VFIO case, I suggest at > least making it a non-fatal error if the daemon is unprivileged, and logging > a VIR_WARN instead. > > An additional bit we could do is have qemu-system-ppc64 ship a > /etc/security/limits.d file to up the memlock limit on pcc64 hosts My understanding is that the consequences of not raising the memory locking limit appropriately can be pretty severe. David, can you give us more details please? What could happen if users ran QEMU with the default memory locking limit of 64 KiB? --- Additional comment from David Gibson on 2016-05-26 02:08:22 EDT --- Cole, The key thing here is that on ppc64, unlike x86, the hardware page tables are encoded as a big hash table, rather than a set of radix trees. Each guest needs its own hashed page table (HPT). These can get quite large - it can vary depending on a number of things, but the usual rule of thumb is that the HPT is 1/128th to 1/64th of RAM size, with a minimum size of 16MiB. For PAPR paravirtualized guests this HPT is accessed entirely via hypercall and does not exist within the guest's RAM - it needs to be allocated on the host above and beyond the guest's RAM image. When using the "HV" KVM implementation (the only one we're targetting) the HPT has to be _host_ physically contiguous, unswappable memory (because it's read directly by hardware. At the moment, the host kernel doesn't actually need the locked memory limit - it allows unprivileged users (with permission to create VMs) to allocate HPTs anyway, but this is really a bug. As it stands a non-privileged user could create a whole pile of tiny VMs (it doesn't even need to actually execute any instructions in the VMs) and consume an unbounded amount of host memory with those 16MiB HPTs. So we plan to fix that in the kernel. In the meantime libvirt treats things as if the kernel enforced that limit even though it doesn't yet, to avoid having yet more ugly kernel version dependencies. Andrea, would it make any sense to have failure of the setrlimit in libvirt cause only a warning, not a fatal error? In that case it wouldn't prevent things working in situations where it can for other reasons (old kernel which doesn't enforce limits, PR KVM which doesn't require it..). --- Additional comment from Peter Krempa on 2016-05-26 03:28:32 EDT --- (In reply to David Gibson from comment #16) [...] > Andrea, would it make any sense to have failure of the setrlimit in libvirt > cause only a warning, not a fatal error? In that case it wouldn't prevent > things working in situations where it can for other reasons (old kernel > which doesn't enforce limits, PR KVM which doesn't require it..). Not really. Warnings are not presented to the user just logged to the log file so its very likely to get ignored. --- Additional comment from Andrea Bolognani on 2016-05-26 04:20:07 EDT --- (In reply to David Gibson from comment #16) > Cole, > > The key thing here is that on ppc64, unlike x86, the hardware page tables > are encoded as a big hash table, rather than a set of radix trees. Each > guest needs its own hashed page table (HPT). These can get quite large - it > can vary depending on a number of things, but the usual rule of thumb is > that the HPT is 1/128th to 1/64th of RAM size, with a minimum size of 16MiB. > > For PAPR paravirtualized guests this HPT is accessed entirely via hypercall > and does not exist within the guest's RAM - it needs to be allocated on the > host above and beyond the guest's RAM image. When using the "HV" KVM > implementation (the only one we're targetting) the HPT has to be _host_ > physically contiguous, unswappable memory (because it's read directly by > hardware. > > At the moment, the host kernel doesn't actually need the locked memory limit > - it allows unprivileged users (with permission to create VMs) to allocate > HPTs anyway, but this is really a bug. So IIUC the bug is that, by not accounting for that memory properly, the kernel is allowing it to be allocated as potentially non-contiguous and swappable, which will result in failure right away (non-contiguous) or as soon as it has been swapped out (swappable). Is that right? > As it stands a non-privileged user > could create a whole pile of tiny VMs (it doesn't even need to actually > execute any instructions in the VMs) and consume an unbounded amount of host > memory with those 16MiB HPTs. That's not really something QEMU specific, though, is it? The same user could just as easily start a bunch of random processes, each one allocating 16MiB+ and get the same result. > So we plan to fix that in the kernel. In the meantime libvirt treats things > as if the kernel enforced that limit even though it doesn't yet, to avoid > having yet more ugly kernel version dependencies. > > > Andrea, would it make any sense to have failure of the setrlimit in libvirt > cause only a warning, not a fatal error? In that case it wouldn't prevent > things working in situations where it can for other reasons (old kernel > which doesn't enforce limits, PR KVM which doesn't require it..). I don't think that's a good idea. First of all, we'd have to be able to tell whether raising the limit is actually needed or not, which would probably be tricky - especially considering that libvirt currently doesn't know anything about the difference between HV and PR KVM. Most importantly, we'd be allowing users to start guests that we know full well may run into trouble later. I'd rather error out early than have the guest behave erratically down the line for no apparent reason. Peter's point about warnings having very little visibility is also a good one. --- Additional comment from David Gibson on 2016-05-26 18:11:08 EDT --- > > At the moment, the host kernel doesn't actually need the locked memory limit > > - it allows unprivileged users (with permission to create VMs) to allocate > > HPTs anyway, but this is really a bug. > So IIUC the bug is that, by not accounting for that memory > properly, the kernel is allowing it to be allocated as > potentially non-contiguous and swappable, which will result > in failure right away (non-contiguous) or as soon as it has > been swapped out (swappable). Is that right? No. The HPT *will* be allocated contiguous and non-swappable (it's allocated with CMA) - it's just not accounted against the process / user's locked memory limit. That's why this is a security bug. > > As it stands a non-privileged user > > could create a whole pile of tiny VMs (it doesn't even need to actually > > execute any instructions in the VMs) and consume an unbounded amount of host > > memory with those 16MiB HPTs. > That's not really something QEMU specific, though, is it? > The same user could just as easily start a bunch of random > processes, each one allocating 16MiB+ and get the same result. No, because in that case the memory would be non-contiguous and swappable. --- Additional comment from Andrea Bolognani on 2016-06-09 10:26:56 EDT --- Got it. So I guess our options are: a) Raise locked memory limit for users to something like 64 MiB, so they can run guests of reasonable size (4 GiB) without running into errors. Appliances created by libguestfs are going to be even smaller than that, I assume, so they would work b) Teach libvirt about the difference between kvm_hv and kvm_pr, only try to tweak the locked memory limit when using HV, and have libguestfs always use PR c) Force libguestfs to use the direct backend on ppc64 d) Leave things as they are, basically restricting libguestfs usage to the root user a) and c) are definitely hacks, but could be implemented fairly quickly and removed once a better solution is in place. b) looks like it would be the proper solution but, as with all thing libvirt, rushing an implementation without thinking hard at the design has the potential to paint us in a corner. d) is probably not acceptable. --- Additional comment from David Gibson on 2016-06-14 02:01:23 EDT --- In the short term, I think we need to go with option (a). That's the only really feasible way we can handle this in the next RHEL release, I think. (b).. I really dislike. We try to avoid explicitly exposing the PR/HV distinction even to qemu as much as possible - instead using explicit capabilities for various features. Exposing and using that distinction a layer beyond qemu is going to open several new cans of worms. For one thing, whether the kernel picks HV or PR can depend on a number of details of both host and guest configuration, so you can't really reliably know which one it's going to be before starting it. (c) I'm not quite sure what "direct mode" entails. (d) is.. yeah, certainly suboptimal. Other things we could try: (e) Change KVM so that if it's unable to allocate the HPT due to locked memory limit, it will fall back to PR-KVM. In a sense that's the most pedantically correct, but I dislike it, because I suspect the result will be lots of people's VMs going slow for non-obvious reasons. (f) Put something distinctive in the error qemu reports when it hits the HPT allocation problem, and only have libvirt try to alter the limit and retry if qemu dies with that error. Involves an extra qemu invocation, which sucks. (g) Introduce some new kind of "VM limits" stuff into RHEL startup scripts, that will adjust users locked memory limits based on some sort of # of VMs and max size of VMs values configured by admin. This is basically a sophisticated version of (a). Ugh.. none of these are great :/. --- Additional comment from Andrea Bolognani on 2016-06-14 06:33:32 EDT --- (In reply to David Gibson from comment #21) > In the short term, I think we need to go with option (a). That's the only > really feasible way we can handle this in the next RHEL release, I think. I guess we would have to make qemu-kvm-rhev ship a /etc/security/limits.d/qemu-kvm-rhev-memlock.conf file that sets the new limit. It wouldn't make sense to raise the limit for hosts that are not going to act as hypervisors. > (b).. I really dislike. We try to avoid explicitly exposing the PR/HV > distinction even to qemu as much as possible - instead using explicit > capabilities for various features. Exposing and using that distinction a > layer beyond qemu is going to open several new cans of worms. For one > thing, whether the kernel picks HV or PR can depend on a number of details > of both host and guest configuration, so you can't really reliably know > which one it's going to be before starting it. Okay then. > (c) I'm not quite sure what "direct mode" entails. Basically libguestfs will call QEMU itself instead of going through libvirt. guestfish will give you this hint: libguestfs: error: could not create appliance through libvirt. Try running qemu directly without libvirt using this environment variable: export LIBGUESTFS_BACKEND=direct and if you do that you'll of course be able to avoid the error raised by libvirt. I don't know what other implications there are to using the direct backend, though. Rich? > (d) is.. yeah, certainly suboptimal. > > > Other things we could try: > > (e) Change KVM so that if it's unable to allocate the HPT due to locked > memory limit, it will fall back to PR-KVM. In a sense that's the most > pedantically correct, but I dislike it, because I suspect the result will be > lots of people's VMs going slow for non-obvious reasons. Yeah, doing this kind of stuff outside of user's control is never going to end well. Better to fail with a clear error message than trying to patch things up behind the scenes. > (f) Put something distinctive in the error qemu reports when it hits the HPT > allocation problem, and only have libvirt try to alter the limit and retry > if qemu dies with that error. Involves an extra qemu invocation, which > sucks. libvirt is not really designed in a way that allows you to just try calling QEMU with some arguments and, if that fails, call it again with different arguments. So QEMU would have to expose the information through QMP somehow, for libvirt to probe beforehand. I'm not sure whether this approach would even be feasible. > (g) Introduce some new kind of "VM limits" stuff into RHEL startup scripts, > that will adjust users locked memory limits based on some sort of # of VMs > and max size of VMs values configured by admin. This is basically a > sophisticated version of (a). The limits are be per-process, though. So the only thing that really matters is how much memory you want to allow for an unpriviledged guest. PCI passthrough is not going to be a factor unless you're root, and in that case you can set the limit as you please. > Ugh.. none of these are great :/. --- Additional comment from Daniel Berrange on 2016-06-14 06:40:48 EDT --- (In reply to Andrea Bolognani from comment #22) > (In reply to David Gibson from comment #21) > > In the short term, I think we need to go with option (a). That's the only > > really feasible way we can handle this in the next RHEL release, I think. > > I guess we would have to make qemu-kvm-rhev ship a > /etc/security/limits.d/qemu-kvm-rhev-memlock.conf file that > sets the new limit. It wouldn't make sense to raise the > limit for hosts that are not going to act as hypervisors. Such files will have no effect. The limits.conf files are processed by PAM, and when libvirt launches QEMU and sets its UID, PAM is not involved in any way. IOW, if we need to set limits for QEMU, libvirt has to set them explicitly. The same would apply for other apps launching QEMU, unless they actually use 'su' to run QEMU as a diffferent account, which I don't believe any do. --- Additional comment from Andrea Bolognani on 2016-06-14 07:14:00 EDT --- (In reply to Daniel Berrange from comment #23) > > I guess we would have to make qemu-kvm-rhev ship a > > /etc/security/limits.d/qemu-kvm-rhev-memlock.conf file that > > sets the new limit. It wouldn't make sense to raise the > > limit for hosts that are not going to act as hypervisors. > > Such files will have no effect. The limits.conf files are processed by PAM, > and when libvirt launches QEMU and sets its UID, PAM is not involved in any > way. > > IOW, if we need to set limits for QEMU, libvirt has to set them explicitly. > The same would apply for other apps launching QEMU, unless they actually use > 'su' to run QEMU as a diffferent account, which I don't believe any do. For user sessions, the libvirt daemon is autostarted and will inherit the user's limits. I tried dropping * hard memlock 64000 * soft memlock 64000 in /etc/security/limits.d/qemu-kvm-rhev-memlock.conf and, after logging out and in again, I was able to install a guest and use guestfish from my unprivileged account. --- Additional comment from Richard W.M. Jones on 2016-06-14 07:28:43 EDT --- (In reply to Andrea Bolognani from comment #22) > > (c) I'm not quite sure what "direct mode" entails. > > Basically libguestfs will call QEMU itself instead of going > through libvirt. guestfish will give you this hint: > > libguestfs: error: could not create appliance through libvirt. > > Try running qemu directly without libvirt using this environment variable: > export LIBGUESTFS_BACKEND=direct > > and if you do that you'll of course be able to avoid the error > raised by libvirt. > > I don't know what other implications there are to using the > direct backend, though. Rich? It's not supported, nor encouraged in RHEL. In this case it's a DIY workaround, but it ought to be fixed in libvirt (or qemu, or wherever, but in any case not by end users). --- Additional comment from Andrea Bolognani on 2016-06-28 05:01:55 EDT --- Moving this to qemu, as the only short-term (and possibly long-term) solution seems to be the one outlined in Comment 20 (proposal A) and POC-ed in Comment 24, ie. ship a /etc/security/limits.d/qemu-memlock.conf file that raises the memory locking limit to something like 64 MiB, thus allowing regular users to run smallish guests.
I just realized that the original report was about this failure happening on a x86_64 host. In the case of TCG guests, regardless of the host architecture, it's my understanding that memory locking should not be required. David, can you please confirm that?
(In reply to Andrea Bolognani from comment #1) > I just realized that the original report was about this > failure happening on a x86_64 host. That's not (really) right. It was actually launching an L2 guest inside an L1 host, where: L2 guest (ppc64) - failed because of memory locking L1 guest (ppc64) - running OK L0 host (x86-64) In this case I'm only doing nested virt while waiting for IBM to send me some POWER hardware. Ha ha, not really.
(In reply to Richard W.M. Jones from comment #2) > > I just realized that the original report was about this > > failure happening on a x86_64 host. > > That's not (really) right. It was actually launching an L2 > guest inside an L1 host, where: > > L2 guest (ppc64) - failed because of memory locking > L1 guest (ppc64) - running OK > L0 host (x86-64) > > In this case I'm only doing nested virt while waiting for > IBM to send me some POWER hardware. Ha ha, not really. Of course the first ppc64 guest (L1) had to be using TCG because the architecture mismatch. But was it started by libvirt? And if so, was it the user daemon or the system one?
(In reply to Andrea Bolognani from comment #3) > (In reply to Richard W.M. Jones from comment #2) > > > I just realized that the original report was about this > > > failure happening on a x86_64 host. > > > > That's not (really) right. It was actually launching an L2 > > guest inside an L1 host, where: > > > > L2 guest (ppc64) - failed because of memory locking > > L1 guest (ppc64) - running OK > > L0 host (x86-64) > > > > In this case I'm only doing nested virt while waiting for > > IBM to send me some POWER hardware. Ha ha, not really. > > Of course the first ppc64 guest (L1) had to be using TCG > because the architecture mismatch. For sure. > But was it started by > libvirt? And if so, was it the user daemon or the system > one? Yes, libvirt, and using the system connection. However I am not clear if the *L2* guest would be using TCG, or whether qemu emulates enough of POWER that it can emulate KVM too (albeit really slowly, of course). For example if you run an L2 guest on x86-64, even without nested KVM, the L2 guest will use (emulated, very slow) KVM because qemu-system-x86_64 running the L1 guest can emulate AMD's virt extensions.
(In reply to Richard W.M. Jones from comment #4) > > > > I just realized that the original report was about this > > > > failure happening on a x86_64 host. > > > > > > That's not (really) right. It was actually launching an L2 > > > guest inside an L1 host, where: > > > > > > L2 guest (ppc64) - failed because of memory locking > > > L1 guest (ppc64) - running OK > > > L0 host (x86-64) > > > > > > In this case I'm only doing nested virt while waiting for > > > IBM to send me some POWER hardware. Ha ha, not really. > > > > Of course the first ppc64 guest (L1) had to be using TCG > > because the architecture mismatch. > > For sure. > > > But was it started by > > libvirt? And if so, was it the user daemon or the system > > one? > > Yes, libvirt, and using the system connection. The system daemon, running as root, was able to raise the locked memory limit, hence why you didn't run into any error. So it all checks out :) This BZ is about teaching libvirt that TCG guests don't need to lock memory, which would make you able to run TCG guests, either on x86_64 or ppc64, from the user daemon. Assuming David confirms that TCG doesn't need to lock memory, that is :) > However I am not clear if the *L2* guest would be using TCG, > or whether qemu emulates enough of POWER that it can emulate > KVM too (albeit really slowly, of course). For example if you > run an L2 guest on x86-64, even without nested KVM, the > L2 guest will use (emulated, very slow) KVM because qemu-system-x86_64 > running the L1 guest can emulate AMD's virt extensions. I think the L2 guest will not be able to use kvm_hv, but will fall back to kvm_pr instead.
> In the case of TCG guests, regardless of the host > architecture, it's my understanding that memory locking > should not be required. > David, can you please confirm that? That's correct, unless you're using VFIO devices with TCG, in which case they will need their own memlock quota, as usual. [Richard] > However I am not clear if the *L2* guest would be using TCG, > or whether qemu emulates enough of POWER that it can emulate > KVM too (albeit really slowly, of course). For example if you > run an L2 guest on x86-64, even without nested KVM, the > L2 guest will use (emulated, very slow) KVM because qemu-system-x86_64 > running the L1 guest can emulate AMD's virt extensions. [Andrea] > I think the L2 guest will not be able to use kvm_hv, but > will fall back to kvm_pr instead. The L2 guest could be using either KVM PR or TCG, I'm not sure. The L1 guest is a PAPR (paravirtualized) guest, which runs with the HV (hypervisor mode) bit *off*. This has to be the case, because we don't support emulating a bare metal Power machine with full HV mode emulation in qemu. There are patches gradually on the way to add that a new "powernv" machine type which will emulate bare metal, but they're not in yet. Using KVM HV requires a host running in hypervisor mode. Since the L1 guest is not in hypervisor mode, it won't even attempt to use KVM HV. KVM PR could work for the L2 guest, however, RHEL by default won't load the KVM PR module. So if L1 is RHEL, and you haven't manually loaded the module, I'd expect the L2 guest to be running under TCG instead. All of which underscores the basic problem here: it's not easy for libvirt to tell what emulation mode a guest will run in until it's running, which is a problem if we need to conditionally adjust the locked memory limit beforehand. I don't have any good ideas about how to deal with that.
As check in our CI jobs, the memory lock problem failed with: libvirt 1.3.5-1.el7 kernel 3.10.0-327.el7 qemu-kvm-rhev 2.6.0-6.el7 Now retest with qemu-kvm-rhev updated packages: libvirt-1.3.5-1.el7.ppc64le qemu-kvm-rhev-2.6.0-10.el7.ppc64le kernel-3.10.0-327.el7.ppc64le steps: # useradd new_user # su - new_user $ virsh list --all Id Name State ---------------------------------------------------- - avocado-vt-vm1 shut off $ virsh dumpxml avocado-vt-vm1 <domain type='kvm'> <name>avocado-vt-vm1</name> <uuid>1c2363d5-90da-4f59-b1f8-25fbb4bec2d8</uuid> <memory unit='KiB'>1048576</memory> <currentMemory unit='KiB'>1048576</currentMemory> <vcpu placement='static'>2</vcpu> <os> <type arch='ppc64le' machine='pseries-rhel7.3.0'>hvm</type> <boot dev='hd'/> </os> <clock offset='utc'/> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>restart</on_crash> <devices> <emulator>/usr/libexec/qemu-kvm</emulator> <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/tmp/autotest.img'/> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </disk> <controller type='usb' index='0'> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> <controller type='pci' index='0' model='pci-root'/> <controller type='virtio-serial' index='0'> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </controller> <interface type='bridge'> <mac address='52:54:00:f4:85:91'/> <source bridge='virbr0'/> <model type='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </interface> <serial type='pty'> <target port='0'/> <address type='spapr-vio' reg='0x30000000'/> </serial> <console type='pty'> <target type='serial' port='0'/> <address type='spapr-vio' reg='0x30000000'/> </console> <input type='keyboard' bus='usb'/> <input type='mouse' bus='usb'/> <graphics type='vnc' port='-1' autoport='yes'> <listen type='address'/> </graphics> <video> <model type='vga' vram='16384' heads='1' primary='yes'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> </video> <memballoon model='virtio'> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </memballoon> <panic model='pseries'/> </devices> </domain> $ virsh start avocado-vt-vm1 error: Failed to start domain avocado-vt-vm1 error: Failed to connect socket to '/home/new_user/.cache/libvirt/virtlogd-sock': Connection refused some problem with virtlogd, as on x86_64 this works, start virtlogd under this user to workaround this $ virtlogd --daemon $ virsh start avocado-vt-vm1 error: Failed to start domain avocado-vt-vm1 error: internal error: /usr/libexec/qemu-bridge-helper --use-vnet --br=virbr0 --fd=24: failed to communicate with bridge helper: Transport endpoint is not connected stderr=libvirt: error : internal error: cannot apply process capabilities -1 another problem with qemu-bridge-helper failed, will file separate bug for this. The memory lock problem can't be reproduced with qemu-kvm-rhev-2.6.0-10.el7.ppc64le
stderr=libvirt: error : internal error: cannot apply process capabilities -1 is likely to be bug 1351995 (ie. a completely different thing)
(In reply to Richard W.M. Jones from comment #9) > stderr=libvirt: error : internal error: cannot apply process capabilities -1 > is likely to be bug 1351995 (ie. a completely different thing) Yeah, I already checked on Friday that that was the case. Didn't get around to update the BZ though.
As checked in CI job for 2.0.0-1, with audit 2.6.2-1 which fixed bug in comment #8, now the bug is reproduced with packages: libvirt 2.0.0-1.el7 kernel 3.10.0-327.el7 qemu-kvm-rhev 2.6.0-11.el7
A fix for this issue has been posted upstream. https://www.redhat.com/archives/libvir-list/2016-July/msg00072.html
The fix has been committed upstream. commit cd89d3451b8efcfed05ff1f4a91d9b252dbe26bc Author: Andrea Bolognani <abologna@redhat.com> Date: Wed Jun 29 10:22:32 2016 +0200 qemu: Memory locking is only required for KVM guests on ppc64 Due to the way the hardware works, KVM on ppc64 always requires memory locking; however, that is not the case for non-KVM ppc64 guests, eg. ppc64 guests that are running on x86_64 with TCG. Only require memory locking for ppc64 guests if they are using KVM or, as it's the case for all architectures, they have host devices assigned using VFIO. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1350772 v2.0.0-37-gcd89d34
*Packages used to reproduce: qemu-kvm-rhev-2.6.0-11.el7.ppc64le libvirt-2.0.0-1.el7.ppc64le kernel-3.10.0-327.el7.ppc64le *Reproduced with 2 scenarios: 1. Non-root user start guest on PPC host Steps: # useradd dzheng # su - dzheng $ virsh define /tmp/rpm/libvirt/guest.xml Domain avocado-vt-vm1 defined from /tmp/rpm/libvirt/guest.xml $ virsh list --all Id Name State ---------------------------------------------------- - avocado-vt-vm1 shut off $ virsh start avocado-vt-vm1 error: Failed to start domain avocado-vt-vm1 error: internal error: Process exited prior to exec: libvirt: error : cannot limit locked memory to 20971520: Operation not permitted 2. libguestfs-test-tool Install libvirt, qemu-kvm-rhev, and libguestfs packages. # systemctl restart libvirtd Log on as root again # ulimit -l 64 # su - dzheng $ ulimit -l 64 $ /usr/bin/libguestfs-test-tool ... Original error from libvirt: internal error: Process exited prior to exec: libvirt: error : cannot limit locked memory to 18874368: Operation not permitted [code=1 int1=-1] libguestfs-test-tool: failed to launch appliance See attachment (ulimit_64_dzheng_qemu_11_fail.libguestfs.log) As workaround, Log on as root # ulimit -l unlimited # su - dzheng $ ulimit -l unlimited $ /usr/bin/libguestfs-test-tool ... libguestfs: appliance is up Guest launched OK. ... ===== TEST FINISHED OK ===== See attachment (ulimit_unlimited_dzheng_qemu_11_pass.libguestfs.log)
Created attachment 1179117 [details] libguestfs-test-tool fail output with qemu-kvm-rhev-2.6.0-11.el7
Created attachment 1179118 [details] libguestfs-test-tool pass output with qemu-kvm-rhev-2.6.0-11.el7 unlimited work around
(In reply to Dan Zheng from comment #16) > *Packages used to reproduce: > > qemu-kvm-rhev-2.6.0-11.el7.ppc64le > libvirt-2.0.0-1.el7.ppc64le > kernel-3.10.0-327.el7.ppc64le The fix has been included in libvirt-2.0.0-2.el7, so the version you're using is too old... Or are you creating a baseline for testing the fix?
Andrea, Yes. comment 16 is just to reproduce the issue. And I did upgrade qemu to qemu-kvm-rhev-2.6.0-13.el7.ppc64le, but libvirt is still libvirt-2.0.0-1.el7 without upgrade. Then the scenarios in comment 16 can pass without your new codes. So I am thinking about what the scenario is to prove your codes required and take effect. BTW, it is hard for me to run a virt-install --arch ppc64le ... on a x86_host for the scenario used in the beginning of the bug because I can not setup TCG using downstream packages, I think. Any other suggestion?
(In reply to Dan Zheng from comment #20) > Andrea, > Yes. comment 16 is just to reproduce the issue. And I did upgrade qemu to > qemu-kvm-rhev-2.6.0-13.el7.ppc64le, but libvirt is still libvirt-2.0.0-1.el7 > without upgrade. Then the scenarios in comment 16 can pass without your new > codes. So I am thinking about what the scenario is to prove your codes > required and take effect. Scenario 1 failed outright; scenario 2 failed until you used a workaround. Those are the failures you're looking for. You either need to keep qemu-kvm-rhev at version 2.6.0-11.el7, or run $ ulimit -l 64 to make sure your memory locking limit is very low; additionaly, you need to make sure that the guest XML starts with <domain type='qemu'> which tells libvirt to use TCG instead of KVM. After you've done this, scenario 1 should fail with libvirt-2.0.0-1.el7 and succeed with libvirt-2.0.0-2.el7. Not sure if there's a way to force libguestfs-test-tool to use TCG in order to test scenario 2... Rich? :) > BTW, it is hard for me to run a virt-install --arch ppc64le ... on a > x86_host for the scenario used in the beginning of the bug because I can not > setup TCG using downstream packages, I think. Any other suggestion? We don't ship qemu-system-ppc64 on x86_64, so running ppc64 guests on x86_64 hosts is not a relevant use case for downstream. No need to test it.
Yes: http://libguestfs.org/guestfs.3.html#force_tcg
(In reply to Richard W.M. Jones from comment #22) > Yes: http://libguestfs.org/guestfs.3.html#force_tcg Sweet! Thank you :)
*Packages used to reproduce: qemu-kvm-rhev-2.6.0-13.el7.ppc64le libvirt-2.0.0-2.el7.ppc64le kernel-3.10.0-461.el7.ppc64le libguestfs-1.32.6-1.el7.ppc64le Scenario 1. Non-root user start guest on PPC host Steps: # useradd dzheng # su - dzheng Edit guest.xml to make sure <domain type='qemu'> $ virsh define /tmp/rpm/libvirt/guest.xml Domain guest1 defined from /tmp/rpm/libvirt/guest.xml $ ulimit -l 64 $ ulimit -l 64 $ virsh start avocado-vt-vm1 $ virsh list --all Id Name State ---------------------------------------------------- 3 guest1 running User can log on the guest. --- With libvirt-2.0.0-1.el7.ppc64le, scenario 1 can be reproduced with same error message as comment 16. Pass. ******************************************************* Scenario 2. libguestfs-test-tool # su - dzheng $ ulimit -l 64 $ export LIBGUESTFS_BACKEND_SETTINGS=force_tcg $ /usr/bin/libguestfs-test-tool [ 1.913667] Rebooting in 1 seconds..libguestfs: error: appliance closed the connection unexpectedly, see earlier error messages libguestfs: child_cleanup: 0x1000f300340: child process died libguestfs: error: guestfs_launch failed, see earlier error messages libguestfs-test-tool: failed to launch appliance $ ulimit -l 65536 Same error message. -- With libvirt-2.0.0-1.el7.ppc64le, same error message. See attachment libguestfs-test-tool-tcg-64-fail.log
Created attachment 1180943 [details] libguestfs-test-tool fail output with qemu-kvm-rhev-2.6.0-13.el7
Okay, there is a different bug that causes libguestfs-test-tool to fail in this situation - I've just filed it as Bug 1350772. So I think it's fair to ignore the libguestfs-test-tool failure for the moment, and just test that a regular libvirt TCG guest can be started by an unprivileged user with low memory locking limit. What kind of OS is installed in the avocado-vt-vm1 guest you tested in Comment 24? According to my tests for Bug 1350772, it would have to be something oldish in order to boot in TCG mode...
Andrea, The OS in avocado-vt-vm1 is Red Hat Enterprise Linux Server release 7.2 (Maipo), kernel 3.10.0-327.3.1.el7.ppc64le Any other information you want?
That explains it - RHEL 7.2 guests can boot fine on the POWER 7 processor that TCG emulates by default. No more information needed from my side, and I think the bug can be moved to VERIFIED now.
Based on above comment 28, I move it to verified now.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2577.html