hardware is a HP moonshot with m400 aarch64 carts. The cart is running RHEL 7.5 alt. The vm is Fedora 28 armv7 with all updates acting as a Fedora koji builder. From time to time (could be caused by builds?) the vm will move to state "paused" and become unresponsive. Trying resume on it gives: # virsh resume buildvm-armv7-14.arm.fedoraproject.org error: Failed to resume domain buildvm-armv7-14.arm.fedoraproject.org error: internal error: unable to execute QEMU command 'cont': Resetting the Virtual Machine is required Then a reset and resume brings it back, but it rebooted: # virsh reset buildvm-armv7-14.arm.fedoraproject.org Domain buildvm-armv7-14.arm.fedoraproject.org was reset # virsh resume buildvm-armv7-14.arm.fedoraproject.org Domain buildvm-armv7-14.arm.fedoraproject.org resumed ssh buildvm-armv7-14.arm.fedoraproject.org # w 20:37:33 up 0 min, 1 user, load average: 0.73, 0.16, 0.05 On the host in /var/log/libvirt/qemu/buildvm-armv7-14.arm.fedoraproject.org.log is: 2018-05-09 06:26:51.401+0000: starting up libvirt version: 3.9.0, package: 14.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2018-03-07-13:59:11, arm64-017.build.eng.bos.redhat.com), qemu version: 2.10.0(qemu-kvm-ma-2.10.0-21.el7), hostname: aarch64-c14n1.arm.fedoraproject.org LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=buildvm-armv7-14.arm.fedoraproject.org,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-5-buildvm-armv7-14.arm/master-key.aes -machine virt-rhel7.5.0,accel=kvm,usb=off,dump-guest-core=off -cpu host,aarch64=off -m 24576 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid e2be82ec-efa8-445a-ba2b-d4b2a4f832b3 -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-5-buildvm-armv7-14.arm/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -no-acpi -boot strict=on -kernel /var/lib/libvirt/images/vmlinuz-4.16.6-302.fc28.armv7hl+lpae -initrd /var/lib/libvirt/images/initramfs-4.16.6-302.fc28.armv7hl+lpae.img -append ' root=UUID=3190f023-985e-4e7f-88b5-2f3c576f7987 ro net.ifnames=0 console=ttyAMA0 LANG=en_US.UTF-8' -device pcie-root-port,port=0x8,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x1 -device pcie-root-port,port=0x9,chassis=2,id=pci.2,bus=pcie.0,addr=0x1.0x1 -device pcie-root-port,port=0xa,chassis=3,id=pci.3,bus=pcie.0,addr=0x1.0x2 -device pcie-root-port,port=0xb,chassis=4,id=pci.4,bus=pcie.0,addr=0x1.0x3 -device pcie-root-port,port=0xc,chassis=5,id=pci.5,bus=pcie.0,addr=0x1.0x4 -device pcie-root-port,port=0xd,chassis=6,id=pci.6,bus=pcie.0,addr=0x1.0x5 -device pcie-root-port,port=0xe,chassis=7,id=pci.7,bus=pcie.0,addr=0x1.0x6 -device qemu-xhci,p2=8,p3=8,id=usb,bus=pci.2,addr=0x0 -device virtio-serial-pci,id=virtio-serial0,bus=pci.3,addr=0x0 -drive file=/dev/vg_Server/buildvm-armv7-14.arm.fedoraproject.org,format=raw,if=none,id=drive-virtio-disk0,cache=none,aio=native -device virtio-blk-pci,scsi=off,bus=pci.4,addr=0x0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:1c:1e:58,bus=pci.1,addr=0x0 -chardev pty,id=charserial0 -serial chardev:charserial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-5-buildvm-armv7-14.arm/org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -device virtio-balloon-pci,id=balloon0,bus=pci.5,addr=0x0 -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.6,addr=0x0 -msg timestamp=on 2018-05-09 06:26:51.402+0000: Domain id=5 is tainted: host-cpu 2018-05-09T06:26:51.483671Z qemu-kvm: -chardev pty,id=charserial0: char device redirected to /dev/pts/0 (label charserial0) error: kvm run failed Function not implemented R00=b6fe6350 R01=b6fe6370 R02=00000006 R03=b6abc138 R04=b6fe6350 R05=b6fe8000 R06=6ffffeff R07=b6abc138 R08=b6c3744c R09=bee2f714 R10=b6fe64a8 R11=bee2f694 R12=b6abc000 R13=bee2f580 R14=b6fbd2f4 R15=b6fc2758 PSR=a0030010 N-C- A usr32 Will try and detect some pattern to the pauses.
Kevin, can we make this bug public?
(In reply to Florian Weimer from comment #2) > Kevin, can we make this bug public? As far as I am concerned, absolutely! Some more data: Fedora kernel builds have been failing with a "bus error" on armv7, which seems like it could well be related to this. In those guests on kernel builds we get: [Mon May 14 21:07:19 2018] Unhandled prefetch abort: implementation fault (lockdown abort) (0x234) at 0xb6df4d94 in dmesg. and there is a core dump: Message: Process 26075 (sh) of user 1000 dumped core. Stack trace of thread 26075: #0 0x00000000b6df4d94 n/a (/usr/lib/libc-2.27.9000.so) A Fedora 27 armv7 builder works as expected, so that points at perhaps something in the Fedora 28 armv7 install that causes this? I suppose this could be a sperate issue too, but they seem related.
More info: f27-builder (where kernel builds complete fine): 4.16.7-200.fc27.armv7hl+lpae f28-builder (where kernel builds bus error): 4.16.7-300.fc28.armv7hl+lpae So, I don't think this is caused by the kernel. Could it be a glibc in f28 issue?
(In reply to Kevin Fenzi from comment #5) > More info: > > f27-builder (where kernel builds complete fine): > 4.16.7-200.fc27.armv7hl+lpae > > f28-builder (where kernel builds bus error): > 4.16.7-300.fc28.armv7hl+lpae Hmm. Are you suggesting that this is not solely a hypervisor issue? > So, I don't think this is caused by the kernel. Could it be a glibc in f28 > issue? If it's a glibc issue, I would expect it to be more deterministic. It could be that the kernel compiled by GCC 8 has issues, though. That would only appear on Fedora 28, not Fedora 27, which still has GCC 7.
I'm not totally convinced that this issue is the root cause of some failures that I've been facing when building SSSD[0], but it may be. Maybe it could be something related to glibc, as suggested by Kevin (or even a totally different issue happening on armv7hl). I've done some tests using mock on a armv7hl hardware and SSSD build passes without any issue. However, it's constantly failing under Fedora infra. [0]: https://koji.fedoraproject.org/koji/taskinfo?taskID=26957996 Florian, If you need some info that may help you to investigate a possible failure on glibc, please, let me know how I could help.
(In reply to Fabiano Fidêncio from comment #7) > If you need some info that may help you to investigate a possible failure on > glibc, please, let me know how I could help. We need a somewhat reliable reproducer outside Koji, to take the KVM-on-aarch64 configuration out of the picture.
(In reply to Florian Weimer from comment #6) > (In reply to Kevin Fenzi from comment #5) > > More info: > > > > f27-builder (where kernel builds complete fine): > > 4.16.7-200.fc27.armv7hl+lpae > > > > f28-builder (where kernel builds bus error): > > 4.16.7-300.fc28.armv7hl+lpae > > Hmm. Are you suggesting that this is not solely a hypervisor issue? There is very likely to be a hypervisor issue here, given the KVM error message shown in QEMU logs "kvm run failed Function not implemented". Based on this error message though, it is probably an issue that has always existed, but was never previously tickled by the guest OS. Thus some change in F28 toolchain, or kernel, or userspace has triggered new codepaths that expose an existing KVM limitation. If we can at least understand what changed in the guest OS to trigger it, it might help identify either a way to avoid the bug, or a way to fix the hypervisor. NB, armv7l-on-aarch64 host is something that gets very little attention from KVM maintainers either upstream or downstream - Fedora is probably the most significant user of it that we know of in fact. > > So, I don't think this is caused by the kernel. Could it be a glibc in f28 > > issue? > > If it's a glibc issue, I would expect it to be more deterministic. > > It could be that the kernel compiled by GCC 8 has issues, though. That > would only appear on Fedora 28, not Fedora 27, which still has GCC 7. Changed/improved GCC code generation sounds plausible as something that could result in this kind of issue.
I installed and booted 4.16.7-200.fc27.armv7hl+lpae on a fedora 28 instance and a kernel build completes fine. That points more at gcc8/toolchain building the kernel.
We had one armv7 builder vm 'pause' last night. It looks like it was building glibc: https://koji.fedoraproject.org/koji/taskinfo?taskID=26975085 Host had: [Tue May 15 12:09:55 2018] kvm [10652]: load/store instruction decoding not implemented in /var/log/libvirt/qemu/*.log: error: kvm run failed Function not implemented R00=b6fa1370 R01=b6da11b8 R02=b6da1000 R03=b6ee7f38 R04=b6fa1370 R05=b6fa4000 R06=6ffffeff R07=b6da11b8 R08=beec673c R09=b6eeb560 R10=b6fa14c8 R11=beec66bc R12=6fffffff R13=beec65a8 R14=b6f7963c R15=b6f7e5e8 PSR=a0030010 N-C- A usr32 So, I think glibc builds (and possibly others) are causing the 'pause' and Function not implemented, where kernel builds just cause a bus error.
This msg is possibly useful: > [Tue May 15 12:09:55 2018] kvm [10652]: load/store instruction decoding not implemented as it points directly to the kernel code where the fault is triggered: virt/kvm/arm/mmio.c: if (kvm_vcpu_dabt_isvalid(vcpu)) { virt/kvm/arm/mmio.c- ret = decode_hsr(vcpu, &is_write, &len); virt/kvm/arm/mmio.c- if (ret) virt/kvm/arm/mmio.c- return ret; virt/kvm/arm/mmio.c- } else { virt/kvm/arm/mmio.c- kvm_err("load/store instruction decoding not implemented\n"); virt/kvm/arm/mmio.c- return -ENOSYS; virt/kvm/arm/mmio.c- } and on an aarch64 host, that method being called is arch/arm64/include/asm/kvm_emulate.h:static inline bool kvm_vcpu_dabt_isvalid(const struct kvm_vcpu *vcpu) arch/arm64/include/asm/kvm_emulate.h-{ arch/arm64/include/asm/kvm_emulate.h- return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_ISV); arch/arm64/include/asm/kvm_emulate.h-} So whatever the arm7 guest is doing is triggering that check to be false. Its beyond my knowledge to explain what that check means though...
I have moved all but 2 of the fedora armv7 builders back to f27 so we don't interfere with ongoing builds. I have the 2 builders installed (but disabled for new builds) with f28 and ready for any additional testing anyone would like me to do.
I seem to remember that there are ARM instructions that modify multiple operands in one go, and when those trap (e.g. because they access an MMIO location), they cannot be virtualized easily (because the information to emulate them isn't readily available from the trap symptoms). This has happened in edk2 before, and the solution was to tweak the guest code so that the compiler wouldn't generate such "practically unvirtualizable" instructions: https://github.com/tianocore/edk2/commit/2efbf710e27a It seems that, for ARMv7, the R15 register is the "program counter" (PC). When the guest is stopped due to the emulation failure, would it be possible to get a disassembly from the neighborhood of R15, over QMP or HMP? That might help with (a) confirming the type of the offending instruction, (b) identifying the guest source code for which gcc-8 generates the offending assembly. In fact, if the guest doesn't crash totally, just a fault is injected (into the guest kernel, or better yet, the guest userspace process), then the instruction and its location could be easily disassembled with gdb -- I think that's what Kevin nearly did in comment 4, from the core dump?
(In reply to Kevin Fenzi from comment #13) > I have moved all but 2 of the fedora armv7 builders back to f27 so we don't > interfere with ongoing builds. I have the 2 builders installed (but disabled > for new builds) with f28 and ready for any additional testing anyone would > like me to do. Yes, please. I've been attempting to reproduce, but without luck. If one of the f28 builders could reproduce, and then be left in the crashed state, we could attempt to extract a dump. I'll keep trying to reproduce as well though.
ok. I have a builder in the 'pause' / crashed state. ;) Can you let me know exactly what you would like me to do? Or if you prefer we could try and get you access to the vm, but that might take a bit of back and forth to set things up. 24 buildvm-armv7-24.arm.fedoraproject.org paused [Fri May 25 19:38:07 2018] kvm [20652]: load/store instruction decoding not implemented error: kvm run failed Function not implemented R00=b6fbd3b8 R01=b6d0c138 R02=b6fc0000 R03=b6e4f4a8 R04=b6fbd3b8 R05=6ffffdff R06=6ffffeff R07=b6d0c138 R08=bef7e3bc R09=b6e57b88 R10=b6fbd514 R11=bef7e33c R12=6fffffff R13=bef7e230 R14=b6f95698 R15=b6f9a620 PSR=a0000010 N-C- A usr32
Hi Kevin, I'd like to get the core and the kernel version from you. If you used libvirt to launch the guest, then we should be able to get the core easily with something like following run on the host $ virsh dump --memory-only --format elf $GUEST_NAME $GUEST_NAME.core If the guest was launched directly with QEMU, then you'll need to connect to the QEMU monitor and run (qemu) dump-guest-memory $GUEST_NAME.core If you can get the core and put it somewhere for me to download, then, along with telling me the guest kernel version running at the time, I'll attempt to analyze it. Note, I'm not using any compression options in the above commands on purpose. Using anything other than elf doesn't seem to work for AArch32 guests. You many compress the core after extracting it from the guest in order to prepare it for download though. Thanks, drew
Got the dump, sending you email on where to get it (as it may have sensitive info in it).
Apparently the same gcc-8 issue has caught up to 32-bit ARM firmware that runs on virtual machines. The proposed solution is to pass "-fno-lto" to gcc, in order to disable link-time optimization: [edk2] [PATCH] ArmVirtPkg/ArmVirtQemu ARM: work around KVM limitations in LTO build https://lists.01.org/pipermail/edk2-devel/2018-June/025476.html Ard's description of the issue, in the patch linked above, seems to confirm my suspicion in comment 14.
I wonder if it is worth talking to GCC maintainers to see if there's a viable way for them to avoid emitting these troublsome instructions, even when they do LTO, otherwise it feels like we'll be playing whack-a-mole when they decide some other thing can be optimized in the same way.
Sent the following message to the upstream GCC mailing list: "code-gen options for disabling multi-operand AArch64 and ARM instructions" https://gcc.gnu.org/ml/gcc/2018-06/msg00036.html
(In reply to Kevin Fenzi from comment #16) > ok. I have a builder in the 'pause' / crashed state. ;) > > Can you let me know exactly what you would like me to do? > Or if you prefer we could try and get you access to the vm, but that might > take a bit of back and forth to set things up. > > 24 buildvm-armv7-24.arm.fedoraproject.org paused > > [Fri May 25 19:38:07 2018] kvm [20652]: load/store instruction decoding not > implemented > > error: kvm run failed Function not implemented > R00=b6fbd3b8 R01=b6d0c138 R02=b6fc0000 R03=b6e4f4a8 > R04=b6fbd3b8 R05=6ffffdff R06=6ffffeff R07=b6d0c138 > R08=bef7e3bc R09=b6e57b88 R10=b6fbd514 R11=bef7e33c > R12=6fffffff R13=bef7e230 R14=b6f95698 R15=b6f9a620 > PSR=a0000010 N-C- A usr32 This example is rather puzzling, and I wonder whether it is a side effect of some other issue rather than being caused by MMIO being performed using instructions that KVM cannot emulate. Note that the exception was taken in USR32 mode. This is weird, considering that you wouldn't expect userland to perform MMIO directly, unless it remapped some MMIO region using /dev/mem explicitly. So given that this does not appear to be kernel code making the access, and considering that rebuilding the (32-bit ARM) world using a new compiler flag is intractible, I would prefer to spend more time gaining a better understanding of the root cause.
Fair enough (and thank you for the analysis!); I think we've all missed USR32 until you've highlighted it. Drew has a vmcore from Kevin; I hope the vmcore will provide more evidence. Thank you!
Something like --- a/virt/kvm/arm/mmio.c +++ b/virt/kvm/arm/mmio.c @@ -172,7 +172,8 @@ int io_mem_abort(struct kvm_vcpu *vcpu, struct kvm_run *run, if (ret) return ret; } else { - kvm_err("load/store instruction decoding not implemented\n"); + kvm_err("load/store instruction decoding not implemented (IPA:%pa ESR:0x%08x)\n", + &fault_ipa, kvm_vcpu_get_hsr(vcpu)); return -ENOSYS; } would already be rather helpful in diagnosing the situation. Note that the a stage 2 data abort (which is the exception that triggers this error) could be raised for many different conditions, including alignment faults and TLB conflicts in the stage 2 translation tables. It should also be noted that the ESR_EL2.ISV bit, which is read by kvm_vcpu_dabt_isvalid(), is documented by the ARM ARM as """ This bit is 0 for all faults reported in ESR_EL2 except the following stage 2 aborts: • AArch64 loads and stores of a single general-purpose register (including the register specified with 0b11111 ), including those with Acquire/Release semantics, but excluding Load Exclusive or Store Exclusive and excluding those with writeback. • AArch32 instructions where the instruction: — Is an LDR, LDA, LDRT, LDRSH, LDRSHT, LDRH, LDAH, LDRHT, LDRSB, LDRSBT, LDRB, LDAB, LDRBT, STR, STL, STRT, STRH, STLH, STRHT, STRB, STLB, or STRBT instruction. — Is not performing register writeback. — Is not using R15 as a source or destination register. For these cases, ISV is UNKNOWN if the exception was generated in Debug state in memory access mode, and otherwise indicates whether ISS[23:14] hold a valid syndrome. """ In other words, the architecture does not require the bit to be set for instructions that *could* be emulated, but rather limits the set of instructions for which you can expect it to ever assume the value '1' in the first place. IOW, it is perfectly acceptable for a core not to set the ISV bit for conditions like TLB conflict aborts etc at stage 2. This seems especially relevant due to the error message reported in #4 [Mon May 14 21:07:19 2018] Unhandled prefetch abort: implementation fault (lockdown abort) (0x234) at 0xb6df4d94 where the lockdown abort is triggered by a instruction fetch. I cannot tell you what the condition is that triggers it (it is implementation defined), but it is likely that if the same condition occurred during, e.g., a stage 2 page table walk, we would take a stage 2 data abort exception without a valid syndrome, regardless of whether the instruction was in the 'virtualizable' class or not.
(In reply to ard.biesheuvel from comment #24) ... > > This seems especially relevant due to the error message reported in #4 > > [Mon May 14 21:07:19 2018] Unhandled prefetch abort: implementation fault > (lockdown abort) (0x234) at 0xb6df4d94 > > where the lockdown abort is triggered by a instruction fetch. I cannot tell > you what the condition is that triggers it (it is implementation defined), > but it is likely that if the same condition occurred during, e.g., a stage 2 > page table walk, we would take a stage 2 data abort exception without a > valid syndrome, regardless of whether the instruction was in the > 'virtualizable' class or not. As pointed out by Christoffer Dall in an unrelated email thread, this code path should never be taken for IMPDEF lockdown abort exceptions, so we really need to instrument the code to get a better handle on this.
Here's a small update: We haven't put a lot of effort into this debug yet, as it's been quite difficult to reproduce the issue. I ran an environment identical to what was described in comment 0 over a weekend, but nothing happened. We did get a guest core once from one of the Fedora build machines, which I took a look at. There wasn't really anything interesting in there, but I did note that only one vcpu was running a task at the time, and it was running the assembler 'as'. So, as we expected from the kernel splat, it was running in user space at the time. We've also provided the reporter with a modified host kernel containing Ard's suggested change from comment 24, but the issue hasn't reproduced (or at least not with the same symptom) since. Since attempting to reproduce with the modified host kernel, the reporter has seen a guest kernel log [Thu Jun 28 01:06:51 2018] Unhandled prefetch abort: implementation fault (lockdown abort) (0x234) at 0xb6db4bdc but there was no crash and no host kernel logs. I'll attempt to find a reproducer some more.
Adding bugzillas that record bugs/issues found on certified AArch64 systems and/or significant bzs (panics, data corruptors, etc.) found generically on AArch64 systems. The intent is to focus on bzs on this tracker going forward in RHEL-alt.
So, not sure if this still happens on f28, but it does not seem to happen on f29. I did a scratch glibc build in our staging env that has f29 builders. The armv7 build failed: https://koji.stg.fedoraproject.org/koji/taskinfo?taskID=90003231 but the guest and host were fine...
Based on comment 29, I'll close this. It, or a new BZ, can always be [re]opened if the issue returns. Thanks
I spoke too soon. This is indeed still happening with Rhel 7.6 alt and Fedora 29 guest vms. It's now happening with kernel, glibc, and qt5. We really need to get to the bottom of this. :( The kernel and glibc ones don't cause the guest to pause, but do fail with a bus error and "Unhandled prefetch abort: implementation fault (lockdown abort) (0x234) at 0x00acfa00" in dmesg and core dump. The qt5 one does cause the guest pause and "[Sat Dec 22 19:16:24 2018] kvm [28942]: load/store instruction decoding not implemented" in dmesg. Happy to gather more info.
It still looks to me like some kind of incorrectly classified RAS error is delivered to the guest. Whether it results in the prefetch abort or the data abort (which triggers the KVM error) depends on whether the original exception occurs on a data access or on an instruction fetch. Is this reproducible across multiple M400s? Does it occur anywhere else?
Tried this on a Mustang with rhel8 beta and an f29 vm. Kernel build failed with a bus error and "Unhandled prefetch abort: implementation fault (lockdown abort) (0x234) at 0xb6de17c8" as reported in c#31. When using the non-lpae kernel on the builder the build finished successfully.
So, another datapoint (that hopefully Paul can duplicate): A Fedora 29 vm with NO updates (ie, GA f29, do not use or apply the updates repo), does seem to work ok. At least I got a glibc build with no problems with it. Perhaps this was why I didn't see it with F29 initially. I then only had the updated Fedora kernel in the vm, and the issue returned. So, it sounds like the kernel of the Fedora guest has something to do with this? 4.18.16-300.fc29.armv7hl+lpae -> no issue 4.20.7-200.fc29 -> this issue Paul, can you confirm? Does this help any isolating this?
Confirmed, using F29 GA with no updates with mock and deps from the GA repo I was able to do a number of kernel builds without an error. Trying with kernel-5.0.1-300.fc29 now.
(In reply to Paul Whalen from comment #35) > Confirmed, using F29 GA with no updates with mock and deps from the GA repo > I was able to do a number of kernel builds without an error. > > Trying with kernel-5.0.1-300.fc29 now. You may want to try 5.1rc1 too as reading through the change logs there was quite a few changes/fixes for arm32 virt and related memory/barriers and all sorts of related bits.
5.0.1-300.fc29.armv7hl+lpae failed with a bus error and "Unhandled prefetch abort: implementation fault (lockdown abort)". Trying with kernel-5.1.0-0.rc1.git0.1.fc31 now.
5.1.0-0.rc1.git0.1.fc31.armv7hl+lpae also fails in the same way with a kernel build.
5.1.0-0.rc5, 5.1.0-0.rc6 tested, no improvement.
Can you describe exactly how to setup the environment to reproduce this?
(In reply to Jon Masters from comment #40) > Can you describe exactly how to setup the environment to reproduce this? RHEL or Fedora host (m400 or Mustang) with a Fedora ARMv7 guest running the lpae kernel. Then attempt to build a kernel, glibc, or qt5. I have a host set up with various vm's if that's easier.
I allocated a m400 last week to test this, but ugh got pulled off it. As a secondary point though, a virt-manager started VM on softiron3k seems to be able to build the fedora kernel package repeatidly without problems (until I ran out of disk space that is).
(In reply to Kevin Fenzi from comment #34) > So, another datapoint (that hopefully Paul can duplicate): > > A Fedora 29 vm with NO updates (ie, GA f29, do not use or apply the updates > repo), does seem to work ok. At least I got a glibc build with no problems > with it. > Perhaps this was why I didn't see it with F29 initially. > > I then only had the updated Fedora kernel in the vm, and the issue returned. Might be worth testing a F-30 userspace with the 4.18.16-300.fc29.armv7hl+lpae kernel to see if that theory holds.
(In reply to Jeremy Linton from comment #42) > I allocated a m400 last week to test this, but ugh got pulled off it. As a > secondary point though, a virt-manager started VM on softiron3k seems to be > able to build the fedora kernel package repeatidly without problems (until I > ran out of disk space that is). Was the guest using the LPAE kernel? I understand that this isn't reproducible with the non-LPAE kernel, which implies something with the LPAE guest memory management could be triggering the bug. Also, as Ard points out in a couple comments above, we're getting stage2 aborts that are sometimes instructions and sometimes data, which implies it could be an issue triggered from guest paging in general. I'm suspicious that this issue may be due to the fact that the host kernel is using 64k pages and the guest is using 4k. If we can find a system and guest where this reliably reproduces, then I would suggest changing the host kernel to one that uses 4k pages to see if it still reproduces.
> Was the guest using the LPAE kernel? I understand that this isn't Yes, the guest is a LPAE kernel. > reproducible with the non-LPAE kernel, which implies something with the LPAE > guest memory management could be triggering the bug. Also, as Ard points out > in a couple comments above, we're getting stage2 aborts that are sometimes > instructions and sometimes data, which implies it could be an issue > triggered from guest paging in general. I'm suspicious that this issue may > be due to the fact that the host kernel is using 64k pages and the guest is > using 4k. If we can find a system and guest where this reliably reproduces, > then I would suggest changing the host kernel to one that uses 4k pages to > see if it still reproduces. Fedora on aarch64 is a 4K kernel, in comment #41 it states it's reproducible on a RHEL or Fedora host so that may well rule out a 64K vs 4K page issue.
What doesn't make any sense is how we'd end up with cache blocks in lockdown to begin with. I'm not sure that was even supported on Potenza (X-Gene1). I'm chatting with the design team now.
So when Paul sees a build fail in the guest, we do so with a lockdown implementation fault prefetch abort reported to the guest kernel, with an FSR indicating that this was due to a cache maintenance operation but also that it explicitly wasn't due to a cache operation. So yeah, sure. LPAE changes the descriptors for fault reporting - the short non-LPAE form also can convey lockdown faults, but you never see any with a non-LPAE kernel. Not that that means much, other than that I think the lockdown "fault" is in fact bogus. BUT we can work with it, since it's reproducible. The next step is to start instrumenting the guest kernel to look at the faulting address instruction page, etc. I'll start poking. Meanwhile, the design team are looking into what could cause a lockdown fault to be reported into a guest.
Confirmed that Potenza DID NOT support lockdown and there's no real reason for this fault to be reported. I personally suspect a silicon bug that is being tickled through a bogus mapping or somesuch. Time to gather more data.
This can quite reliably be reproduced on F30 by simply doing an olddefconfig and clean sequence. Mostly, you'll get a bus error, but sometimes the host will get the load/store unimplemented warning. I've seen reference to writing to /dev/null in perhaps all of the cases so far. Just trying to narrow it down to a single reproducer if possible before I start down the path of filesystem debug or poking at /dev/null.
Software managed Access Flag handling seems busted when using LPAE guests on AArch64 hosts. Every time we get the pause mentioned above it's because KVM's mmio code doesn't know what its doing so it assumes it's emulating an MMIO access (and then it spews the warning). But the actual instructions executing are straightforward loads. What they do have in common is that they're trying to happen against pages that have the access flag cleared. Something is going wrong there. I'm going to poke. Will update this BZ with various logs and dumps.
(Please excuse me if the following is dumb/naive.) (In reply to Jon Masters from comment #55) > the actual instructions executing are straightforward loads. What they > do have in common is that they're trying to happen against pages that > have the access flag cleared. Does that mean "first access to a recently paged-in page"? Can we perturb that by (a) disabling swap in the guest, *and* (b) locking all guest memory into host RAM? ( The libvirt-generated cmdline from comment 0 includes "-realtime mlock=off", hence question (b). The related libvirt domain XML elements are documented at <https://libvirt.org/formatdomain.html#elementsMemoryBacking>. I believe (a)+(b) wouldn't completely eliminate "first access to a recently paged-in page" in the guest, since that should happen anyway as a part of read()s from regular files, and as a part of explicit mmap()s of regular files by applications. Hence the word "perturb". :) ) Thanks.
gfn = fault_ipa >> PAGE_SHIFT; Except no. Not when the page size differs between host and guest. Working on a patch.
I'm not quite right in exactly how to fix this, but I'm pretty sure I'm on the right track that it's a page size interaction.
(In reply to Jon Masters from comment #58) > I'm not quite right in exactly how to fix this, but I'm pretty sure I'm on > the right track that it's a page size interaction. I wonder how aarch64 handles this as we have the same config there in that the underlying hypervisor is RHEL and the Fedora guests for aarch64 are 4K page sizes.
Ok, slightly more nuanced. It looks like the value being reported in hpfar_el2 is truncated such that it's missing bits higher than 31. Which means that 0x43deb9000 becomes 0x3deb9000. I originally assumed Linux was doing a wrong shift, but I think it's the hardware. I'm tired out of my mind at this point but I think a manual AT translation will be needed here to workaround this if I'm right. Definitely need to look at this with fresh eyes tomorrow.
I've request information from the manufacturer about any errata. It seems as if this /may/ be happening: A guest running in AArch32 execution state at EL1 performs a data access or instruction fetch that requires a translation table walk. The guest physical memory translation table is located at a guest IPA in LPAE space beyond 32-bits. One of the pages containing the PTE maps to a host stage2 translation that is not present. A trap is taken to EL2 with a truncated guest IPA contained within HPFAR_EL2.FIPA[47:12] equivalent to truncating HPFAR_EL2.FIPA into HPFAR_EL2.FIPA[31:12] (the actual HPFAR_EL2 register itself is not simply truncated). The host takes the fault and walks through the memslots that it has for guest physical memory. Seeing the truncated version, it does not realize that the faulting IPA is contained within the guest's physical memory. It mishandles this as an IO memory abort (fallthrough) and generates a spurious warning about load/store instruction emulation. One possible workaround is to pin guest memory in the qemu process.
More confirmation... [ 1373.454742] JCM: [piabt] gfn: 0x3e4c [ 1475.404950] JCM: [piabt] gfn: 0x3dea [ 1492.926644] JCM: [piabt] gfn: 0x3da4 [ 1493.157918] JCM: [piabt] gfn: 0x3e4b [ 1652.172442] JCM: fault_ipa: 0x 3dc45838 [ 1652.225778] kvm [6509]: load/store instruction decoding not implemented [ 1652.305180] JCM: io_mem_abort_failed [ 1652.348060] JCM: faulting gfn: 0x3dc4 [ 1652.391982] JCM: faulting far: 0xb6fbe838 [ 1652.440076] JCM: occurred while performing S1 PTW [ 1652.496516] JCM: PGD: 0x40207000
[ 487.128991] JCM: WARNING: Mismatched FIPA and PA translation detected! [ 487.207365] JCM: Guest faulting far: 0xb6dfe1cc (gfn: 0x3f13) [ 487.276335] JCM: Guest TTBCR: 0xb5023500, TTBR0: 0x5b0fd540 [ 487.343214] JCM: Guest PGD address: 0x5b0fd550 [ 487.396523] JCM: Guest PGD: 0x5b1c1003 [ 487.441484] JCM: Guest PMD address: 0x5b1c1db0 [ 487.494795] JCM: Guest PMD: 0x43f130003 [ 487.540804] JCM: Guest PTE address: 0x43f130ff0 [ 487.595153] JCM: Guest PTE: 0xe0000430bccf5f [ 487.646378] JCM: Manually translated as: 0xb6dfe1cc->0x430bcc000 [ 487.718467] JCM: fault_ipa: 0x 3f1301cc [ 487.771775] kvm [6716]: load/store instruction decoding not implemented [ 487.851163] JCM: io_mem_abort_failed [ 487.8 487.937955] JCM: faulting far: 0xb6dfe1cc [ 487.986049] JCM: occurred while performing S1 PTW [ 488.042485] JCM: PGD: 0xba00005b0fd540 Here you can see that we take a data abort from the guest as we try to perform a stage 1 walk, with an FIPA of 0x3f1301cc, which should probably be 0x43f1301cc. The PTE it's trying to read is very close in address to the manually translated ones above for the faulting address currently in the FAR.
It's aliiiiiiiive muwhahahahaha: [ 143.670063] JCM: WARNING: Mismatched FIPA and PA translation detected! [ 143.748447] JCM: Hyper faulting far: 0x3deb0000 [ 143.802808] JCM: Guest faulting far: 0xb6dce3c4 (gfn: 0x3deb) [ 143.871776] JCM: Guest TTBCR: 0xb5023500, TTBR0: 0x5b06cc40 [ 143.938649] JCM: Guest PGD address: 0x5b06cc50 [ 143.991962] JCM: Guest PGD: 0x5b150003 [ 144.036925] JCM: Guest PMD address: 0x5b150db0 [ 144.090238] JCM: Guest PMD: 0x43deb0003 [ 144.136241] JCM: Guest PTE address: 0x43deb0e70 [ 144.190604] JCM: Guest PTE: 0x42000043bb72fdf [ 144.242884] JCM: Manually translated as: 0xb6dce3c4->0x43bb72000 [ 144.314972] JCM: Faulting IPA page: 0x3deb0000 [ 144.368286] JCM: Faulting PTE page: 0x43deb0000 [ 144.422641] JCM: Fault occurred while performing S1 PTW -fixing [ 144.493684] JCM: corrected fault_ipa: 0x43deb0000 [ 144.550133] JCM: Corrected gfn: 0x43deb [ 144.596145] JCM: handle user_mem_abort [ 144.641155] JCM: ret: 0x1 [ 173.268497] JCM: WARNING: Mismatched FIPA and PA translation detected! [ 173.346903] JCM: Hyper faulting far: 0x3dea8000 [ 173.401265] JCM: Guest faulting far: 0xb6dcf3c4 (gfn: 0x3dea) [ 173.470236] JCM: Guest TTBCR: 0xb5023500, TTBR0: 0x5b0c1e80 [ 173.537111] JCM: Guest PGD address: 0x5b0c1e90 [ 173.590425] JCM: Guest PGD: 0x5a891003 [ 173.635392] JCM: Guest PMD address: 0x5a891db0 [ 173.688704] JCM: Guest PMD: 0x43dea8003 [ 173.734709] JCM: Guest PTE address: 0x43dea8e78 [ 173.789060] JCM: Guest PTE: 0x42000043bb72fdf [ 173.841326] JCM: Manually translated as: 0xb6dcf3c4->0x43bb72000 [ 173.913418] JCM: Faulting IPA page: 0x3dea8000 [ 173.966731] JCM: Faulting PTE page: 0x43dea8000 [ 174.021088] JCM: Fault occurred while performing S1 PTW -fixing [ 174.092138] JCM: corrected fault_ipa: 0x43dea8000 [ 174.148579] JCM: Corrected gfn: 0x43dea [ 174.194592] JCM: handle user_mem_abort [ 174.239601] JCM: ret: 0x1
Created attachment 1587164 [details] 0001-virt-arm-correct-hardware-errata-reading-HPFAR_EL2-o.patch Ugly patch for debug purposes only proving that there exists an HPFAR_EL2 errata on the hardware and correcting it for test purposes ONLY.
Paul ran a test mock kernel build for me (which is still ongoing). The test patch caught and fixed up two faults that would have normally crashed the builder so far: [ 2290.867797] JCM: WARNING: Mismatched FIPA and PA translation detected! [ 2290.946180] JCM: Hyper faulting far: 0x3d876000 [ 2291.000539] JCM: Guest faulting far: 0xb6dde3c4 (gfn: 0x3d87) [ 2291.069508] JCM: Guest TTBCR: 0xb5023500, TTBR0: 0x525b37c0 [ 2291.136386] JCM: Guest PGD address: 0x525b37d0 [ 2291.189704] JCM: Guest PGD: 0x5aba9003 [ 2291.234674] JCM: Guest PMD address: 0x5aba9db0 [ 2291.287997] JCM: Guest PMD: 0x43d876003 [ 2291.334005] JCM: Guest PTE address: 0x43d876ef0 [ 2291.388367] JCM: Guest PTE: 0x4200004050e3fdf [ 2291.440636] JCM: Manually translated as: 0xb6dde3c4->0x4050e3000 [ 2291.512728] JCM: Faulting IPA page: 0x3d876000 [ 2291.566038] JCM: Faulting PTE page: 0x43d876000 [ 2291.620398] JCM: Fault occurred while performing S1 PTW -fixing [ 2291.691445] JCM: corrected fault_ipa: 0x43d876000 [ 2291.747897] JCM: Corrected gfn: 0x43d87 [ 2291.793907] JCM: handle user_mem_abort [ 2291.838898] JCM: ret: 0x1 [ 2297.938930] JCM: WARNING: Mismatched FIPA and PA translation detected! [ 2298.017310] JCM: Hyper faulting far: 0x3ce00000 [ 2298.071671] JCM: Guest faulting far: 0xb6de13c4 (gfn: 0x3ce0) [ 2298.140639] JCM: Guest TTBCR: 0xb5023500, TTBR0: 0x525e9ec0 [ 2298.207512] JCM: Guest PGD address: 0x525e9ed0 [ 2298.260823] JCM: Guest PGD: 0x5aba9003 [ 2298.305798] JCM: Guest PMD address: 0x5aba9db0 [ 2298.359117] JCM: Guest PMD: 0x43ce00003 [ 2298.405130] JCM: Guest PTE address: 0x43ce00f08 [ 2298.459497] JCM: Guest PTE: 0x4200004050e3fdf [ 2298.511763] JCM: Manually translated as: 0xb6de13c4->0x4050e3000 [ 2298.583861] JCM: Faulting IPA page: 0x3ce00000 [ 2298.637176] JCM: Faulting PTE page: 0x43ce00000 [ 2298.691535] JCM: Fault occurred while performing S1 PTW -fixing [ 2298.762583] JCM: corrected fault_ipa: 0x43ce00000 [ 2298.819033] JCM: Corrected gfn: 0x43ce0 [ 2298.865041] JCM: handle user_mem_abort [ 2298.910027] JCM: ret: 0x1
NOTE on testing the above. Guests withstood load testing for a long time (much longer than usual). Eventually, the guests paused again under load. This time, the faulting address was outside the range covered by my patch (in guest kernel memory this time). The guest was getting an interrupt but the pages for the vgic at stage 2 (host) caused a fault (e.g. access bit update in the host). This could be an unrelated bug in that the HPFAR was correct this time but we didn't fix up the access fault correctly in the host. I'll look at that, possibly separately.
I setup a test rig at Ampere and they were able to reproduce this. It's being investigated. Meanwhile, I've got a guest test kernel building with HIGHPTE disabled that I'm hoping will very temporarily work around the problem. We'll see if practically speaking we run out of low memory to allocate the PTEs from in reality. If it works well enough it could be an option for the Fedora builders temporarily.
Thanks Jon (and everyone!) for tracking this down and working on a final fix...
Created attachment 1592060 [details] force CONFIG_HIGHPTE off for Arm LPAE kernels This is the patch that I'm currently using. It hasn't rolled over yet.
Applied to rawhide (should have a build on Mon with 5.3-rc1) and the stabilization branch (5.2) which we can test next week as it's 5.2 test week.
Tested here with 5.2.2-200.fc30.armv7hl+lpae. Uptime almost 2 days so far... multiple kernel and other builds, 0 problems. Looking good here!
This looks to be fixed with 5.2.2-200.fc30.armv7hl+lpae, builder has been up 21 days without issue.