Bug 1576593 - Fedorn 29 armv7 guests "pause" from time to time on rhel-alt 7.6 kernel-4.14.0-115.2.2.el7a.aarch64 kernel
Summary: Fedorn 29 armv7 guests "pause" from time to time on rhel-alt 7.6 kernel-4.14....
Status: ASSIGNED
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel-alt
Version: 7.6-Alt
Hardware: aarch64
OS: Unspecified
high
high
Target Milestone: rc
: 7.6-Alt
Assignee: Jon Masters
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Keywords: Reopened
Depends On:
Blocks: ARMTracker 1638451 1645315
TreeView+ depends on / blocked
 
Reported: 2018-05-09 21:18 UTC by Kevin Fenzi
Modified: 2019-07-19 17:57 UTC (History)
31 users (show)

(edit)
Clone Of:
(edit)
Last Closed: 2018-12-10 17:53:16 UTC


Attachments (Terms of Use)
0001-virt-arm-correct-hardware-errata-reading-HPFAR_EL2-o.patch (5.51 KB, patch)
2019-07-03 19:00 UTC, Jon Masters
no flags Details | Diff
force CONFIG_HIGHPTE off for Arm LPAE kernels (923 bytes, patch)
2019-07-19 17:12 UTC, Jon Masters
no flags Details | Diff

Description Kevin Fenzi 2018-05-09 21:18:56 UTC
hardware is a HP moonshot with m400 aarch64 carts. 

The cart is running RHEL 7.5 alt. 

The vm is Fedora 28 armv7 with all updates acting as a Fedora koji builder.

From time to time (could be caused by builds?) the vm will move to state "paused" and become unresponsive. 

Trying resume on it gives: 
# virsh resume buildvm-armv7-14.arm.fedoraproject.org
error: Failed to resume domain buildvm-armv7-14.arm.fedoraproject.org
error: internal error: unable to execute QEMU command 'cont': Resetting the Virtual Machine is required

Then a reset and resume brings it back, but it rebooted: 
# virsh reset  buildvm-armv7-14.arm.fedoraproject.org                                      
Domain buildvm-armv7-14.arm.fedoraproject.org was reset

# virsh resume buildvm-armv7-14.arm.fedoraproject.org
Domain buildvm-armv7-14.arm.fedoraproject.org resumed

ssh buildvm-armv7-14.arm.fedoraproject.org

# w
 20:37:33 up 0 min,  1 user,  load average: 0.73, 0.16, 0.05

On the host in /var/log/libvirt/qemu/buildvm-armv7-14.arm.fedoraproject.org.log is: 

2018-05-09 06:26:51.401+0000: starting up libvirt version: 3.9.0, package: 14.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2018-03-07-13:59:11, arm64-017.build.eng.bos.redhat.com), qemu version: 2.10.0(qemu-kvm-ma-2.10.0-21.el7), hostname: aarch64-c14n1.arm.fedoraproject.org
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=buildvm-armv7-14.arm.fedoraproject.org,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-5-buildvm-armv7-14.arm/master-key.aes -machine virt-rhel7.5.0,accel=kvm,usb=off,dump-guest-core=off -cpu host,aarch64=off -m 24576 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid e2be82ec-efa8-445a-ba2b-d4b2a4f832b3 -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-5-buildvm-armv7-14.arm/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -no-acpi -boot strict=on -kernel /var/lib/libvirt/images/vmlinuz-4.16.6-302.fc28.armv7hl+lpae -initrd /var/lib/libvirt/images/initramfs-4.16.6-302.fc28.armv7hl+lpae.img -append '     root=UUID=3190f023-985e-4e7f-88b5-2f3c576f7987 ro net.ifnames=0 console=ttyAMA0 LANG=en_US.UTF-8' -device pcie-root-port,port=0x8,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x1 -device pcie-root-port,port=0x9,chassis=2,id=pci.2,bus=pcie.0,addr=0x1.0x1 -device pcie-root-port,port=0xa,chassis=3,id=pci.3,bus=pcie.0,addr=0x1.0x2 -device pcie-root-port,port=0xb,chassis=4,id=pci.4,bus=pcie.0,addr=0x1.0x3 -device pcie-root-port,port=0xc,chassis=5,id=pci.5,bus=pcie.0,addr=0x1.0x4 -device pcie-root-port,port=0xd,chassis=6,id=pci.6,bus=pcie.0,addr=0x1.0x5 -device pcie-root-port,port=0xe,chassis=7,id=pci.7,bus=pcie.0,addr=0x1.0x6 -device qemu-xhci,p2=8,p3=8,id=usb,bus=pci.2,addr=0x0 -device virtio-serial-pci,id=virtio-serial0,bus=pci.3,addr=0x0 -drive file=/dev/vg_Server/buildvm-armv7-14.arm.fedoraproject.org,format=raw,if=none,id=drive-virtio-disk0,cache=none,aio=native -device virtio-blk-pci,scsi=off,bus=pci.4,addr=0x0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:1c:1e:58,bus=pci.1,addr=0x0 -chardev pty,id=charserial0 -serial chardev:charserial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-5-buildvm-armv7-14.arm/org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -device virtio-balloon-pci,id=balloon0,bus=pci.5,addr=0x0 -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.6,addr=0x0 -msg timestamp=on
2018-05-09 06:26:51.402+0000: Domain id=5 is tainted: host-cpu
2018-05-09T06:26:51.483671Z qemu-kvm: -chardev pty,id=charserial0: char device redirected to /dev/pts/0 (label charserial0)
error: kvm run failed Function not implemented
R00=b6fe6350 R01=b6fe6370 R02=00000006 R03=b6abc138
R04=b6fe6350 R05=b6fe8000 R06=6ffffeff R07=b6abc138
R08=b6c3744c R09=bee2f714 R10=b6fe64a8 R11=bee2f694
R12=b6abc000 R13=bee2f580 R14=b6fbd2f4 R15=b6fc2758
PSR=a0030010 N-C- A usr32

Will try and detect some pattern to the pauses.

Comment 2 Florian Weimer 2018-05-15 11:43:38 UTC
Kevin, can we make this bug public?

Comment 4 Kevin Fenzi 2018-05-15 15:50:53 UTC
(In reply to Florian Weimer from comment #2)
> Kevin, can we make this bug public?

As far as I am concerned, absolutely!

Some more data: 

Fedora kernel builds have been failing with a "bus error" on armv7, which seems like it could well be related to this. 

In those guests on kernel builds we get: 
[Mon May 14 21:07:19 2018] Unhandled prefetch abort: implementation fault (lockdown abort) (0x234) at 0xb6df4d94
in dmesg. and there is a core dump: 


Message: Process 26075 (sh) of user 1000 dumped core.

Stack trace of thread 26075:
#0  0x00000000b6df4d94 n/a (/usr/lib/libc-2.27.9000.so)

A Fedora 27 armv7 builder works as expected, so that points at perhaps something in the Fedora 28 armv7 install that causes this? I suppose this could be a sperate issue too, but they seem related.

Comment 5 Kevin Fenzi 2018-05-15 18:37:18 UTC
More info: 

f27-builder (where kernel builds complete fine): 
4.16.7-200.fc27.armv7hl+lpae

f28-builder (where kernel builds bus error): 
4.16.7-300.fc28.armv7hl+lpae

So, I don't think this is caused by the kernel. Could it be a glibc in f28 issue?

Comment 6 Florian Weimer 2018-05-15 18:47:53 UTC
(In reply to Kevin Fenzi from comment #5)
> More info: 
> 
> f27-builder (where kernel builds complete fine): 
> 4.16.7-200.fc27.armv7hl+lpae
> 
> f28-builder (where kernel builds bus error): 
> 4.16.7-300.fc28.armv7hl+lpae

Hmm.  Are you suggesting that this is not solely a hypervisor issue?

> So, I don't think this is caused by the kernel. Could it be a glibc in f28
> issue?

If it's a glibc issue, I would expect it to be more deterministic.

It could be that the kernel compiled by GCC 8 has issues, though.  That would only appear on Fedora 28, not Fedora 27, which still has GCC 7.

Comment 7 Fabiano Fidêncio 2018-05-15 19:41:14 UTC
I'm not totally convinced that this issue is the root cause of some failures that I've been facing when building SSSD[0], but it may be. Maybe it could be something related to glibc, as suggested by Kevin (or even a totally different issue happening on armv7hl).

I've done some tests using mock on a armv7hl hardware and SSSD build passes without any issue. However, it's constantly failing under Fedora infra.

[0]: https://koji.fedoraproject.org/koji/taskinfo?taskID=26957996

Florian,

If you need some info that may help you to investigate a possible failure on glibc, please, let me know how I could help.

Comment 8 Florian Weimer 2018-05-15 19:47:38 UTC
(In reply to Fabiano Fidêncio from comment #7)
> If you need some info that may help you to investigate a possible failure on
> glibc, please, let me know how I could help.

We need a somewhat reliable reproducer outside Koji, to take the KVM-on-aarch64 configuration out of the picture.

Comment 9 Daniel Berrange 2018-05-15 19:58:37 UTC
(In reply to Florian Weimer from comment #6)
> (In reply to Kevin Fenzi from comment #5)
> > More info: 
> > 
> > f27-builder (where kernel builds complete fine): 
> > 4.16.7-200.fc27.armv7hl+lpae
> > 
> > f28-builder (where kernel builds bus error): 
> > 4.16.7-300.fc28.armv7hl+lpae
> 
> Hmm.  Are you suggesting that this is not solely a hypervisor issue?

There is very likely to be a hypervisor issue here, given the KVM error message shown in QEMU logs  "kvm run failed Function not implemented". Based on this error message though, it is probably an issue that has always existed, but was never previously tickled by the guest OS. Thus some change in F28 toolchain, or kernel, or userspace has triggered new codepaths that expose an existing KVM limitation.  If we can at least understand what changed in the guest OS to trigger it, it might help identify either a way to avoid the bug, or a way to fix the hypervisor.

NB, armv7l-on-aarch64 host is something that gets very little attention from KVM  maintainers either upstream or downstream - Fedora is probably the most significant user of it that we know of in fact.

> > So, I don't think this is caused by the kernel. Could it be a glibc in f28
> > issue?
> 
> If it's a glibc issue, I would expect it to be more deterministic.
> 
> It could be that the kernel compiled by GCC 8 has issues, though.  That
> would only appear on Fedora 28, not Fedora 27, which still has GCC 7.

Changed/improved GCC code generation sounds plausible as something that could result in this kind of issue.

Comment 10 Kevin Fenzi 2018-05-16 04:34:33 UTC
I installed and booted 4.16.7-200.fc27.armv7hl+lpae on a fedora 28 instance and a kernel build completes fine. That points more at gcc8/toolchain building the kernel.

Comment 11 Kevin Fenzi 2018-05-16 19:17:09 UTC
We had one armv7 builder vm 'pause' last night. It looks like it was building glibc: https://koji.fedoraproject.org/koji/taskinfo?taskID=26975085

Host had: 
[Tue May 15 12:09:55 2018] kvm [10652]: load/store instruction decoding not implemented
in /var/log/libvirt/qemu/*.log: 
error: kvm run failed Function not implemented
R00=b6fa1370 R01=b6da11b8 R02=b6da1000 R03=b6ee7f38
R04=b6fa1370 R05=b6fa4000 R06=6ffffeff R07=b6da11b8
R08=beec673c R09=b6eeb560 R10=b6fa14c8 R11=beec66bc
R12=6fffffff R13=beec65a8 R14=b6f7963c R15=b6f7e5e8
PSR=a0030010 N-C- A usr32

So, I think glibc builds (and possibly others) are causing the 'pause' and Function not implemented, where kernel builds just cause a bus error.

Comment 12 Daniel Berrange 2018-05-16 19:53:10 UTC
This msg is possibly useful:

> [Tue May 15 12:09:55 2018] kvm [10652]: load/store instruction decoding not implemented

as it points directly to the kernel code where the fault is triggered:

  virt/kvm/arm/mmio.c:    if (kvm_vcpu_dabt_isvalid(vcpu)) {
  virt/kvm/arm/mmio.c-            ret = decode_hsr(vcpu, &is_write, &len);
  virt/kvm/arm/mmio.c-            if (ret)
  virt/kvm/arm/mmio.c-                    return ret;
  virt/kvm/arm/mmio.c-    } else {
  virt/kvm/arm/mmio.c-            kvm_err("load/store instruction decoding not implemented\n");
  virt/kvm/arm/mmio.c-            return -ENOSYS;
  virt/kvm/arm/mmio.c-    }

and on an aarch64 host, that method being called is

  arch/arm64/include/asm/kvm_emulate.h:static inline bool kvm_vcpu_dabt_isvalid(const struct kvm_vcpu *vcpu)
  arch/arm64/include/asm/kvm_emulate.h-{
  arch/arm64/include/asm/kvm_emulate.h-   return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_ISV);
  arch/arm64/include/asm/kvm_emulate.h-}

So whatever the arm7 guest is doing is triggering that check to be false. Its beyond my knowledge to explain what that check means though...

Comment 13 Kevin Fenzi 2018-05-16 23:53:25 UTC
I have moved all but 2 of the fedora armv7 builders back to f27 so we don't interfere with ongoing builds. I have the 2 builders installed (but disabled for new builds) with f28 and ready for any additional testing anyone would like me to do.

Comment 14 Laszlo Ersek 2018-05-17 13:59:12 UTC
I seem to remember that there are ARM instructions that modify multiple
operands in one go, and when those trap (e.g. because they access an MMIO
location), they cannot be virtualized easily (because the information to
emulate them isn't readily available from the trap symptoms). This has
happened in edk2 before, and the solution was to tweak the guest code so
that the compiler wouldn't generate such "practically unvirtualizable"
instructions:

https://github.com/tianocore/edk2/commit/2efbf710e27a

It seems that, for ARMv7, the R15 register is the "program counter" (PC).
When the guest is stopped due to the emulation failure, would it be possible
to get a disassembly from the neighborhood of R15, over QMP or HMP? That
might help with (a) confirming the type of the offending instruction, (b)
identifying the guest source code for which gcc-8 generates the offending
assembly.

In fact, if the guest doesn't crash totally, just a fault is injected (into
the guest kernel, or better yet, the guest userspace process), then the
instruction and its location could be easily disassembled with gdb -- I
think that's what Kevin nearly did in comment 4, from the core dump?

Comment 15 Andrew Jones 2018-05-22 06:57:26 UTC
(In reply to Kevin Fenzi from comment #13)
> I have moved all but 2 of the fedora armv7 builders back to f27 so we don't
> interfere with ongoing builds. I have the 2 builders installed (but disabled
> for new builds) with f28 and ready for any additional testing anyone would
> like me to do.

Yes, please. I've been attempting to reproduce, but without luck. If one of the f28 builders could reproduce, and then be left in the crashed state, we could attempt to extract a dump. I'll keep trying to reproduce as well though.

Comment 16 Kevin Fenzi 2018-05-25 19:50:40 UTC
ok. I have a builder in the 'pause' / crashed state. ;) 

Can you let me know exactly what you would like me to do? 
Or if you prefer we could try and get you access to the vm, but that might take a bit of back and forth to set things up. 

 24    buildvm-armv7-24.arm.fedoraproject.org paused

[Fri May 25 19:38:07 2018] kvm [20652]: load/store instruction decoding not implemented

error: kvm run failed Function not implemented
R00=b6fbd3b8 R01=b6d0c138 R02=b6fc0000 R03=b6e4f4a8
R04=b6fbd3b8 R05=6ffffdff R06=6ffffeff R07=b6d0c138
R08=bef7e3bc R09=b6e57b88 R10=b6fbd514 R11=bef7e33c
R12=6fffffff R13=bef7e230 R14=b6f95698 R15=b6f9a620
PSR=a0000010 N-C- A usr32

Comment 17 Andrew Jones 2018-05-28 13:47:01 UTC
Hi Kevin,

I'd like to get the core and the kernel version from you. If you used libvirt to launch the guest, then we should be able to get the core easily with something like following run on the host

 $ virsh dump --memory-only --format elf $GUEST_NAME $GUEST_NAME.core

If the guest was launched directly with QEMU, then you'll need to connect to the QEMU monitor and run
 
 (qemu) dump-guest-memory $GUEST_NAME.core

If you can get the core and put it somewhere for me to download, then, along with telling me the guest kernel version running at the time, I'll attempt to analyze it. Note, I'm not using any compression options in the above commands on purpose. Using anything other than elf doesn't seem to work for AArch32 guests. You many compress the core after extracting it from the guest in order to prepare it for download though.

Thanks,
drew

Comment 18 Kevin Fenzi 2018-05-28 18:12:18 UTC
Got the dump, sending you email on where to get it (as it may have sensitive info in it).

Comment 19 Laszlo Ersek 2018-06-04 15:04:40 UTC
Apparently the same gcc-8 issue has caught up to 32-bit ARM firmware that runs on virtual machines. The proposed solution is to pass "-fno-lto" to gcc, in order to disable link-time optimization:

  [edk2] [PATCH] ArmVirtPkg/ArmVirtQemu ARM: work around KVM limitations in
                 LTO build
  https://lists.01.org/pipermail/edk2-devel/2018-June/025476.html

Ard's description of the issue, in the patch linked above, seems to confirm my suspicion in comment 14.

Comment 20 Daniel Berrange 2018-06-04 16:34:26 UTC
I wonder if it is worth talking to GCC maintainers to see if there's a viable way for them to avoid emitting these troublsome instructions, even when they do LTO, otherwise it feels like we'll be playing whack-a-mole when they decide some other thing can be optimized in the same way.

Comment 21 Laszlo Ersek 2018-06-04 18:11:54 UTC
Sent the following message to the upstream GCC mailing list:

"code-gen options for disabling multi-operand AArch64 and ARM instructions"
https://gcc.gnu.org/ml/gcc/2018-06/msg00036.html

Comment 22 ard.biesheuvel 2018-06-05 08:42:35 UTC
(In reply to Kevin Fenzi from comment #16)
> ok. I have a builder in the 'pause' / crashed state. ;) 
> 
> Can you let me know exactly what you would like me to do? 
> Or if you prefer we could try and get you access to the vm, but that might
> take a bit of back and forth to set things up. 
> 
>  24    buildvm-armv7-24.arm.fedoraproject.org paused
> 
> [Fri May 25 19:38:07 2018] kvm [20652]: load/store instruction decoding not
> implemented
> 
> error: kvm run failed Function not implemented
> R00=b6fbd3b8 R01=b6d0c138 R02=b6fc0000 R03=b6e4f4a8
> R04=b6fbd3b8 R05=6ffffdff R06=6ffffeff R07=b6d0c138
> R08=bef7e3bc R09=b6e57b88 R10=b6fbd514 R11=bef7e33c
> R12=6fffffff R13=bef7e230 R14=b6f95698 R15=b6f9a620
> PSR=a0000010 N-C- A usr32

This example is rather puzzling, and I wonder whether it is a side effect of some other issue rather than being caused by MMIO being performed using instructions that KVM cannot emulate.

Note that the exception was taken in USR32 mode. This is weird, considering that you wouldn't expect userland to perform MMIO directly, unless it remapped some MMIO region using /dev/mem explicitly.

So given that this does not appear to be kernel code making the access, and considering that rebuilding the (32-bit ARM) world using a new compiler flag is intractible, I would prefer to spend more time gaining a better understanding of the root cause.

Comment 23 Laszlo Ersek 2018-06-05 09:00:29 UTC
Fair enough (and thank you for the analysis!); I think we've all missed USR32 until you've highlighted it. Drew has a vmcore from Kevin; I hope the vmcore will provide more evidence. Thank you!

Comment 24 ard.biesheuvel 2018-06-05 09:26:26 UTC
Something like

--- a/virt/kvm/arm/mmio.c
+++ b/virt/kvm/arm/mmio.c
@@ -172,7 +172,8 @@ int io_mem_abort(struct kvm_vcpu *vcpu, struct kvm_run *run,
                if (ret)
                        return ret;
        } else {
-               kvm_err("load/store instruction decoding not implemented\n");
+               kvm_err("load/store instruction decoding not implemented (IPA:%pa ESR:0x%08x)\n",
+                       &fault_ipa, kvm_vcpu_get_hsr(vcpu));
                return -ENOSYS;
        }
 

would already be rather helpful in diagnosing the situation. Note that the a stage 2 data abort (which is the exception that triggers this error) could be raised for many different conditions, including alignment faults and TLB conflicts in the stage 2 translation tables.

It should also be noted that the ESR_EL2.ISV bit, which is read by kvm_vcpu_dabt_isvalid(), is documented by the ARM ARM as

"""
This bit is 0 for all faults reported in ESR_EL2 except the following stage 2 aborts:
• AArch64 loads and stores of a single general-purpose register (including the register
specified with 0b11111 ), including those with Acquire/Release semantics, but excluding Load
Exclusive or Store Exclusive and excluding those with writeback.
• AArch32 instructions where the instruction:
— Is an LDR, LDA, LDRT, LDRSH, LDRSHT, LDRH, LDAH, LDRHT, LDRSB,
LDRSBT, LDRB, LDAB, LDRBT, STR, STL, STRT, STRH, STLH, STRHT, STRB,
STLB, or STRBT instruction.
— Is not performing register writeback.
— Is not using R15 as a source or destination register.
For these cases, ISV is UNKNOWN if the exception was generated in Debug state in memory access
mode, and otherwise indicates whether ISS[23:14] hold a valid syndrome.
"""

In other words, the architecture does not require the bit to be set for instructions that *could* be emulated, but rather limits the set of instructions for which you can expect it to ever assume the value '1' in the first place. IOW, it is perfectly acceptable for a core not to set the ISV bit for conditions like TLB conflict aborts etc at stage 2.

This seems especially relevant due to the error message reported in #4

[Mon May 14 21:07:19 2018] Unhandled prefetch abort: implementation fault (lockdown abort) (0x234) at 0xb6df4d94

where the lockdown abort is triggered by a instruction fetch. I cannot tell you what the condition is that triggers it (it is implementation defined), but it is likely that if the same condition occurred during, e.g., a stage 2 page table walk, we would take a stage 2 data abort exception without a valid syndrome, regardless of whether the instruction was in the 'virtualizable' class or not.

Comment 25 ard.biesheuvel 2018-06-07 11:32:59 UTC
(In reply to ard.biesheuvel from comment #24)
...
> 
> This seems especially relevant due to the error message reported in #4
> 
> [Mon May 14 21:07:19 2018] Unhandled prefetch abort: implementation fault
> (lockdown abort) (0x234) at 0xb6df4d94
> 
> where the lockdown abort is triggered by a instruction fetch. I cannot tell
> you what the condition is that triggers it (it is implementation defined),
> but it is likely that if the same condition occurred during, e.g., a stage 2
> page table walk, we would take a stage 2 data abort exception without a
> valid syndrome, regardless of whether the instruction was in the
> 'virtualizable' class or not.

As pointed out by Christoffer Dall in an unrelated email thread, this code path should never be taken for IMPDEF lockdown abort exceptions, so we really need to instrument the code to get a better handle on this.

Comment 26 Andrew Jones 2018-07-03 13:52:49 UTC
Here's a small update:

We haven't put a lot of effort into this debug yet, as it's been quite difficult to reproduce the issue. I ran an environment identical to what was described in comment 0 over a weekend, but nothing happened. We did get a guest core once from one of the Fedora build machines, which I took a look at. There wasn't really anything interesting in there, but I did note that only one vcpu was running a task at the time, and it was running the assembler 'as'. So, as we expected from the kernel splat, it was running in user space at the time. We've also provided the reporter with a modified host kernel containing Ard's suggested change from comment 24, but the issue hasn't reproduced (or at least not with the same symptom) since.

Since attempting to reproduce with the modified host kernel, the reporter has seen a guest kernel log

[Thu Jun 28 01:06:51 2018] Unhandled prefetch abort: implementation
fault (lockdown abort) (0x234) at 0xb6db4bdc

but there was no crash and no host kernel logs.

I'll attempt to find a reproducer some more.

Comment 27 John Feeney 2018-10-22 18:55:34 UTC
Adding bugzillas that record bugs/issues found on certified AArch64 systems and/or significant bzs (panics, data corruptors, etc.) found generically on AArch64 systems. The intent is to focus on bzs on this tracker going forward in RHEL-alt.

Comment 29 Kevin Fenzi 2018-11-25 05:42:57 UTC
So, not sure if this still happens on f28, but it does not seem to happen on f29. 

I did a scratch glibc build in our staging env that has f29 builders. 
The armv7 build failed: 

https://koji.stg.fedoraproject.org/koji/taskinfo?taskID=90003231

but the guest and host were fine...

Comment 30 Andrew Jones 2018-12-10 17:53:16 UTC
Based on comment 29, I'll close this. It, or a new BZ, can always be [re]opened if the issue returns. Thanks

Comment 31 Kevin Fenzi 2018-12-22 19:37:53 UTC
I spoke too soon. This is indeed still happening with Rhel 7.6 alt and Fedora 29 guest vms.

It's now happening with kernel, glibc, and qt5.

We really need to get to the bottom of this. :( 

The kernel and glibc ones don't cause the guest to pause, but do fail with a bus error and "Unhandled prefetch abort: implementation fault (lockdown abort) (0x234) at 0x00acfa00" in dmesg and core dump.
The qt5 one does cause the guest pause and "[Sat Dec 22 19:16:24 2018] kvm [28942]: load/store instruction decoding not implemented" in dmesg. 

Happy to gather more info.

Comment 32 ard.biesheuvel 2018-12-23 15:26:55 UTC
It still looks to me like some kind of incorrectly classified RAS error is delivered to the guest. Whether it results in the prefetch abort or the data abort (which triggers the KVM error) depends on whether the original exception occurs on a data access or on an instruction fetch.

Is this reproducible across multiple M400s? Does it occur anywhere else?

Comment 33 Paul Whalen 2019-01-29 20:14:45 UTC
Tried this on a Mustang with rhel8 beta and an f29 vm. Kernel build failed with a bus error and "Unhandled prefetch abort: implementation fault (lockdown abort) (0x234) at 0xb6de17c8" as reported in c#31. When using the non-lpae kernel on the builder the build finished successfully.

Comment 34 Kevin Fenzi 2019-02-16 03:45:00 UTC
So, another datapoint (that hopefully Paul can duplicate):

A Fedora 29 vm with NO updates (ie, GA f29, do not use or apply the updates repo), does seem to work ok. At least I got a glibc build with no problems with it. 
Perhaps this was why I didn't see it with F29 initially. 

I then only had the updated Fedora kernel in the vm, and the issue returned. 

So, it sounds like the kernel of the Fedora guest has something to do with this?
4.18.16-300.fc29.armv7hl+lpae -> no issue
4.20.7-200.fc29 -> this issue

Paul, can you confirm? Does this help any isolating this?

Comment 35 Paul Whalen 2019-03-18 14:21:39 UTC
Confirmed, using F29 GA with no updates with mock and deps from the GA repo I was able to do a number of kernel builds without an error. 

Trying with kernel-5.0.1-300.fc29 now.

Comment 36 Peter Robinson 2019-03-18 14:27:37 UTC
(In reply to Paul Whalen from comment #35)
> Confirmed, using F29 GA with no updates with mock and deps from the GA repo
> I was able to do a number of kernel builds without an error. 
> 
> Trying with kernel-5.0.1-300.fc29 now.

You may want to try 5.1rc1 too as reading through the change logs there was quite a few changes/fixes for arm32 virt and related memory/barriers and all sorts of related bits.

Comment 37 Paul Whalen 2019-03-19 15:42:45 UTC
5.0.1-300.fc29.armv7hl+lpae failed with a bus error and "Unhandled prefetch abort: implementation fault (lockdown abort)".

Trying with kernel-5.1.0-0.rc1.git0.1.fc31 now.

Comment 38 Paul Whalen 2019-03-19 17:35:05 UTC
5.1.0-0.rc1.git0.1.fc31.armv7hl+lpae also fails in the same way with a kernel build.

Comment 39 Paul Whalen 2019-04-24 15:02:39 UTC
5.1.0-0.rc5, 5.1.0-0.rc6 tested, no improvement.

Comment 40 Jon Masters 2019-04-25 17:36:09 UTC
Can you describe exactly how to setup the environment to reproduce this?

Comment 41 Paul Whalen 2019-04-26 13:18:05 UTC
(In reply to Jon Masters from comment #40)
> Can you describe exactly how to setup the environment to reproduce this?

RHEL or Fedora host (m400 or Mustang) with a Fedora ARMv7 guest running the lpae kernel. Then attempt to build a kernel, glibc, or qt5. I have a host set up with various vm's if that's easier.

Comment 42 Jeremy Linton 2019-04-29 22:48:23 UTC
I allocated a m400 last week to test this, but ugh got pulled off it. As a secondary point though, a virt-manager started VM on softiron3k seems to be able to build the fedora kernel package repeatidly without problems (until I ran out of disk space that is).

Comment 43 Peter Robinson 2019-04-30 09:35:44 UTC
(In reply to Kevin Fenzi from comment #34)
> So, another datapoint (that hopefully Paul can duplicate):
> 
> A Fedora 29 vm with NO updates (ie, GA f29, do not use or apply the updates
> repo), does seem to work ok. At least I got a glibc build with no problems
> with it. 
> Perhaps this was why I didn't see it with F29 initially. 
> 
> I then only had the updated Fedora kernel in the vm, and the issue returned. 

Might be worth testing a F-30 userspace with the 4.18.16-300.fc29.armv7hl+lpae kernel to see if that theory holds.

Comment 44 Andrew Jones 2019-04-30 10:29:44 UTC
(In reply to Jeremy Linton from comment #42)
> I allocated a m400 last week to test this, but ugh got pulled off it. As a
> secondary point though, a virt-manager started VM on softiron3k seems to be
> able to build the fedora kernel package repeatidly without problems (until I
> ran out of disk space that is).

Was the guest using the LPAE kernel? I understand that this isn't reproducible with the non-LPAE kernel, which implies something with the LPAE guest memory management could be triggering the bug. Also, as Ard points out in a couple comments above, we're getting stage2 aborts that are sometimes instructions and sometimes data, which implies it could be an issue triggered from guest paging in general. I'm suspicious that this issue may be due to the fact that the host kernel is using 64k pages and the guest is using 4k. If we can find a system and guest where this reliably reproduces, then I would suggest changing the host kernel to one that uses 4k pages to see if it still reproduces.

Comment 45 Peter Robinson 2019-04-30 10:41:11 UTC
> Was the guest using the LPAE kernel? I understand that this isn't

Yes, the guest is a LPAE kernel.

> reproducible with the non-LPAE kernel, which implies something with the LPAE
> guest memory management could be triggering the bug. Also, as Ard points out
> in a couple comments above, we're getting stage2 aborts that are sometimes
> instructions and sometimes data, which implies it could be an issue
> triggered from guest paging in general. I'm suspicious that this issue may
> be due to the fact that the host kernel is using 64k pages and the guest is
> using 4k. If we can find a system and guest where this reliably reproduces,
> then I would suggest changing the host kernel to one that uses 4k pages to
> see if it still reproduces.

Fedora on aarch64 is a 4K kernel, in comment #41 it states it's reproducible on a RHEL or Fedora host so that may well rule out a 64K vs 4K page issue.

Comment 51 Jon Masters 2019-06-18 19:56:02 UTC
What doesn't make any sense is how we'd end up with cache blocks in lockdown to begin with. I'm not sure that was even supported on Potenza (X-Gene1). I'm chatting with the design team now.

Comment 52 Jon Masters 2019-06-18 22:19:15 UTC
So when Paul sees a build fail in the guest, we do so with a lockdown implementation fault prefetch abort reported to the guest kernel, with an FSR indicating that this was due to a cache maintenance operation but also that it explicitly wasn't due to a cache operation. So yeah, sure. LPAE changes the descriptors for fault reporting - the short non-LPAE form also can convey lockdown faults, but you never see any with a non-LPAE kernel. Not that that means much, other than that I think the lockdown "fault" is in fact bogus. BUT we can work with it, since it's reproducible. The next step is to start instrumenting the guest kernel to look at the faulting address instruction page, etc. I'll start poking. Meanwhile, the design team are looking into what could cause a lockdown fault to be reported into a guest.

Comment 53 Jon Masters 2019-06-19 14:08:02 UTC
Confirmed that Potenza DID NOT support lockdown and there's no real reason for this fault to be reported. I personally suspect a silicon bug that is being tickled through a bogus mapping or somesuch. Time to gather more data.

Comment 54 Jon Masters 2019-06-21 20:23:48 UTC
This can quite reliably be reproduced on F30 by simply doing an olddefconfig and clean sequence. Mostly, you'll get a bus error, but sometimes the host will get the load/store unimplemented warning. I've seen reference to writing to /dev/null in perhaps all of the cases so far. Just trying to narrow it down to a single reproducer if possible before I start down the path of filesystem debug or poking at /dev/null.

Comment 55 Jon Masters 2019-06-29 22:49:39 UTC
Software managed Access Flag handling seems busted when using LPAE guests on AArch64 hosts. Every time we get the pause mentioned above it's because KVM's mmio code doesn't know what its doing so it assumes it's emulating an MMIO access (and then it spews the warning). But the actual instructions executing are straightforward loads. What they do have in common is that they're trying to happen against pages that have the access flag cleared. Something is going wrong there. I'm going to poke. Will update this BZ with various logs and dumps.

Comment 56 Laszlo Ersek 2019-06-30 11:38:19 UTC
(Please excuse me if the following is dumb/naive.)

(In reply to Jon Masters from comment #55)
> the actual instructions executing are straightforward loads. What they
> do have in common is that they're trying to happen against pages that
> have the access flag cleared.

Does that mean "first access to a recently paged-in page"?

Can we perturb that by (a) disabling swap in the guest, *and* (b)
locking all guest memory into host RAM?

(

The libvirt-generated cmdline from comment 0 includes "-realtime
mlock=off", hence question (b). The related libvirt domain XML elements
are documented at
<https://libvirt.org/formatdomain.html#elementsMemoryBacking>.

I believe (a)+(b) wouldn't completely eliminate "first access to a
recently paged-in page" in the guest, since that should happen anyway as
a part of read()s from regular files, and as a part of explicit mmap()s
of regular files by applications. Hence the word "perturb". :)

)

Thanks.

Comment 57 Jon Masters 2019-07-01 05:12:44 UTC
        gfn = fault_ipa >> PAGE_SHIFT;

Except no. Not when the page size differs between host and guest. Working on a patch.

Comment 58 Jon Masters 2019-07-01 06:31:51 UTC
I'm not quite right in exactly how to fix this, but I'm pretty sure I'm on the right track that it's a page size interaction.

Comment 59 Peter Robinson 2019-07-01 06:37:35 UTC
(In reply to Jon Masters from comment #58)
> I'm not quite right in exactly how to fix this, but I'm pretty sure I'm on
> the right track that it's a page size interaction.

I wonder how aarch64 handles this as we have the same config there in that the underlying hypervisor is RHEL and the Fedora guests for aarch64 are 4K page sizes.

Comment 60 Jon Masters 2019-07-01 08:31:13 UTC
Ok, slightly more nuanced. It looks like the value being reported in hpfar_el2 is truncated such that it's missing bits higher than 31. Which means that 0x43deb9000 becomes 0x3deb9000. I originally assumed Linux was doing a wrong shift, but I think it's the hardware. I'm tired out of my mind at this point but I think a manual AT translation will be needed here to workaround this if I'm right.

Definitely need to look at this with fresh eyes tomorrow.

Comment 61 Jon Masters 2019-07-01 14:11:52 UTC
I've request information from the manufacturer about any errata. It seems as if this /may/ be happening:

A guest running in AArch32 execution state at EL1 performs a data access or instruction fetch that requires a translation table walk. The guest physical memory translation table is located at a guest IPA in LPAE space beyond 32-bits. One of the pages containing the PTE maps to a host stage2 translation that is not present. A trap is taken to EL2 with a truncated guest IPA contained within HPFAR_EL2.FIPA[47:12] equivalent to truncating HPFAR_EL2.FIPA into HPFAR_EL2.FIPA[31:12] (the actual HPFAR_EL2 register itself is not simply truncated).

The host takes the fault and walks through the memslots that it has for guest physical memory. Seeing the truncated version, it does not realize that the faulting IPA is contained within the guest's physical memory. It mishandles this as an IO memory abort (fallthrough) and generates a spurious warning about load/store instruction emulation.

One possible workaround is to pin guest memory in the qemu process.

Comment 62 Jon Masters 2019-07-03 02:00:06 UTC
More confirmation...

[ 1373.454742] JCM: [piabt] gfn: 0x3e4c
[ 1475.404950] JCM: [piabt] gfn: 0x3dea
[ 1492.926644] JCM: [piabt] gfn: 0x3da4
[ 1493.157918] JCM: [piabt] gfn: 0x3e4b
[ 1652.172442] JCM: fault_ipa: 0x        3dc45838
[ 1652.225778] kvm [6509]: load/store instruction decoding not implemented
[ 1652.305180] JCM: io_mem_abort_failed
[ 1652.348060] JCM: faulting gfn: 0x3dc4
[ 1652.391982] JCM: faulting far: 0xb6fbe838
[ 1652.440076] JCM: occurred while performing S1 PTW
[ 1652.496516] JCM: PGD: 0x40207000

Comment 63 Jon Masters 2019-07-03 08:33:12 UTC
[  487.128991] JCM: WARNING: Mismatched FIPA and PA translation detected!
[  487.207365] JCM: Guest faulting far: 0xb6dfe1cc (gfn: 0x3f13)
[  487.276335] JCM: Guest TTBCR: 0xb5023500, TTBR0: 0x5b0fd540
[  487.343214] JCM: Guest PGD address: 0x5b0fd550
[  487.396523] JCM: Guest PGD: 0x5b1c1003
[  487.441484] JCM: Guest PMD address: 0x5b1c1db0
[  487.494795] JCM: Guest PMD: 0x43f130003
[  487.540804] JCM: Guest PTE address: 0x43f130ff0
[  487.595153] JCM: Guest PTE: 0xe0000430bccf5f
[  487.646378] JCM: Manually translated as: 0xb6dfe1cc->0x430bcc000
[  487.718467] JCM: fault_ipa: 0x        3f1301cc
[  487.771775] kvm [6716]: load/store instruction decoding not implemented
[  487.851163] JCM: io_mem_abort_failed
[  487.8 487.937955] JCM: faulting far: 0xb6dfe1cc
[  487.986049] JCM: occurred while performing S1 PTW
[  488.042485] JCM: PGD: 0xba00005b0fd540

Here you can see that we take a data abort from the guest as we try to perform a stage 1 walk, with an FIPA of 0x3f1301cc, which should probably be 0x43f1301cc. The PTE it's trying to read is very close in address to the manually translated ones above for the faulting address currently in the FAR.

Comment 64 Jon Masters 2019-07-03 18:55:48 UTC
It's aliiiiiiiive muwhahahahaha:

[  143.670063] JCM: WARNING: Mismatched FIPA and PA translation detected!
[  143.748447] JCM: Hyper faulting far: 0x3deb0000
[  143.802808] JCM: Guest faulting far: 0xb6dce3c4 (gfn: 0x3deb)
[  143.871776] JCM: Guest TTBCR: 0xb5023500, TTBR0: 0x5b06cc40
[  143.938649] JCM: Guest PGD address: 0x5b06cc50
[  143.991962] JCM: Guest PGD: 0x5b150003
[  144.036925] JCM: Guest PMD address: 0x5b150db0
[  144.090238] JCM: Guest PMD: 0x43deb0003
[  144.136241] JCM: Guest PTE address: 0x43deb0e70
[  144.190604] JCM: Guest PTE: 0x42000043bb72fdf
[  144.242884] JCM: Manually translated as: 0xb6dce3c4->0x43bb72000
[  144.314972] JCM: Faulting IPA page: 0x3deb0000
[  144.368286] JCM: Faulting PTE page: 0x43deb0000
[  144.422641] JCM: Fault occurred while performing S1 PTW -fixing
[  144.493684] JCM: corrected fault_ipa: 0x43deb0000
[  144.550133] JCM: Corrected gfn: 0x43deb
[  144.596145] JCM: handle user_mem_abort
[  144.641155] JCM: ret: 0x1
[  173.268497] JCM: WARNING: Mismatched FIPA and PA translation detected!
[  173.346903] JCM: Hyper faulting far: 0x3dea8000
[  173.401265] JCM: Guest faulting far: 0xb6dcf3c4 (gfn: 0x3dea)
[  173.470236] JCM: Guest TTBCR: 0xb5023500, TTBR0: 0x5b0c1e80
[  173.537111] JCM: Guest PGD address: 0x5b0c1e90
[  173.590425] JCM: Guest PGD: 0x5a891003
[  173.635392] JCM: Guest PMD address: 0x5a891db0
[  173.688704] JCM: Guest PMD: 0x43dea8003
[  173.734709] JCM: Guest PTE address: 0x43dea8e78
[  173.789060] JCM: Guest PTE: 0x42000043bb72fdf
[  173.841326] JCM: Manually translated as: 0xb6dcf3c4->0x43bb72000
[  173.913418] JCM: Faulting IPA page: 0x3dea8000
[  173.966731] JCM: Faulting PTE page: 0x43dea8000
[  174.021088] JCM: Fault occurred while performing S1 PTW -fixing
[  174.092138] JCM: corrected fault_ipa: 0x43dea8000
[  174.148579] JCM: Corrected gfn: 0x43dea
[  174.194592] JCM: handle user_mem_abort
[  174.239601] JCM: ret: 0x1

Comment 65 Jon Masters 2019-07-03 19:00 UTC
Created attachment 1587164 [details]
0001-virt-arm-correct-hardware-errata-reading-HPFAR_EL2-o.patch

Ugly patch for debug purposes only proving that there exists an HPFAR_EL2 errata on the hardware and correcting it for test purposes ONLY.

Comment 66 Jon Masters 2019-07-03 19:41:17 UTC
Paul ran a test mock kernel build for me (which is still ongoing). The test patch caught and fixed up two faults that would have normally crashed the builder so far:

[ 2290.867797] JCM: WARNING: Mismatched FIPA and PA translation detected!
[ 2290.946180] JCM: Hyper faulting far: 0x3d876000
[ 2291.000539] JCM: Guest faulting far: 0xb6dde3c4 (gfn: 0x3d87)
[ 2291.069508] JCM: Guest TTBCR: 0xb5023500, TTBR0: 0x525b37c0
[ 2291.136386] JCM: Guest PGD address: 0x525b37d0
[ 2291.189704] JCM: Guest PGD: 0x5aba9003
[ 2291.234674] JCM: Guest PMD address: 0x5aba9db0
[ 2291.287997] JCM: Guest PMD: 0x43d876003
[ 2291.334005] JCM: Guest PTE address: 0x43d876ef0
[ 2291.388367] JCM: Guest PTE: 0x4200004050e3fdf
[ 2291.440636] JCM: Manually translated as: 0xb6dde3c4->0x4050e3000
[ 2291.512728] JCM: Faulting IPA page: 0x3d876000
[ 2291.566038] JCM: Faulting PTE page: 0x43d876000
[ 2291.620398] JCM: Fault occurred while performing S1 PTW -fixing
[ 2291.691445] JCM: corrected fault_ipa: 0x43d876000
[ 2291.747897] JCM: Corrected gfn: 0x43d87
[ 2291.793907] JCM: handle user_mem_abort
[ 2291.838898] JCM: ret: 0x1
[ 2297.938930] JCM: WARNING: Mismatched FIPA and PA translation detected!
[ 2298.017310] JCM: Hyper faulting far: 0x3ce00000
[ 2298.071671] JCM: Guest faulting far: 0xb6de13c4 (gfn: 0x3ce0)
[ 2298.140639] JCM: Guest TTBCR: 0xb5023500, TTBR0: 0x525e9ec0
[ 2298.207512] JCM: Guest PGD address: 0x525e9ed0
[ 2298.260823] JCM: Guest PGD: 0x5aba9003
[ 2298.305798] JCM: Guest PMD address: 0x5aba9db0
[ 2298.359117] JCM: Guest PMD: 0x43ce00003
[ 2298.405130] JCM: Guest PTE address: 0x43ce00f08
[ 2298.459497] JCM: Guest PTE: 0x4200004050e3fdf
[ 2298.511763] JCM: Manually translated as: 0xb6de13c4->0x4050e3000
[ 2298.583861] JCM: Faulting IPA page: 0x3ce00000
[ 2298.637176] JCM: Faulting PTE page: 0x43ce00000
[ 2298.691535] JCM: Fault occurred while performing S1 PTW -fixing
[ 2298.762583] JCM: corrected fault_ipa: 0x43ce00000
[ 2298.819033] JCM: Corrected gfn: 0x43ce0
[ 2298.865041] JCM: handle user_mem_abort
[ 2298.910027] JCM: ret: 0x1

Comment 67 Jon Masters 2019-07-08 04:34:14 UTC
NOTE on testing the above. Guests withstood load testing for a long time (much longer than usual). Eventually, the guests paused again under load. This time, the faulting address was outside the range covered by my patch (in guest kernel memory this time). The guest was getting an interrupt but the pages for the vgic at stage 2 (host) caused a fault (e.g. access bit update in the host). This could be an unrelated bug in that the HPFAR was correct this time but we didn't fix up the access fault correctly in the host. I'll look at that, possibly separately.

Comment 68 Jon Masters 2019-07-18 22:00:00 UTC
I setup a test rig at Ampere and they were able to reproduce this. It's being investigated.

Meanwhile, I've got a guest test kernel building with HIGHPTE disabled that I'm hoping will very temporarily work around the problem. We'll see if practically speaking we run out of low memory to allocate the PTEs from in reality. If it works well enough it could be an option for the Fedora builders temporarily.

Comment 69 Kevin Fenzi 2019-07-19 16:10:00 UTC
Thanks Jon (and everyone!) for tracking this down and working on a final fix...

Comment 70 Jon Masters 2019-07-19 17:12 UTC
Created attachment 1592060 [details]
force CONFIG_HIGHPTE off for Arm LPAE kernels

This is the patch that I'm currently using. It hasn't rolled over yet.

Comment 71 Peter Robinson 2019-07-19 17:57:51 UTC
Applied to rawhide (should have a build on Mon with 5.3-rc1) and the stabilization branch (5.2) which we can test next week as it's 5.2 test week.


Note You need to log in before you can comment on or make changes to this bug.