Bug 1576593 - Fedorn 29 armv7 guests "pause" from time to time on rhel-alt 7.6 kernel-4.14.0-115.2.2.el7a.aarch64 kernel
Summary: Fedorn 29 armv7 guests "pause" from time to time on rhel-alt 7.6 kernel-4.14....
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel-alt   
(Show other bugs)
Version: 7.6-Alt
Hardware: aarch64
OS: Unspecified
high
high
Target Milestone: rc
: 7.6-Alt
Assignee: Andrew Jones
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Keywords: Reopened
Depends On:
Blocks: ARMTracker 1638451 1645315
TreeView+ depends on / blocked
 
Reported: 2018-05-09 21:18 UTC by Kevin Fenzi
Modified: 2019-03-19 17:35 UTC (History)
28 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-12-10 17:53:16 UTC
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

Description Kevin Fenzi 2018-05-09 21:18:56 UTC
hardware is a HP moonshot with m400 aarch64 carts. 

The cart is running RHEL 7.5 alt. 

The vm is Fedora 28 armv7 with all updates acting as a Fedora koji builder.

From time to time (could be caused by builds?) the vm will move to state "paused" and become unresponsive. 

Trying resume on it gives: 
# virsh resume buildvm-armv7-14.arm.fedoraproject.org
error: Failed to resume domain buildvm-armv7-14.arm.fedoraproject.org
error: internal error: unable to execute QEMU command 'cont': Resetting the Virtual Machine is required

Then a reset and resume brings it back, but it rebooted: 
# virsh reset  buildvm-armv7-14.arm.fedoraproject.org                                      
Domain buildvm-armv7-14.arm.fedoraproject.org was reset

# virsh resume buildvm-armv7-14.arm.fedoraproject.org
Domain buildvm-armv7-14.arm.fedoraproject.org resumed

ssh buildvm-armv7-14.arm.fedoraproject.org

# w
 20:37:33 up 0 min,  1 user,  load average: 0.73, 0.16, 0.05

On the host in /var/log/libvirt/qemu/buildvm-armv7-14.arm.fedoraproject.org.log is: 

2018-05-09 06:26:51.401+0000: starting up libvirt version: 3.9.0, package: 14.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2018-03-07-13:59:11, arm64-017.build.eng.bos.redhat.com), qemu version: 2.10.0(qemu-kvm-ma-2.10.0-21.el7), hostname: aarch64-c14n1.arm.fedoraproject.org
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=buildvm-armv7-14.arm.fedoraproject.org,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-5-buildvm-armv7-14.arm/master-key.aes -machine virt-rhel7.5.0,accel=kvm,usb=off,dump-guest-core=off -cpu host,aarch64=off -m 24576 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid e2be82ec-efa8-445a-ba2b-d4b2a4f832b3 -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-5-buildvm-armv7-14.arm/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -no-acpi -boot strict=on -kernel /var/lib/libvirt/images/vmlinuz-4.16.6-302.fc28.armv7hl+lpae -initrd /var/lib/libvirt/images/initramfs-4.16.6-302.fc28.armv7hl+lpae.img -append '     root=UUID=3190f023-985e-4e7f-88b5-2f3c576f7987 ro net.ifnames=0 console=ttyAMA0 LANG=en_US.UTF-8' -device pcie-root-port,port=0x8,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x1 -device pcie-root-port,port=0x9,chassis=2,id=pci.2,bus=pcie.0,addr=0x1.0x1 -device pcie-root-port,port=0xa,chassis=3,id=pci.3,bus=pcie.0,addr=0x1.0x2 -device pcie-root-port,port=0xb,chassis=4,id=pci.4,bus=pcie.0,addr=0x1.0x3 -device pcie-root-port,port=0xc,chassis=5,id=pci.5,bus=pcie.0,addr=0x1.0x4 -device pcie-root-port,port=0xd,chassis=6,id=pci.6,bus=pcie.0,addr=0x1.0x5 -device pcie-root-port,port=0xe,chassis=7,id=pci.7,bus=pcie.0,addr=0x1.0x6 -device qemu-xhci,p2=8,p3=8,id=usb,bus=pci.2,addr=0x0 -device virtio-serial-pci,id=virtio-serial0,bus=pci.3,addr=0x0 -drive file=/dev/vg_Server/buildvm-armv7-14.arm.fedoraproject.org,format=raw,if=none,id=drive-virtio-disk0,cache=none,aio=native -device virtio-blk-pci,scsi=off,bus=pci.4,addr=0x0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:1c:1e:58,bus=pci.1,addr=0x0 -chardev pty,id=charserial0 -serial chardev:charserial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-5-buildvm-armv7-14.arm/org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -device virtio-balloon-pci,id=balloon0,bus=pci.5,addr=0x0 -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.6,addr=0x0 -msg timestamp=on
2018-05-09 06:26:51.402+0000: Domain id=5 is tainted: host-cpu
2018-05-09T06:26:51.483671Z qemu-kvm: -chardev pty,id=charserial0: char device redirected to /dev/pts/0 (label charserial0)
error: kvm run failed Function not implemented
R00=b6fe6350 R01=b6fe6370 R02=00000006 R03=b6abc138
R04=b6fe6350 R05=b6fe8000 R06=6ffffeff R07=b6abc138
R08=b6c3744c R09=bee2f714 R10=b6fe64a8 R11=bee2f694
R12=b6abc000 R13=bee2f580 R14=b6fbd2f4 R15=b6fc2758
PSR=a0030010 N-C- A usr32

Will try and detect some pattern to the pauses.

Comment 2 Florian Weimer 2018-05-15 11:43:38 UTC
Kevin, can we make this bug public?

Comment 4 Kevin Fenzi 2018-05-15 15:50:53 UTC
(In reply to Florian Weimer from comment #2)
> Kevin, can we make this bug public?

As far as I am concerned, absolutely!

Some more data: 

Fedora kernel builds have been failing with a "bus error" on armv7, which seems like it could well be related to this. 

In those guests on kernel builds we get: 
[Mon May 14 21:07:19 2018] Unhandled prefetch abort: implementation fault (lockdown abort) (0x234) at 0xb6df4d94
in dmesg. and there is a core dump: 


Message: Process 26075 (sh) of user 1000 dumped core.

Stack trace of thread 26075:
#0  0x00000000b6df4d94 n/a (/usr/lib/libc-2.27.9000.so)

A Fedora 27 armv7 builder works as expected, so that points at perhaps something in the Fedora 28 armv7 install that causes this? I suppose this could be a sperate issue too, but they seem related.

Comment 5 Kevin Fenzi 2018-05-15 18:37:18 UTC
More info: 

f27-builder (where kernel builds complete fine): 
4.16.7-200.fc27.armv7hl+lpae

f28-builder (where kernel builds bus error): 
4.16.7-300.fc28.armv7hl+lpae

So, I don't think this is caused by the kernel. Could it be a glibc in f28 issue?

Comment 6 Florian Weimer 2018-05-15 18:47:53 UTC
(In reply to Kevin Fenzi from comment #5)
> More info: 
> 
> f27-builder (where kernel builds complete fine): 
> 4.16.7-200.fc27.armv7hl+lpae
> 
> f28-builder (where kernel builds bus error): 
> 4.16.7-300.fc28.armv7hl+lpae

Hmm.  Are you suggesting that this is not solely a hypervisor issue?

> So, I don't think this is caused by the kernel. Could it be a glibc in f28
> issue?

If it's a glibc issue, I would expect it to be more deterministic.

It could be that the kernel compiled by GCC 8 has issues, though.  That would only appear on Fedora 28, not Fedora 27, which still has GCC 7.

Comment 7 Fabiano Fidêncio 2018-05-15 19:41:14 UTC
I'm not totally convinced that this issue is the root cause of some failures that I've been facing when building SSSD[0], but it may be. Maybe it could be something related to glibc, as suggested by Kevin (or even a totally different issue happening on armv7hl).

I've done some tests using mock on a armv7hl hardware and SSSD build passes without any issue. However, it's constantly failing under Fedora infra.

[0]: https://koji.fedoraproject.org/koji/taskinfo?taskID=26957996

Florian,

If you need some info that may help you to investigate a possible failure on glibc, please, let me know how I could help.

Comment 8 Florian Weimer 2018-05-15 19:47:38 UTC
(In reply to Fabiano Fidêncio from comment #7)
> If you need some info that may help you to investigate a possible failure on
> glibc, please, let me know how I could help.

We need a somewhat reliable reproducer outside Koji, to take the KVM-on-aarch64 configuration out of the picture.

Comment 9 Daniel Berrange 2018-05-15 19:58:37 UTC
(In reply to Florian Weimer from comment #6)
> (In reply to Kevin Fenzi from comment #5)
> > More info: 
> > 
> > f27-builder (where kernel builds complete fine): 
> > 4.16.7-200.fc27.armv7hl+lpae
> > 
> > f28-builder (where kernel builds bus error): 
> > 4.16.7-300.fc28.armv7hl+lpae
> 
> Hmm.  Are you suggesting that this is not solely a hypervisor issue?

There is very likely to be a hypervisor issue here, given the KVM error message shown in QEMU logs  "kvm run failed Function not implemented". Based on this error message though, it is probably an issue that has always existed, but was never previously tickled by the guest OS. Thus some change in F28 toolchain, or kernel, or userspace has triggered new codepaths that expose an existing KVM limitation.  If we can at least understand what changed in the guest OS to trigger it, it might help identify either a way to avoid the bug, or a way to fix the hypervisor.

NB, armv7l-on-aarch64 host is something that gets very little attention from KVM  maintainers either upstream or downstream - Fedora is probably the most significant user of it that we know of in fact.

> > So, I don't think this is caused by the kernel. Could it be a glibc in f28
> > issue?
> 
> If it's a glibc issue, I would expect it to be more deterministic.
> 
> It could be that the kernel compiled by GCC 8 has issues, though.  That
> would only appear on Fedora 28, not Fedora 27, which still has GCC 7.

Changed/improved GCC code generation sounds plausible as something that could result in this kind of issue.

Comment 10 Kevin Fenzi 2018-05-16 04:34:33 UTC
I installed and booted 4.16.7-200.fc27.armv7hl+lpae on a fedora 28 instance and a kernel build completes fine. That points more at gcc8/toolchain building the kernel.

Comment 11 Kevin Fenzi 2018-05-16 19:17:09 UTC
We had one armv7 builder vm 'pause' last night. It looks like it was building glibc: https://koji.fedoraproject.org/koji/taskinfo?taskID=26975085

Host had: 
[Tue May 15 12:09:55 2018] kvm [10652]: load/store instruction decoding not implemented
in /var/log/libvirt/qemu/*.log: 
error: kvm run failed Function not implemented
R00=b6fa1370 R01=b6da11b8 R02=b6da1000 R03=b6ee7f38
R04=b6fa1370 R05=b6fa4000 R06=6ffffeff R07=b6da11b8
R08=beec673c R09=b6eeb560 R10=b6fa14c8 R11=beec66bc
R12=6fffffff R13=beec65a8 R14=b6f7963c R15=b6f7e5e8
PSR=a0030010 N-C- A usr32

So, I think glibc builds (and possibly others) are causing the 'pause' and Function not implemented, where kernel builds just cause a bus error.

Comment 12 Daniel Berrange 2018-05-16 19:53:10 UTC
This msg is possibly useful:

> [Tue May 15 12:09:55 2018] kvm [10652]: load/store instruction decoding not implemented

as it points directly to the kernel code where the fault is triggered:

  virt/kvm/arm/mmio.c:    if (kvm_vcpu_dabt_isvalid(vcpu)) {
  virt/kvm/arm/mmio.c-            ret = decode_hsr(vcpu, &is_write, &len);
  virt/kvm/arm/mmio.c-            if (ret)
  virt/kvm/arm/mmio.c-                    return ret;
  virt/kvm/arm/mmio.c-    } else {
  virt/kvm/arm/mmio.c-            kvm_err("load/store instruction decoding not implemented\n");
  virt/kvm/arm/mmio.c-            return -ENOSYS;
  virt/kvm/arm/mmio.c-    }

and on an aarch64 host, that method being called is

  arch/arm64/include/asm/kvm_emulate.h:static inline bool kvm_vcpu_dabt_isvalid(const struct kvm_vcpu *vcpu)
  arch/arm64/include/asm/kvm_emulate.h-{
  arch/arm64/include/asm/kvm_emulate.h-   return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_ISV);
  arch/arm64/include/asm/kvm_emulate.h-}

So whatever the arm7 guest is doing is triggering that check to be false. Its beyond my knowledge to explain what that check means though...

Comment 13 Kevin Fenzi 2018-05-16 23:53:25 UTC
I have moved all but 2 of the fedora armv7 builders back to f27 so we don't interfere with ongoing builds. I have the 2 builders installed (but disabled for new builds) with f28 and ready for any additional testing anyone would like me to do.

Comment 14 Laszlo Ersek 2018-05-17 13:59:12 UTC
I seem to remember that there are ARM instructions that modify multiple
operands in one go, and when those trap (e.g. because they access an MMIO
location), they cannot be virtualized easily (because the information to
emulate them isn't readily available from the trap symptoms). This has
happened in edk2 before, and the solution was to tweak the guest code so
that the compiler wouldn't generate such "practically unvirtualizable"
instructions:

https://github.com/tianocore/edk2/commit/2efbf710e27a

It seems that, for ARMv7, the R15 register is the "program counter" (PC).
When the guest is stopped due to the emulation failure, would it be possible
to get a disassembly from the neighborhood of R15, over QMP or HMP? That
might help with (a) confirming the type of the offending instruction, (b)
identifying the guest source code for which gcc-8 generates the offending
assembly.

In fact, if the guest doesn't crash totally, just a fault is injected (into
the guest kernel, or better yet, the guest userspace process), then the
instruction and its location could be easily disassembled with gdb -- I
think that's what Kevin nearly did in comment 4, from the core dump?

Comment 15 Andrew Jones 2018-05-22 06:57:26 UTC
(In reply to Kevin Fenzi from comment #13)
> I have moved all but 2 of the fedora armv7 builders back to f27 so we don't
> interfere with ongoing builds. I have the 2 builders installed (but disabled
> for new builds) with f28 and ready for any additional testing anyone would
> like me to do.

Yes, please. I've been attempting to reproduce, but without luck. If one of the f28 builders could reproduce, and then be left in the crashed state, we could attempt to extract a dump. I'll keep trying to reproduce as well though.

Comment 16 Kevin Fenzi 2018-05-25 19:50:40 UTC
ok. I have a builder in the 'pause' / crashed state. ;) 

Can you let me know exactly what you would like me to do? 
Or if you prefer we could try and get you access to the vm, but that might take a bit of back and forth to set things up. 

 24    buildvm-armv7-24.arm.fedoraproject.org paused

[Fri May 25 19:38:07 2018] kvm [20652]: load/store instruction decoding not implemented

error: kvm run failed Function not implemented
R00=b6fbd3b8 R01=b6d0c138 R02=b6fc0000 R03=b6e4f4a8
R04=b6fbd3b8 R05=6ffffdff R06=6ffffeff R07=b6d0c138
R08=bef7e3bc R09=b6e57b88 R10=b6fbd514 R11=bef7e33c
R12=6fffffff R13=bef7e230 R14=b6f95698 R15=b6f9a620
PSR=a0000010 N-C- A usr32

Comment 17 Andrew Jones 2018-05-28 13:47:01 UTC
Hi Kevin,

I'd like to get the core and the kernel version from you. If you used libvirt to launch the guest, then we should be able to get the core easily with something like following run on the host

 $ virsh dump --memory-only --format elf $GUEST_NAME $GUEST_NAME.core

If the guest was launched directly with QEMU, then you'll need to connect to the QEMU monitor and run
 
 (qemu) dump-guest-memory $GUEST_NAME.core

If you can get the core and put it somewhere for me to download, then, along with telling me the guest kernel version running at the time, I'll attempt to analyze it. Note, I'm not using any compression options in the above commands on purpose. Using anything other than elf doesn't seem to work for AArch32 guests. You many compress the core after extracting it from the guest in order to prepare it for download though.

Thanks,
drew

Comment 18 Kevin Fenzi 2018-05-28 18:12:18 UTC
Got the dump, sending you email on where to get it (as it may have sensitive info in it).

Comment 19 Laszlo Ersek 2018-06-04 15:04:40 UTC
Apparently the same gcc-8 issue has caught up to 32-bit ARM firmware that runs on virtual machines. The proposed solution is to pass "-fno-lto" to gcc, in order to disable link-time optimization:

  [edk2] [PATCH] ArmVirtPkg/ArmVirtQemu ARM: work around KVM limitations in
                 LTO build
  https://lists.01.org/pipermail/edk2-devel/2018-June/025476.html

Ard's description of the issue, in the patch linked above, seems to confirm my suspicion in comment 14.

Comment 20 Daniel Berrange 2018-06-04 16:34:26 UTC
I wonder if it is worth talking to GCC maintainers to see if there's a viable way for them to avoid emitting these troublsome instructions, even when they do LTO, otherwise it feels like we'll be playing whack-a-mole when they decide some other thing can be optimized in the same way.

Comment 21 Laszlo Ersek 2018-06-04 18:11:54 UTC
Sent the following message to the upstream GCC mailing list:

"code-gen options for disabling multi-operand AArch64 and ARM instructions"
https://gcc.gnu.org/ml/gcc/2018-06/msg00036.html

Comment 22 ard.biesheuvel 2018-06-05 08:42:35 UTC
(In reply to Kevin Fenzi from comment #16)
> ok. I have a builder in the 'pause' / crashed state. ;) 
> 
> Can you let me know exactly what you would like me to do? 
> Or if you prefer we could try and get you access to the vm, but that might
> take a bit of back and forth to set things up. 
> 
>  24    buildvm-armv7-24.arm.fedoraproject.org paused
> 
> [Fri May 25 19:38:07 2018] kvm [20652]: load/store instruction decoding not
> implemented
> 
> error: kvm run failed Function not implemented
> R00=b6fbd3b8 R01=b6d0c138 R02=b6fc0000 R03=b6e4f4a8
> R04=b6fbd3b8 R05=6ffffdff R06=6ffffeff R07=b6d0c138
> R08=bef7e3bc R09=b6e57b88 R10=b6fbd514 R11=bef7e33c
> R12=6fffffff R13=bef7e230 R14=b6f95698 R15=b6f9a620
> PSR=a0000010 N-C- A usr32

This example is rather puzzling, and I wonder whether it is a side effect of some other issue rather than being caused by MMIO being performed using instructions that KVM cannot emulate.

Note that the exception was taken in USR32 mode. This is weird, considering that you wouldn't expect userland to perform MMIO directly, unless it remapped some MMIO region using /dev/mem explicitly.

So given that this does not appear to be kernel code making the access, and considering that rebuilding the (32-bit ARM) world using a new compiler flag is intractible, I would prefer to spend more time gaining a better understanding of the root cause.

Comment 23 Laszlo Ersek 2018-06-05 09:00:29 UTC
Fair enough (and thank you for the analysis!); I think we've all missed USR32 until you've highlighted it. Drew has a vmcore from Kevin; I hope the vmcore will provide more evidence. Thank you!

Comment 24 ard.biesheuvel 2018-06-05 09:26:26 UTC
Something like

--- a/virt/kvm/arm/mmio.c
+++ b/virt/kvm/arm/mmio.c
@@ -172,7 +172,8 @@ int io_mem_abort(struct kvm_vcpu *vcpu, struct kvm_run *run,
                if (ret)
                        return ret;
        } else {
-               kvm_err("load/store instruction decoding not implemented\n");
+               kvm_err("load/store instruction decoding not implemented (IPA:%pa ESR:0x%08x)\n",
+                       &fault_ipa, kvm_vcpu_get_hsr(vcpu));
                return -ENOSYS;
        }
 

would already be rather helpful in diagnosing the situation. Note that the a stage 2 data abort (which is the exception that triggers this error) could be raised for many different conditions, including alignment faults and TLB conflicts in the stage 2 translation tables.

It should also be noted that the ESR_EL2.ISV bit, which is read by kvm_vcpu_dabt_isvalid(), is documented by the ARM ARM as

"""
This bit is 0 for all faults reported in ESR_EL2 except the following stage 2 aborts:
• AArch64 loads and stores of a single general-purpose register (including the register
specified with 0b11111 ), including those with Acquire/Release semantics, but excluding Load
Exclusive or Store Exclusive and excluding those with writeback.
• AArch32 instructions where the instruction:
— Is an LDR, LDA, LDRT, LDRSH, LDRSHT, LDRH, LDAH, LDRHT, LDRSB,
LDRSBT, LDRB, LDAB, LDRBT, STR, STL, STRT, STRH, STLH, STRHT, STRB,
STLB, or STRBT instruction.
— Is not performing register writeback.
— Is not using R15 as a source or destination register.
For these cases, ISV is UNKNOWN if the exception was generated in Debug state in memory access
mode, and otherwise indicates whether ISS[23:14] hold a valid syndrome.
"""

In other words, the architecture does not require the bit to be set for instructions that *could* be emulated, but rather limits the set of instructions for which you can expect it to ever assume the value '1' in the first place. IOW, it is perfectly acceptable for a core not to set the ISV bit for conditions like TLB conflict aborts etc at stage 2.

This seems especially relevant due to the error message reported in #4

[Mon May 14 21:07:19 2018] Unhandled prefetch abort: implementation fault (lockdown abort) (0x234) at 0xb6df4d94

where the lockdown abort is triggered by a instruction fetch. I cannot tell you what the condition is that triggers it (it is implementation defined), but it is likely that if the same condition occurred during, e.g., a stage 2 page table walk, we would take a stage 2 data abort exception without a valid syndrome, regardless of whether the instruction was in the 'virtualizable' class or not.

Comment 25 ard.biesheuvel 2018-06-07 11:32:59 UTC
(In reply to ard.biesheuvel from comment #24)
...
> 
> This seems especially relevant due to the error message reported in #4
> 
> [Mon May 14 21:07:19 2018] Unhandled prefetch abort: implementation fault
> (lockdown abort) (0x234) at 0xb6df4d94
> 
> where the lockdown abort is triggered by a instruction fetch. I cannot tell
> you what the condition is that triggers it (it is implementation defined),
> but it is likely that if the same condition occurred during, e.g., a stage 2
> page table walk, we would take a stage 2 data abort exception without a
> valid syndrome, regardless of whether the instruction was in the
> 'virtualizable' class or not.

As pointed out by Christoffer Dall in an unrelated email thread, this code path should never be taken for IMPDEF lockdown abort exceptions, so we really need to instrument the code to get a better handle on this.

Comment 26 Andrew Jones 2018-07-03 13:52:49 UTC
Here's a small update:

We haven't put a lot of effort into this debug yet, as it's been quite difficult to reproduce the issue. I ran an environment identical to what was described in comment 0 over a weekend, but nothing happened. We did get a guest core once from one of the Fedora build machines, which I took a look at. There wasn't really anything interesting in there, but I did note that only one vcpu was running a task at the time, and it was running the assembler 'as'. So, as we expected from the kernel splat, it was running in user space at the time. We've also provided the reporter with a modified host kernel containing Ard's suggested change from comment 24, but the issue hasn't reproduced (or at least not with the same symptom) since.

Since attempting to reproduce with the modified host kernel, the reporter has seen a guest kernel log

[Thu Jun 28 01:06:51 2018] Unhandled prefetch abort: implementation
fault (lockdown abort) (0x234) at 0xb6db4bdc

but there was no crash and no host kernel logs.

I'll attempt to find a reproducer some more.

Comment 27 John Feeney 2018-10-22 18:55:34 UTC
Adding bugzillas that record bugs/issues found on certified AArch64 systems and/or significant bzs (panics, data corruptors, etc.) found generically on AArch64 systems. The intent is to focus on bzs on this tracker going forward in RHEL-alt.

Comment 29 Kevin Fenzi 2018-11-25 05:42:57 UTC
So, not sure if this still happens on f28, but it does not seem to happen on f29. 

I did a scratch glibc build in our staging env that has f29 builders. 
The armv7 build failed: 

https://koji.stg.fedoraproject.org/koji/taskinfo?taskID=90003231

but the guest and host were fine...

Comment 30 Andrew Jones 2018-12-10 17:53:16 UTC
Based on comment 29, I'll close this. It, or a new BZ, can always be [re]opened if the issue returns. Thanks

Comment 31 Kevin Fenzi 2018-12-22 19:37:53 UTC
I spoke too soon. This is indeed still happening with Rhel 7.6 alt and Fedora 29 guest vms.

It's now happening with kernel, glibc, and qt5.

We really need to get to the bottom of this. :( 

The kernel and glibc ones don't cause the guest to pause, but do fail with a bus error and "Unhandled prefetch abort: implementation fault (lockdown abort) (0x234) at 0x00acfa00" in dmesg and core dump.
The qt5 one does cause the guest pause and "[Sat Dec 22 19:16:24 2018] kvm [28942]: load/store instruction decoding not implemented" in dmesg. 

Happy to gather more info.

Comment 32 ard.biesheuvel 2018-12-23 15:26:55 UTC
It still looks to me like some kind of incorrectly classified RAS error is delivered to the guest. Whether it results in the prefetch abort or the data abort (which triggers the KVM error) depends on whether the original exception occurs on a data access or on an instruction fetch.

Is this reproducible across multiple M400s? Does it occur anywhere else?

Comment 33 Paul Whalen 2019-01-29 20:14:45 UTC
Tried this on a Mustang with rhel8 beta and an f29 vm. Kernel build failed with a bus error and "Unhandled prefetch abort: implementation fault (lockdown abort) (0x234) at 0xb6de17c8" as reported in c#31. When using the non-lpae kernel on the builder the build finished successfully.

Comment 34 Kevin Fenzi 2019-02-16 03:45:00 UTC
So, another datapoint (that hopefully Paul can duplicate):

A Fedora 29 vm with NO updates (ie, GA f29, do not use or apply the updates repo), does seem to work ok. At least I got a glibc build with no problems with it. 
Perhaps this was why I didn't see it with F29 initially. 

I then only had the updated Fedora kernel in the vm, and the issue returned. 

So, it sounds like the kernel of the Fedora guest has something to do with this?
4.18.16-300.fc29.armv7hl+lpae -> no issue
4.20.7-200.fc29 -> this issue

Paul, can you confirm? Does this help any isolating this?

Comment 35 Paul Whalen 2019-03-18 14:21:39 UTC
Confirmed, using F29 GA with no updates with mock and deps from the GA repo I was able to do a number of kernel builds without an error. 

Trying with kernel-5.0.1-300.fc29 now.

Comment 36 Peter Robinson 2019-03-18 14:27:37 UTC
(In reply to Paul Whalen from comment #35)
> Confirmed, using F29 GA with no updates with mock and deps from the GA repo
> I was able to do a number of kernel builds without an error. 
> 
> Trying with kernel-5.0.1-300.fc29 now.

You may want to try 5.1rc1 too as reading through the change logs there was quite a few changes/fixes for arm32 virt and related memory/barriers and all sorts of related bits.

Comment 37 Paul Whalen 2019-03-19 15:42:45 UTC
5.0.1-300.fc29.armv7hl+lpae failed with a bus error and "Unhandled prefetch abort: implementation fault (lockdown abort)".

Trying with kernel-5.1.0-0.rc1.git0.1.fc31 now.

Comment 38 Paul Whalen 2019-03-19 17:35:05 UTC
5.1.0-0.rc1.git0.1.fc31.armv7hl+lpae also fails in the same way with a kernel build.


Note You need to log in before you can comment on or make changes to this bug.