Bug 1820402
| Summary: | Sometimes hit "error: kvm run failed Bad address" when launching a guest on Power8 | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Yihuang Yu <yihyu> | |
| Component: | kernel | Assignee: | David Gibson <dgibson> | |
| kernel sub component: | KVM | QA Contact: | Zhenyu Zhang <zhenyzha> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | high | |||
| Priority: | high | CC: | dgibson, juzhang, lagarcia, lvivier, mdeng, mtessun, ngu, qzhang, virt-maint, xianwang, xuma, zhenyzha | |
| Version: | 8.2 | Keywords: | Regression, Triaged, ZStream | |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
|
| Target Release: | 8.3 | |||
| Hardware: | ppc64le | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | kernel-4.18.0-200.el8 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1890883 (view as bug list) | Environment: | ||
| Last Closed: | 2020-11-04 01:12:11 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1826160, 1890883 | |||
It seems there isn't "max-cpu-compat=power8" in cmdline. (In reply to Min Deng from comment #1) > It seems there isn't "max-cpu-compat=power8" in cmdline. I mean this problem only occurs on power8 host. I've reproduce this myself, with a script that starts a guest repeatedly. Simply because there are very few plausibly relevant patches between -187 and -193, I suspect the problem was introduced by b940de4c6e68468c2817e072fd94a56d5399924a which is the downstream backport of cd758a9b57ee85f0733c759e60f42b969c81f27b "[kvm] KVM: PPC: Book3S HV: Use __gfn_to_pfn_memslot in HPT page fault handler". I'm working on confirming this now. Confirmed that I can reproduce the problem with downstream commit b940de4c6e68468c2817e072fd94a56d5399924a. Now trying the parent commit (942a59196ff8cbdaac59f6d9aa02399d2bf4d3a2), to see if I can reproduce there. Wasn't able to reproduce with 942a59196ff8cbdaac59f6d9aa02399d2bf4d3a2 in a little over 100 attempts, so that seems to confirm that b940de4c6e68468c2817e072fd94a56d5399924a introduced the problem downstream. Next trying upstream. Reproduced upstream with cd758a9b57ee85f0733c759e60f42b969c81f27b, though it took quite a while (38th attempt). Also reproduced with upstream master (ae46d2aa6a7fbe8ca0946f24b061b6ccdc6c3f25). Failed to reproduce with upstream cd758a9b57ee85f0733c759e60f42b969c81f27b^ = 1c482452d5db0f52e4e8eed95bd7314eec537d78 on over 200 attempts, so I think we can be pretty confident the problem was introduced by cd758a9b57ee85f0733c759e60f42b969c81f27b and is not yet fixed upstream. Did a bunch of debugging on the upstream kernel. The fault is occuring in the KVM page fault handler kvmppc_book3s_hv_page_fault() where we're exiting because we have a Linux PTE which is cache-inhibited, but we have a hash PTE which is not.
Specifically:
Linux PTE: 0x400000079788003d
Hash PTE (R): 0x00000000001e1190
Since this bug is upstream, it will be useful to reference publically. Making public. Tried a change suggested by Paul Mackerras, and it seems to be working on >140 attempts so far. I've posted the change upstream. Now merged upstream as ae49dedaa92b55258544aace7c585094b862ef79 "KVM: PPC: Book3S HV: Handle non-present PTEs in page fault functions". Brewing test kernel at: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28222807 Patch(es) available on kernel-4.18.0-200.el8 Test Environment: Host Distro: RHEL-8.3.0-20200520.n.4 BaseOS ppc64le Host Kernel: 4.18.0-201.el8.ppc64le Qemu-kvm: qemu-kvm-5.0.0-0.module+el8.3.0+6620+5d5e1420 SLOF: SLOF-20200327-1.git8e012d6f.module+el8.3.0+6612+6b86f0c9 Guest OS:RHEL.8.3.0, RHEL.7.9 log: http://10.0.136.47/4270815/results.html fast-virt-6620 acceptance test no hit this issue with kernel-4.18.0-201.el8.ppc64le on Power8 Hi David, Hit this issue on RHEL-8.2.1-z Since this bug was fixed on RHEL-8.3.0, do we plan to fix it on RHEL-8.2.1-z? Test Environment: Host Distro: RHEL-8.2.0 BaseOS ppc64le Host Kernel: 4.18.0-193.el8.ppc64le Qemu-kvm: qemu-kvm-4.2.0-29.module+el8.2.1+7712+3c3fe332.2 SLOF: SLOF-20191022-3.git899d9883.module+el8.2.0+5449+efc036dd Guest OS:RHEL.8.2.1, RHEL.7.9 log: http://10.0.136.47/4506164/results.html Martin, I always forget the support durations on the various y-streams. Do you think we need to fix this for RHEL8.2? After discussion on Monday's call, I think this should be backported to 8.2.1.z (In reply to David Gibson from comment #28) > After discussion on Monday's call, I think this should be backported to > 8.2.1.z Hi David, Thanks for the feedback. Since the current bug status is verified, should we change the bug status to track the 8.2.1.z modification? Hit this issue on RHEL-8.2.1-z, qemu-kvm-4.2.0-29.module+el8.2.1+7983+71fbcacb.3. Not yet. We need to wait for the zstream approval, then the bug will be cloned to track the zstream variant. (In reply to David Gibson from comment #30) > Not yet. We need to wait for the zstream approval, then the bug will be > cloned to track the zstream variant. OK, Thanks for the quick reply. Please let me know when you need to clone this bug. (In reply to David Gibson from comment #30) > Not yet. We need to wait for the zstream approval, then the bug will be > cloned to track the zstream variant. Hi Martin, Hit this issue again on RHEL AV 8.2.1 - Batch 2 - Compose So do we still need to wait for zstream approval? Test Environment: Host Distro: RHEL-8.2.0 BaseOS ppc64le Host Kernel: kernel-4.18.0-193.el8.ppc64le virt repo url: http://download.eng.bos.redhat.com/rhel-8/rel-eng/updates/ADVANCED-VIRT-8/ADVANCED-VIRT-8.2.1-RHEL-8-20201012.0/compose/Advanced-virt/ppc64le/os qemu-kvm: qemu-kvm-4.2.0-29.module+el8.2.1+7990+27f1e480.4.ppc64le SLOF: SLOF-20191022-3.git899d9883.module+el8.2.0+5449+efc036dd.noarch 07:52:56 DEBUG| Send command: {'execute': 'cont', 'id': 'uhMrPsH4'} 07:52:56 INFO | Boot a guest with stg0("aio=threads,cache=none"). 07:52:56 DEBUG| Attempting to log into 'avocado-vt-vm1' (timeout 360s) 07:53:19 INFO | [qemu output] error: kvm run failed Bad address 07:53:19 INFO | [qemu output] error: kvm run failed Bad address 07:53:19 INFO | [qemu output] NIP c0000000001c26c8 LR c0000000001a6044 CTR c0000000001bbc80 XER 0000000020000000 CPU#1 07:53:19 INFO | [qemu output] MSR 8000000040009033 HID0 0000000000000000 HF 8000000000000000 iidx 3 didx 3 07:53:19 INFO | [qemu output] TB 00000000 00000000 DECR 0 07:53:19 INFO | [qemu output] GPR00 c0000000001a5fe0 c000000007c27b70 c000000001920b00 c00000073dd98c00 07:53:19 INFO | [qemu output] GPR04 c00000073dd98c80 00000000000003fe 0000000000000000 c00000073859d600 07:53:19 INFO | [qemu output] GPR08 0000000000000000 c00000073dd98c00 c000000007bf3180 0000000000000001 07:53:19 INFO | [qemu output] GPR12 0000000000004400 c000000003ecee00 0000000000000000 00007fffab340000 07:53:19 INFO | [qemu output] GPR16 0000000126f2fd38 0000000126f2fcd8 0000000126f2fd08 c000000000245930 07:53:19 INFO | [qemu output] GPR20 00000000d62925ce 0000000000000001 0000000000000000 0000000000000000 07:53:19 INFO | [qemu output] GPR24 0000000000000000 c000000007bf3180 000000073cad0000 c00000000194dd70 07:53:19 INFO | [qemu output] GPR28 c0000000012c8c00 0000000000000008 c00000073dd98c00 c0000000012c8c00 07:53:19 INFO | [qemu output] CR 88004488 [ L L - - G G L L ] RES ffffffffffffffff 07:53:19 INFO | [qemu output] SRR0 0000000126f23088 SRR1 800000000000f033 PVR 00000000004b0201 VRSAVE 00000000ffffffff 07:53:19 INFO | [qemu output] SPRG0 0000000000000000 SPRG1 c000000003ecee00 SPRG2 00007fffa94067c0 SPRG3 0000000000000001 07:53:19 INFO | [qemu output] SPRG4 0000000000000000 SPRG5 0000000000000000 SPRG6 0000000000000000 SPRG7 0000000000000000 07:53:19 INFO | [qemu output] HSRR0 0000000000000000 HSRR1 0000000000000000 07:53:19 INFO | [qemu output] CFAR 0000000000000000 07:53:19 INFO | [qemu output] LPCR 000000000384f001 07:53:19 INFO | [qemu output] SDR1 0000000000000000 DAR 00000100374f2990 DSISR 0000000000000000 07:53:19 INFO | [qemu output] NIP c0000000001c26c8 LR c0000000001a6044 CTR c0000000001bbc80 XER 0000000020000000 CPU#3 07:53:19 INFO | [qemu output] MSR 8000000040009033 HID0 0000000000000000 HF 8000000000000000 iidx 3 didx 3 07:53:19 INFO | [qemu output] TB 00000000 00000000 DECR 0 07:53:19 INFO | [qemu output] GPR00 c0000000001a5fe0 c000000007c1fb70 c000000001920b00 c00000073df98c00 07:53:19 INFO | [qemu output] GPR04 c00000073df98c80 00000000000003ff 0000000000000000 c000000738591a00 07:53:19 INFO | [qemu output] GPR08 0000000000000000 c00000073df98c00 c000000007bebc00 0000000000000001 07:53:19 INFO | [qemu output] GPR12 0000000000004400 c000000003ecae00 0000000000000000 00007fffab340000 07:53:19 INFO | [qemu output] GPR16 0000000126f2fd38 0000000126f2fcd8 0000000126f2fd08 c000000000245930 07:53:19 INFO | [qemu output] GPR20 00000000d6292113 0000000000000001 0000000000000000 0000000000000000 07:53:19 INFO | [qemu output] GPR24 0000000000000000 c000000007bebc00 000000073ccd0000 c00000000194dd70 07:53:19 INFO | [qemu output] GPR28 c0000000012c8c00 0000000000000018 c00000073df98c00 c0000000012c8c00 07:53:19 INFO | [qemu output] CR 88004488 [ L L - - G G L L ] RES ffffffffffffffff 07:53:19 INFO | [qemu output] SRR0 0000000126f23088 SRR1 800000000000d033 PVR 00000000004b0201 VRSAVE 00000000ffffffff 07:53:19 INFO | [qemu output] SPRG0 0000000000000000 SPRG1 c000000003ecae00 SPRG2 00007fffa40067c0 SPRG3 0000000000000003 07:53:19 INFO | [qemu output] SPRG4 0000000000000000 SPRG5 0000000000000000 SPRG6 0000000000000000 SPRG7 0000000000000000 07:53:19 INFO | [qemu output] HSRR0 0000000000000000 HSRR1 0000000000000000 07:53:19 INFO | [qemu output] CFAR 0000000000000000 07:53:19 INFO | [qemu output] LPCR 000000000384f001 07:53:19 INFO | [qemu output] SDR1 0000000000000000 DAR d001f7ffffd027b8 DSISR 0000000000000000 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: kernel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:4431 |
Description of problem: In recent rounds of acceptance testing, QEMU sometimes reports "error: kvm run failed Bad address" and then guest stops. If I degrade the host kernel to 4.18.0-187.el8.ppc64le, the tests are passed (200/200) Version-Release number of selected component (if applicable): Host kernel version: 4.18.0-193.el8.ppc64le qemu version: qemu-kvm-4.2.0-17.module+el8.2.0+6131+4e715f3b.ppc64le guest kernel: 4.18.0-193.el8.ppc64le How reproducible: 3/100 Steps to Reproduce: 1. Launch a guest /usr/libexec/qemu-kvm \ -name 'avocado-vt-vm1' \ -S \ -sandbox on \ -machine pseries \ -nodefaults \ -device VGA,bus=pci.0,addr=0x2 \ -m 27648 \ -smp 6,maxcpus=6,cores=3,threads=1,sockets=2 \ -cpu 'host' \ -chardev socket,path=/var/tmp/avocado_05oqo3uk/serial-serial0-20200331-214431-IiTIsZHE,server,id=chardev_serial0,nowait \ -device spapr-vty,id=serial0,reg=0x30000000,chardev=chardev_serial0 \ -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 \ -blockdev node-name=file_image1,driver=file,aio=threads,filename=/home/kar/vt_test_images/rhel820-ppc64le-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_image1,driver=qcow2,cache.direct=on,cache.no-flush=off,file=file_image1 \ -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \ -device virtio-net-pci,mac=9a:5f:3b:b1:c3:38,id=idVWGTmh,netdev=idGI7vS7,bus=pci.0,addr=0x5 \ -netdev tap,id=idGI7vS7,vhost=on,vhostfd=21,fd=17 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -vnc :0 \ -rtc base=utc,clock=host \ -boot menu=off,order=cdn,once=c,strict=off \ -enable-kvm \ -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 2. Resume the guest {"execute": "cont"} Actual results: 14:08:10 INFO | [qemu output] error: kvm run failed Bad address 14:08:10 INFO | [qemu output] NIP c0000000001c2198 LR c0000000001a5f54 CTR c0000000001ac5f0 XER 0000000000000000 CPU#6 14:08:10 INFO | [qemu output] MSR 8000000040009033 HID0 0000000000000000 HF 8000000000000000 iidx 3 didx 3 14:08:10 INFO | [qemu output] TB 00000000 00000000 DECR 00000000 14:08:10 INFO | [qemu output] GPR00 c0000000001a5e70 c000003e61937740 c000000001920a00 c000003e70298c00 14:08:10 INFO | [qemu output] GPR04 c000003e618eeb00 0000000000000000 0000000000000000 c000003e618eeb00 14:08:10 INFO | [qemu output] GPR08 0000000000000001 c000003e70298c00 0000000000000000 0000000000000001 14:08:10 INFO | [qemu output] GPR12 c0000000001ac5f0 c000000007ff7800 c000003e61937f90 0000000000000000 14:08:10 INFO | [qemu output] GPR16 c0000000019520d8 0000000000000000 0000000000000800 c000000000245250 14:08:10 INFO | [qemu output] GPR20 0000004e5f4a97ea 0000000000000001 0000000000000002 0000000000000000 14:08:10 INFO | [qemu output] GPR24 0000000000000000 c000003e618eeb00 0000003e6efd0000 c00000000194dd70 14:08:10 INFO | [qemu output] GPR28 c0000000012c8c00 0000000000000030 c000003e70298c00 c0000000012c8c00 14:08:10 INFO | [qemu output] CR 88000222 [ L L - - - E E E ] RES ffffffffffffffff 14:08:10 INFO | [qemu output] FPR00 00000000b3004100 ffffffffffffffff ffffffffffffffff ffffffffffffffff 14:08:10 INFO | [qemu output] FPR04 0000000000000000 0000000000000000 0000000000000000 0000000000000000 14:08:10 INFO | [qemu output] FPR08 0000000000000000 0000000000000000 0000000000000000 0000000000000000 14:08:10 INFO | [qemu output] FPR12 0000000000000000 0000000000000000 0000000000000000 0000000000000000 14:08:10 INFO | [qemu output] FPR16 0000000000000000 0000000000000000 0000000000000000 0000000000000000 14:08:10 INFO | [qemu output] FPR20 0000000000000000 0000000000000000 0000000000000000 0000000000000000 14:08:10 INFO | [qemu output] FPR24 0000000000000000 0000000000000000 0000000000000000 0000000000000000 14:08:10 INFO | [qemu output] FPR28 0000000000000000 0000000000000000 0000000000000000 0000000000000000 14:08:10 INFO | [qemu output] FPSCR 00000000b3004100 14:08:10 INFO | [qemu output] SRR0 c0000000000090b0 SRR1 9000000000001033 PVR 00000000004c0100 VRSAVE 0000000000000000 14:08:10 INFO | [qemu output] SPRG0 0000000000000000 SPRG1 c000000007ff7800 SPRG2 c000000007ff7800 SPRG3 0000000000000006 14:08:10 INFO | [qemu output] SPRG4 0000000000000000 SPRG5 0000000000000000 SPRG6 0000000000000000 SPRG7 0000000000000000 14:08:10 INFO | [qemu output] HSRR0 0000000000000000 HSRR1 0000000000000000 14:08:10 INFO | [qemu output] CFAR 0000000000000000 14:08:10 INFO | [qemu output] LPCR 000000000384f001 14:08:10 INFO | [qemu output] SDR1 0000000000000000 DAR 00007ffeeffb0300 DSISR 0000000042000000 qmp logs: 2020-04-02 14:02:15: {"execute": "query-status", "id": "uSSk9GOJ"} 2020-04-02 14:02:15: {"return": {"status": "prelaunch", "singlestep": false, "running": false}, "id": "uSSk9GOJ"} 2020-04-02 14:02:15: {"execute": "cont", "id": "TdnqPfLP"} 2020-04-02 14:02:15: {"timestamp": {"seconds": 1585850535, "microseconds": 592268}, "event": "RESUME"} 2020-04-02 14:02:15: {"return": {}, "id": "TdnqPfLP"} 2020-04-02 14:02:15: {"execute": "query-status", "id": "7mab2AFa"} 2020-04-02 14:02:15: {"return": {"status": "running", "singlestep": false, "running": true}, "id": "7mab2AFa"} 2020-04-02 14:02:15: {"execute": "query-status", "id": "8I8m5aio"} 2020-04-02 14:02:15: {"return": {"status": "running", "singlestep": false, "running": true}, "id": "8I8m5aio"} 2020-04-02 14:37:34: {"timestamp": {"seconds": 1585850577, "microseconds": 761903}, "event": "NIC_RX_FILTER_CHANGED", "data": {"name": "idrsSzlS", "path": "/machine/peripheral/idrsSzlS/virtio-backend"}} 2020-04-02 14:37:34: {"timestamp": {"seconds": 1585850584, "microseconds": 586535}, "event": "RTC_CHANGE", "data": {"offset": 1}} 2020-04-02 14:37:34: {"timestamp": {"seconds": 1585850890, "microseconds": 68324}, "event": "STOP"} 2020-04-02 14:37:34: {"execute": "query-status", "id": "Aw0K1gee"} 2020-04-02 14:37:34: {"return": {"status": "internal-error", "singlestep": false, "running": false}, "id": "Aw0K1gee"} 2020-04-02 14:37:55: {"execute": "query-status", "id": "dlL6N9GO"} 2020-04-02 14:37:55: {"return": {"status": "internal-error", "singlestep": false, "running": false}, "id": "dlL6N9GO"} 2020-04-02 14:38:16: {"execute": "query-status", "id": "Bk9lLnEf"} 2020-04-02 14:38:16: {"return": {"status": "internal-error", "singlestep": false, "running": false}, "id": "Bk9lLnEf"} 2020-04-02 14:38:37: {"execute": "query-status", "id": "zfAZ1zeZ"} 2020-04-02 14:38:37: {"return": {"status": "internal-error", "singlestep": false, "running": false}, "id": "zfAZ1zeZ"} 2020-04-02 14:38:59: {"execute": "query-status", "id": "vSxsQ53z"} 2020-04-02 14:38:59: {"return": {"status": "internal-error", "singlestep": false, "running": false}, "id": "vSxsQ53z"} 2020-04-02 14:39:20: {"execute": "query-status", "id": "miCF0jm4"} 2020-04-02 14:39:20: {"return": {"status": "internal-error", "singlestep": false, "running": false}, "id": "miCF0jm4"} Expected results: Guest should be successfully launched and work fine. Additional info: This problem only hit on Power8, and it's not easy to reproduce so I can not provide a clear reproducer yet. I will try to further test these kernel versions(187 - 193) by dichotomy.