Bug 1820402
Summary: | Sometimes hit "error: kvm run failed Bad address" when launching a guest on Power8 | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Yihuang Yu <yihyu> | |
Component: | kernel | Assignee: | David Gibson <dgibson> | |
kernel sub component: | KVM | QA Contact: | Zhenyu Zhang <zhenyzha> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | dgibson, juzhang, lagarcia, lvivier, mdeng, mtessun, ngu, qzhang, virt-maint, xianwang, xuma, zhenyzha | |
Version: | 8.2 | Keywords: | Regression, Triaged, ZStream | |
Target Milestone: | rc | |||
Target Release: | 8.3 | |||
Hardware: | ppc64le | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | kernel-4.18.0-200.el8 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1890883 (view as bug list) | Environment: | ||
Last Closed: | 2020-11-04 01:12:11 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1826160, 1890883 |
Description
Yihuang Yu
2020-04-03 01:30:31 UTC
It seems there isn't "max-cpu-compat=power8" in cmdline. (In reply to Min Deng from comment #1) > It seems there isn't "max-cpu-compat=power8" in cmdline. I mean this problem only occurs on power8 host. I've reproduce this myself, with a script that starts a guest repeatedly. Simply because there are very few plausibly relevant patches between -187 and -193, I suspect the problem was introduced by b940de4c6e68468c2817e072fd94a56d5399924a which is the downstream backport of cd758a9b57ee85f0733c759e60f42b969c81f27b "[kvm] KVM: PPC: Book3S HV: Use __gfn_to_pfn_memslot in HPT page fault handler". I'm working on confirming this now. Confirmed that I can reproduce the problem with downstream commit b940de4c6e68468c2817e072fd94a56d5399924a. Now trying the parent commit (942a59196ff8cbdaac59f6d9aa02399d2bf4d3a2), to see if I can reproduce there. Wasn't able to reproduce with 942a59196ff8cbdaac59f6d9aa02399d2bf4d3a2 in a little over 100 attempts, so that seems to confirm that b940de4c6e68468c2817e072fd94a56d5399924a introduced the problem downstream. Next trying upstream. Reproduced upstream with cd758a9b57ee85f0733c759e60f42b969c81f27b, though it took quite a while (38th attempt). Also reproduced with upstream master (ae46d2aa6a7fbe8ca0946f24b061b6ccdc6c3f25). Failed to reproduce with upstream cd758a9b57ee85f0733c759e60f42b969c81f27b^ = 1c482452d5db0f52e4e8eed95bd7314eec537d78 on over 200 attempts, so I think we can be pretty confident the problem was introduced by cd758a9b57ee85f0733c759e60f42b969c81f27b and is not yet fixed upstream. Did a bunch of debugging on the upstream kernel. The fault is occuring in the KVM page fault handler kvmppc_book3s_hv_page_fault() where we're exiting because we have a Linux PTE which is cache-inhibited, but we have a hash PTE which is not. Specifically: Linux PTE: 0x400000079788003d Hash PTE (R): 0x00000000001e1190 Since this bug is upstream, it will be useful to reference publically. Making public. Tried a change suggested by Paul Mackerras, and it seems to be working on >140 attempts so far. I've posted the change upstream. Now merged upstream as ae49dedaa92b55258544aace7c585094b862ef79 "KVM: PPC: Book3S HV: Handle non-present PTEs in page fault functions". Brewing test kernel at: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28222807 Patch(es) available on kernel-4.18.0-200.el8 Test Environment: Host Distro: RHEL-8.3.0-20200520.n.4 BaseOS ppc64le Host Kernel: 4.18.0-201.el8.ppc64le Qemu-kvm: qemu-kvm-5.0.0-0.module+el8.3.0+6620+5d5e1420 SLOF: SLOF-20200327-1.git8e012d6f.module+el8.3.0+6612+6b86f0c9 Guest OS:RHEL.8.3.0, RHEL.7.9 log: http://10.0.136.47/4270815/results.html fast-virt-6620 acceptance test no hit this issue with kernel-4.18.0-201.el8.ppc64le on Power8 Hi David, Hit this issue on RHEL-8.2.1-z Since this bug was fixed on RHEL-8.3.0, do we plan to fix it on RHEL-8.2.1-z? Test Environment: Host Distro: RHEL-8.2.0 BaseOS ppc64le Host Kernel: 4.18.0-193.el8.ppc64le Qemu-kvm: qemu-kvm-4.2.0-29.module+el8.2.1+7712+3c3fe332.2 SLOF: SLOF-20191022-3.git899d9883.module+el8.2.0+5449+efc036dd Guest OS:RHEL.8.2.1, RHEL.7.9 log: http://10.0.136.47/4506164/results.html Martin, I always forget the support durations on the various y-streams. Do you think we need to fix this for RHEL8.2? After discussion on Monday's call, I think this should be backported to 8.2.1.z (In reply to David Gibson from comment #28) > After discussion on Monday's call, I think this should be backported to > 8.2.1.z Hi David, Thanks for the feedback. Since the current bug status is verified, should we change the bug status to track the 8.2.1.z modification? Hit this issue on RHEL-8.2.1-z, qemu-kvm-4.2.0-29.module+el8.2.1+7983+71fbcacb.3. Not yet. We need to wait for the zstream approval, then the bug will be cloned to track the zstream variant. (In reply to David Gibson from comment #30) > Not yet. We need to wait for the zstream approval, then the bug will be > cloned to track the zstream variant. OK, Thanks for the quick reply. Please let me know when you need to clone this bug. (In reply to David Gibson from comment #30) > Not yet. We need to wait for the zstream approval, then the bug will be > cloned to track the zstream variant. Hi Martin, Hit this issue again on RHEL AV 8.2.1 - Batch 2 - Compose So do we still need to wait for zstream approval? Test Environment: Host Distro: RHEL-8.2.0 BaseOS ppc64le Host Kernel: kernel-4.18.0-193.el8.ppc64le virt repo url: http://download.eng.bos.redhat.com/rhel-8/rel-eng/updates/ADVANCED-VIRT-8/ADVANCED-VIRT-8.2.1-RHEL-8-20201012.0/compose/Advanced-virt/ppc64le/os qemu-kvm: qemu-kvm-4.2.0-29.module+el8.2.1+7990+27f1e480.4.ppc64le SLOF: SLOF-20191022-3.git899d9883.module+el8.2.0+5449+efc036dd.noarch 07:52:56 DEBUG| Send command: {'execute': 'cont', 'id': 'uhMrPsH4'} 07:52:56 INFO | Boot a guest with stg0("aio=threads,cache=none"). 07:52:56 DEBUG| Attempting to log into 'avocado-vt-vm1' (timeout 360s) 07:53:19 INFO | [qemu output] error: kvm run failed Bad address 07:53:19 INFO | [qemu output] error: kvm run failed Bad address 07:53:19 INFO | [qemu output] NIP c0000000001c26c8 LR c0000000001a6044 CTR c0000000001bbc80 XER 0000000020000000 CPU#1 07:53:19 INFO | [qemu output] MSR 8000000040009033 HID0 0000000000000000 HF 8000000000000000 iidx 3 didx 3 07:53:19 INFO | [qemu output] TB 00000000 00000000 DECR 0 07:53:19 INFO | [qemu output] GPR00 c0000000001a5fe0 c000000007c27b70 c000000001920b00 c00000073dd98c00 07:53:19 INFO | [qemu output] GPR04 c00000073dd98c80 00000000000003fe 0000000000000000 c00000073859d600 07:53:19 INFO | [qemu output] GPR08 0000000000000000 c00000073dd98c00 c000000007bf3180 0000000000000001 07:53:19 INFO | [qemu output] GPR12 0000000000004400 c000000003ecee00 0000000000000000 00007fffab340000 07:53:19 INFO | [qemu output] GPR16 0000000126f2fd38 0000000126f2fcd8 0000000126f2fd08 c000000000245930 07:53:19 INFO | [qemu output] GPR20 00000000d62925ce 0000000000000001 0000000000000000 0000000000000000 07:53:19 INFO | [qemu output] GPR24 0000000000000000 c000000007bf3180 000000073cad0000 c00000000194dd70 07:53:19 INFO | [qemu output] GPR28 c0000000012c8c00 0000000000000008 c00000073dd98c00 c0000000012c8c00 07:53:19 INFO | [qemu output] CR 88004488 [ L L - - G G L L ] RES ffffffffffffffff 07:53:19 INFO | [qemu output] SRR0 0000000126f23088 SRR1 800000000000f033 PVR 00000000004b0201 VRSAVE 00000000ffffffff 07:53:19 INFO | [qemu output] SPRG0 0000000000000000 SPRG1 c000000003ecee00 SPRG2 00007fffa94067c0 SPRG3 0000000000000001 07:53:19 INFO | [qemu output] SPRG4 0000000000000000 SPRG5 0000000000000000 SPRG6 0000000000000000 SPRG7 0000000000000000 07:53:19 INFO | [qemu output] HSRR0 0000000000000000 HSRR1 0000000000000000 07:53:19 INFO | [qemu output] CFAR 0000000000000000 07:53:19 INFO | [qemu output] LPCR 000000000384f001 07:53:19 INFO | [qemu output] SDR1 0000000000000000 DAR 00000100374f2990 DSISR 0000000000000000 07:53:19 INFO | [qemu output] NIP c0000000001c26c8 LR c0000000001a6044 CTR c0000000001bbc80 XER 0000000020000000 CPU#3 07:53:19 INFO | [qemu output] MSR 8000000040009033 HID0 0000000000000000 HF 8000000000000000 iidx 3 didx 3 07:53:19 INFO | [qemu output] TB 00000000 00000000 DECR 0 07:53:19 INFO | [qemu output] GPR00 c0000000001a5fe0 c000000007c1fb70 c000000001920b00 c00000073df98c00 07:53:19 INFO | [qemu output] GPR04 c00000073df98c80 00000000000003ff 0000000000000000 c000000738591a00 07:53:19 INFO | [qemu output] GPR08 0000000000000000 c00000073df98c00 c000000007bebc00 0000000000000001 07:53:19 INFO | [qemu output] GPR12 0000000000004400 c000000003ecae00 0000000000000000 00007fffab340000 07:53:19 INFO | [qemu output] GPR16 0000000126f2fd38 0000000126f2fcd8 0000000126f2fd08 c000000000245930 07:53:19 INFO | [qemu output] GPR20 00000000d6292113 0000000000000001 0000000000000000 0000000000000000 07:53:19 INFO | [qemu output] GPR24 0000000000000000 c000000007bebc00 000000073ccd0000 c00000000194dd70 07:53:19 INFO | [qemu output] GPR28 c0000000012c8c00 0000000000000018 c00000073df98c00 c0000000012c8c00 07:53:19 INFO | [qemu output] CR 88004488 [ L L - - G G L L ] RES ffffffffffffffff 07:53:19 INFO | [qemu output] SRR0 0000000126f23088 SRR1 800000000000d033 PVR 00000000004b0201 VRSAVE 00000000ffffffff 07:53:19 INFO | [qemu output] SPRG0 0000000000000000 SPRG1 c000000003ecae00 SPRG2 00007fffa40067c0 SPRG3 0000000000000003 07:53:19 INFO | [qemu output] SPRG4 0000000000000000 SPRG5 0000000000000000 SPRG6 0000000000000000 SPRG7 0000000000000000 07:53:19 INFO | [qemu output] HSRR0 0000000000000000 HSRR1 0000000000000000 07:53:19 INFO | [qemu output] CFAR 0000000000000000 07:53:19 INFO | [qemu output] LPCR 000000000384f001 07:53:19 INFO | [qemu output] SDR1 0000000000000000 DAR d001f7ffffd027b8 DSISR 0000000000000000 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: kernel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:4431 |