Bug 1856283

Summary: hard lockup while kvm tests
Product: [Fedora] Fedora Reporter: Michel Normand <normand>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED WORKSFORME QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 32CC: acaringi, airlied, bskeggs, dan, hannsj_uhl, hdegoede, ichavero, itamar, jarodwilson, jcajka, jeremy, jglisse, john.j5live, jonathan, josef, kernel-maint, lgoncalv, linville, masami256, mchehab, mjg59, steved
Target Milestone: ---   
Target Release: ---   
Hardware: ppc64le   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-03 07:49:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1071880    
Attachments:
Description Flags
journalctl_abanb_today_20200713.log none

Description Michel Normand 2020-07-13 09:57:29 UTC
Created attachment 1700805 [details]
journalctl_abanb_today_20200713.log

1. Please describe the problem:

my local openQA server with fedora 32, starting to fail after last dnf distro-sync done on 20200709 (with kernel   5.7.8-200) failed on 20200711 when starting  to execute a set of openQA tests (using kvm guests)

and opened ssh session closed, I have to ipmi power off/on to recover

Two occurences similar backtrace,
the attached log associated to 2nd occurence.
===
Jul 11 06:27:20 abanb.tlslab.ibm.com kernel: watchdog: CPU 0 detected hard LOCKUP on other CPUs 16
... 
Jul 11 06:59:59 abanb.tlslab.ibm.com worker[4273]: [info] Test schedule has changed, reloading test_order.json
-- Reboot -- <= power off/on via ipmi 
Jul 13 07:15:14 localhost.localdomain kernel: dt-cpu-ftrs: setup for ISA 2070
... 
Jul 13 08:30:20 abanb.tlslab.ibm.com kernel: watchdog: CPU 8 detected hard LOCKUP on other CPUs 32
...
Jul 11 06:59:59 abanb.tlslab.ibm.com worker[4273]: [info] Test schedule has changed, reloading test_order.json
-- Reboot -- <= power off/on via ipmi
Jul 13 07:15:14 localhost.localdomain kernel: dt-cpu-ftrs: setup for ISA 2070
===
watchdog: CPU 8 detected hard LOCKUP on other CPUs 32
watchdog: CPU 8 TB:2372958361977, last SMP heartbeat TB:2364766439129 (15999ms ago)
rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
rcu:         32-...0: (4 ticks this GP) idle=c56/1/0x4000000000000000 softirq=227906/227908 fqs=2203 
        (detected by 48, t=6002 jiffies, g=422993, q=4057)
Sending NMI from CPU 48 to CPUs 32:
CPU 32 didn't respond to backtrace IPI, inspecting paca.
irq_soft_mask: 0x03 in_mce: 0 in_nmi: 0 current: 13963 (qemu-system-ppc)
Back trace of paca->saved_r1 (0xc000000bf97ab570) (possibly stale):
Call Trace:
[c000000bf97ab570] [c00000000013cc5c] guest_bypass+0x38/0x2c0 (unreliable)
[c000000bf97ab640] [c00000000013bda8] kvmppc_call_hv_entry+0x28/0x9c
[c000000bf97ab6b0] [c008000002eb0a00] __kvmppc_vcore_entry+0xa0/0x104 [kvm_hv]
[c000000bf97ab890] [c008000002eaa444] kvmppc_run_core+0xedc/0x2820 [kvm_hv]
[c000000bf97aba50] [c008000002eaf490] kvmppc_vcpu_run_hv+0x5d8/0xec0 [kvm_hv]
[c000000bf97abb60] [c008000003e4e77c] kvmppc_vcpu_run+0x34/0x48 [kvm]
[c000000bf97abb80] [c008000003e4a20c] kvm_arch_vcpu_ioctl_run+0x334/0x450 [kvm]
[c000000bf97abc10] [c008000003e39114] kvm_vcpu_ioctl+0x27c/0x760 [kvm]
[c000000bf97abd70] [c000000000568784] sys_ioctl+0xf4/0x150
[c000000bf97abdc0] [c000000000032630] system_call_exception+0xf0/0x180
[c000000bf97abe20] [c00000000000ca70] system_call_common+0xf0/0x278
===

2. What is the Version-Release number of the kernel: 5.7.8-200


3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

worked before with kernel 5.6.17-300

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

IBM local openQA server on a P8 host


5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

not tested

6. Are you running any modules that not shipped with directly Fedora's kernel?:

no


7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

the attached log is journalctl --since today
to

Comment 1 Fedora Program Management 2021-04-29 16:55:39 UTC
This message is a reminder that Fedora 32 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '32'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 32 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 2 Michel Normand 2021-05-03 07:49:43 UTC
no failure anymore