Bug 1826160

Summary:

[ppc64le][dump]executing kdump test in multiple guests will cause error in both guest and host

Product:

Red Hat Enterprise Linux Advanced Virtualization

Reporter:

Min Deng <mdeng>

Component:

qemu-kvm

Assignee:

Virtualization Maintenance <virt-maint>

qemu-kvm sub component:

General

QA Contact:

Min Deng <mdeng>

Status:

CLOSED CURRENTRELEASE

Docs Contact:

Severity:

high

Priority:

high

CC:

bfu, bugproxy, dgibson, fnovak, hannsj_uhl, mdeng, ngu, qzhang, virt-maint, xianwang, xuma, yihyu

Version:

8.2

Keywords:

Triaged

Target Milestone:

Target Release:

8.3

Hardware:

ppc64le

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-09-17 00:26:47 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1820402

Bug Blocks:

1776265

Attachments:

Description	Flags
auto-debug-log	none
auto-instruction	none
newlog	none
old-build-log	none
for P8 and P9	none
latest build	none

Description Min Deng 2020-04-21 06:42:57 UTC

Description of problem:
[ppc64le][dump][power8]After triggering kdump in the guest,qemu-kvm terminated 

Version-Release number of selected component (if applicable):
qemu-kvm-common-4.2.0-19.module+el8.2.0+6296+6b821950.ppc64le
kernel-4.18.0-193.el8.ppc64le
host-kernel:kernel-4.18.0-193.9.el8.ppc64le

How reproducible:
3/3

Steps to Reproduce:
1.boot up a guest with similar cli
   /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' -sandbox on -machine pseries -nodefaults -device VGA,bus=pci.0,addr=0x2 -m 8192 -smp 8,maxcpus=8,cores=4,threads=1,sockets=2 -cpu 'host' -chardev socket,id=chardev_serial0,nowait,server,path=/tmp/t11 -device spapr-vty,id=serial0,reg=0x30000000,chardev=chardev_serial0 -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 -blockdev node-name=file_image1,driver=file,aio=threads,filename=/home/kar/vt_test_images/rhel820-ppc64le-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off -blockdev node-name=drive_image1,driver=qcow2,cache.direct=on,cache.no-flush=off,file=file_image1 -device scsi-hd,id=image1,drive=drive_image1,write-cache=on -vnc :0 -rtc base=utc,clock=host -boot menu=off,order=cdn,once=c,strict=off -enable-kvm -monitor stdio -device usb-kbd,id=input0 -device usb-mouse,id=input1 -device usb-tablet,id=input2

2.echo c >/proc/sysrq-trigger	
3.

Actual results:
qemu-kvm terminated.
[root@localhost ~]# echo c >/proc/sysrq-trigger	
echo c >/proc/sysrq-trigger 
[  200.326864] sysrq: SysRq : Trigger a crash
[  200.326912] Unable to handle kernel paging request for data at address 0x00000000
[  200.326972] Faulting instruction address: 0xc00000000086f308
[  200.327031] Oops: Kernel access of bad area, sig: 11 [#1]
[  200.327072] LE SMP NR_CPUS=2048 NUMA pSeries
[  200.327134] Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 nft_chain_route_ipv6 nft_chain_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack nft_chain_route_ipv4 ip6_tables nft_compat ip_set nf_tables nfnetlink bochs_drm drm_vram_helper ttm drm_kms_helper drm xts drm_panel_orientation_quirks fb_sys_fops syscopyarea sysfillrect sysimgblt vmx_crypto ip_tables xfs libcrc32c sd_mod sg virtio_scsi dm_mirror dm_region_hash dm_log dm_mod
[  200.327753] CPU: 4 PID: 2932 Comm: bash Kdump: loaded Not tainted 4.18.0-193.el8.ppc64le #1
[  200.327814] NIP:  c00000000086f308 LR: c000000000870314 CTR: c00000000086f2e0
[  200.327875] REGS: c0000001f0a1ba50 TRAP: 0300   Not tainted  (4.18.0-193.el8.ppc64le)
[  200.327937] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28222282  XER: 20000000
[  200.328001] CFAR: c0000000001f6530 DAR: 0000000000000000 DSISR: 42000000 IRQMASK: 0 
[  200.328001] GPR00: c000000000870314 c0000001f0a1bcd0 c000000001920a00 0000000000000063 
[  200.328001] GPR04: c0000001ff30cf90 c0000001ff395628 0000000000000177 0000000000000004 
[  200.328001] GPR08: 0000000000000007 0000000000000001 0000000000000000 c0000001f0a1b86f 
[  200.328001] GPR12: c00000000086f2e0 c00000003fff9c00 0000000040000000 000000013a759798 
[  200.328001] GPR16: 000000013a759724 000000013a6f6968 000000013a68f230 000000013a75d568 
[  200.328001] GPR20: 0000010012038b30 0000000000000001 000000013a7096e8 00007fffd98a0ca4 
[  200.328001] GPR24: 00007fffd98a0ca0 c00000000168b520 0000000000000000 0000000000000007 
[  200.328001] GPR28: 0000000000000000 0000000000000063 c00000000195237c c000000001634d28 
[  200.328564] NIP [c00000000086f308] sysrq_handle_crash+0x28/0x30
[  200.328618] LR [c000000000870314] __handle_sysrq+0xe4/0x230
[  200.328659] Call Trace:
[  200.328682] [c0000001f0a1bcd0] [c0000000008702f8] __handle_sysrq+0xc8/0x230 (unreliable)
[  200.328744] [c0000001f0a1bd70] [c000000000870a98] write_sysrq_trigger+0x68/0x90
[  200.328814] [c0000001f0a1bda0] [c0000000005d6fb4] proc_reg_write+0x84/0x100
[  200.328867] [c0000001f0a1bdd0] [c0000000004ff564] sys_write+0x134/0x3a0
[  200.328919] [c0000001f0a1be30] [c00000000000b388] system_call+0x5c/0x70
[  200.328971] Instruction dump:
[  200.329002] 4bfffe38 00000000 3c4c010b 38421720 7c0802a6 60000000 39200001 3d42ffd1 
[  200.329066] 394ab048 912a0000 7c0004ac 39400000 <992a0000> 4e800020 3c4c010b 384216f0 
[  200.329130] ---[ end trace da07443dbbfd9bcf ]---
[  200.331061] 
[  200.332993] Sending IPI to other CPUs
[  200.335364] IPI complete
[  200.337633] kexec: Starting switchover sequence.
Linux ppc64le
#1 SMP Fri Mar 2[    0.000000] hash-mmu: Page sizes from device-tree:
[    0.000000] hash-mmu: base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
[    0.000000] hash-mmu: base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1
[    0.000000] Using 1TB segments
[    0.000000] hash-mmu: Initializing hash mmu with SLB
[    0.000000] Linux version 4.18.0-193.el8.ppc64le (mockbuild.eng.bos.redhat.com) (gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Fri Mar 27 14:40:12 UTC 2020
[    0.000000] Found initrd at 0xc000000009d70000:0xc00000000b0cf169
[    0.000000] Using pSeries machine description
[    0.000000] Partition configured for 8 cpus.
[    0.000000] CPU maps initialized for 1 thread per core
[    0.000000] NUMA disabled by user
[    0.000000] -----------------------------------------------------
[    0.000000] ppc64_pft_size    = 0x1a
[    0.000000] phys_mem_size     = 0x10010000
[    0.000000] dcache_bsize      = 0x80
[    0.000000] icache_bsize      = 0x80
[    0.000000] cpu_features      = 0x000000ef8f4d91a7
[    0.000000]   possible        = 0x0001fbffcf5fb1a7
[    0.000000]   always          = 0x00000003800081a1
[    0.000000] cpu_user_features = 0xdc0065c2 0xae000000
[    0.000000] mmu_features      = 0x78006001
[    0.000000] firmware_features = 0x00000005455a445f
[    0.000000] htab_hash_mask    = 0x7ffff
[    0.000000] physical_start    = 0x8000000
[    0.000000] -----------------------------------------------------
[    0.000000] numa:   NODE_DATA [mem 0x0ff2fc80-0x0ff3bfff]
[    0.000000] rfi-flush: fallback displacement flush available
[    0.000000] rfi-flush: ori type flush available
[    0.000000] rfi-flush: mttrig type flush available
[    0.000000] link-stack-flush: software flush enabled.
[    0.000000] count-cache-flush: hardware assisted flush sequence enabled
[    0.000000] stf-barrier: hwsync barrier available
[    0.000000] PCI host bridge /pci@800000020000000  ranges:
[    0.000000]   IO 0x0000200000000000..0x000020000000ffff -> 0x0000000000000000
[    0.000000]  MEM 0x0000200080000000..0x00002000ffffffff -> 0x0000000080000000 
[    0.000000]  MEM 0x0000210000000000..0x000021ffffffffff -> 0x0000210000000000 
[    0.000000] PCI: OF: PROBE_ONLY disabled
[    0.000000] PPC64 nvram contains 65536 bytes
[    0.000000] barrier-nospec: using ORI speculation barrier
[    0.000000] Zone ranges:
[    0.000000]   DMA      empty
[    0.000000]   DMA32    empty
[    0.000000]   Normal   [mem 0x0000000000000000-0x000000002fffffff]
[    0.000000]   Device   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000000fffffff]
[    0.000000]   node   0: [mem 0x000000002fff0000-0x000000002fffffff]
[    0.000000] Zeroed struct page in unavailable ranges: 255 pages
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000002fffffff]
[    0.000000] percpu: Embedded 11 pages/cpu s631832 r0 d89064 u1048576
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 4092
[    0.000000] Policy zone: Normal
[    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.18.0-193.el8.ppc64le ro console=ttyS0,115200 biosdevname=0 net.ifnames=0 console=tty0 biosdevname=0 net.ifnames=0 console=hvc0,38400 nmi_watchdog=1 irqpoll maxcpus=1 noirqdistrib reset_devices cgroup_disable=memory numa=off udev.children-max=2 ehea.use_mcs=0 panic=10 rootflags=nofail kvm_cma_resv_ratio=0 transparent_hugepage=never novmcoredd elfcorehdr=161088K 
[    0.000000] Specific versions of hardware are certified with Red Hat Enterprise Linux 8. Please see the list of hardware certified with Red Hat Enterprise Linux 8 at https://access.redhat.com/ecosystem.
[    0.000000] Misrouted IRQ fixup and polling support enabled
[    0.000000] This may significantly impact system performance
[    0.000000] Memory: 8128K/262208K available (14208K kernel code, 3520K rwdata, 3392K rodata, 4992K init, 3764K bss, 254080K reserved, 0K cma-reserved)
[    0.000000] random: get_random_u64 called from cache_random_seq_create+0x98/0x1f0 with crng_init=0
[    0.000000] SLUB: HWalign=128, Order=0-3, MinObjects=0, CPUs=8, Nodes=1
[    0.000000] ftrace: allocating 29509 entries in 11 pages
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu: 	RCU restricting CPUs from NR_CPUS=2048 to nr_cpu_ids=8.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
[    0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=8
[    0.000000] NR_IRQS: 512, nr_irqs: 512, preallocated irqs: 16
[    0.000000] rcu: 	Offload RCU callbacks from CPUs: (none).
[    0.000001] clocksource: timebase: mask: 0xffffffffffffffff max_cycles: 0x761537d007, max_idle_ns: 440795202126 ns
[    0.000002] clocksource: timebase mult[1f40000] shift[24] registered
[    0.000028] Console: colour dummy device 80x25
[    0.000106] printk: console [tty0] enabled
[    0.000136] pid_max: default: 32768 minimum: 301
[    0.000216] Security Framework initialized
[    0.000218] Yama: becoming mindful.
[    0.000222] SELinux:  Initializing.
[    0.000481] Dentry cache hash table entries: 8192 (order: 0, 65536 bytes)
[    0.000495] Inode-cache hash table entries: 8192 (order: 0, 65536 bytes)
[    0.000512] Mount-cache hash table entries: 8192 (order: 0, 65536 bytes)
[    0.000523] Mountpoint-cache hash table entries: 8192 (order: 0, 65536 bytes)
[    0.000767] Disabling memory control group subsystem
[    0.000938] EEH: pSeries platform initialized
[    0.000940] POWER8 performance monitor hardware support registered
[    0.000956] rcu: Hierarchical SRCU implementation.
[    0.001305] smp: Bringing up secondary CPUs ...
[    0.001307] smp: Brought up 1 node, 1 CPU
[    0.001387] Using standard scheduler topology
[    0.001990] devtmpfs: initialized
[    0.002601] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    0.002606] futex hash table entries: 2048 (order: 2, 262144 bytes)
[    0.002776] NET: Registered protocol family 16
[    0.002877] audit: initializing netlink subsys (disabled)
[    0.002954] cpuidle: using governor menu
[    0.003056] pstore: Registered nvram as persistent store backend
[    0.003836] PCI: Probing PCI hardware
[    0.003857] PCI host bridge to bus 0000:00
[    0.003861] pci_bus 0000:00: root bus resource [io  0x10000-0x1ffff] (bus address [0x0000-0xffff])
[    0.003864] pci_bus 0000:00: root bus resource [mem 0x200080000000-0x2000ffffffff] (bus address [0x80000000-0xffffffff])
[    0.003867] pci_bus 0000:00: root bus resource [mem 0x210000000000-0x21ffffffffff]
[    0.003870] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.004014] pci 0000:00:04.0: No hypervisor support for SR-IOV on this device, IOV BARs disabled.
[    0.004581] pci 0000:00:03.0: No hypervisor support for SR-IOV on this device, IOV BARs disabled.
[    0.004831] pci 0000:00:02.0: No hypervisor support for SR-IOV on this device, IOV BARs disabled.
[    0.068948] IOMMU table initialized, virtual merging enabled
[    0.068952] audit: type=2000 audit(1587449646.000:1): state=initialized audit_enabled=0 res=1
[    0.068963] pci 0000:00:04.0: Adding to iommu group 0
[    0.069032] pci 0000:00:03.0: Adding to iommu group 0
[    0.069076] pci 0000:00:02.0: Adding to iommu group 0
[    0.069161] pci_bus 0000:00: resource 4 [io  0x10000-0x1ffff]
[    0.069163] pci_bus 0000:00: resource 5 [mem 0x200080000000-0x2000ffffffff]
[    0.069165] pci_bus 0000:00: resource 6 [mem 0x210000000000-0x21ffffffffff]
[    0.069873] EEH: No capable adapters found
[    0.071020] cryptd: max_cpu_qlen set to 1000
[    0.071104] iommu: Default domain type: Translated 
[    0.071136] pci 0000:00:02.0: vgaarb: VGA device added: decodes=io+mem,owns=mem,locks=none
[    0.071139] pci 0000:00:02.0: vgaarb: bridge control possible
[    0.071150] pci 0000:00:02.0: vgaarb: setting as boot device (VGA legacy resources not available)
[    0.071152] vgaarb: loaded
[    0.071210] SCSI subsystem initialized
[    0.071237] usbcore: registered new interface driver usbfs
[    0.071244] usbcore: registered new interface driver hub
[    0.071252] usbcore: registered new device driver usb
[    0.071277] pps_core: LinuxPPS API ver. 1 registered
[    0.071278] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti>
[    0.071282] PTP clock support registered
[    0.071303] EDAC MC: Ver: 3.0.0
[    0.071462] NetLabel: Initializing
[    0.071464] NetLabel:  domain hash size = 128
[    0.071465] NetLabel:  protocols = UNLABELED CIPSOv4 CALIPSO
[    0.071476] NetLabel:  unlabeled traffic allowed by default
[    0.071522] clocksource: Switched to clocksource timebase
[    0.081801] VFS: Disk quotas dquot_6.6.0
[    0.081828] VFS: Dquot-cache hash table entries: 8192 (order 0, 65536 bytes)
[    0.081949] hugetlbfs: disabling because there are no supported hugepage sizes
[    0.082862] NET: Registered protocol family 2
[    0.082970] tcp_listen_portaddr_hash hash table entries: 4096 (order: 0, 65536 bytes)
[    0.082983] TCP established hash table entries: 8192 (order: 0, 65536 bytes)
[    0.082999] TCP bind hash table entries: 8192 (order: 1, 131072 bytes)
[    0.083016] TCP: Hash tables configured (established 8192 bind 8192)
[    0.083034] UDP hash table entries: 2048 (order: 0, 65536 bytes)
[    0.083045] UDP-Lite hash table entries: 2048 (order: 0, 65536 bytes)
[    0.083086] NET: Registered protocol family 1
[    0.083091] NET: Registered protocol family 44
[    0.083575] PCI: CLS 0 bytes, default 128
[    0.083603] Unpacking initramfs...
[    0.086368] Initramfs unpacking failed: write error
[    0.086421] Freeing initrd memory: 19776K
[    0.086539] rtas_flash: no firmware flash support
[    0.087240] Initialise system trusted keyrings
[    0.087270] workingset: timestamp_bits=38 max_order=11 bucket_order=0
[    0.088812] zbud: loaded
[    0.089029] pstore: using deflate compression
[    0.124829] NET: Registered protocol family 38
[    0.124832] Key type asymmetric registered
[    0.124834] Asymmetric key parser 'x509' registered
[    0.124840] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 246)
[    0.124857] io scheduler mq-deadline registered
[    0.124859] io scheduler kyber registered
[    0.124885] io scheduler bfq registered
[    0.124919] atomic64_test: passed
[    0.124959] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
[    0.124971] PowerPC PowerNV PCI Hotplug Driver version: 0.1
[    0.126325] Using unsupported 800x600 vga at 200080000000, depth=32, pitch=3200
[    0.127866] Console: switching to colour frame buffer device 100x37
[    0.128964] fb0: Open Firmware frame buffer device on /pci@800000020000000/vga@2
[    0.129456] virtio-pci 0000:00:04.0: Using 64-bit direct DMA at offset 800000000000000
[    0.321260] printk: console [hvc0] enabled
[    0.321855] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[    0.322667] rdac: device handler registered
[    0.323253] hp_sw: device handler registered
[    0.323822] emc: device handler registered
[    0.324390] alua: device handler registered
[    0.324969] libphy: Fixed MDIO Bus: probed
[    0.325533] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[    0.326121] ehci-pci: EHCI PCI platform driver
[    0.326699] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[    0.327300] ohci-pci: OHCI PCI platform driver
[    0.327888] uhci_hcd: USB Universal Host Controller Interface driver
[    0.328726] xhci_hcd 0000:00:03.0: xHCI Host Controller
[    0.329342] xhci_hcd 0000:00:03.0: new USB bus registered, assigned bus number 1
[    0.330142] xhci_hcd 0000:00:03.0: Using 64-bit direct DMA at offset 800000000000000
[    0.331021] xhci_hcd 0000:00:03.0: hcc params 0x00087001 hci version 0x100 quirks 0x0000000000000010
[    0.332757] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 4.18
[    0.333417] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    0.334080] usb usb1: Product: xHCI Host Controller
[    0.334721] usb usb1: Manufacturer: Linux 4.18.0-193.el8.ppc64le xhci-hcd
[    0.335380] usb usb1: SerialNumber: 0000:00:03.0
[    0.336073] hub 1-0:1.0: USB hub found
[    0.336729] hub 1-0:1.0: 4 ports detected
[    0.337613] xhci_hcd 0000:00:03.0: xHCI Host Controller
[    0.338274] xhci_hcd 0000:00:03.0: new USB bus registered, assigned bus number 2
[    0.338942] xhci_hcd 0000:00:03.0: Host supports USB 3.0  SuperSpeed
[    0.339628] usb usb2: We don't know the algorithms for LPM for this host, disabling LPM.
[    0.340311] usb usb2: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 4.18
[    0.340980] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    0.341644] usb usb2: Product: xHCI Host Controller
[    0.342272] usb usb2: Manufacturer: Linux 4.18.0-193.el8.ppc64le xhci-hcd
[    0.342908] usb usb2: SerialNumber: 0000:00:03.0
[    0.343581] hub 2-0:1.0: USB hub found
[    0.344213] hub 2-0:1.0: 4 ports detected
[    0.345814] usbcore: registered new interface driver usbserial_generic
[    0.346446] usbserial: USB Serial support registered for generic
[    0.347114] mousedev: PS/2 mouse device common for all mice
[    0.347805] rtc-generic rtc-generic: rtc core: registered rtc-generic as rtc0
[    0.348561] hidraw: raw HID events driver (C) Jiri Kosina
[    0.349220] usbcore: registered new interface driver usbhid
[    0.349854] usbhid: USB HID core driver
[    0.350481] drop_monitor: Initializing network drop monitor service
[    0.351176] Initializing XFRM netlink socket
[    0.351910] NET: Registered protocol family 10
[    0.352684] Segment Routing with IPv6
[    0.353311] NET: Registered protocol family 17
[    0.353945] mpls_gso: MPLS GSO support
[    0.354561] drmem: No dynamic reconfiguration memory found
[    0.355312] registered taskstats version 1
[    0.355894] Loading compiled-in X.509 certificates
[    0.389467] Loaded X.509 cert 'Red Hat Enterprise Linux kernel signing key: fecacdc1ef6906580f2ae4d0394b40c5da15dc1e'
[    0.391058] Loaded X.509 cert 'Red Hat Enterprise Linux Driver Update Program (key 3): bf57f3e87362bc7229d9f465321773dfd1f77a80'
[    0.392737] Loaded X.509 cert 'Red Hat Enterprise Linux kpatch signing key: 4d38fd864ebe18c5f0b72e3852e2014c3a676fc8'
[    0.394093] zswap: loaded using pool lzo/zbud
[    0.395053] Key type big_key registered
[    0.395878] Key type encrypted registered
[    0.396519] ima: No TPM chip found, activating TPM-bypass!
[    0.397172] ima: Allocated hash algorithm: sha256
[    0.397828] evm: Initialising EVM extended attributes:
[    0.398468] evm: security.selinux
[    0.399089] evm: security.ima
[    0.399700] evm: security.capability
[    0.400311] evm: HMAC attrs: 0x1
[    0.400951] rtc-generic rtc-generic: setting system clock to 2020-04-21 06:14:06 UTC (1587449646)
[    0.402624] Freeing unused kernel memory: 4992K
[    0.403260] This architecture does not have kernel memory protection.
[    0.403976] Failed to execute /init (error -2)
[    0.404612] Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
[    0.405962] CPU: 4 PID: 1 Comm: swapper/4 Not tainted 4.18.0-193.el8.ppc64le #1
[    0.406650] Call Trace:
[    0.407287] [c00000000fabbce0] [c000000008da863c] dump_stack+0xb0/0xf4 (unreliable)
[    0.407995] [c00000000fabbd20] [c000000008152cb4] panic+0x148/0x3c4
[    0.408687] [c00000000fabbdc0] [c000000008010678] kernel_init+0x12c/0x148
[    0.409381] [c00000000fabbe30] [c00000000800b75c] ret_from_kernel_thread+0x5c/0x80
[    0.410965] pstore: pstore dump routine blocked in NMI path, may corrupt error record
[    0.410966] pstore: pstore dump routine blocked in NMI path, may corrupt error record
[    0.410967] pstore: pstore dump routine blocked in NMI path, may corrupt error record
[    0.410968] pstore: pstore dump routine blocked in NMI path, may corrupt error record
[    0.410969] pstore: pstore dump routine blocked in NMI path, may corrupt error record
[    0.410971] pstore: pstore dump routine blocked in Panic path, may corrupt error record

Expected results:
Guest works well.

Additional info:
hostname:ibm-p8-11.pnr.lab.eng.rdu2.redhat.com
it was also reproduced in qemu-kvm-5.0.0-0.scrmod+el8.3.0+6312+cee4f348.ppc64le

Comment 1 John Ferlan 2020-04-24 12:12:25 UTC

I set the ITR to 8.3.0 since this is a high/crash type problem. Feel free to adjust to 8.2.1 if a fix is possible there or leave at 8.3.0 if a future rebase is the best way.

Comment 2 David Gibson 2020-04-28 04:57:32 UTC

AFAICT, qemu isn't doing anything wrong here.  The guest kdump kernel is crashing while trying to dump, which causes qemu to report and error.

So looks like the real problem is in kdump.  What is the guest kernel and userspace version?

Comment 3 Min Deng 2020-04-28 05:30:54 UTC

The build information,
qemu-kvm-common-4.2.0-19.module+el8.2.0+6296+6b821950.ppc64le
or
qemu-kvm-5.0.0-0.scrmod+el8.3.0+6312+cee4f348.ppc64le

kernel-4.18.0-193.el8.ppc64le
host-kernel:kernel-4.18.0-193.9.el8.ppc64le

Comment 4 Min Deng 2020-04-29 06:37:36 UTC

Tried the issue on p9 ,also hit the similar issue.
rpm -qa|grep SLOF
SLOF-20200327-1.git8e012d6f.el8.noarch
[root@ibm-p9b-42 ~]# rpm -qa|grep qemu-kvm
qemu-kvm-block-curl-5.0.0-0.scrmod+el8.3.0+6312+1f7d6182.ppc64le
qemu-kvm-tests-5.0.0-0.scrmod+el8.3.0+6312+1f7d6182.ppc64le
qemu-kvm-tests-debuginfo-5.0.0-0.scrmod+el8.3.0+6312+1f7d6182.ppc64le
qemu-kvm-debugsource-5.0.0-0.scrmod+el8.3.0+6312+1f7d6182.ppc64le
qemu-kvm-common-5.0.0-0.scrmod+el8.3.0+6312+1f7d6182.ppc64le
qemu-kvm-block-ssh-5.0.0-0.scrmod+el8.3.0+6312+1f7d6182.ppc64le
qemu-kvm-block-iscsi-5.0.0-0.scrmod+el8.3.0+6312+1f7d6182.ppc64le
qemu-kvm-5.0.0-0.scrmod+el8.3.0+6312+1f7d6182.ppc64le
qemu-kvm-block-rbd-debuginfo-5.0.0-0.scrmod+el8.3.0+6312+1f7d6182.ppc64le
qemu-kvm-block-curl-debuginfo-5.0.0-0.scrmod+el8.3.0+6312+1f7d6182.ppc64le
qemu-kvm-block-ssh-debuginfo-5.0.0-0.scrmod+el8.3.0+6312+1f7d6182.ppc64le
qemu-kvm-block-iscsi-debuginfo-5.0.0-0.scrmod+el8.3.0+6312+1f7d6182.ppc64le
qemu-kvm-debuginfo-5.0.0-0.scrmod+el8.3.0+6312+1f7d6182.ppc64le
qemu-kvm-block-rbd-5.0.0-0.scrmod+el8.3.0+6312+1f7d6182.ppc64le
qemu-kvm-core-5.0.0-0.scrmod+el8.3.0+6312+1f7d6182.ppc64le
qemu-kvm-core-debuginfo-5.0.0-0.scrmod+el8.3.0+6312+1f7d6182.ppc64le
qemu-kvm-common-debuginfo-5.0.0-0.scrmod+el8.3.0+6312+1f7d6182.ppc64le

Comment 5 David Gibson 2020-05-01 03:06:07 UTC

I haven't been able to reproduce this, despite matching as many parameters as I could think of which look like they could be relevant.

This error message:
    [    0.403976] Failed to execute /init (error -2)

suggests that the kdump initrd has been incorrectly constructed within this guest image and is missing the init file (error -2 is ENOENT).

If you rebuild the kdump initrd by using "kdumpctl rebuild" inside the guest, does kdump work afterwards?

Comment 6 Min Deng 2020-05-06 08:31:29 UTC

Hi David,
Please do this before trigger a crash,the problem can be reproduced 100%.

Steps,
1.# service kdump stop
  Redirecting to /bin/systemctl stop kdump.service
2.# echo c >/proc/sysrq-trigger 

Actual result,
qemu-kvm terminated right away.
Expected result,
the guest should work well,for example,it can reboot/generate dump file and so on so forth.

Thanks.

Comment 7 David Gibson 2020-05-07 02:29:24 UTC

The behavior you describe in comment 6 is expected.

You're explicitly disabling kdump, so the dump service is not active.  That means that the guest kernel will report the panic to qemu, which will terminate it.

You should have the same behaviour on x86, if a pvpanic device is attached (POWER guests always have an equivalent device attached, it's part of the firmware functionality).

If you want qemu to keep running with the crashed guest, to trigger a dump using the monitor, for example, you can use the -no-shutdown option.

Comment 8 Min Deng 2020-05-07 14:45:01 UTC

(In reply to David Gibson from comment #7)
> The behavior you describe in comment 6 is expected.
> 
> You're explicitly disabling kdump, so the dump service is not active.  That
> means that the guest kernel will report the panic to qemu, which will
> terminate it.
> 
> You should have the same behaviour on x86, if a pvpanic device is attached
> (POWER guests always have an equivalent device attached, it's part of the
> firmware functionality).
> 
> If you want qemu to keep running with the crashed guest, to trigger a dump
> using the monitor, for example, you can use the -no-shutdown option.

Hi David,
QE understood above points,thanks for that.
QE run some automation test for kdump on multiple guests today,hit one issue,so I had better paste it here for discussion.I will upload some logs to the bug too as well as step's instruction.It is not always reproducible since I hit one time among 4 or 5 times.But I failed to reproduce it manually but I will still try it in the following days.Any problems please let me know,thanks.

In automation's log,
03:45:48 DEBUG| Kdump service status before our testing:
 kdump.service - Crash recovery kernel arming
   Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled)
   Active: active (exited) since Thu 2020-05-07 15:45:47 CST; 352ms ago
  Process: 3438 ExecStop=/usr/bin/kdumpctl stop (code=exited, status=0/SUCCESS)
  Process: 3447 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS)
 Main PID: 3447 (code=exited, status=0/SUCCESS)

May 07 15:45:43 dhcp19-129-162.pnr.lab.eng.bos.redhat.com systemd[1]: Starting.
May 07 15:45:47 dhcp19-129-162.pnr.lab.eng.bos.redhat.com kdumpctl[3447]: Modif
May 07 15:45:47 dhcp19-129-162.pnr.lab.eng.bos.redhat.com kdumpctl[3447]: kexec
May 07 15:45:47 dhcp19-129-162.pnr.lab.eng.bos.redhat.com kdumpctl[3447]: Start
May 07 15:45:47 dhcp19-129-162.pnr.lab.eng.bos.redhat.com systemd[1]: Started .
Hint: Some lines were ellipsized, use -l to show in full.

03:45:48 INFO | Triggering crash on vcpu 0 ...
03:45:48 INFO | Context: Kdump Testing, force the Linux kernel to crash
03:45:48 DEBUG| Attempting to log into 'vm2' (timeout 360s)
03:45:48 DEBUG| Found/Verified IP 10.19.129.105 for VM vm2 NIC 0
03:45:49 INFO | [qemu output] error: kvm run failed Bad address
03:45:49 INFO | [qemu output] NIP c000000008008ac0   LR 0000000000000670 CTR c000000008dd3d60 XER 0000000000000000 CPU#35
03:45:49 INFO | [qemu output] MSR c00000000008dc30 HID0 0000000000000000  HF 8000000000000001 iidx 3 didx 3
03:45:49 INFO | [qemu output] TB 00000000 00000000 DECR 0
03:45:49 INFO | [qemu output] GPR00 0000000000000670 c00000000f644590 c000000009966300 0000000000000000
03:45:49 INFO | [qemu output] GPR04 fffffffffffd5704 0000000000000670 0000000000000008 c000000008dd3d60
03:45:49 INFO | [qemu output] GPR08 feeeeeeeeeeeeeee ffffffffff0c576c c0000000081fdc08 0000000000000381
03:45:49 INFO | [qemu output] GPR12 c000000008dd3d60 c00000000ff4c680 c0000000f6cbbf90 0000000000000000
03:45:49 INFO | [qemu output] GPR16 c0000000019a21e0 0000000000000000 0000000000000800 0000000000000001
03:45:49 INFO | [qemu output] GPR20 c000000001265608 0000000000000023 0000000000000000 0000000000000000
03:45:49 INFO | [qemu output] GPR24 0000000000000023 c000000009265808 feeeeeeeeeeeeeee c0000000090310c0
03:45:49 INFO | [qemu output] GPR28 000000000000000b c00000000f6446a0 c00000000f644570 c0000000081fdc08
03:45:49 INFO | [qemu output] CR 88008228  [ L  L  -  -  L  E  E  L  ]             RES ffffffffffffffff
03:45:49 INFO | [qemu output]  SRR0 c000000008008ac0  SRR1 c0000000081fdc00    PVR 00000000004e1202 VRSAVE 0000000000000000
03:45:49 INFO | [qemu output] SPRG0 0000000000000000 SPRG1 c00000000ff4c680  SPRG2 c00000000ff4c680  SPRG3 0000000000000023
03:45:49 INFO | [qemu output] SPRG4 0000000000000000 SPRG5 0000000000000000  SPRG6 0000000000000000  SPRG7 0000000000000000
03:45:49 INFO | [qemu output] HSRR0 0000000000000000 HSRR1 0000000000000000
03:45:49 INFO | [qemu output]  CFAR 0000000000000000
03:45:49 INFO | [qemu output]  LPCR 0000000003d6f41f
03:45:49 INFO | [qemu output]  PTCR 0000000000000000   DAR beeeeeeef815fe8e  DSISR 0000000000000000
03:45:49 INFO | [qemu output] error: kvm run failed Bad address
03:45:49 INFO | [qemu output] NIP c000000008008ac0   LR 0000000000000670 CTR c000000008dd3d60 XER 0000000000000000 CPU#26
03:45:49 INFO | [qemu output] MSR c00000000008dc30 HID0 0000000000000000  HF 8000000000000001 iidx 3 didx 3
03:45:49 INFO | [qemu output] TB 00000000 00000000 DECR 0
03:45:49 INFO | [qemu output] GPR00 0000000000000670 c00000000f2a2e40 c000000009966300 0000000000000000
03:45:49 INFO | [qemu output] GPR04 fffffffffffd5704 0000000000000670 0000000000000008 c000000008dd3d60
03:45:49 INFO | [qemu output] GPR08 feeeeeeeeeeeeeee ffffffffff0c576c c0000000081fdc08 0000000000000381
03:45:49 INFO | [qemu output] GPR12 c000000008dd3d60 c00000000ff59e80 c0000000f6cf7f90 0000000000000000
03:45:49 INFO | [qemu output] GPR16 c0000000019a21e0 0000000000000000 0000000000000800 0000000000000001
03:45:49 INFO | [qemu output] GPR20 c000000001265608 000000000000001a 0000000000000000 0000000000000000
03:45:49 INFO | [qemu output] GPR24 000000000000001a 0000000000000000 0000000000000000 c0000000090310c0
03:45:49 INFO | [qemu output] GPR28 000000000000000b c00000000f2a2f50 c00000000f2a2e20 c0000000081fdc08
03:45:49 INFO | [qemu output] CR 88008228  [ L  L  -  -  L  E  E  L  ]             RES ffffffffffffffff
03:45:49 INFO | [qemu output]  SRR0 c000000008008ac0  SRR1 c0000000081fdc00    PVR 00000000004e1202 VRSAVE 0000000000000000
03:45:49 INFO | [qemu output] SPRG0 0000000000000000 SPRG1 c00000000ff59e80  SPRG2 c00000000ff59e80  SPRG3 000000000000001a
03:45:49 INFO | [qemu output] SPRG4 0000000000000000 SPRG5 0000000000000000  SPRG6 0000000000000000  SPRG7 0000000000000000
03:45:49 INFO | [qemu output] HSRR0 0000000000000000 HSRR1 0000000000000000
03:45:49 INFO | [qemu output]  CFAR 0000000000000000
03:45:49 INFO | [qemu output]  LPCR 0000000003d6f41f
03:45:49 INFO | [qemu output]  PTCR 0000000000000000   DAR beeeeeeef815fe8e  DSISR 0000000000000000


host console,
[82048.653732] KVM: Got unsupported MMU fault
[82048.654544] KVM: Got unsupported MMU fault
[82947.506516] watchdog: CPU 0 detected hard LOCKUP on other CPUs 3
[82947.506569] watchdog: CPU 0 TB:42545620888830, last SMP heartbeat TB:42537684573427 (15500ms ago)
[82947.507429] watchdog: CPU 3 Hard LOCKUP
[82947.507434] watchdog: CPU 3 TB:42545621013192, last heartbeat TB:42537428563546 (16000ms ago)
[82947.507437] Modules linked in: xt_CHECKSUM ipt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_route_ipv6 nft_chain_nat_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 nft_counter nft_chain_route_ipv4 nft_chain_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack nf_tables vhost_net vhost tap tun nfnetlink bluetooth ecdh_generic rfkill rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache bridge stp llc kvm_hv kvm i2c_dev sunrpc ses at24 ofpart enclosure powernv_flash scsi_transport_sas xts uio_pdrv_genirq mtd uio ipmi_powernv ipmi_devintf ibmpowernv vmx_crypto ipmi_msghandler opal_prd ip_tables xfs libcrc32c sd_mod sg ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i40e aacraid drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_mod
[82947.507579] CPU: 3 PID: 99750 Comm: qemu-kvm Kdump: loaded Not tainted 4.18.0-193.13.el8.ppc64le #1
[82947.507582] NIP:  c0080000080bd7cc LR: c0080000080bd7cc CTR: c00000000001aba0
[82947.507587] REGS: c000000bf2a27748 TRAP: 0100   Not tainted  (4.18.0-193.13.el8.ppc64le)
[82947.507589] MSR:  9000000102803033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 24422222  XER: 00000000
[82947.507615] CFAR: c000000001b35ca8 IRQMASK: 40000003d6f41f 
[82947.507621] GPR00: c0080000080bd7cc c000000bf2a278b0 c000000001965300 c000000bf2a27748 
[82947.507632] GPR04: c009dff19f2f4198 0000000000000000 0000000000000000 0000000000000003 
[82947.507643] GPR08: 0000000000000003 0000000000000000 0000000ffb1d0000 c0080000080cfd10 
[82947.507654] GPR12: c00000000001aba0 c000000fffffbb80 00007fffa4e70000 00007ffd737f0000 
[82947.507665] GPR16: 00007fffa3d24410 c000200df0a6a558 0000000000000001 c0000000012f5cf0 
[82947.507675] GPR20: 0000000000000003 c000000001b35ca8 c000200df0a6a558 0000000000000003 
[82947.507686] GPR24: 0000000000000003 ffffffffffffffff 0040000003d6f41f 0000000000000100 
[82947.507697] GPR28: c000200df0a60000 c000000bf92c0000 c000000fec3e4c00 c000000bef1d2a40 
[82947.507709] NIP [c0080000080bd7cc] kvmhv_run_single_vcpu+0x7a4/0xda0 [kvm_hv]
[82947.507713] LR [c0080000080bd7cc] kvmhv_run_single_vcpu+0x7a4/0xda0 [kvm_hv]
[82947.507716] Call Trace:
[82947.507719] [c000000bf2a278b0] [c0080000080bd7cc] kvmhv_run_single_vcpu+0x7a4/0xda0 [kvm_hv] (unreliable)
[82947.507727] [c000000bf2a27980] [c0080000080be748] kvmppc_vcpu_run_hv+0x980/0x1060 [kvm_hv]
[82947.507732] [c000000bf2a27a90] [c00800000841de5c] kvmppc_vcpu_run+0x34/0x48 [kvm]
[82947.507738] [c000000bf2a27ab0] [c008000008418f8c] kvm_arch_vcpu_ioctl_run+0x364/0x820 [kvm]
[82947.507744] [c000000bf2a27ba0] [c008000008403298] kvm_vcpu_ioctl+0x460/0x7d0 [kvm]
[82947.507749] [c000000bf2a27d10] [c00000000052c490] do_vfs_ioctl+0xe0/0xaa0
[82947.507754] [c000000bf2a27de0] [c00000000052d024] sys_ioctl+0xc4/0x160
[82947.507759] [c000000bf2a27e30] [c00000000000b408] system_call+0x5c/0x70
[82947.507763] Instruction dump:
[82947.507767] 614af804 7fa95000 409efcf8 7fe3fb78 48012d0d e8410018 4bfffce8 60000000 
[82947.507782] 60000000 2f9b0100 409efbfc 48012549 <e8410018> 4bfffbf0 60000000 60000000 
[82947.517421] watchdog: CPU 3 became unstuck TB:42545626476761
[82947.517432] CPU: 3 PID: 99750 Comm: qemu-kvm Kdump: loaded Not tainted 4.18.0-193.13.el8.ppc64le #1
[82947.517445] NIP:  c00000000000a8fc LR: c00000000001ae54 CTR: c00000000802dca0
[82947.517467] REGS: c000000bf2a27610 TRAP: 0901   Not tainted  (4.18.0-193.13.el8.ppc64le)
[82947.517477] MSR:  900000010280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 28422244  XER: 20040000
[82947.517496] CFAR: 0000000000000874 IRQMASK: 0 
[82947.517496] GPR00: c0080000080bd414 c000000bf2a27890 c000000001965300 0000000000000900 
[82947.517496] GPR04: 0000000ffb1d0000 000026b1eca5dc35 000026b022d693ff c000000fffffbb80 
[82947.517496] GPR08: 0000000800000000 0000000028008228 b000000000001003 0000000000000009 
[82947.517496] GPR12: 0000000024842828 c000000fffffbb80 
[82947.517559] NIP [c00000000000a8fc] replay_interrupt_return+0x0/0x4
[82947.517572] LR [c00000000001ae54] arch_local_irq_restore+0x74/0x90
[82947.517582] Call Trace:
[82947.517588] [c000000bf2a27890] [c000000fec3e4c00] 0xc000000fec3e4c00 (unreliable)
[82947.517603] [c000000bf2a278b0] [c0080000080bd414] kvmhv_run_single_vcpu+0x3ec/0xda0 [kvm_hv]
[82947.537629] [c000000bf2a27980] [c0080000080be748] kvmppc_vcpu_run_hv+0x980/0x1060 [kvm_hv]
[82947.537660] [c000000bf2a27a90] [c00800000841de5c] kvmppc_vcpu_run+0x34/0x48 [kvm]
[82947.537690] [c000000bf2a27ab0] [c008000008418f8c] kvm_arch_vcpu_ioctl_run+0x364/0x820 [kvm]
[82947.537729] [c000000bf2a27ba0] [c008000008403298] kvm_vcpu_ioctl+0x460/0x7d0 [kvm]
[82947.537764] [c000000bf2a27d10] [c00000000052c490] do_vfs_ioctl+0xe0/0xaa0
[82947.537793] [c000000bf2a27de0] [c00000000052d024] sys_ioctl+0xc4/0x160
[82947.537815] [c000000bf2a27e30] [c00000000000b408] system_call+0x5c/0x70
[82947.537834] Instruction dump:
[82947.537850] 7d200026 618c8000 2c030900 4182e7e8 2c030500 4182f2e0 2c030f00 4182f3f8 
[82947.537884] 2c030a00 4182ff9c 2c030e60 4182f088 <4e800020> 7c781b78 48000385 4800039d 

Message from syslogd@ibm-p9b-42 at May  7 04:00:47 ...
 kernel:watchdog: CPU 0 detected hard LOCKUP on other CPUs 3

Message from syslogd@ibm-p9b-42 at May  7 04:00:47 ...
 kernel:watchdog: CPU 0 TB:42545620888830, last SMP heartbeat TB:42537684573427 (15500ms ago)

Message from syslogd@ibm-p9b-42 at May  7 04:00:47 ...
 kernel:watchdog: CPU 3 Hard LOCKUP

Message from syslogd@ibm-p9b-42 at May  7 04:00:47 ...
 kernel:watchdog: CPU 3 TB:42545621013192, last heartbeat TB:42537428563546 (16000ms ago)

Message from syslogd@ibm-p9b-42 at May  7 04:00:47 ...
 kernel:watchdog: CPU 3 became unstuck TB:42545626476761
[84119.553375] watchdog: CPU 0 detected hard LOCKUP on other CPUs 2
[84119.553437] watchdog: CPU 0 TB:43145708882094, last SMP heartbeat TB:43137521676261 (15990ms ago)
[84119.554277] watchdog: CPU 2 Hard LOCKUP
[84119.554281] watchdog: CPU 2 TB:43145709018683, last heartbeat TB:43137516557332 (16000ms ago)
[84119.554283] Modules linked in: xt_CHECKSUM ipt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_route_ipv6 nft_chain_nat_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 nft_counter nft_chain_route_ipv4 nft_chain_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack nf_tables vhost_net vhost tap tun nfnetlink bluetooth ecdh_generic rfkill rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache bridge stp llc kvm_hv kvm i2c_dev sunrpc ses at24 ofpart enclosure powernv_flash scsi_transport_sas xts uio_pdrv_genirq mtd uio ipmi_powernv ipmi_devintf ibmpowernv vmx_crypto ipmi_msghandler opal_prd ip_tables xfs libcrc32c sd_mod sg ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i40e aacraid drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_mod
[84119.554421] CPU: 2 PID: 99745 Comm: qemu-kvm Kdump: loaded Not tainted 4.18.0-193.13.el8.ppc64le #1
[84119.554424] NIP:  c0080000080bd7cc LR: c0080000080bd7cc CTR: c00000000001aba0
[84119.554428] REGS: c000000d885e3748 TRAP: 0100   Not tainted  (4.18.0-193.13.el8.ppc64le)
[84119.554430] MSR:  9000000102803033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 24422222  XER: 00000000
[84119.554456] CFAR: c000000001b35ca8 IRQMASK: 40000003d6f41f 
[84119.554462] GPR00: c0080000080bd7cc c000000d885e38b0 c000000001965300 c000000d885e3748 
[84119.554472] GPR04: c009dff19f2f4198 0000000000000000 0000000000000000 0000000000000002 
[84119.554483] GPR08: 0000000000000003 0000000000000000 0000000ffb120000 c0080000080cfd10 
[84119.554494] GPR12: c00000000001aba0 c000000fffffcd00 00007fffa4e70000 00007ffd927d0000 
[84119.554503] GPR16: 00007fffa3d24410 c000200df0a6a558 0000000000000001 c0000000012f5cf0 
[84119.554514] GPR20: 0000000000000002 c000000001b35ca8 c000200df0a6a558 0000000000000002 
[84119.554524] GPR24: 0000000000000002 ffffffffffffffff 0040000003d6f41f 0000000000000100 
[84119.554534] GPR28: c000200df0a60000 c000000bfbd40000 c000000fec3ed900 c000000bef1b66c0 
[84119.554547] NIP [c0080000080bd7cc] kvmhv_run_single_vcpu+0x7a4/0xda0 [kvm_hv]
[84119.554550] LR [c0080000080bd7cc] kvmhv_run_single_vcpu+0x7a4/0xda0 [kvm_hv]
[84119.554552] Call Trace:
[84119.554556] [c000000d885e38b0] [c0080000080bd7cc] kvmhv_run_single_vcpu+0x7a4/0xda0 [kvm_hv] (unreliable)
[84119.554564] [c000000d885e3980] [c0080000080be748] kvmppc_vcpu_run_hv+0x980/0x1060 [kvm_hv]
[84119.554569] [c000000d885e3a90] [c00800000841de5c] kvmppc_vcpu_run+0x34/0x48 [kvm]
[84119.554575] [c000000d885e3ab0] [c008000008418f8c] kvm_arch_vcpu_ioctl_run+0x364/0x820 [kvm]
[84119.554579] [c000000d885e3ba0] [c008000008403298] kvm_vcpu_ioctl+0x460/0x7d0 [kvm]
[84119.554584] [c000000d885e3d10] [c00000000052c490] do_vfs_ioctl+0xe0/0xaa0
[84119.554590] [c000000d885e3de0] [c00000000052d024] sys_ioctl+0xc4/0x160
[84119.554595] [c000000d885e3e30] [c00000000000b408] system_call+0x5c/0x70
[84119.554598] Instruction dump:
[84119.554602] 614af804 7fa95000 409efcf8 7fe3fb78 48012d0d e8410018 4bfffce8 60000000 
[84119.554616] 60000000 2f9b0100 409efbfc 48012549 <e8410018> 4bfffbf0 60000000 60000000 
[84119.564517] watchdog: CPU 2 became unstuck TB:43145714590195
[84119.564531] CPU: 2 PID: 99745 Comm: qemu-kvm Kdump: loaded Not tainted 4.18.0-193.13.el8.ppc64le #1
[84119.564569] NIP:  c00000000000a8fc LR: c00000000001ae54 CTR: c00000000802dca0
[84119.564600] REGS: c000000d885e3610 TRAP: 0901   Not tainted  (4.18.0-193.13.el8.ppc64le)
[84119.564620] MSR:  900000010280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 28422244  XER: 20040000
[84119.564660] CFAR: 0000000000000874 IRQMASK: 0 
[84119.564660] GPR00: c0080000080bd414 c000000d885e3890 c000000001965300 0000000000000900 
[84119.564660] GPR04: 0000000ffb120000 0000273da4addc65 0000273bdadead42 c000000fffffcd00 
[84119.564660] GPR08: 0000000800000000 0000000028008268 b000000000001003 0000000000000009 
[84119.564660] GPR12: 0000000024842828 c000000fffffcd00 
[84119.564778] NIP [c00000000000a8fc] replay_interrupt_return+0x0/0x4
[84119.564809] LR [c00000000001ae54] arch_local_irq_restore+0x74/0x90
[84119.564837] Call Trace:
[84119.564843] [c000000d885e3890] [c000000fec3ed900] 0xc000000fec3ed900 (unreliable)
[84119.564883] [c000000d885e38b0] [c0080000080bd414] kvmhv_run_single_vcpu+0x3ec/0xda0 [kvm_hv]
[84119.564932] [c000000d885e3980] [c0080000080be748] kvmppc_vcpu_run_hv+0x980/0x1060 [kvm_hv]
[84119.564973] [c000000d885e3a90] [c00800000841de5c] kvmppc_vcpu_run+0x34/0x48 [kvm]
[84119.565013] [c000000d885e3ab0] [c008000008418f8c] kvm_arch_vcpu_ioctl_run+0x364/0x820 [kvm]
[84119.565051] [c000000d885e3ba0] [c008000008403298] kvm_vcpu_ioctl+0x460/0x7d0 [kvm]
[84119.565085] [c000000d885e3d10] [c00000000052c490] do_vfs_ioctl+0xe0/0xaa0
[84119.565105] [c000000d885e3de0] [c00000000052d024] sys_ioctl+0xc4/0x160
[84119.565127] [c000000d885e3e30] [c00000000000b408] system_call+0x5c/0x70
[84119.565160] Instruction dump:
[84119.565168] 7d200026 618c8000 2c030900 4182e7e8 2c030500 4182f2e0 2c030f00 4182f3f8 
[84119.565207] 2c030a00 4182ff9c 2c030e60 4182f088 <4e800020> 7c781b78 48000385 4800039d 

Message from syslogd@ibm-p9b-42 at May  7 04:20:19 ...
 kernel:watchdog: CPU 0 detected hard LOCKUP on other CPUs 2

Message from syslogd@ibm-p9b-42 at May  7 04:20:19 ...
 kernel:watchdog: CPU 0 TB:43145708882094, last SMP heartbeat TB:43137521676261 (15990ms ago)

Message from syslogd@ibm-p9b-42 at May  7 04:20:19 ...
 kernel:watchdog: CPU 2 Hard LOCKUP

Message from syslogd@ibm-p9b-42 at May  7 04:20:19 ...
 kernel:watchdog: CPU 2 TB:43145709018683, last heartbeat TB:43137516557332 (16000ms ago)

Message from syslogd@ibm-p9b-42 at May  7 04:20:19 ...
 kernel:watchdog: CPU 2 became unstuck TB:43145714590195

Comment 9 Min Deng 2020-05-07 14:45:49 UTC

Created attachment 1686207 [details]
auto-debug-log

Comment 10 Min Deng 2020-05-07 14:51:17 UTC

Created attachment 1686209 [details]
auto-instruction

Comment 11 Min Deng 2020-05-07 14:58:09 UTC

Build
ibm-p9b-42.pnr.lab.eng.bos.redhat.com
kernel-4.18.0-193.13.el8.ppc64le - host
kernel-4.18.0-195.el8.ppc64le - guest
SLOF-20200327-1.git8e012d6f.scrmod+el8.3.0+6495+1936fa11.wrb.noarch
qemu-kvm-5.0.0-0.scrmod+el8.3.0+6495+1936fa11.wrb200506.ppc64le

Comment 12 David Gibson 2020-05-11 02:14:04 UTC

I see the guest side error described in comment 8:

    03:45:49 INFO | [qemu output] error: kvm run failed Bad address

This is probably caused by bug 1820402.  I have posted a fix for that, but we're still waiting for it to be merged downstream.

I'm not sure if the host side errors are related to this or not.  I think we need to retest once we have a fix for bug 1820402 in order to check.

Comment 13 David Gibson 2020-05-18 05:17:27 UTC

Min,

Now that we have a fix for bug 1820402, can you please retest this.  You'll need your *host* kernel updated with the fix from bug 1820402.

Comment 14 Min Deng 2020-05-18 12:45:02 UTC

Already set up the test,will update the final result as soon as getting it,thanks.

Comment 15 Min Deng 2020-05-21 05:38:37 UTC

Fortunately,the similar problem still can be reproduced and uploaded log to this bug too.
build information,
kernel-4.18.0-200.el8.ppc64le host
kernel-4.18.0-201.el8.ppc64le guest
qemu-kvm-5.0.0-0.module+el8.3.0+6620+5d5e1420.ppc64le

May 21 10:37:11 dhcp16-213-134.lab2.eng.bos.redhat.com systemd[1]: Starting Cr.
May 21 10:37:14 dhcp16-213-134.lab2.eng.bos.redhat.com kdumpctl[7747]: Modified
May 21 10:37:14 dhcp16-213-134.lab2.eng.bos.redhat.com kdumpctl[7747]: kexec: l
May 21 10:37:14 dhcp16-213-134.lab2.eng.bos.redhat.com kdumpctl[7747]: Startin]
May 21 10:37:15 dhcp16-213-134.lab2.eng.bos.redhat.com systemd[1]: Started Cra.
Hint: Some lines were ellipsized, use -l to show in full.

22:37:16 INFO | Triggering crash on vcpu 0 ...
22:37:16 INFO | Context: Kdump Testing, force the Linux kernel to crash
22:37:16 DEBUG| Attempting to log into 'vm2' (timeout 360s)
22:37:16 DEBUG| Found/Verified IP 10.16.213.131 for VM vm2 NIC 0
22:37:16 INFO | [qemu output] error: kvm run failed Bad address
22:37:16 INFO | [qemu output] NIP c000000008008ac0   LR 000000000000066f CTR c000000008ddaff0 XER 0000000000000000 CPU#38
22:37:16 INFO | [qemu output] MSR c00000000008d330 HID0 0000000000000000  HF 8000000000000001 iidx 3 didx 3
22:37:16 INFO | [qemu output] TB 00000000 00000000 DECR 0
22:37:16 INFO | [qemu output] GPR00 000000000000066f c00000000f840c90 c000000009977900 0000000000000000
22:37:16 INFO | [qemu output] GPR04 fffffffaff26a3b8 000000000000066f 0000000000000008 c000000008ddaff0
22:37:16 INFO | [qemu output] GPR08 feeeeeeeeeeeeeee 00000004ffc7e880 c00000000805d338 0000000000000381
22:37:16 INFO | [qemu output] GPR12 9000000000001003 c00000000ff57a80 c0000000f7cd3f90 0000000000000000
22:37:16 INFO | [qemu output] GPR16 c0000000019b21d8 0000000000000001 0000000000000800 0000000000000001
22:37:16 INFO | [qemu output] GPR20 c000000001275608 0000000000000026 0000000000000001 0000000000000000
22:37:16 INFO | [qemu output] GPR24 0000000000000026 c000000009275808 feeeeeeeeeeeeeee c000000009041e40
22:37:16 INFO | [qemu output] GPR28 000000000000000b c00000000f840da0 c00000000f840c70 c00000000805d338
22:37:16 INFO | [qemu output] CR 88008228  [ L  L  -  -  L  E  E  L  ]             RES ffffffffffffffff
22:37:16 INFO | [qemu output]  SRR0 c000000008008ac0  SRR1 c00000000805d330    PVR 00000000004e1202 VRSAVE 0000000000000000
22:37:16 INFO | [qemu output] SPRG0 0000000000000000 SPRG1 c00000000ff57a80  SPRG2 c00000000ff57a80  SPRG3 0000000000000026
22:37:16 INFO | [qemu output] SPRG4 0000000000000000 SPRG5 0000000000000000  SPRG6 0000000000000000  SPRG7 0000000000000000
22:37:16 INFO | [qemu output] HSRR0 0000000000000000 HSRR1 0000000000000000
22:37:16 INFO | [qemu output]  CFAR 0000000000000000
22:37:16 INFO | [qemu output]  LPCR 0000000003d6f41f
22:37:16 INFO | [qemu output]  PTCR 0000000000000000   DAR beeeeeeef81646f6  DSISR 0000000000000000
22:37:16 INFO | [qemu output] error: kvm run failed Bad address
22:37:16 INFO | [qemu output] NIP c000000008008ac0   LR 000000000000066f CTR c000000008ddaff0 XER 0000000000000000 CPU#31
22:37:16 INFO | [qemu output] MSR c00000000008d5b0 HID0 0000000000000000  HF 8000000000000001 iidx 3 didx 3
22:37:16 INFO | [qemu output] TB 00000000 00000000 DECR 0
22:37:16 INFO | [qemu output] GPR00 000000000000066f c00000000f56dcb0 c000000009977900 0000000000000000
22:37:16 INFO | [qemu output] GPR04 fffffffaff40a638 000000000000066f 0000000000000008 c000000008ddaff0
22:37:16 INFO | [qemu output] GPR08 feeeeeeeeeeeeeee 00000004ffc7e880 c0000000081fd5b8 0000000000000381
22:37:16 INFO | [qemu output] GPR12 9000000000001003 c00000000ff62280 c0000000f7ce7f90 0000000000000000
22:37:16 INFO | [qemu output] GPR16 c0000000019b21d8 0000000000000000 0000000000000800 0000000000000001
22:37:16 INFO | [qemu output] GPR20 c000000001275608 000000000000001f 0000000000000000 0000000000000000
22:37:16 INFO | [qemu output] GPR24 000000000000001f c000000009275808 feeeeeeeeeeeeeee c000000009275808
22:37:16 INFO | [qemu output] GPR28 000000000000000b c00000000f56ddc0 c00000000f56dc90 c0000000081fd5b8
22:37:16 INFO | [qemu output] CR 88008228  [ L  L  -  -  L  E  E  L  ]             RES ffffffffffffffff
22:37:16 INFO | [qemu output]  SRR0 c000000008008ac0  SRR1 c0000000081fd5b0    PVR 00000000004e1202 VRSAVE 0000000000000000
22:37:16 INFO | [qemu output] SPRG0 0000000000000000 SPRG1 c00000000ff62280  SPRG2 c00000000ff62280  SPRG3 000000000000001f
22:37:16 INFO | [qemu output] SPRG4 0000000000000000 SPRG5 0000000000000000  SPRG6 0000000000000000  SPRG7 0000000000000000
22:37:16 INFO | [qemu output] HSRR0 0000000000000000 HSRR1 0000000000000000

Comment 16 Min Deng 2020-05-21 05:40:33 UTC

Created attachment 1690489 [details]
newlog

Comment 17 Min Deng 2020-05-21 05:54:33 UTC

QE also tried the issue on older build and hit issues like this,it seems it was hardware issue,which was different with above comments.Also uploaded log to the bug.Thanks.
build information,
kernel-4.18.0-200.el8.ppc64le host
kernel-4.18.0-193.el8.ppc64le guest,there's test result for kernel-4.18.0-201.el8.ppc64le,see above comment.
qemu-kvm-5.0.0-0.module+el8.3.0+6620+5d5e1420.ppc64le

00:37:33 INFO | Triggering crash on vcpu 0 ...
00:37:33 INFO | Context: Kdump Testing, force the Linux kernel to crash
00:37:33 DEBUG| Attempting to log into 'vm2' (timeout 360s)
00:37:33 DEBUG| Found/Verified IP 10.16.213.142 for VM vm2 NIC 0
00:37:35 INFO | [qemu output] KVM: unknown exit, hardware reason 3b8                    
00:37:35 INFO | [qemu output] NIP 0000000000000700   LR c0000000081f6580 CTR c000000008dd9528 XER 0000000000000000 CPU#24
00:37:35 INFO | [qemu output] MSR 8000000000001001 HID0 0000000000000000  HF 8000000000000001 iidx 3 didx 3
00:37:35 INFO | [qemu output] TB 00000000 00000000 DECR 0
00:37:35 INFO | [qemu output] GPR00 c0000000081f6580 c00000000fbfe670 00007fff8872ee60 00007fff889005b8
00:37:35 INFO | [qemu output] GPR04 c000000008370de4 0000000000000669 0000000000000008 c000000008da8700
00:37:35 INFO | [qemu output] GPR08 feeeeeeeeeeeeeee 0000000040000000 0000000080000018 0000000000000381
00:37:35 INFO | [qemu output] GPR12 c000000008dd9528 c00000000ff6ca00 c0000000f7cd7f90 0000000000000000
00:37:35 INFO | [qemu output] GPR16 c0000000019520d8 0000000000000000 0000000000000800 0000000000000001
00:37:35 INFO | [qemu output] GPR20 c000000001235608 0000000000000018 c0000000016a7fa8 0000000000000000
00:37:35 INFO | [qemu output] GPR24 0000000000000018 0000000000000000 0000000000000019 c000000008ffc6c8
00:37:35 INFO | [qemu output] GPR28 00007fff889005b8 0000000000000669 0000000000000008 0000000000000000
00:37:35 INFO | [qemu output] CR 48008244  [ G  L  -  -  L  E  G  G  ]             RES ffffffffffffffff
00:37:35 INFO | [qemu output]  SRR0 c000000008dd952c  SRR1 8000000000001003    PVR 00000000004e1202 VRSAVE 0000000000000000
00:37:35 INFO | [qemu output] SPRG0 0000000000000000 SPRG1 c00000000ff6ca00  SPRG2 c00000000ff6ca00  SPRG3 0000000000000018
00:37:35 INFO | [qemu output] SPRG4 0000000000000000 SPRG5 0000000000000000  SPRG6 0000000000000000  SPRG7 0000000000000000
00:37:35 INFO | [qemu output] HSRR0 0000000000000000 HSRR1 0000000000000000
00:37:35 INFO | [qemu output]  CFAR 0000000000000000
00:37:35 INFO | [qemu output]  LPCR 0000000003d6f41f
00:37:35 INFO | [qemu output]  PTCR 0000000000000000   DAR beeeeeeef812fe8e  DSISR 0000000000000000
00:37:36 DEBUG| Trying to SCP with command 'scp -r -v -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o PreferredAuthentications=password  -P 22 root@\[10.16.213.142\]:/etc/kdump.conf /home/kar/workspace/job-results/job-2020-05-19T23.04-0ddb0eb/test-results/5-Host_RHEL.m8.u3.product_av.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.8.2.0.ppc64le.io-github-autotest-qemu.kdump.multi_vms/kdump.conf-vm2-test', timeout 600s

Comment 18 Min Deng 2020-05-21 06:00:59 UTC

Created attachment 1690509 [details]
old-build-log

Comment 19 David Gibson 2020-06-09 05:42:40 UTC

Discussed with mdeng to clarify the situation.

The current issue is the one shown in comment 15.  With several guests running in an automated test, one of them is falling over with a Bad Address error (-EFAULT) from KVM_RUN.  This is despite having the fix for bug 1820402 present.  The comment 17 trace isn't really interesting because that kernel does *not* have the bug 1820402 bug fix present, which means we don't know if it's showing the issue we're still trying to find or the known problem from bug 1820402.

Unfortunately the traces in comment 15 give us almost no useful information.  They are all guest state, but an -EFAULT from KVM_RUN almost certainly indicates a host kernel or qemu error.

Next steps: we'll need at least one of these two things to proceed from here:

 1) The *host* dmesg logs when the error occurs (unfortunately those weren't captured in the comment 15 case)

 2) Instructions on how to set up and run the automated test case that has triggered this problem

Comment 20 Min Deng 2020-06-15 05:31:24 UTC

(In reply to David Gibson from comment #19)
> Discussed with mdeng to clarify the situation.
> 
> The current issue is the one shown in comment 15.  With several guests
> running in an automated test, one of them is falling over with a Bad Address
> error (-EFAULT) from KVM_RUN.  This is despite having the fix for bug
> 1820402 present.  The comment 17 trace isn't really interesting because that
> kernel does *not* have the bug 1820402 bug fix present, which means we don't
> know if it's showing the issue we're still trying to find or the known
> problem from bug 1820402.
> 
> Unfortunately the traces in comment 15 give us almost no useful information.
> They are all guest state, but an -EFAULT from KVM_RUN almost certainly
> indicates a host kernel or qemu error.
> 
> Next steps: we'll need at least one of these two things to proceed from here:
> 
>  1) The *host* dmesg logs when the error occurs (unfortunately those weren't
> captured in the comment 15 case)

>  2) Instructions on how to set up and run the automated test case that has
> triggered this problem

QE will try it later and it probably takes much time,as soon as the result is available, the bug will be updated correspondingly.

Thanks.

Comment 21 Min Deng 2020-06-29 01:09:52 UTC

Tried but still hit the same issue without explicit error message from host,I will try it on another host, thanks.
build information,
qemu-kvm-5.0.0-0.scrmod+el8.3.0+7150+88a2c83e.wrb200624.ppc64le

Comment 22 David Gibson 2020-07-20 05:05:00 UTC

Min,

Have you been able to gather any more information about how this bug triggers?

Comment 23 Min Deng 2020-08-03 12:13:37 UTC

Tried the bug on both P8 and P9, and I guess it was related with vcpu number, since in automation cmdline, every guest will consume an half of cpus of the host.
P8 results,
the host has 24 cpus and each guest has 12 in the cmdline,but I failed to reproduce it by manual.
01:33:32 INFO | Triggering crash on vcpu 0 ...
01:33:32 INFO | Context: Check the vmcore file after triggering a crash
01:33:37 INFO | Context: Check the vmcore file after triggering a crash --> Waiting for kernel crash dump to complete
01:33:37 DEBUG| Attempting to log into 'avocado-vt-vm1' (timeout 1200s)
01:33:38 DEBUG| Found/Verified IP 10.0.1.218 for VM avocado-vt-vm1 NIC 0
01:35:08 DEBUG| Attempting to log into 'avocado-vt-vm1' via serial console (timeout 10s)
01:35:31 WARNI| Error occur when update VM address cache: Login timeout expired    (output: 'exceeded 10 s timeout')
01:37:35 DEBUG| Attempting to log into 'avocado-vt-vm1' via serial console (timeout 10s)
01:37:57 WARNI| Error occur when update VM address cache: Login timeout expired    (output: 'exceeded 10 s timeout')
01:38:32 INFO | [qemu output] qemu-kvm: OS terminated: OS panic: System is deadlocked on memory
01:38:32 INFO | [qemu output]
01:38:32 INFO | [qemu output] (Process terminated with status 0)
01:38:33 WARNI| registers is not alive. Can't query the avocado-vt-vm1 status
01:40:02 DEBUG| Attempting to log into 'avocado-vt-vm1' via serial console (timeout 10s)
01:40:13 WARNI| Error occur when update VM address cache: VM is dead    detail: 'qemu-kvm: OS terminated: OS panic: System is deadlocked on memory\n\n'
01:42:17 DEBUG| Attempting to log into 'avocado-vt-vm1' via serial console (timeout 10s)
01:42:18 WARNI| Error occur when update VM address cache: VM is dead    detail: 'qemu-kvm: OS terminated: OS panic: System is deadlocked on memory\n\n'
01:44:23 DEBUG| Attempting to log into 'avocado-vt-vm1' via serial console (timeout 10s)
01:44:23 WARNI| Error occur when update VM address cache: VM is dead    detail: 'qemu-kvm: OS terminated: OS panic: System is deadlocked on memory\n\n'
01:46:28 DEBUG| Attempting to log into 'avocado-vt-vm1' via serial console (timeout 10s)
01:46:29 WARNI| Error occur when update VM address cache: VM is dead    detail: 'qemu-kvm: OS terminated: OS panic: System is deadlocked on memory\n\n'
01:48:33 DEBUG| Attempting to log into 'avocado-vt-vm1' via serial console (timeout 10s)
01:48:34 WARNI| Error occur when update VM address cache: VM is dead    detail: 'qemu-kvm: OS terminated: OS panic: System is deadlocked on memory\n\n'
01:50:39 DEBUG| Attempting to log into 'avocado-vt-vm1' via serial console (timeout 10s)
01:50:39 WARNI| Error occur when update VM address cache: VM is dead    detail: 'qemu-kvm: OS terminated: OS panic: System is deadlocked on memory\n\n'
01:52:21 WARNI| IPv6 address sniffing is not supported yet by using TShark, please fallback to use other sniffers by uninstalling TShark when testing with IPv6
01:52:44 DEBUG| Attempting to log into 'avocado-vt-vm1' via serial console (timeout 10s)
01:52:44 WARNI| Error occur when update VM address cache: VM is dead    detail: 'qemu-kvm: OS terminated: OS panic: System is deadlocked on memory\n\n'
01:53:47 ERROR| Can't get guest network status information, reason: Client process terminated    (status: 1,    output: '')
01:53:47 DEBUG| Attempting to log into 'avocado-vt-vm1' (timeout 360s)
02:00:03 DEBUG| Attempting to log into 'avocado-vt-vm1' via serial console (timeout 360s)
02:00:03 WARNI| Error occur when update VM address cache: VM is dead    detail: 'qemu-kvm: OS terminated: OS panic: System is deadlocked on memory\n\n'

P9's results,
Got the same issue but with error was as following, 
...
Message from syslogd@ibm-p9wr-04 at Aug  3 04:26:21 ...
 kernel:kvmppc_emulate_mmio: emulation failed (7ce01828)

Notes, the debug logs also were attached in the bug.

Comment 24 Min Deng 2020-08-03 12:15:04 UTC

Created attachment 1703288 [details]
for P8 and P9

Comment 25 Min Deng 2020-08-03 12:16:43 UTC

Build information,
kernel-4.18.0-230.el8.ppc64le
qemu-kvm-5.1.0-0.scrmod+el8.3.0+7493+a5e196a4.wrb200729.ppc64le

Comment 26 Min Deng 2020-08-03 12:20:21 UTC

P8's host
[root@ibm-p8-11 home]# lscpu
Architecture:         ppc64le
Byte Order:           Little Endian
CPU(s):               192
On-line CPU(s) list:  0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168,176,184
Off-line CPU(s) list: 1-7,9-15,17-23,25-31,33-39,41-47,49-55,57-63,65-71,73-79,81-87,89-95,97-103,105-111,113-119,121-127,129-135,137-143,145-151,153-159,161-167,169-175,177-183,185-191
Thread(s) per core:   1
Core(s) per socket:   6
Socket(s):            4
NUMA node(s):         4
Model:                2.1 (pvr 004b 0201)
Model name:           POWER8E (raw), altivec supported
CPU max MHz:          3923.0000
CPU min MHz:          2061.0000
L1d cache:            64K
L1i cache:            32K
L2 cache:             512K
L3 cache:             8192K
NUMA node0 CPU(s):    0,8,16,24,32,40
NUMA node1 CPU(s):    48,56,64,72,80,88
NUMA node16 CPU(s):   96,104,112,120,128,136
NUMA node17 CPU(s):   144,152,160,168,176,184
[root@ibm-p8-11 home]# free -m
              total        used        free      shared  buff/cache   available
Mem:         518925       11433      491251          42       16240      504346
Swap:          4095           0        4095

P9 host,
[root@ibm-p9wr-04 home]# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  4
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9, altivec supported
CPU max MHz:         3800.0000
CPU min MHz:         2300.0000
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-63
NUMA node8 CPU(s):   64-127
[root@ibm-p9wr-04 home]# free -m
              total        used        free      shared  buff/cache   available
Mem:         257550        4673      241671          33       11205      251164
Swap:          4095           0        4095

Comment 27 Min Deng 2020-09-14 15:28:22 UTC

Do a summary for this bug,
1.The bug was reproduced on comment15
Build information,
kernel-4.18.0-200.el8.ppc64le host
kernel-4.18.0-201.el8.ppc64le guest
qemu-kvm-5.0.0-0.module+el8.3.0+6620+5d5e1420.ppc64le

2.QE also hit other situations on comment23
Build information,
kernel-4.18.0-230.el8.ppc64le
qemu-kvm-5.1.0-0.scrmod+el8.3.0+7493+a5e196a4.wrb200729.ppc64le
please have a look on the attachment if you need.

QE tried this bug on the latest build as followings,
kernel-4.18.0-236.el8.ppc64le
qemu-kvm-5.1.0-6.module+el8.3.0+8041+42ff16b8.ppc64le

P9:ibm-p9wr-14.ibm2.lab.eng.bos.redhat.com
P8:ibm-p8-garrison-01.rhts.eng.bos.redhat.com

After running the test for multiple times [about 10 times], now the bug can't be reproduced. Thanks. Any issues please let me know.

Comment 28 Min Deng 2020-09-14 15:29:50 UTC

Created attachment 1714820 [details]
latest build

Comment 29 Min Deng 2020-09-16 13:34:24 UTC

Base on comment27, it can be close as currentrelease, if there's any concerns, please just let me know, thanks a lot.

Comment 30 David Gibson 2020-09-17 00:26:47 UTC

Great news.  Thanks for rechecking this, Min.