1893787 – Guest display call trace when doing save snapshot

Bug 1893787 - Guest display call trace when doing save snapshot

Summary: Guest display call trace when doing save snapshot

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux Advanced Virtualization
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	8.3
Hardware:	ppc64le
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	rc
Target Release:	8.5
Assignee:	Greg Kurz
QA Contact:	Xujun Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-02 15:52 UTC by Xujun Ma
Modified:	2021-11-16 08:02 UTC (History)
CC List:	13 users (show)
Fixed In Version:	qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-11-16 07:51:01 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2021:4684	0	None	None	None	2021-11-16 07:51:22 UTC

Description Xujun Ma 2020-11-02 15:52:08 UTC

Description of problem:
Guest crashed when loading snapshot.

Version-Release number of selected component (if applicable):
qemu-kvm-core-5.1.0-14.module+el8.3.0+8438+644aff69.ppc64le
guest:
kernel-4.18.0-240.el8.ppc64le

How reproducible:
6/10

Steps to Reproduce:
1.Boot up guest with command
MALLOC_PERTURB_=1  /usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine pseries  \
    -nodefaults \
    -device VGA,bus=pci.0,addr=0x2 \
    -m 115712  \
    -smp 80,maxcpus=80,cores=40,threads=1,sockets=2  \
    -cpu 'host' \
    -chardev socket,path=/var/tmp/avocado_7gc20iw6/monitor-qmpmonitor1-20201029-214645-5WCPsAme,nowait,server,id=qmp_id_qmpmonitor1  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \
    -chardev socket,path=/var/tmp/avocado_7gc20iw6/monitor-catch_monitor-20201029-214645-5WCPsAme,nowait,server,id=qmp_id_catch_monitor  \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -chardev socket,path=/var/tmp/avocado_7gc20iw6/serial-serial0-20201029-214645-5WCPsAme,nowait,server,id=chardev_serial0 \
    -device spapr-vty,id=serial0,reg=0x30000000,chardev=chardev_serial0 \
    -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 \
    -blockdev node-name=file_image1,driver=file,aio=threads,filename=/home/kar/vt_test_images/rhel830-ppc64le-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \
    -device virtio-net-pci,mac=9a:36:2b:66:0e:bb,id=idhpXzRQ,netdev=idQ43RsY,bus=pci.0,addr=0x5  \
    -netdev tap,id=idQ43RsY,vhost=on,vhostfd=20,fd=16  \
    -vnc :0  \
    -rtc base=utc,clock=host  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm
2.do snapshot with savevm command
savevm vm_inzifM3A
3.do load snapshot with command
loadvm vm_inzifM3A

Actual results:
Guest crashed with error message as following:
2020-10-29 21:50:45: [  210.326329] rcu: INFO: rcu_sched self-detected stall on CPU
2020-10-29 21:50:45: [  210.326651] rcu: 	35-...!: (2 GPs behind) idle=946/0/0x1 softirq=778/782 fqs=0
2020-10-29 21:50:45: [  210.326765] 	(t=18508 jiffies g=2261 q=121)
2020-10-29 21:50:45: [  210.326904] Sending NMI from CPU 35 to CPUs 0:
2020-10-29 21:50:45: [  210.328348] NMI backtrace for cpu 0
2020-10-29 21:50:45: [  210.328350] CPU 0 didn't respond to backtrace IPI, inspecting paca.
2020-10-29 21:50:45: [  210.328354] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 0 (swapper/0)
2020-10-29 21:50:45: [  210.329087] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 4.18.0-240.el8.ppc64le #1
2020-10-29 21:50:45: [  210.329214] Back trace of paca->saved_r1 (0xc000001c3ffabea0) (possibly stale):
2020-10-29 21:50:45: [  210.330188] NIP:  c0000000000fb2c4 LR: c000000000b247f8 CTR: c000000002710000
2020-10-29 21:50:45: [  210.330501] Call Trace:
2020-10-29 21:50:45: [  210.332157] REGS: c000000001abfa60 TRAP: 0501   Not tainted  (4.18.0-240.el8.ppc64le)
2020-10-29 21:50:45: [  210.332160] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44000824  XER: 00000000
2020-10-29 21:50:45: [  210.332262] [c000001c3ffabea0] [000000003ffabef0] 0x3ffabef0 (unreliable)
2020-10-29 21:50:45: [  210.332308] Sending NMI from CPU 35 to CPUs 1:
2020-10-29 21:50:45: [  210.332337] CFAR: 000000011ae6d93c IRQMASK: 0
2020-10-29 21:50:45: [  210.332337] GPR00: 0000000
2020-10-29 21:51:39: [  210.288655] rcu: INFO: rcu_sched self-detected stall on CPU
2020-10-29 21:51:39: [  210.288749] rcu: 	30-...!: (27 GPs behind) idle=2d6/0/0x1 softirq=388/388 fqs=0
2020-10-29 21:51:39: [  210.288866] 	(t=18505 jiffies g=2261 q=120)
2020-10-29 21:51:39: [  210.289575] Sending NMI from CPU 30 to CPUs 3:
2020-10-29 21:51:39: [  210.290304] NMI backtrace for cpu 3
2020-10-29 21:51:39: [  210.290304] CPU 3 didn't respond to backtrace IPI, inspecting paca.
2020-10-29 21:51:39: [  210.290306] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 0 (swapper/3)
2020-10-29 21:51:39: [  210.290836] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Not tainted 4.18.0-240.el8.ppc64le #1
2020-10-29 21:51:39: [  210.290885] Back trace of paca->saved_r1 (0xc000001c3ff93ea0) (possibly stale):
2020-10-29 21:51:39: [  210.290887] Call Trace:
2020-10-29 21:51:39: [  210.290941] NIP:  c0000000000fb2c4 LR: c000000000b247f8 CTR: c000000ffffbae80
2020-10-29 21:51:39: [  210.291014] [c000001c3ff93ea0] [000000003ff93ef0] 0x3ff93ef0 (unreliable)
2020-10-29 21:51:39: [  210.291035] Sending NMI from CPU 30 to CPUs 6:
2020-10-29 21:51:39: [  210.291065] REGS: c000001c327f7a60 TRAP: 0501   Not tainted  (4.18.0-240.el8.ppc64le)
2020-10-29 21:51:39: [  210.291067] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 42000424  XER: 20040000
2020-10-29 21:51:39: [  210.291335] CFAR: c0000000004ca930 IRQMASK: 0
2020-10-29 21:51:39: [  210.291335] GPR00: 0000000022000424 c000001c327f7ce0 c000000001ac1100 0000000000000000
2020-10-29 21:51:39: [  210.291335] GPR04: c000000001749808 0000000000000001 00000001cfd96288 0000000000000000
2020-10-29 21:51:39: [  210.291335] GPR08: 0000000000000000 0000000000000000 c0000000012b6ce0 0000000000000006
2020-10-29 21:51:39: [  210.291335] GPR12: c000000000b246d0 c000000ffffbae80
2020-10-29 21:51:39: [  210.291530] NMI backtrace for cpu 6
2020-10-29 21:51:39: [  210.291532] CPU 6 didn't respond to backtrace IPI, inspecting paca.
2020-10-29 21:51:39: [  210.291534] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 0 (swapper/6)
2020-10-29 21:51:39: [  210.291537] Back trace of paca->saved_r1 (0xc000001c3ff7bea0) (possibly stale):
2020-10-29 21:51:39: [  210.291538] Call Trace:
2020-10-29 21:51:39: [  210.291542] [c000001c3ff7bea0] [000000003ff7bef0] 0x3ff7bef0 (unreliable)
2020-10-29 21:51:39: [  210.291573] Sending NMI from CPU 30 to CPUs 21:
2020-10-29 21:51:39: [  210.291793] NMI backtrace for cpu 21
2020-10-29 21:51:39: [  210.291794] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Not tainted 4.18.0-240.el8.ppc64le #1
2020-10-29 21:51:39: [  210.291795] CPU 21 didn't respond to backtrace IPI, inspecting paca.
2020-10-29 21:51:39: [  210.291797] NIP:  c0000000000fb2c4 LR: c000000000b247f8 CTR: c000000ffffa1280
2020-10-29 21:51:39: [  210.291798] REGS: c000001c328afa60 TRAP: 0501   Not tainted  (4.18.0-240.el8.ppc64le)
2020-10-29 21:51:39: [  210.291800] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 0 (swapper/21)
2020-10-29 21:51:39: [  210.291802] MSR:  800000000280b033 <
2020-10-29 21:51:39: [  210.291803] Back trace of paca->saved_r1 (0xc000000006cdb990) (possibly stale):
2020-10-29 21:51:39: [  210.291804] SF
2020-10-29 21:51:39: [  210.291805] Call Trace:
2020-10-29 21:51:39: [  210.291805] ,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44000424  XER: 20040000
2020-10-29 21:51:39: [  210.291808] CFAR: c000000000057708 IRQMASK: 0
2020-10-29 21:51:39: [  210.291808] GPR00: 0000000024000424 c000001c328afce0 c000000001ac1100 0000000000000000
2020-10-29 21:51:39: [  210.291808] GPR04: c000000001749808 0000000000000001 000000002fa609a8 0000000000000000
2020-10-29 21:51:39: [  210.291808] GPR08: 0000000000000000 0000000000000000 c0000000012b6ce0 0000000000000005
2020-10-29 21:51:39: [  210.291808] GPR12: c000000000b246d0 c000000ffffa1280
2020-10-29 21:51:39: [  210.291817] NIP [c0000000000fb2c4] plpar_hcall_norets+0x1c/0x28
2020-10-29 21:51:39: [  210.291820] LR [c000000000b247f8] shared_cede_loop+0x128/0x190
2020-10-29 21:51:39: [  210.291820] Call Trace:
2020-10-29 21:51:39: [  210.291823] [c000001c328afce0] [c000000000254c4c] tick_nohz_get_sleep_length+0xbc/0x120 (unreliable)
2020-10-29 21:51:39: [  210.291825] [c000001c328afd60] [c000000000b1ff84] cpuidle_enter_state+0xa4/0x810
2020-10-29 21:51:39: [  210.291826] NMI backtrace for cpu 30
2020-10-29 21:51:39: [  210.291828] [c000001c328afde0] [c000000000b20790] cpuidle_enter+0x50/0x70
2020-10-29 21:51:39: [  210.291830] CPU: 30 PID: 0 Comm: swapper/30 Kdump: loaded Not tainted 4.18.0-240.el8.ppc64le #1
2020-10-29 21:51:39: [  210.291832] [c000001c328afe20] [c0000000001b5d90] do_idle+0x400/0x4c0
2020-10-29 21:51:39: [  210.291833] Call Trace:
2020-10-29 21:51:39: [  210.291835] [c000001c328afea0] [c0000000001b6098] cpu_startup_entry+0x38/0x40
2020-10-29 21:51:39: [  210.291839] [c000001c328b3580] [c000000000e10dfc] dump_stack+0xb0/0xf4 (unreliable)
2020-10-29 21:51:39: [  210.291842] [c000001c328afed0] [c0000000000597dc] start_secondary+0x7bc/0x8f0
2020-10-29 21:51:39: [  210.291846] [c000001c328b35c0] [c000000000e1dab8] nmi_trigger_cpumask_backtrace+0x1b8/0x230
2020-10-29 21:51:39: [  210.291849] [c000001c328aff90] [c00000000000ac70] start_secondary_prolog+0x10/0x14
2020-10-29 21:51:39: [  210.291850] Instruction dump:
2020-10-29 21:51:39: [  210.291853] [c000001c328b3660] [c00000000006d598] arch_trigger_cpumask_backtrace+0x28/0x40
2020-10-29 21:51:39: [  210.291854] f9490000 38210060
2020-10-29 21:51:39: [  210.291857] [c000001c328b3680] [c00000000022141c] rcu_dump_cpu_stacks+0xf8/0x154
2020-10-29 21:51:39: [  210.291857] e8010010
2020-10-29 21:51:39: [  210.291858] 7c0803a6 4e800020 3c4c019c 38425e58 7c421378
2020-10-29 21:51:39: [  210.291861] 7c000026 90010008 60000000 44000022 <80010008> 7c0ff120 4e800020
2020-10-29 21:51:39: [  210.291868] [c000001c328b36d0] [c00000000021fbcc] rcu_sched_clock_irq+0x81c/0xa20
2020-10-29 21:51:39: [  210.291868] 7c0802a6
2020-10-29 21:51:39: [  210.291871] [c000001c328b37a0] [c00000000023614c] update_process_times+0x8c/0x100
2020-10-29 21:51:39: [  210.291874] [c000001c328b37d0] [c000000000253884] tick_sched_timer+0x74/0x130
2020-10-29 21:51:39: [  210.291876] [c000001c328b3810] [c0000000002370fc] __hrtimer_run_queues+0x17c/0x480
2020-10-29 21:51:39: [  210.291879] [c000001c328b38b0] [c000000000239094] hrtimer_interrupt+0x144/0x520
2020-10-29 21:51:39: [  210.291881] [c000001c328b3990] [c000000000027e84] timer_interrupt+0x104/0x2f0
2020-10-29 21:51:39: [  210.291884] [c000001c328b39f0] [c0000000000091d4] decrementer_common+0x114/0x120
2020-10-29 21:51:39: [  210.291888] --- interrupt: 901 at plpar_hcall_norets+0x1c/0x28
2020-10-29 21:51:39: [  210.291888]     LR = shared_cede_loop+0x128/0x190
2020-10-29 21:51:39: [  210.291893] NIP [c0000000000fb2c4] plpar_hcall_norets+0x1c/0x28
2020-10-29 21:51:39: [  210.291898] LR [c000000000b247f8] shared_cede_loop+0x128/0x190
2020-10-29 21:51:39: [  210.291899] [c000001c328b3ce0] [0000000000000800] 0x800 (unreliable)
2020-10-29 21:51:39: [  210.291902] [c000001c328b3d60] [c000000000b1ff84] cpuidle_enter_state+0xa4/0x810
2020-10-29 21:51:39: [  210.291904] [c000001c328b3de0] [c000000000b20790] cpuidle_enter+0x50/0x70
2020-10-29 21:51:39: [  210.291907] [c000001c328b3e20] [c0000000001b5d90] do_idle+0x400/0x4c0
2020-10-29 21:51:39: [  210.291909] [c000001c328b3ea0] [c0000000001b6098] cpu_startup_entry+0x38/0x40
2020-10-29 21:51:39: [  210.291912] [c000001c328b3ed0] [c0000000000597dc] start_secondary+0x7bc/0x8f0
2020-10-29 21:51:39: [  210.291915] [c000001c328b3f90] [c00000000000ac70] start_secondary_prolog+0x10/0x14
2020-10-29 21:51:39: [  210.291938] Sending NMI from CPU 30 to CPUs 31:
2020-10-29 21:51:39: [  210.292011] CPU: 6 PID: 0 Comm: swapper/6 Kdump: loaded Not tainted 4.18.0-240.el8.ppc64le #1
2020-10-29 21:51:39: [  210.292241] Call Trace:
2020-10-29 21:51:39: [  210.292372] NIP:  c0000000000fb2c4 LR: c000000000b247f8 CTR: c000000ffffb7a00
2020-10-29 21:51:39: [  210.292698] [c000001c327f7ce0] [c000000000254c4c] tick_nohz_get_sleep_length+0xbc/0x120 (unreliable)
2020-10-29 21:51:39: [  210.292733] REGS: c000001c327c3a60 TRAP: 0501   Not tainted  (4.18.0-240.el8.ppc64le)
2020-10-29 21:51:39: [  210.292823] [c000001c327f7d60] [c000000000b1ff84] cpuidle_enter_state+0xa4/0x810
2020-10-29 21:51:39: [  210.292894] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44000824  XER: 00000000
2020-10-29 21:51:39: [  210.293032] [c000001c327f7de0] [c000000000b20790] cpuidle_enter+0x50/0x70
2020-10-29 21:51:39: [  210.293252] CFAR: 00007fff86a3e86c IRQMASK: 0
2020-10-29 21:51:39: [  210.293252] GPR00: 0000000024000824 c000001c327c3ce0 c000000001ac1100 0000000000000000
2020-10-29 21:51:39: [  210.293252] GPR04: c000000001749808 0000000000000001 0000000055b094e0 0000000000000000
2020-10-29 21:51:39: [  210.293252] GPR08: 0000000000000000 0000000000000000 c0000000012b6ce0 0000000000000005
2020-10-29 21:51:39: [  210.293252] GPR12: c000000000b246d0 c000000ffffb7a00
2020-10-29 21:51:39: [  210.293358] [c000001c327f7e20] [c0000000001b5d90] do_idle+0x400/0x4c0
2020-10-29 21:51:39: [  210.293557] NIP [c0000000000fb2c4] plpar_hcall_norets+0x1c/0x28
2020-10-29 21:51:39: [  210.293562] LR [c000000000b247f8] shared_cede_loop+0x128/0x190
2020-10-29 21:51:39: [  210.293674] [c000001c327f7ea0] [c0000000001b609c] cpu_startup_entry+0x3c/0x40
2020-10-29 21:51:39: [  210.293741] Call Trace:
2020-10-29 21:51:39: [  210.293781] [c000001c327f7ed0] [c0000000000597dc] start_secondary+0x7bc/0x8f0
2020-10-29 21:51:39: [  210.294003] [c000001c327c3ce0] [c000000000254c5c] tick_nohz_get_sleep_length+0xcc/0x120 (unreliable)
2020-10-29 21:51:39: [  210.294029] [c000001c327f7f90] [c00000000000ac70] start_secondary_prolog+0x10/0x14
2020-10-29 21:51:39: [  210.294105] [c000001c327c3d60] [c000000000b1ff84] cpuidle_enter_state+0xa4/0x810
2020-10-29 21:51:39: [  210.294178] Instruction dump:
2020-10-29 21:51:39: [  210.294524] [c000001c327c3de0] [c000000000b20790] cpuidle_enter+0x50/0x70
2020-10-29 21:51:39: [  210.294626] f9490000 38210060 e8010010 7c0803a6 4e800020 3c4c019c 38425e58 7c421378
2020-10-29 21:51:39: [  210.294682] [c000001c327c3e20] [c0000000001b5d90] do_idle+0x400/0x4c0
2020-10-29 21:51:39: [  210.294705] 7c000026 90010008 60000000 44000022 <80010008> 7c0ff120 4e800020 7c0802a6
2020-10-29 21:51:39: [  210.294780] [c000001c327c3ea0] [c0000000001b609c] cpu_startup_entry+0x3c/0x40
2020-10-29 21:51:39: [  210.300084] [c000001c327c3ed0] [c0000000000597dc] start_secondary+0x7bc/0x8f0
2020-10-29 21:51:39: [  210.300158] [c000001c327c3f90] [c00000000000ac70] start_secondary_prolog+0x10/0x14
2020-10-29 21:51:39: [  210.300230] Instruction dump:
2020-10-29 21:51:39: [  210.300269] f9490000 38210060 e8010010 7c0803a6 4e800020 3c4c019c 38425e58 7c421378
2020-10-29 21:51:39: [  210.300345] 7c000026 90010008 60000000 44000022 <80010008> 7c0ff120 4e800020 7c0802a6
2020-10-29 21:51:45: [  215.759229] CPU 31 didn't respond to backtrace IPI, inspecting paca.
2020-10-29 21:51:45: [  215.759299] irq_soft_mask: 0x00 in_mce: 0 in_nmi: 0 current: 0 (swapper/31)
2020-10-29 21:51:45: [  215.759359] Back trace of paca->saved_r1 (0xc000001c10a6f6b0) (possibly stale):
2020-10-29 21:51:45: [  215.759420] Call Trace:

Expected results:
Load snapshot succesfully without error.

Additional info:

Comment 1 Qunfang Zhang 2020-11-03 00:49:32 UTC

Xujun,

Is this bug a ppc64le specific issue? Can you re-test on x86 according to our bug filing workflow?  I'm also involve Ademar and Kevin in block area as well.

Thanks,
Qunfang

Comment 2 Kevin Wolf 2020-11-03 08:47:22 UTC

Can you try if the same problem is reproducible with migration to file instead of savevm/loadvm?

Comment 3 Ademar Reis 2020-11-03 14:12:40 UTC

(In reply to Xujun Ma from comment #0)
> 2.do snapshot with savevm command
> savevm vm_inzifM3A
> 3.do load snapshot with command
> loadvm vm_inzifM3A
> 

Xujun and Qunfang:

savevm and loadvm (internal snapshots) are explicitly not supported in RHEL-AV and declared deprecated in RHEL. Please update the tests.

I'll leave this BZ open for now just because of the needinfo from Kevin to test this with migration to file:

(In reply to Kevin Wolf from comment #2)
> Can you try if the same problem is reproducible with migration to file
> instead of savevm/loadvm?

Comment 4 Xujun Ma 2020-11-05 02:15:27 UTC

(In reply to Ademar Reis from comment #3)
> (In reply to Xujun Ma from comment #0)
> > 2.do snapshot with savevm command
> > savevm vm_inzifM3A
> > 3.do load snapshot with command
> > loadvm vm_inzifM3A
> > 
> 
> Xujun and Qunfang:
> 
> savevm and loadvm (internal snapshots) are explicitly not supported in
> RHEL-AV and declared deprecated in RHEL. Please update the tests.
> 
> I'll leave this BZ open for now just because of the needinfo from Kevin to
> test this with migration to file:
> 
> (In reply to Kevin Wolf from comment #2)
> > Can you try if the same problem is reproducible with migration to file
> > instead of savevm/loadvm?

Hi
This test case was added because of a bug 1734287 that IBM filed in RHEL8.1.

I have tested migrate to file and didn't hit this bug again.

Comment 6 Kevin Wolf 2020-11-06 09:06:18 UTC

(In reply to Xujun Ma from comment #4)
> I have tested migrate to file and didn't hit this bug again.

This is surprising, it did look more like a migration bug at first sight.

Does it still happen when you don't use the "loadvm" monitor command in the running QEMU instance, but you restart QEMU with -loadvm?

Comment 9 Laurent Vivier 2020-11-06 15:46:24 UTC

@xuma

Did you test on P8 or P9?

I guess it might be on P9, so I would guess a problem with the state of the XIVE (we had one in the past)

Could you test on P8?

@gkurz

Are you aware of a problem with XIVE migration?

Comment 10 Xujun Ma 2020-11-09 02:47:47 UTC

(In reply to Kevin Wolf from comment #6)
> (In reply to Xujun Ma from comment #4)
> > I have tested migrate to file and didn't hit this bug again.
> 
> This is surprising, it did look more like a migration bug at first sight.
> 
> Does it still happen when you don't use the "loadvm" monitor command in the
> running QEMU instance, but you restart QEMU with -loadvm?

Hi
I have confirmed that it happen when use savevm and have no crash ,just display call trace message.
It only happen on guest with emulated xive so far.

Comment 11 Xujun Ma 2020-11-09 02:57:35 UTC

(In reply to Ademar Reis from comment #3)
> (In reply to Xujun Ma from comment #0)
> > 2.do snapshot with savevm command
> > savevm vm_inzifM3A
> > 3.do load snapshot with command
> > loadvm vm_inzifM3A
> > 
> 
> Xujun and Qunfang:
> 
> savevm and loadvm (internal snapshots) are explicitly not supported in
> RHEL-AV and declared deprecated in RHEL. Please update the tests.
> 
> I'll leave this BZ open for now just because of the needinfo from Kevin to
> test this with migration to file:
> 
> (In reply to Kevin Wolf from comment #2)
> > Can you try if the same problem is reproducible with migration to file
> > instead of savevm/loadvm?

Hi
I tried libvirt save snapshot command and reproduce this issue.
Although savevm and loadvm are not supported from qemu side,user will use them from libvirt side.
So I think this bug should be fixed.

Comment 12 Xujun Ma 2020-11-09 03:01:14 UTC

(In reply to Laurent Vivier from comment #9)
> @xuma
> 
> Did you test on P8 or P9?
p9 boston host.
> 
> I guess it might be on P9, so I would guess a problem with the state of the
> XIVE (we had one in the past)
Yes,happen on guest with emulated xive so far.
> 
> Could you test on P8?
Ok,will update result later.
> 
> @gkurz
> 
> Are you aware of a problem with XIVE migration?

Comment 13 Xujun Ma 2020-11-09 05:32:12 UTC

(In reply to Laurent Vivier from comment #9)
> @xuma
> 
> Did you test on P8 or P9?
> 
> I guess it might be on P9, so I would guess a problem with the state of the
> XIVE (we had one in the past)
> 
> Could you test on P8?
Find the same problem when doing savevm on p8 guest on p8 host
ircchip:in-kernel xics.
Call trace message:
[  312.760378] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  312.760538] rcu: 	0-...!: (40 GPs behind) idle=b50/0/0x0 softirq=6285/6285 fqs=0 
[  312.760630] rcu: 	1-...!: (60 GPs behind) idle=1f0/0/0x0 softirq=9650/9650 fqs=0 
[  312.760705] rcu: 	3-...!: (0 ticks this GP) idle=3b6/0/0x1 softirq=10948/10948 fqs=0 
[  312.760779] 	(detected by 2, t=19837 jiffies, g=9229, q=1)
[  312.760839] Sending NMI from CPU 2 to CPUs 0:
[  312.761275] CPU 0 didn't respond to backtrace IPI, inspecting paca.
[  312.761279] NMI backtrace for cpu 0
[  312.761353] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 1832 (gnome-keyring-d)
[  312.761395] CPU: 0 PID: 1832 Comm: gnome-keyring-d Kdump: loaded Not tainted 4.18.0-222.el8.ppc64le #1
[  312.761460] Back trace of paca->saved_r1 (0xc000000001abf990) (possibly stale):
[  312.761529] NIP:  c0000000001a3608 LR: c0000000001a35d4 CTR: 0000000000007ffe
[  312.761604] Call Trace:
[  312.761664] REGS: c0000000feebb4a0 TRAP: 0501   Not tainted  (4.18.0-222.el8.ppc64le)
[  312.761666] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24008248  XER: 00000000
[  312.761698] Sending NMI from CPU 2 to CPUs 1:
[  312.761754] CFAR: c00000000000dd1c IRQMASK: 0 
               GPR00: c0000000001a35d4 c0000000feebb720 c000000001ac0800 0000000000000000 
               GPR04: 00000000fe450000 00000027446fa06e 000000274484aad4 c000000002700000 
               GPR08: 0000000000000000 c0000000eacf0200 c000000005149a4c c000000002700004 
               GPR12: 0000000000008800 c000000002700000 
[  312.761852] NMI backtrace for cpu 1
[  312.761853] CPU 1 didn't respond to backtrace IPI, inspecting paca.
[  312.761853] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 0 (swapper/1)
[  312.761855] Back trace of paca->saved_r1 (0xc0000000f1ec3630) (possibly stale):
[  312.761856] Call Trace:
[  312.761858] Sending NMI from CPU 2 to CPUs 3:
[  312.761881] CPU 3 didn't respond to backtrace IPI, inspecting paca.
[  312.761882] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 0 (swapper/3)
[  312.761886] Back trace of paca->saved_r1 (0xc0000000e816f470) (possibly stale):
[  312.761919] NIP [c0000000001a3608] finish_task_switch+0xf8/0x390
[  312.761921] LR [c0000000001a35d4] finish_task_switch+0xc4/0x390
[  312.761922] Call Trace:
[  312.761924] [c0000000feebb720] [c0000000001a35d4] finish_task_switch+0xc4/0x390 (unreliable)
[  312.761935] [c0000000feebb7d0] [c000000000e3648c] __schedule+0x2dc/0x980
[  312.761937] [c0000000feebb8a0] [c000000000e36b98] schedule+0x68/0x130
[  312.761939] [c0000000feebb8d0] [c000000000e3ca98] schedule_hrtimeout_range+0xa8/0x150
[  312.761949] [c0000000feebb970] [c000000000549444] do_sys_poll+0x584/0x760
[  312.761951] [c0000000feebbdc0] [c00000000054a1c4] sys_poll+0x64/0x1b0
[  312.761953] [c0000000feebbe30] [c00000000000b408] system_call+0x5c/0x70
[  312.761954] Instruction dump:
[  312.761969] e92d0960 e92d0960 60000000 2fbe0000 419e0040 e92d0960 e92905d0 7fbe4800 
[  312.761972] 409e0008 813e2488 395e004c 7c0004ac <7d205028> 3129ffff 7d20512d 40c2fff4 
[  312.762151] CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded Not tainted 4.18.0-222.el8.ppc64le #1
[  312.762186] Call Trace:
[  312.762235] NIP:  c000000000e3df68 LR: c000000000e3dfe8 CTR: c00000003ffcee80
[  312.762297] rcu: rcu_sched kthread starved for 19837 jiffies! g9229 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
[  312.762355] REGS: c0000000fab3b500 TRAP: 0501   Not tainted  (4.18.0-222.el8.ppc64le)
[  312.762376] rcu: RCU grace-period kthread stack dump:
[  312.762378] rcu_sched       I    0    10      2 0x00000808
[  312.762418] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24000284  XER: 00000000
[  312.762469] Call Trace:
[  312.762521] CFAR: c000000000e3df74 IRQMASK: 0 
               GPR00: c000000000e3dfe8 c0000000fab3b780 c000000001ac0800 c0000000016d4280 
               GPR04: 0000000000000002 00000000000000a2 0000000000000004 c000000001af1e78 
               GPR08: 0000000000000010 0000000080000002 0000000080000002 0000000000000000 
               GPR12: c00000000021c690 c00000003ffcee80 
[  312.762601] [c0000000fab17890] [00000000ffffb7bf] 0xffffb7bf (unreliable)
[  312.762637] NIP [c000000000e3df68] _raw_spin_lock_irqsave+0x98/0x120
[  312.762640] LR [c000000000e3dfe8] _raw_spin_lock_irqsave+0x118/0x120
[  312.762695] [c0000000fab17a60] [c000000000020630] __switch_to+0x2e0/0x4f0
[  312.762713] Call Trace:
[  312.762787] [c0000000fab17ac0] [c000000000e36484] __schedule+0x2d4/0x980
[  312.762838] [c0000000fab3b780] [c000000000e3dfe8] _raw_spin_lock_irqsave+0x118/0x120 (unreliable)
[  312.762890] [c0000000fab17b90] [c000000000e36b98] schedule+0x68/0x130
[  312.762951] [c0000000fab3b7c0] [c00000000021bc7c] rcu_core+0x10c/0x7c0
[  312.763002] [c0000000fab17bc0] [c000000000e3c478] schedule_timeout+0x248/0x480
[  312.763054] [c0000000fab3b860] [c000000000e3eab8] __do_softirq+0x178/0x420
[  312.763105] [c0000000fab17ce0] [c00000000021abf8] rcu_gp_kthread+0xb18/0x11a0
[  312.763108] [c0000000fab17dc0] [c0000000001946bc] kthread+0x1ac/0x1c0
[  312.763140] [c0000000fab3b960] [c000000000162c34] irq_exit+0x184/0x1c0
[  312.763202] [c0000000fab17e30] [c00000000000b7dc] ret_from_kernel_thread+0x5c/0x80
[  312.763263] [c0000000fab3b990] [c00000000002ac28] timer_interrupt+0x128/0x2f0
[  312.764829] [c0000000fab3b9f0] [c0000000000091d4] decrementer_common+0x114/0x120
[  312.764906] --- interrupt: 901 at plpar_hcall_norets+0x1c/0x28
                   LR = shared_cede_loop+0x128/0x190
[  312.764992] [c0000000fab3bce0] [c000000000252a5c] tick_nohz_get_sleep_length+0xcc/0x120 (unreliable)
[  312.765068] [c0000000fab3bd60] [c000000000b1cf44] cpuidle_enter_state+0xa4/0x810
[  312.765132] [c0000000fab3bde0] [c000000000b1d750] cpuidle_enter+0x50/0x70
[  312.765187] [c0000000fab3be20] [c0000000001b43a0] do_idle+0x400/0x4c0
[  312.765241] [c0000000fab3bea0] [c0000000001b46a8] cpu_startup_entry+0x38/0x40
[  312.765306] [c0000000fab3bed0] [c00000000005c43c] start_secondary+0x7bc/0x8f0
[  312.765369] [c0000000fab3bf90] [c00000000000ac70] start_secondary_prolog+0x10/0x14
[  312.765431] Instruction dump:
[  312.765464] 60000000 60000000 7c0802a6 fbe10038 7c7f1b78 fba10028 f8010050 8bad098a 
[  312.765525] 7fc3f378 4b1dce8d 60000000 7c210b78 <48000078> 813f0000 2fa90000 409efff0 

> 
> @gkurz
> 
> Are you aware of a problem with XIVE migration?

Comment 14 Greg Kurz 2020-11-12 08:13:39 UTC

(In reply to Laurent Vivier from comment #9)
> 
> @gkurz
> 
> Are you aware of a problem with XIVE migration?

I'm not aware of any.

Comment 15 Qunfang Zhang 2020-11-24 02:56:12 UTC

(In reply to Kevin Wolf from comment #6)
> (In reply to Xujun Ma from comment #4)
> > I have tested migrate to file and didn't hit this bug again.
> 
> This is surprising, it did look more like a migration bug at first sight.
> 
> Does it still happen when you don't use the "loadvm" monitor command in the
> running QEMU instance, but you restart QEMU with -loadvm?

Xujun, there's still a scenario that you haven't tried.  Can you confirm?  Thanks.

Comment 16 Xujun Ma 2020-11-24 03:17:01 UTC

(In reply to Qunfang Zhang from comment #15)
> (In reply to Kevin Wolf from comment #6)
> > (In reply to Xujun Ma from comment #4)
> > > I have tested migrate to file and didn't hit this bug again.
> > 
> > This is surprising, it did look more like a migration bug at first sight.
> > 
> > Does it still happen when you don't use the "loadvm" monitor command in the
> > running QEMU instance, but you restart QEMU with -loadvm?
> 
> Xujun, there's still a scenario that you haven't tried.  Can you confirm? 
> Thanks.

Hi Qunfang

I have replied that confrim this bug happended when using savevm command not loadvm.

Comment 17 Kevin Wolf 2020-11-24 08:58:25 UTC

(In reply to Xujun Ma from comment #16)
> I have replied that confrim this bug happended when using savevm command not
> loadvm.

Oh, sorry, I must have misunderstood then. So you save a snapshot and without loading it again, you already get the messages in the guest log?

savevm shouldn't in theory change much about the state. What it does change, though, is timing because the VM is stopped during savevm. I wonder if that could be the problem? Does stop/cont for a similar time as savevm takes show the same problem?

Comment 18 Xujun Ma 2020-12-01 02:42:18 UTC

(In reply to Kevin Wolf from comment #17)
> (In reply to Xujun Ma from comment #16)
> > I have replied that confrim this bug happended when using savevm command not
> > loadvm.
> 
> Oh, sorry, I must have misunderstood then. So you save a snapshot and
> without loading it again, you already get the messages in the guest log?
Sorry to reply late，yes，just doing savevm can get error messages.
> 
> savevm shouldn't in theory change much about the state. What it does change,
> though, is timing because the VM is stopped during savevm. I wonder if that
> could be the problem? Does stop/cont for a similar time as savevm takes show
> the same problem? 
I actually hit the similar problem when doing stop/cont vm by signal(kill -19/18 pidofqemu) not hmp stop cmdline.

Comment 19 Kevin Wolf 2020-12-01 14:45:02 UTC

(In reply to Xujun Ma from comment #18)
> I actually hit the similar problem when doing stop/cont vm by signal(kill
> -19/18 pidofqemu) not hmp stop cmdline.

Ok, then it looks like it's neither related to disk snapshots nor to migration, but to paused VMs. Maybe pausing the VM messes with the timekeeping on this machine.

Greg, do you want take a closer look maybe, or do you know who would be the right person?

Comment 20 Dr. David Alan Gilbert 2020-12-01 15:05:04 UTC

The two are not necessarily the same problem, but related.
You've got multiple timers kicking off, and if the guest can see the distruption
then it can trigger the stalled problem.
However, if we get it right on a migration and all the timers get saved and restored
coherently, then the destination might not be able to tell.

Comment 21 Kevin Wolf 2020-12-01 16:33:43 UTC

Note that the "savevm" case shows the problem even without a "loadvm". So in migration terms, even the source has the problem when you resume it after the migration has completed. Which makes sense because in the "savevm" case, the VM is stopped while we're saving the memory content, and I assume the VM state including the timer is still only saved at the end of the whole process when the VM has already been stopped for a significant time. So if just stopping the VM already causes the problem, "savevm" has to cause it, too.

Comment 22 Xujun Ma 2020-12-02 10:46:28 UTC

(In reply to Dr. David Alan Gilbert from comment #20)
> The two are not necessarily the same problem, but related.
> You've got multiple timers kicking off, and if the guest can see the
> distruption
> then it can trigger the stalled problem.
> However, if we get it right on a migration and all the timers get saved and
> restored
> coherently, then the destination might not be able to tell.

Shall we file a new bug for the stop/continue issue?

Comment 23 Xujun Ma 2020-12-02 11:40:52 UTC

It displays call trace message when continue a paused vm as following.

[   29.513290] fuse: init (API version 7.31)
[   90.414603] rcu: INFO: rcu_sched self-detected stall on CPU
[   90.414663] rcu: 	0-...!: (1 ticks this GP) idle=4be/0/0x1 softirq=7249/7249 fqs=0 
[   90.414708] 	(t=6032 jiffies g=4441 q=20)
[   90.414731] rcu: rcu_sched kthread starved for 6032 jiffies! g4441 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=1
[   90.414785] rcu: RCU grace-period kthread stack dump:
[   90.414815] rcu_sched       R  running task        0    10      2 0x00000808
[   90.414860] Call Trace:
[   90.415047] [c0000000f9aeb890] [0000000000000002] 0x2 (unreliable)
[   90.415131] [c0000000f9aeba60] [c000000000020660] __switch_to+0x2e0/0x4f0
[   90.415216] [c0000000f9aebac0] [c000000000e67604] __schedule+0x2d4/0x9b0
[   90.415256] [c0000000f9aebb90] [c000000000e67d48] schedule+0x68/0x130
[   90.415294] [c0000000f9aebbc0] [c000000000e6d54c] schedule_timeout+0x27c/0x480
[   90.415337] [c0000000f9aebce0] [c000000000228d78] rcu_gp_kthread+0xb98/0x1260
[   90.415381] [c0000000f9aebdc0] [c00000000019ee00] kthread+0x1b0/0x1c0
[   90.415418] [c0000000f9aebe30] [c00000000000b7dc] ret_from_kernel_thread+0x5c/0x80
[   90.415462] NMI backtrace for cpu 0
[   90.415486] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 4.18.0-255.el8.ppc64le #1
[   90.415535] Call Trace:
[   90.415552] [c000000001b1f560] [c000000000e3d40c] dump_stack+0xb0/0xf4 (unreliable)
[   90.415596] [c000000001b1f5a0] [c000000000e4a0a8] nmi_trigger_cpumask_backtrace+0x1b8/0x230
[   90.415653] [c000000001b1f640] [c00000000006e048] arch_trigger_cpumask_backtrace+0x28/0x40
[   90.415700] [c000000001b1f660] [c00000000022df7c] rcu_dump_cpu_stacks+0xf8/0x154
[   90.415748] [c000000001b1f6b0] [c00000000022c240] rcu_sched_clock_irq+0xa20/0xc20
[   90.415797] [c000000001b1f790] [c000000000242dd0] update_process_times+0x90/0x120
[   90.415845] [c000000001b1f7d0] [c000000000260804] tick_sched_timer+0x74/0x130
[   90.415893] [c000000001b1f810] [c000000000243d9c] __hrtimer_run_queues+0x17c/0x480
[   90.415940] [c000000001b1f8b0] [c000000000245d34] hrtimer_interrupt+0x144/0x520
[   90.415988] [c000000001b1f990] [c0000000000280b4] timer_interrupt+0x104/0x2f0
[   90.416036] [c000000001b1f9f0] [c0000000000091d4] decrementer_common+0x114/0x120
[   90.416088] --- interrupt: 901 at plpar_hcall_norets+0x1c/0x28
[   90.416088]     LR = shared_cede_loop+0x128/0x190
[   90.416194] [c000000001b1fce0] [c000000000261b5c] tick_nohz_get_sleep_length+0xcc/0x120 (unreliable)
[   90.416249] [c000000001b1fd60] [c000000000b3a914] cpuidle_enter_state+0xa4/0x810
[   90.416297] [c000000001b1fde0] [c000000000b3b120] cpuidle_enter+0x50/0x70
[   90.416339] [c000000001b1fe20] [c0000000001bea10] do_idle+0x400/0x4c0
[   90.416379] [c000000001b1fea0] [c0000000001bed18] cpu_startup_entry+0x38/0x40
[   90.416426] [c000000001b1fed0] [c0000000000105b4] rest_init+0xe0/0xf8
[   90.416468] [c000000001b1ff00] [c0000000012044a0] start_kernel+0x680/0x6bc
[   90.416516] [c000000001b1ff90] [c00000000000adbc] start_here_common+0x1c/0x560
[   90.416582] Sending NMI from CPU 0 to CPUs 2:
[   90.416660] NMI backtrace for cpu 2
[   90.416661] CPU 2 didn't respond to backtrace IPI, inspecting paca.
[   90.416665] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 1042 (pool)
[   90.416708] CPU: 2 PID: 1042 Comm: pool Kdump: loaded Not tainted 4.18.0-255.el8.ppc64le #1
[   90.416761] Back trace of paca->saved_r1 (0xc0000000fff57ea0) (possibly stale):
[   90.416763] Call Trace:
[   90.416811] NIP:  c000000000101bd8 LR: c0000000000b2904 CTR: c0000000fffcdf00
[   90.416870] [c0000000fff57ea0] [00000000fff57ef0] 0xfff57ef0 (unreliable)
[   90.416930] REGS: c0000000f0147470 TRAP: 0501   Not tainted  (4.18.0-255.el8.ppc64le)
[   90.416951] Sending NMI from CPU 0 to CPUs 3:
[   90.417009] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 48004482  XER: 00000000
[   90.417016] CFAR: c0000000000bf254 IRQMASK: 0 
[   90.417016] GPR00: 0000000048004482 c0000000f01476f0 c000000001b23800 0000000000000000 
[   90.417016] GPR04: 0000000000000000 00000000000105a3 00000000021da4df c000000001b51d80 
[   90.417016] GPR08: 0000000000000000 c000000002770000 0000000080000000 00000000fffffffe 
[   90.417016] GPR12: 0000000000004000 c0000000fffcdf00 
[   90.417273] NMI backtrace for cpu 3
[   90.417274] CPU 3 didn't respond to backtrace IPI, inspecting paca.
[   90.417275] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 948 (dbus-daemon)
[   90.417277] Back trace of paca->saved_r1 (0xc0000000fff4fea0) (possibly stale):
[   90.417277] Call Trace:
[   90.417279] [c0000000fff4fea0] [00000000fff4fef0] 0xfff4fef0 (unreliable)
[   90.417285] NIP [c000000000101bd8] plpar_hcall_norets+0x1c/0x28
[   90.417289] LR [c0000000000b2904] __spin_yield+0x94/0xb0
[   90.417530] CPU: 3 PID: 948 Comm: dbus-daemon Kdump: loaded Not tainted 4.18.0-255.el8.ppc64le #1
[   90.417557] Call Trace:
[   90.417594] NIP:  c000000000e6f008 LR: c000000000e6f088 CTR: c0000000fffccb00
[   90.417648] [c0000000f01476f0] [c0000000ffd43320] 0xc0000000ffd43320 (unreliable)
[   90.417653] [c0000000f0147750] [c000000000e6f088] _raw_spin_lock_irqsave+0x118/0x120
[   90.417694] REGS: c0000000047ff970 TRAP: 0501   Not tainted  (4.18.0-255.el8.ppc64le)
[   90.417695] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24022484  XER: 00000000
[   90.417715] [c0000000f0147790] [c00000000022a3b0] rcu_core+0x110/0x640
[   90.417752] CFAR: c000000000e6f014 IRQMASK: 0 
[   90.417752] GPR00: c000000000e6f088 c0000000047ffbf0 c000000001b23800 c000000001723f00 
[   90.417752] GPR04: 0000000000000000 00000000000105be 0000000000000800 c000000001b51d80 
[   90.417752] GPR08: 0000000000000000 0000000080000000 0000000080000000 c000000001b522d8 
[   90.417752] GPR12: 0000000000002000 c0000000fffccb00 
[   90.417799] [c0000000f0147810] [c000000000e6fb4c] __do_softirq+0x16c/0x400
[   90.417830] NIP [c000000000e6f008] _raw_spin_lock_irqsave+0x98/0x120
[   90.417832] LR [c000000000e6f088] _raw_spin_lock_irqsave+0x118/0x120
[   90.417893] [c0000000f0147900] [c00000000016ce54] irq_exit+0x184/0x1c0
[   90.417910] Call Trace:
[   90.417965] [c0000000f0147930] [c0000000000280d8] timer_interrupt+0x128/0x2f0
[   90.418012] [c0000000047ffbf0] [c000000000e6f088] _raw_spin_lock_irqsave+0x118/0x120 (unreliable)
[   90.418080] [c0000000f0147990] [c0000000000091d4] decrementer_common+0x114/0x120
[   90.418127] [c0000000047ffc30] [c00000000022a3b0] rcu_core+0x110/0x640
[   90.418129] [c0000000047ffcb0] [c000000000e6fb4c] __do_softirq+0x16c/0x400
[   90.418198] --- interrupt: 901 at replay_interrupt_return+0x0/0x4
[   90.418198]     LR = arch_local_irq_restore+0x74/0x90
[   90.418200] [c0000000f0147c80] [c0000000f0147cc0] 0xc0000000f0147cc0 (unreliable)
[   90.418239] [c0000000047ffda0] [c00000000016ce54] irq_exit+0x184/0x1c0
[   90.418505] [c0000000f0147ca0] [c000000000e6f0e0] _raw_spin_unlock_irqrestore+0x50/0x80
[   90.418543] [c0000000047ffdd0] [c0000000000280d8] timer_interrupt+0x128/0x2f0
[   90.419082] [c0000000f0147cc0] [c0000000001bb668] sched_move_task+0xb8/0x250
[   90.419115] [c0000000047ffe30] [c0000000000091d4] decrementer_common+0x114/0x120
[   90.419150] [c0000000f0147d20] [c0000000001edb90] sched_autogroup_exit_task+0x20/0x40
[   90.419164] Instruction dump:
[   90.419206] [c0000000f0147d40] [c00000000016a9a0] do_exit+0x4e0/0xd60
[   90.419251] 60000000 60000000 7c0802a6 fbe10038 7c7f1b78 fba10028 f8010050 8bad098a 
[   90.419293] [c0000000f0147e10] [c00000000016b284] sys_exit+0x24/0x30
[   90.419326] 7fc3f378 4b1abdad 60000000 7c210b78 <48000078> 813f0000 2fa90000 409efff0 
[   90.419362] [c0000000f0147e30] [c00000000000b408] system_call+0x5c/0x70
[   90.419951] Instruction dump:
[   90.419976] 78840020 38210060 e8010010 7c0803a6 4bfffeb8 3c4c01a2 38421c44 7c421378 
[   90.420018] 7c000026 90010008 60000000 44000022 <80010008> 7c0ff120 4e800020 7c0802a6

Comment 24 Kevin Wolf 2020-12-02 13:34:43 UTC

(In reply to Xujun Ma from comment #22)
> Shall we file a new bug for the stop/continue issue?

No, not for now I would say. I think we should first look for a fix for the stop/cont issue. And when it is fixed, I expect that the savevm bug will automatically be fixed by the same change, too.

If in the end we have a patch and it turns out that it fixed one problem, but not the other, we can still clone the bug.

Comment 25 Greg Kurz 2020-12-02 16:53:54 UTC

(In reply to Kevin Wolf from comment #19)
> (In reply to Xujun Ma from comment #18)
> > I actually hit the similar problem when doing stop/cont vm by signal(kill
> > -19/18 pidofqemu) not hmp stop cmdline.
> 

On POWER we have a VM state change handler that allows to measure the
downtime between VM stop/cont and to adjust the guest timebase, thus
hiding the downtime from the guest.

Stopping the QEMU process for a certain time is a completely
different matter since the adjustment isn't done. It is thus
expected that the guest detects stalls in this case.

> Ok, then it looks like it's neither related to disk snapshots nor to
> migration, but to paused VMs. Maybe pausing the VM messes with the
> timekeeping on this machine.
> 

It might be related to both actually.

> Greg, do you want take a closer look maybe, or do you know who would be the
> right person?

The issue is that we have a pre-save handler for the guest timebase
that basically calls the same code as for VM stop. In the case of
savevm, this replaces the data already recorded when the VM was
stopped. This means that the timebase adjustment will miss the
downtime between VM stop and pre-save. If the guest has a lot of
RAM, like in the description, this is likely to be long enough
to be detected by the guest.

We already addressed a similar issue with live migration of paused
guests:

commit d14f33976282a8744ca1bf1d64e73996c145aa3f
Author: Maxiwell S. Garcia <maxiwell.com>
Date:   Thu Jul 11 16:47:02 2019 -0300

    migration: Do not re-read the clock on pre_save in case of paused guest

I think we should extend the logic to savevm.

Comment 42 Xujun Ma 2021-05-17 06:24:30 UTC

Still have problem with the fixed build and it display error message as following:

[  212.303892] rcu: INFO: rcu_sched self-detected stall on CPU
[  212.303950] rcu:     25-....: (5999 ticks this GP) idle=9a2/1/0x4000000000000002 softirq=8607/8607 fqs=3000
[  212.303999]  (t=6000 jiffies g=16533 q=2587)
[  212.304029] NMI backtrace for cpu 25
[  212.304053] CPU: 25 PID: 11064 Comm: Xorg Kdump: loaded Tainted: G                 --------- -t - 4.18.0-305.6.el8.ppc64le #1
[  212.304115] Call Trace:
[  212.304135] [c0000000498a71f0] [c000000000eb78bc] dump_stack+0xb0/0xf4 (unreliable)
[  212.304178] [c0000000498a7230] [c000000000ec44e8] nmi_trigger_cpumask_backtrace+0x1b8/0x230
[  212.304221] [c0000000498a72d0] [c000000000070338] arch_trigger_cpumask_backtrace+0x28/0x40
[  212.304265] [c0000000498a72f0] [c00000000023615c] rcu_dump_cpu_stacks+0xf8/0x154
[  212.304330] [c0000000498a7340] [c000000000234420] rcu_sched_clock_irq+0xa20/0xc20
[  212.304387] [c0000000498a7420] [c00000000024c420] update_process_times+0x90/0x120
[  212.304445] [c0000000498a7460] [c00000000026a3e4] tick_sched_timer+0x74/0x130
[  212.304507] [c0000000498a74a0] [c00000000024d3ec] __hrtimer_run_queues+0x17c/0x480
[  212.304569] [c0000000498a7540] [c00000000024f384] hrtimer_interrupt+0x144/0x520
[  212.304630] [c0000000498a7620] [c0000000000202f4] timer_interrupt+0x104/0x2f0
[  212.304690] [c0000000498a7680] [c0000000000091b0] decrementer_common+0x110/0x120
[  212.304752] --- interrupt: 901 at smp_call_function_single+0x130/0x180
                   LR = smp_call_function_single+0x120/0x180
[  212.304842] [c0000000498a79f0] [c000000000099674] radix__flush_tlb_page_psize+0x364/0x7f0
[  212.304904] [c0000000498a7ab0] [c0000000004876bc] ptep_clear_flush+0xac/0x160
[  212.304965] [c0000000498a7af0] [c000000000469358] wp_page_copy+0x428/0xa90
[  212.305016] [c0000000498a7b80] [c00000000046e568] do_wp_page+0xd8/0xa10
[  212.305067] [c0000000498a7bd0] [c000000000474b3c] __handle_mm_fault+0x9bc/0xdc0
[  212.305127] [c0000000498a7cb0] [c00000000047506c] handle_mm_fault+0x12c/0x310
[  212.305187] [c0000000498a7cf0] [c000000000082e98] __do_page_fault+0x288/0xab0
[  212.305247] [c0000000498a7de0] [c0000000000836f8] do_page_fault+0x38/0xd0
[  212.305298] [c0000000498a7e20] [c00000000000a6e4] handle_page_fault+0x18/0x38
[  239.695040] watchdog: BUG: soft lockup - CPU#25 stuck for 22s! [Xorg:11064]
[  239.695085] Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink uio_pdrv_genirq ip_tables xfs libcrc32c nvme_tcp(t) nvme_fabrics nvme nvme_core sd_mod t10_pi sg xts vmx_crypto virtio_net net_failover virtio_scsi failover dm_multipath dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
[  239.695365] CPU: 25 PID: 11064 Comm: Xorg Kdump: loaded Tainted: G                 --------- -t - 4.18.0-305.6.el8.ppc64le #1
[  239.695438] NIP:  c000000000272fa4 LR: c000000000272f90 CTR: c0000000000beeb0
[  239.695494] REGS: c0000000498a76f0 TRAP: 0901   Tainted: G                 --------- -t -  (4.18.0-305.6.el8.ppc64le)
[  239.695569] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 44084448  XER: 00000000
[  239.695632] CFAR: c000000000272fb0 IRQMASK: 0
               GPR00: c000000000272f90 c0000000498a7980 c000000001c24300 0000000000000000
               GPR04: c0000000498a79a0 c0000001fd17b000 c000000006793400 c000000001c66fe0
               GPR08: 0000000000000001 0000000000000001 0000000000000001 0000000000000000
               GPR12: c0000000000beeb0 c0000001ffefc300
[  239.695882] NIP [c000000000272fa4] smp_call_function_single+0x134/0x180
[  239.695932] LR [c000000000272f90] smp_call_function_single+0x120/0x180
[  239.695981] Call Trace:
[  239.696003] [c0000000498a7980] [c000000000272f90] smp_call_function_single+0x120/0x180 (unreliable)
[  239.696074] [c0000000498a79f0] [c000000000099674] radix__flush_tlb_page_psize+0x364/0x7f0
[  239.696134] [c0000000498a7ab0] [c0000000004876bc] ptep_clear_flush+0xac/0x160
[  239.696195] [c0000000498a7af0] [c000000000469358] wp_page_copy+0x428/0xa90
[  239.696245] [c0000000498a7b80] [c00000000046e568] do_wp_page+0xd8/0xa10
[  239.696296] [c0000000498a7bd0] [c000000000474b3c] __handle_mm_fault+0x9bc/0xdc0
[  239.696355] [c0000000498a7cb0] [c00000000047506c] handle_mm_fault+0x12c/0x310
[  239.696415] [c0000000498a7cf0] [c000000000082e98] __do_page_fault+0x288/0xab0
[  239.696474] [c0000000498a7de0] [c0000000000836f8] do_page_fault+0x38/0xd0
[  239.696526] [c0000000498a7e20] [c00000000000a6e4] handle_page_fault+0x18/0x38
[  239.696584] Instruction dump:
[  239.696616] ebe1fff8 7c0803a6 4e800020 7ca62b78 7c852378 7fe4fb78 4bfffd3d 813f0018
[  239.696678] 71290001 4182001c 60000000 7c210b78 <7c421378> 813f0018 71290001 4082fff0

Comment 43 Xujun Ma 2021-05-18 01:34:53 UTC

Hi Greg

Could you help take a look and confirm if it still has problem on qemu6.0?

Comment 44 Greg Kurz 2021-05-25 09:51:36 UTC

(In reply to Xujun Ma from comment #43)
> Hi Greg
> 
> Could you help take a look and confirm if it still has problem on qemu6.0?

Hi Xujun,

I've tried both with upstream QEMU and downstream qemu-kvm-6.0.50-16.scrmod+el8.5.0+11030+6b744098.wrb210519.ppc64le,
following this testing scenario :
- start the VM manually with a similar command line as the one used in the description
- let the guest fully boot
- savevm foo on the monitor
- wait a couple of seconds
- loadvm foo

and I don't get any error message in the guest's dmesg.

What is your testing scenario ?

Comment 45 Greg Kurz 2021-05-25 10:04:42 UTC

(In reply to Greg Kurz from comment #44)
> (In reply to Xujun Ma from comment #43)
> > Hi Greg
> > 
> > Could you help take a look and confirm if it still has problem on qemu6.0?
> 
> Hi Xujun,
> 
> I've tried both with upstream QEMU and downstream
> qemu-kvm-6.0.50-16.scrmod+el8.5.0+11030+6b744098.wrb210519.ppc64le,
> following this testing scenario :
> - start the VM manually with a similar command line as the one used in the
> description
> - let the guest fully boot
> - savevm foo on the monitor
> - wait a couple of seconds
> - loadvm foo
> 

Note that when loadvm completes, the VM is paused.

(qemu) info status 
VM status: paused (restore-vm)

'cont' is needed to resume execution.

> and I don't get any error message in the guest's dmesg.
> 
> What is your testing scenario ?

Comment 46 Xujun Ma 2021-05-26 09:57:05 UTC

It happened sometimes when doing savevm

(qemu) savevm bbb
(qemu) [ 1991.034318] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1991.034389] rcu:     21-....: (5999 ticks this GP) idle=546/1/0x4000000000000002 softirq=74695/74695 fqs=2999
[ 1991.034444]  (t=6000 jiffies g=177553 q=1202)
[ 1991.034476] NMI backtrace for cpu 21
[ 1991.034501] CPU: 21 PID: 15192 Comm: gnome-keyring-d Kdump: loaded Tainted: G                 --------- -t - 4.18.0-307.el8.ppc64le #1
[ 1991.034568] Call Trace:
[ 1991.034592] [c0000000421b31f0] [c000000000eb889c] dump_stack+0xb0/0xf4 (unreliable)
[ 1991.034642] [c0000000421b3230] [c000000000ec54d8] nmi_trigger_cpumask_backtrace+0x1b8/0x230
[ 1991.034692] [c0000000421b32d0] [c000000000070278] arch_trigger_cpumask_backtrace+0x28/0x40
[ 1991.034741] [c0000000421b32f0] [c000000000235a4c] rcu_dump_cpu_stacks+0xf8/0x154
[ 1991.034787] [c0000000421b3340] [c000000000233d10] rcu_sched_clock_irq+0xa20/0xc20
[ 1991.034835] [c0000000421b3420] [c00000000024bd10] update_process_times+0x90/0x120
[ 1991.034882] [c0000000421b3460] [c000000000269cd4] tick_sched_timer+0x74/0x130
[ 1991.034929] [c0000000421b34a0] [c00000000024ccdc] __hrtimer_run_queues+0x17c/0x480
[ 1991.034978] [c0000000421b3540] [c00000000024ec74] hrtimer_interrupt+0x144/0x520
[ 1991.035028] [c0000000421b3620] [c0000000000202f4] timer_interrupt+0x104/0x2f0
[ 1991.035078] [c0000000421b3680] [c0000000000091b0] decrementer_common+0x110/0x120
[ 1991.035128] --- interrupt: 901 at smp_call_function_single+0x134/0x180
[ 1991.035128]     LR = smp_call_function_single+0x120/0x180
[ 1991.035203] [c0000000421b39f0] [c000000000099774] radix__flush_tlb_page_psize+0x364/0x7f0
[ 1991.035255] [c0000000421b3ab0] [c000000000486b3c] ptep_clear_flush+0xac/0x160
[ 1991.035305] [c0000000421b3af0] [c0000000004687d8] wp_page_copy+0x428/0xa90
[ 1991.035348] [c0000000421b3b80] [c00000000046d9e8] do_wp_page+0xd8/0xa10
[ 1991.035390] [c0000000421b3bd0] [c000000000473fbc] __handle_mm_fault+0x9bc/0xdc0
[ 1991.035446] [c0000000421b3cb0] [c0000000004744ec] handle_mm_fault+0x12c/0x310
[ 1991.035513] [c0000000421b3cf0] [c000000000082dd8] __do_page_fault+0x288/0xab0
[ 1991.035571] [c0000000421b3de0] [c000000000083638] do_page_fault+0x38/0xd0
[ 1991.035621] [c0000000421b3e20] [c00000000000a6e4] handle_page_fault+0x18/0x38
[ 2016.410689] watchdog: BUG: soft lockup - CPU#21 stuck for 21s! [gnome-keyring-d:15192]
[ 2016.410751] Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink uio_pdrv_genirq ip_tables xfs libcrc32c nvme_tcp(t) nvme_fabrics nvme nvme_core sd_mod t10_pi sg xts virtio_net vmx_crypto net_failover virtio_scsi failover dm_multipath dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
[ 2016.411125] CPU: 21 PID: 15192 Comm: gnome-keyring-d Kdump: loaded Tainted: G                 --------- -t - 4.18.0-307.el8.ppc64le #1
[ 2016.411212] NIP:  c000000000272894 LR: c000000000272880 CTR: c0000000000bf030
[ 2016.411270] REGS: c0000000421b36f0 TRAP: 0901   Tainted: G                 --------- -t -  (4.18.0-307.el8.ppc64le)
[ 2016.411347] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 44084448  XER: 00000000
[ 2016.411410] CFAR: c0000000002728a0 IRQMASK: 0
[ 2016.411410] GPR00: c000000000272880 c0000000421b3980 c000000001c14900 0000000000000000
[ 2016.411410] GPR04: c0000000421b39a0 c0000000fd415000 c000000005fc3600 c000000001c56fe0
[ 2016.411410] GPR08: 0000000000000001 0000000000000001 0000000000000001 0000000000000000
[ 2016.411410] GPR12: c0000000000bf030 c0000000fff41280
[ 2016.411655] NIP [c000000000272894] smp_call_function_single+0x134/0x180
[ 2016.411706] LR [c000000000272880] smp_call_function_single+0x120/0x180
[ 2016.411754] Call Trace:
[ 2016.411776] [c0000000421b3980] [c000000000272880] smp_call_function_single+0x120/0x180 (unreliable)
[ 2016.411846] [c0000000421b39f0] [c000000000099774] radix__flush_tlb_page_psize+0x364/0x7f0
[ 2016.411906] [c0000000421b3ab0] [c000000000486b3c] ptep_clear_flush+0xac/0x160
[ 2016.411967] [c0000000421b3af0] [c0000000004687d8] wp_page_copy+0x428/0xa90
[ 2016.412016] [c0000000421b3b80] [c00000000046d9e8] do_wp_page+0xd8/0xa10
[ 2016.412067] [c0000000421b3bd0] [c000000000473fbc] __handle_mm_fault+0x9bc/0xdc0
[ 2016.412128] [c0000000421b3cb0] [c0000000004744ec] handle_mm_fault+0x12c/0x310
[ 2016.412187] [c0000000421b3cf0] [c000000000082dd8] __do_page_fault+0x288/0xab0
[ 2016.412246] [c0000000421b3de0] [c000000000083638] do_page_fault+0x38/0xd0
[ 2016.412296] [c0000000421b3e20] [c00000000000a6e4] handle_page_fault+0x18/0x38
[ 2016.412355] Instruction dump:
[ 2016.412386] ebe1fff8 7c0803a6 4e800020 7ca62b78 7c852378 7fe4fb78 4bfffd3d 813f0018
[ 2016.412447] 71290001 4182001c 60000000 7c210b78 <7c421378> 813f0018 71290001 4082fff0

Message from syslogd@dhcp16-215-160 at May 26 17:07:41 ...
 kernel:watchdog: BUG: soft lockup - CPU#21 stuck for 21s! [gnome-keyring-d:15192]
[ 2157.222326] rcu: INFO: rcu_sched self-detected stall on CPU
[ 2157.222385] rcu:     21-....: (24002 ticks this GP) idle=546/1/0x4000000000000002 softirq=74695/74695 fqs=11983
[ 2157.222464]  (t=24003 jiffies g=177553 q=3734)
[ 2157.222504] NMI backtrace for cpu 21
[ 2157.222536] CPU: 21 PID: 15192 Comm: gnome-keyring-d Kdump: loaded Tainted: G             L   --------- -t - 4.18.0-307.el8.ppc64le #1
[ 2157.222629] Call Trace:
[ 2157.222652] [c0000000421b31f0] [c000000000eb889c] dump_stack+0xb0/0xf4 (unreliable)
[ 2157.222711] [c0000000421b3230] [c000000000ec54d8] nmi_trigger_cpumask_backtrace+0x1b8/0x230
[ 2157.222771] [c0000000421b32d0] [c000000000070278] arch_trigger_cpumask_backtrace+0x28/0x40
[ 2157.222831] [c0000000421b32f0] [c000000000235a4c] rcu_dump_cpu_stacks+0xf8/0x154
[ 2157.222891] [c0000000421b3340] [c000000000233d10] rcu_sched_clock_irq+0xa20/0xc20
[ 2157.222951] [c0000000421b3420] [c00000000024bd10] update_process_times+0x90/0x120
[ 2157.223010] [c0000000421b3460] [c000000000269cd4] tick_sched_timer+0x74/0x130
[ 2157.223069] [c0000000421b34a0] [c00000000024ccdc] __hrtimer_run_queues+0x17c/0x480
[ 2157.223129] [c0000000421b3540] [c00000000024ec74] hrtimer_interrupt+0x144/0x520
[ 2157.223187] [c0000000421b3620] [c0000000000202f4] timer_interrupt+0x104/0x2f0
[ 2157.223247] [c0000000421b3680] [c0000000000091b0] decrementer_common+0x110/0x120
[ 2157.223307] --- interrupt: 901 at smp_call_function_single+0x134/0x180
[ 2157.223307]     LR = smp_call_function_single+0x120/0x180
[ 2157.223397] [c0000000421b39f0] [c000000000099774] radix__flush_tlb_page_psize+0x364/0x7f0
[ 2157.223456] [c0000000421b3ab0] [c000000000486b3c] ptep_clear_flush+0xac/0x160
[ 2157.223515] [c0000000421b3af0] [c0000000004687d8] wp_page_copy+0x428/0xa90
[ 2157.223566] [c0000000421b3b80] [c00000000046d9e8] do_wp_page+0xd8/0xa10
[ 2157.223615] [c0000000421b3bd0] [c000000000473fbc] __handle_mm_fault+0x9bc/0xdc0
[ 2157.223674] [c0000000421b3cb0] [c0000000004744ec] handle_mm_fault+0x12c/0x310
[ 2157.223734] [c0000000421b3cf0] [c000000000082dd8] __do_page_fault+0x288/0xab0
[ 2157.223792] [c0000000421b3de0] [c000000000083638] do_page_fault+0x38/0xd0
[ 2157.223842] [c0000000421b3e20] [c00000000000a6e4] handle_page_fault+0x18/0x38
[ 2186.263456] watchdog: BUG: soft lockup - CPU#21 stuck for 24s! [gnome-keyring-d:15192]
[ 2186.263531] Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink uio_pdrv_genirq ip_tables xfs libcrc32c nvme_tcp(t) nvme_fabrics nvme nvme_core sd_mod t10_pi sg xts virtio_net vmx_crypto net_failover virtio_scsi failover dm_multipath dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
[ 2186.263908] CPU: 21 PID: 15192 Comm: gnome-keyring-d Kdump: loaded Tainted: G             L   --------- -t - 4.18.0-307.el8.ppc64le #1
[ 2186.263994] NIP:  c000000000272898 LR: c000000000272880 CTR: c0000000000bf030
[ 2186.264053] REGS: c0000000421b36f0 TRAP: 0901   Tainted: G             L   --------- -t -  (4.18.0-307.el8.ppc64le)
[ 2186.264130] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 44084448  XER: 00000000
[ 2186.264191] CFAR: c0000000002728a0 IRQMASK: 0
[ 2186.264191] GPR00: c000000000272880 c0000000421b3980 c000000001c14900 0000000000000000
[ 2186.264191] GPR04: c0000000421b39a0 c0000000fd415000 c000000005fc3600 c000000001c56fe0
[ 2186.264191] GPR08: 0000000000000001 0000000000000001 0000000000000001 0000000000000000
[ 2186.264191] GPR12: c0000000000bf030 c0000000fff41280
[ 2186.264435] NIP [c000000000272898] smp_call_function_single+0x138/0x180
[ 2186.264485] LR [c000000000272880] smp_call_function_single+0x120/0x180
[ 2186.264533] Call Trace:
[ 2186.264553] [c0000000421b3980] [c000000000272880] smp_call_function_single+0x120/0x180 (unreliable)
[ 2186.264617] [c0000000421b39f0] [c000000000099774] radix__flush_tlb_page_psize+0x364/0x7f0
[ 2186.264700] [c0000000421b3ab0] [c000000000486b3c] ptep_clear_flush+0xac/0x160
[ 2186.264758] [c0000000421b3af0] [c0000000004687d8] wp_page_copy+0x428/0xa90
[ 2186.264810] [c0000000421b3b80] [c00000000046d9e8] do_wp_page+0xd8/0xa10
[ 2186.264859] [c0000000421b3bd0] [c000000000473fbc] __handle_mm_fault+0x9bc/0xdc0
[ 2186.264918] [c0000000421b3cb0] [c0000000004744ec] handle_mm_fault+0x12c/0x310
[ 2186.264975] [c0000000421b3cf0] [c000000000082dd8] __do_page_fault+0x288/0xab0
[ 2186.265033] [c0000000421b3de0] [c000000000083638] do_page_fault+0x38/0xd0
[ 2186.265084] [c0000000421b3e20] [c00000000000a6e4] handle_page_fault+0x18/0x38
[ 2186.265142] Instruction dump:
[ 2186.265173] 7c0803a6 4e800020 7ca62b78 7c852378 7fe4fb78 4bfffd3d 813f0018 71290001
[ 2186.265234] 4182001c 60000000 7c210b78 7c421378 <813f0018> 71290001 4082fff0 7c2004ac

Message from syslogd@dhcp16-215-160 at May 26 17:10:45 ...
 kernel:watchdog: BUG: soft lockup - CPU#21 stuck for 24s! [gnome-keyring-d:15192]

Comment 47 Xujun Ma 2021-06-01 01:09:43 UTC

Hi Greg

Can you reproduce this issue?

Comment 48 Xujun Ma 2021-06-04 01:18:40 UTC

Have done a lot of testing and can confirm still have problem on this fixed build，so modify status to ASSIGNED.

Comment 52 Greg Kurz 2021-06-04 13:52:45 UTC

(In reply to Xujun Ma from comment #47)
> Hi Greg
> 
> Can you reproduce this issue?

I've been running savevm in a loop for hours on a guest with 200G
of RAM and 64 vCPUs and I still can't reproduce...

Comment 53 Qunfang Zhang 2021-06-07 06:14:21 UTC

(In reply to Greg Kurz from comment #52)
> (In reply to Xujun Ma from comment #47)
> > Hi Greg
> > 
> > Can you reproduce this issue?
> 
> I've been running savevm in a loop for hours on a guest with 200G
> of RAM and 64 vCPUs and I still can't reproduce...

Hi Xujun,

Can you provide a test environment and guest image that you can reproduce and share to Greg?  Maybe some configuration or environment difference reason? 

Thanks,
Qunfang

Comment 54 Greg Kurz 2021-06-07 06:39:11 UTC

(In reply to Qunfang Zhang from comment #53)
> (In reply to Greg Kurz from comment #52)
> > (In reply to Xujun Ma from comment #47)
> > > Hi Greg
> > > 
> > > Can you reproduce this issue?
> > 
> > I've been running savevm in a loop for hours on a guest with 200G
> > of RAM and 64 vCPUs and I still can't reproduce...
> 
> Hi Xujun,
> 
> Can you provide a test environment and guest image that you can reproduce
> and share to Greg?  Maybe some configuration or environment difference
> reason? 
> 

Since comment #12 mentions a boston system, I finally gave a try using
emulated XIVE, i.e. kernel_irqchip=off and I got the trace once.

> Thanks,
> Qunfang

Comment 57 Xujun Ma 2021-06-08 09:02:45 UTC

(In reply to Qunfang Zhang from comment #53)
> (In reply to Greg Kurz from comment #52)
> > (In reply to Xujun Ma from comment #47)
> > > Hi Greg
> > > 
> > > Can you reproduce this issue?
> > 
> > I've been running savevm in a loop for hours on a guest with 200G
> > of RAM and 64 vCPUs and I still can't reproduce...
> 
> Hi Xujun,
> 
> Can you provide a test environment and guest image that you can reproduce
> and share to Greg?  Maybe some configuration or environment difference
> reason? 
> 
> Thanks,
> Qunfang

Will provide later when I reproduce it again.

Comment 58 Xujun Ma 2021-06-16 01:29:11 UTC

Hi Greg

I did a lot of testing on bostion host and confirm the problem happened when doing loadvm operation,not savevm.
I will keep the host for two days. if you want to retest it, please login host 10.16.214.108 with root user and password "kvmautotest".
I will test it again on witherspoon host if get one later.
Test result:
http://10.0.136.47/5472659/results.html

Have problem on cases 001-003,008,010,012,015,019,022,024,034,038,039,042-047,051,053,058,060,067,070,073,078,082,089,094.

Test env:
qemu-kvm-6.0.0-19.module+el8.5.0+11385+6e7d542e.ppc64le

Comment 59 Greg Kurz 2021-06-16 08:53:17 UTC

(In reply to Xujun Ma from comment #58)
> Hi Greg
> 
> I did a lot of testing on bostion host and confirm the problem happened when
> doing loadvm operation,not savevm.
> I will keep the host for two days. if you want to retest it, please login
> host 10.16.214.108 with root user and password "kvmautotest".
> I will test it again on witherspoon host if get one later.
> Test result:
> http://10.0.136.47/5472659/results.html
> 

I see 2 "watchdog: BUG: soft lockup" errors _only_ out of 100 savevm/loadvm sequences.
It seems that we're far away from the 6/10 reproducibility mentioned in the description.

Does that mean the fix (comment #37) improved things somehow ?

> Have problem on cases
> 001-003,008,010,012,015,019,022,024,034,038,039,042-047,051,053,058,060,067,
> 070,073,078,082,089,094.
> 
> Test env:
> qemu-kvm-6.0.0-19.module+el8.5.0+11385+6e7d542e.ppc64le

Comment 60 Xujun Ma 2021-06-18 12:13:49 UTC

(In reply to Greg Kurz from comment #59)
> (In reply to Xujun Ma from comment #58)
> > Hi Greg
> > 
> > I did a lot of testing on bostion host and confirm the problem happened when
> > doing loadvm operation,not savevm.
> > I will keep the host for two days. if you want to retest it, please login
> > host 10.16.214.108 with root user and password "kvmautotest".
> > I will test it again on witherspoon host if get one later.
> > Test result:
> > http://10.0.136.47/5472659/results.html
> > 
> 
> I see 2 "watchdog: BUG: soft lockup" errors _only_ out of 100 savevm/loadvm
> sequences.
> It seems that we're far away from the 6/10 reproducibility mentioned in the
> description.
> 
> Does that mean the fix (comment #37) improved things somehow ?
> 
> > Have problem on cases
> > 001-003,008,010,012,015,019,022,024,034,038,039,042-047,051,053,058,060,067,
> > 070,073,078,082,089,094.
> > 
> > Test env:
> > qemu-kvm-6.0.0-19.module+el8.5.0+11385+6e7d542e.ppc64le

Yes, have improvement. verify it first for this bug, and will file a new one if have more situation.

Comment 61 Greg Kurz 2021-06-18 13:50:44 UTC

(In reply to Xujun Ma from comment #60)
> 
> Yes, have improvement. verify it first for this bug, and will file a new one
> if have more situation.

I've run two consecutive all-day runs, 350+ iterations each, on a
boston system with pretty much the same command line and I didn't
hit the issue... while it was relatively easy to hit before the
fix on the witherspoon (with more RAM).

It thus seems reasonable to move to VERIFY and maybe open a
separate bug if you see stuck vCPUs again.

Thanks !

Comment 63 errata-xmlrpc 2021-11-16 07:51:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4684

Note You need to log in before you can comment on or make changes to this bug.