Bug 2307510 - ISST-LTE:KOP:1060FW:evelp2 :L2 Guest migration: evelp2g4[L2]: while running NFS guest migration continuously dumping smp_call_function_many_cond+0x500/0x738 (unreliable) and watchdog: BUG: soft lockup - CPU#14 stuck for 223s! [systemd-homed} (Fedora)
Summary: ISST-LTE:KOP:1060FW:evelp2 :L2 Guest migration: evelp2g4[L2]: while running N...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: qemu
Version: 40
Hardware: ppc64le
OS: All
unspecified
unspecified
Target Milestone: ---
Assignee: Richard W.M. Jones
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-08-23 09:10 UTC by IBM Bug Proxy
Modified: 2024-08-28 02:36 UTC (History)
10 users (show)

Fixed In Version: qemu-8.2.6-3.fc40
Clone Of:
Environment:
Last Closed: 2024-08-28 02:36:38 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
IBM Linux Technology Center 208605 0 None None None 2024-08-23 09:10:44 UTC

Description IBM Bug Proxy 2024-08-23 09:10:19 UTC

Comment 1 IBM Bug Proxy 2024-08-23 09:10:33 UTC
DescriptionSEETEENA THOUFEEK 2024-08-23 04:02:23 CDT
== Comment: #0 - Rajanikanth H. Adaveeshaiah <rajanikanth.ha.com> - 2024-05-24 09:54:22 ==
Description :

evelp2g4 is NFS backed L2 guest running on evelp2 L1 host, both host and guest installed with fedora 40. kernel 6.8.10-300.fc40.ppc64le.
While running the NFS guest migration these call traces are continuously dumping on the console and in dmesg of the guest.

Unable to ssh guest and on console continously dumping not bale to get any LOGs

FYI -

[79198.150752] rcu: Stack dump where RCU GP kthread last ran:
[79198.150817] Sending NMI from CPU 10 to CPUs 9:
[79201.298780] watchdog: BUG: soft lockup - CPU#19 stuck for 399s! [kworker/19:0:230506]
[79201.298969] Modules linked in: nft_compat rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace netfs nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bonding tls rfkill nf_tables sunrpc binfmt_misc virtio_balloon virtio_net net_failover failover aes_gcm_p10_crypto crct10dif_vpmsum loop fuse nfnetlink zram xfs ibmvscsi scsi_transport_srp vmx_crypto pseries_wdt crc32c_vpmsum
[79201.299894] CPU: 19 PID: 230506 Comm: kworker/19:0 Tainted: G             L     6.8.10-300.fc40.ppc64le #1
[79201.300023] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 0x800200 0xf000006 of:SLOF,HEAD hv:linux,kvm pSeries
[79201.300182] Workqueue: rcu_par_gp sync_rcu_exp_select_node_cpus
[79201.300276] NIP:  c0000000002bcbb4 LR: c000000000263514 CTR: c0000000000d192c
[79201.300382] REGS: c00000001a8a7ab0 TRAP: 0900   Tainted: G             L      (6.8.10-300.fc40.ppc64le)
[79201.300506] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 48000202  XER: 20040072
[79201.300642] CFAR: 0000000000000000 IRQMASK: 0
GPR00: c000000000263514 c00000001a8a7d50 c0000000020ded00 000000000000000a
GPR04: c0000007416d3480 0000000000000000 0000000000000000 c000000743dd3f00
GPR08: c0000000002631a4 0000000000000001 000000073e480000 c000000003d06188
GPR12: c0000000000d192c c000000743dd3f00 c0000000001b549c c00000000e9a4300
GPR16: 0000000000000000 0000000000000000 000000000000000a 0000000020ecf8dc
GPR20: 0000000000000400 c000000003d01fe8 000000000000000a c0000000002631a4
GPR24: c000000003252d80 c00000000324de78 000000000000ffff 0000000000000080
GPR28: c000000003a982f0 c000000003a98080 c000000741012d80 c00000001a8a7d80
[79201.301577] NIP [c0000000002bcbb4] smp_call_function_single+0x140/0x1bc
[79201.301680] LR [c000000000263514] __sync_rcu_exp_select_node_cpus+0x2ac/0x548
[79201.301797] Call Trace:
[79201.301838] [c00000001a8a7d50] [c0000000002bcbe4] smp_call_function_single+0x170/0x1bc (unreliable)
[79201.301979] [c00000001a8a7dc0] [c000000000263514] __sync_rcu_exp_select_node_cpus+0x2ac/0x548
[79201.302115] [c00000001a8a7e50] [c0000000001a55b8] process_one_work+0x1e8/0x4d8
[79201.302233] [c00000001a8a7ef0] [c0000000001a7e0c] worker_thread+0x3b8/0x578
[79201.302325] [c00000001a8a7f90] [c0000000001b55c8] kthread+0x134/0x13c
[79201.302417] [c00000001a8a7fe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
[79201.302526] Code: 60000000 60000000 60420000 e9470030 3d220117 39294780 7c895214 81240008 71290001 4182001c 60420000 7c40003c <60000000> 81240008 71290001 4082fff0
[79205.161424] watchdog: BUG: soft lockup - CPU#12 stuck for 401s! [sshd:280379]
[79205.161718] Modules linked in: nft_compat rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace netfs nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bonding tls rfkill nf_tables sunrpc binfmt_misc virtio_balloon virtio_net net_failover failover aes_gcm_p10_crypto crct10dif_vpmsum loop fuse nfnetlink zram xfs ibmvscsi scsi_transport_srp vmx_crypto pseries_wdt crc32c_vpmsum
[79205.163578] CPU: 12 PID: 280379 Comm: sshd Tainted: G             L     6.8.10-300.fc40.ppc64le #1
[79205.163691] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 0x800200 0xf000006 of:SLOF,HEAD hv:linux,kvm pSeries
[79205.163834] NIP:  c0000000002bb7a4 LR: c0000000002bb750 CTR: c0000000000d192c
[79205.163929] REGS: c0000003871cf1b0 TRAP: 0900   Tainted: G             L      (6.8.10-300.fc40.ppc64le)
[79205.165041] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44042222  XER: 20040004
[79205.165266] CFAR: 0000000000000000 IRQMASK: 0
GPR00: c0000000002bbc58 c0000003871cf450 c0000000020ded00 0000000000000009
GPR04: 0000000000000009 0000000000000009 0000000000000080 0000000000000200
GPR08: 00000000000001ff 0000000000000001 c000000740f57ee0 0000000044048222
GPR12: c0000000000d192c c000000743ddc980 0000000000000000 0000000000000000
GPR16: 0000000000000000 c00000000d86e200 0000000000000001 0000000000000001
GPR20: 000000000000000c c000000003d06188 c0000000000ac4d0 c00000000a374e00
GPR24: c000000003d06840 0000000000000000 c000000741193188 c000000741193188
GPR28: c000000741193180 c000000003d06840 0000000000000048 0000000000000009
[79205.171660] NIP [c0000000002bb7a4] smp_call_function_many_cond+0x1e0/0x738
[79205.171752] LR [c0000000002bb750] smp_call_function_many_cond+0x18c/0x738
[79205.171835] Call Trace:
[79205.171869] [c0000003871cf450] [c0000000002bbc58] smp_call_function_many_cond+0x694/0x738 (unreliable)
[79205.171986] [c0000003871cf520] [c0000000000ac4d0] radix__tlb_flush+0x4c/0x140
[79205.173636] [c0000003871cf560] [c00000000052e900] tlb_finish_mmu+0x130/0x1f0
[79205.173754] [c0000003871cf590] [c00000000052a280] exit_mmap+0x1cc/0x574
[79205.173848] [c0000003871cf6c0] [c00000000016ec9c] __mmput+0x54/0x1d4
[79205.173939] [c0000003871cf6f0] [c0000000006385c4] begin_new_exec+0x6dc/0xefc
[79205.174037] [c0000003871cf780] [c0000000006edea8] load_elf_binary+0x4c8/0x1a50
[79205.174136] [c0000003871cf880] [c0000000006361c8] bprm_execve+0x2b4/0x7a0
[79205.174219] [c0000003871cf950] [c000000000637988] do_execveat_common+0x1c0/0x2d8
[79205.174316] [c0000003871cf9f0] [c000000000638e38] sys_execve+0x54/0x6c
[79205.174399] [c0000003871cfa20] [c00000000002fec8] system_call_exception+0x168/0x310
[79205.174497] [c0000003871cfe50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
[79205.176245] --- interrupt: 3000 at 0x7fff95b10b08
[79205.176326] NIP:  00007fff95b10b08 LR: 00007fff95b10b08 CTR: 0000000000000000
[79205.176438] REGS: c0000003871cfe80 TRAP: 3000   Tainted: G             L      (6.8.10-300.fc40.ppc64le)
[79205.176558] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 48044424  XER: 00000000
[79205.176686] IRQMASK: 0
GPR00: 000000000000000b 00007fffe6919aa0 00007fff95c47c00 0000000152598c80
GPR04: 00007fffe6919bf8 00000001525db6e0 ffffffffffffffff 00007fffe6919a20
GPR08: 0000000152598c88 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fff969a4220 0000000152585570 0000000000000000
GPR16: 00007fffe6919c48 0000000000000570 0000000152598c80 0000000000000000
GPR20: 0000000000000000 0000000000009998 000000015259a450 0000000152586460
GPR24: 00000001525bca90 00007fffe6919e48 0000000000000000 00000001525db6e0
GPR28: 0000000117e98448 00000001525d0b00 0000000000000000 0000000000100000
[79205.177505] NIP [00007fff95b10b08] 0x7fff95b10b08
[79205.177578] LR [00007fff95b10b08] 0x7fff95b10b08
[79205.177649] --- interrupt: 3000
[79205.177702] Code: e95c0000 283e0800 40800528 3d2201c2 392932e8 7bde1f24 7d29f02a 7d4a4a14 812a0008 71290001 41820018 7c40003c <60000000> 812a0008 71290001 4082fff0
[79205.278771] watchdog: BUG: soft lockup - CPU#18 stuck for 377s! [abrt-dump-journ:1092]
[79205.278912] Modules linked in: nft_compat rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace netfs nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bonding tls rfkill nf_tables sunrpc binfmt_misc virtio_balloon virtio_net net_failover failover aes_gcm_p10_crypto crct10dif_vpmsum loop fuse nfnetlink zram xfs ibmvscsi scsi_transport_srp vmx_crypto pseries_wdt crc32c_vpmsum
[79205.279542] CPU: 18 PID: 1092 Comm: abrt-dump-journ Tainted: G             L     6.8.10-300.fc40.ppc64le #1
[79205.279667] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 0x800200 0xf000006 of:SLOF,HEAD hv:linux,kvm pSeries
[79205.279824] NIP:  c0000000002bb7a4 LR: c0000000002bb7cc CTR: c0000000000d192c
[79205.279930] REGS: c0000000118c34a0 TRAP: 0900   Tainted: G             L      (6.8.10-300.fc40.ppc64le)
[79205.280052] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 44042220  XER: 2004006c
[79205.280168] CFAR: 0000000000000000 IRQMASK: 0
GPR00: c0000000002bbac4 c0000000118c3740 c0000000020ded00 0000000000000009
GPR04: 0000000000000009 0000000000000009 0000000000000080 0000000000fbffff
GPR08: 0000000000fbfdff 0000000000000001 c000000740f58060 0000000024008222
GPR12: c0000000000d192c c000000743dd5300 0000000000000000 0000000000000000
GPR16: 0000000000000000 c0000000118c3b98 0000000000000001 0000000000000001
GPR20: 0000000000000012 c000000003d06188 c0000000000aa310 00000001251d0000
GPR24: c000000003d06840 0000000000000000 c000000741613188 c000000741613188
GPR28: c000000741613180 c000000003d06840 0000000000000048 c000000003d01fe8
[79205.281075] NIP [c0000000002bb7a4] smp_call_function_many_cond+0x1e0/0x738
[79205.281169] LR [c0000000002bb7cc] smp_call_function_many_cond+0x208/0x738
[79205.281259] Call Trace:
[79205.281297] [c0000000118c3740] [c0000000002bbac4] smp_call_function_many_cond+0x500/0x738 (unreliable)
[79205.281425] [c0000000118c3810] [c0000000000aa310] flush_type_needed+0x1c8/0x23c
[79205.281534] [c0000000118c3850] [c0000000000ab83c] __radix__flush_tlb_range_psize+0xb4/0x500
[79205.281646] [c0000000118c38f0] [c00000000052e900] tlb_finish_mmu+0x130/0x1f0
[79205.281756] [c0000000118c3920] [c00000000052604c] unmap_region+0x168/0x1c0
[79205.281849] [c0000000118c3a20] [c000000000526f08] do_vmi_align_munmap+0x418/0x5b8
[79205.281958] [c0000000118c3b70] [c00000000052ba24] sys_brk+0x3bc/0x444
[79205.282050] [c0000000118c3c30] [c00000000002fec8] system_call_exception+0x168/0x310
[79205.282161] [c0000000118c3e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
[79205.282270] --- interrupt: 3000 at 0x7fff85745ae4
[79205.282344] NIP:  00007fff85745ae4 LR: 00007fff85745ae4 CTR: 0000000000000000
[79205.282449] REGS: c0000000118c3e80 TRAP: 3000   Tainted: G             L      (6.8.10-300.fc40.ppc64le)
[79205.282570] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 48002224  XER: 00000000
[79205.282705] IRQMASK: 0
GPR00: 000000000000002d 00007fffd6db3560 0000000000100000 00000001251d0000
GPR04: 0000000000031320 0000000000010010 00000001251becf0 0000000000021310
GPR08: 00000001251753f0 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fff8587c240 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 00000000000fffff 00007fffd6db36e8 0000000000001000
GPR24: 0000000125133ed0 000000012510cab0 0000000000000000 00000000000085e0
GPR28: 00007fffd6db36f0 00007fff85847580 00000001251e0000 00000001251d0000
[79205.283568] NIP [00007fff85745ae4] 0x7fff85745ae4
[79205.283640] LR [00007fff85745ae4] 0x7fff85745ae4
[79205.283712] --- interrupt: 3000

Guest xml  -

[root@evelp2 ~]# virsh dumpxml evelp2g4
<domain type='kvm' id='35'>
<name>evelp2g4</name>
<uuid>be95b37d-c549-4862-8c28-f461372755ad</uuid>
<maxMemory slots='16' unit='KiB'>67108864</maxMemory>
<memory unit='KiB'>31457280</memory>
<currentMemory unit='KiB'>31457280</currentMemory>
<vcpu placement='static' current='32'>64</vcpu>
<resource>
<partition>/machine</partition>
</resource>
<os>
<type arch='ppc64le' machine='pseries-8.1'>hvm</type>
<boot dev='hd'/>
<boot dev='network'/>
<boot dev='cdrom'/>
</os>
<cpu mode='custom' match='exact' check='none'>
<model fallback='forbid'>POWER10</model>
<topology sockets='4' dies='1' clusters='1' cores='8' threads='2'/>
<numa>
<cell id='0' cpus='0-63' memory='31457280' unit='KiB'/>
</numa>
</cpu>
<clock offset='utc'/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>coredump-restart</on_crash>
<devices>
<emulator>/usr/bin/qemu-system-ppc64</emulator>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none' io='native'/>
<source file='/kvm_pool/evelp2g4_root.qcow2' index='1'/>
<backingStore/>
<target dev='sda' bus='scsi'/>
<alias name='scsi0-0-0-0'/>
<address type='drive' controller='0' bus='0' target='0' unit='0'/>
</disk>
<controller type='usb' index='0' model='nec-xhci'>
<alias name='usb'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
</controller>
<controller type='scsi' index='0' model='ibmvscsi'>
<alias name='scsi0'/>
<address type='spapr-vio' reg='0x00002000'/>
</controller>
<controller type='pci' index='0' model='pci-root'>
<model name='spapr-pci-host-bridge'/>
<target index='0'/>
<alias name='pci.0'/>
</controller>
<interface type='direct'>
<mac address='52:54:00:24:ab:bc'/>
<source network='macvtap' portid='bcee096d-cc84-4236-94b5-81d3d114c9c8' dev='eth1' mode='bridge'/>
<target dev='macvtap19'/>
<model type='virtio'/>
<alias name='net0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
</interface>
<serial type='pty'>
<source path='/dev/pts/4'/>
<target type='spapr-vio-serial' port='0'>
<model name='spapr-vty'/>
</target>
<alias name='serial0'/>
<address type='spapr-vio' reg='0x30000000'/>
</serial>
<console type='pty' tty='/dev/pts/4'>
<source path='/dev/pts/4'/>
<target type='serial' port='0'/>
<alias name='serial0'/>
<address type='spapr-vio' reg='0x30000000'/>
</console>
<audio id='1' type='none'/>
<memballoon model='virtio'>
<alias name='balloon0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</memballoon>
<panic model='pseries'/>
</devices>
<seclabel type='dynamic' model='dac' relabel='yes'>
<label>+107:+107</label>
<imagelabel>+107:+107</imagelabel>
</seclabel>
</domain>

Steps to reproduce: Install the F40 build on NFS storage  guest kernel 6.8.10-300

Start the HTX workload - mdt.less

Start the NFS guest migration between the L2 hosts.

Sourece L2 host : evelp2
Target L2 host  : rinlp1

migration command : virsh migrate --live  --domain $vm_name qemu+ssh://$target_host/system --verbose --undefinesource --persistent --timeout 120

Share the same NFS storage between two hosts [here /kvm_pool]
10.33.4.52:/kvm_pool           nfs4      650G  304G  347G  47% /kvm_pool

Test running : HTX

Guest state : up

------------------------------------------------------------------------------------- --------------------------------------

L2 guest Config:

(1) Problem on  Guest:   evelp2g4

(2) PHYP/ Processor Type:  KVM/P10/Everest

(3) Rootvg Filesystem: EXT4

(5) Network Bridge: Macvtap

(6) IO Disk Type/Driver: qemu-img/ qcow2

(7) Install Disk Type: Single

------------------------------------------------------------------------------------- --------------------------------------

L1 host details :

MDC mode : off

(1) PHYP/ Processor Type:  KVM/P10/Everest

(2) CEC Name: evelp2

(3) Rootvg Filesystem: xfs

(5) Network Interface: Dedicated Network

(6) IO Type: NVME

(8) Multipath Enabled: no

(9) Install Disk Type: Single

(10) MMU: RPT



DUMP Config:

(1) KDUMP configured on Host: yes

(2) KDUMP configured on Guest: yes

(3) DUMP Available: No

------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------

Log Details :

Unable to ssh guest and on console continously dumping not bale to get any LOGs

(1) Guest SOSREPORT provided  : No,

Host sosreport : yes

Guest virsh dump collected

(2) /var/log/messages provided  : No

(3) console logs provided   Yes

(4) Guest xml - yes

(5) dmesg provided : No

(6) Firmware : FW 1060

opening seperate bug to integrate qemu related patches 
as per instructed by Fedora and Dev. 

reference bug - 206737 - redhat bugzilla - 	2293597

Qemu patches are at
https://lore.kernel.org/qemu-devel/171760304518.1127.12881297254648658843.stgit@ad1b393f0e09/

Comment 2 IBM Bug Proxy 2024-08-23 09:10:51 UTC
------- Comment From sthoufee.com 2024-08-23 05:05 EDT-------
Need to integrate this changes into Fedora 40

Comment 3 Richard W.M. Jones 2024-08-24 18:13:39 UTC
Looks like the upstream commits would be:

b9c0a2e01c0f38bdc4ba8f69cf298eeebfb3738b linux-header: PPC: KVM: Update one-reg ids for DEXCR, HASHKEYR and HASHPKEYR
ca85beb4b783064781a3295feaa7b1a8645f2df9 target/ppc/cpu_init: Synchronize DEXCR with KVM for migration
843b243f8620a92f5ff652550b61fc724e5d520c target/ppc/cpu_init: Synchronize HASHKEYR with KVM for migration
c0840b46d4c8483a93370434f9ea10b8a7b50bde target/ppc/cpu_init: Synchronize HASHPKEYR with KVM for migration

As an aside, it might be more direct if you submit merge requests via
https://src.fedoraproject.org/rpms/qemu

Comment 4 Richard W.M. Jones 2024-08-24 18:18:46 UTC
I also had to pull in 978897a572e975faad912a473815a668a43d9f1f "target/ppc: Restore [H]DEXCR to 64-bits" as a prerequisite, so I hope that's OK.

Comment 6 Fedora Update System 2024-08-24 20:17:39 UTC
FEDORA-2024-d18acd2287 (qemu-8.2.6-3.fc40) has been submitted as an update to Fedora 40.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-d18acd2287

Comment 7 Fedora Update System 2024-08-25 01:27:46 UTC
FEDORA-2024-d18acd2287 has been pushed to the Fedora 40 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-d18acd2287`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-d18acd2287

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 8 Fedora Update System 2024-08-28 02:36:38 UTC
FEDORA-2024-d18acd2287 (qemu-8.2.6-3.fc40) has been pushed to the Fedora 40 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.