Bug 2219526 - Guest hit call trace when reboot after multifd migration
Summary: Guest hit call trace when reboot after multifd migration
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: qemu-kvm
Version: 9.3
Hardware: All
OS: Unspecified
low
medium
Target Milestone: rc
: ---
Assignee: Nitesh Narayan Lal
QA Contact: Li Xiaohui
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-04 07:09 UTC by Li Xiaohui
Modified: 2023-07-21 05:09 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-161469 0 None None None 2023-07-04 07:17:47 UTC

Description Li Xiaohui 2023-07-04 07:09:32 UTC
Description of problem:
Enable multifd capability, set multifd threads to 4, and migration speed to 1M, then do multifd migration. 
Check if migration is active, sleep 15s if it's active, then cancel migration.
Finally restart to multifd migration with setting multifd enabled and threads to 2, migration speed to 100M.
After migration, reboot VM on dst host, guest hit call trace, and seems can't shutdown by sending cmd through console.


Version-Release number of selected component (if applicable):
hosts info: kernel-5.14.0-327.el9.aarch64+64k && qemu-kvm-8.0.0-5.el9.aarch64
guest info: kernel-5.14.0-331.el9.aarch64+64k


How reproducible:
1/260


Steps to Reproduce:
1.Boot a guest on src host with '-name debug-threads=on', qemu cmd see below commend.
2.Launch qemu on destination host with '-name debug-threads=on'&'-incoming defer'.
3.Set migration speed to 1M
{"execute": "migrate-set-parameters", "arguments": {"max-bandwidth": 1048576}, "id": "BNAlvy9K"}
4.Turn on multifd on both src and dst host
{"execute": "migrate-set-capabilities", "arguments": {"capabilities": [{"capability": "multifd", "state": true}]}, "id": "0VKvXnmp"}
5.Set multifd-channels to 4 on src&dst host before migration
{"execute": "migrate-set-parameters", "arguments": {"multifd-channels": 4}, "id": "hwpTLChb"}
6.Migrate guest from src to dst host
dst qmp: {"execute": "migrate-incoming", "arguments": {"uri": "tcp:[::]:4000"}, "id": "gwFlcvSO"}
src qmp: {"execute": "migrate", "arguments": {"uri": "tcp:10.19.241.218:4000", "blk": false, "inc": false, "detach": true, "resume": false}, "id": "Ucz45f2A"}
7.During migration is active, cancel migration
{"execute": "query-migrate", "id": "mOVeKcKr"}
{"return": {"expected-downtime": 300, "status": "active", "setup-time": 4, "total-time": 5013, "ram": {"total": 4429512704, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 26550400, "pages-per-second": 1280, "downtime-bytes": 0, "page-size": 4096, "remaining": 4403298304, "postcopy-bytes": 0, "mbps": 42.482080000000003, "transferred": 26550792, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 392, "duplicate": 0, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 26214400, "normal": 6400}}, "id": "mOVeKcKr"}

sleep 15 seconds, then cancel migration:
{"execute": "migrate_cancel", "id": "q3BGoScD"}
8.Restart a guest on dst host with '-incoming defer'&'-name debug-threads=on'
9.Set multifd-channels to 2 in src qemu
{"execute": "migrate-set-parameters", "arguments": {"multifd-channels": 2}, "id": "uAvrYhFZ"}
10.Turn on multifd on dst host
{"execute": "migrate-set-capabilities", "arguments": {"capabilities": [{"capability": "multifd", "state": true}]}, "id": "eOnaiLYD"}
11.Set migration listening port in dst qemu
{"execute": "migrate-incoming", "arguments": {"uri": "tcp:[::]:4000"}, "id": "cOK03s5I"}
12.Migrate guest from src to dst host
a. Change migration speed to 100M
{"execute": "migrate-set-parameters", "arguments": {"max-bandwidth": 104857600}, "id": "dFU5hcwV"}
b. start migration
{"execute": "migrate", "arguments": {"uri": "tcp:10.19.241.218:4000", "blk": false, "inc": false, "detach": true, "resume": false}, "id": "3pHhkaxt"}
13.Check guest status after migration finished
a. Check dmesg info in guest
b. reboot guest
[root@guest ~]# reboot


Actual results:
after the b of step 13, guest hit call trace during reboot:
2023-07-01-21:26:17: [  106.753100] systemd-shutdown[1]: Syncing filesystems and block devices.
2023-07-01-21:26:17: [  106.756793] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
2023-07-01-21:26:17: [  106.759583] systemd-journald[746]: Received SIGTERM from PID 1 (systemd-shutdow).
2023-07-01-21:26:17: [  106.802388] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
2023-07-01-21:26:17: [  106.804253] systemd-shutdown[1]: Unmounting file systems.
2023-07-01-21:26:17: [  106.804747] [2268]: Remounting '/' read-only with options 'seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota'.
2023-07-01-21:26:17: [  106.818468] systemd-shutdown[1]: All filesystems unmounted.
2023-07-01-21:26:17: [  106.818473] systemd-shutdown[1]: Deactivating swaps.
2023-07-01-21:26:17: [  106.818512] systemd-shutdown[1]: All swaps deactivated.
2023-07-01-21:26:17: [  106.818513] systemd-shutdown[1]: Detaching loop devices.
2023-07-01-21:26:17: [  106.818657] systemd-shutdown[1]: All loop devices detached.
2023-07-01-21:26:17: [  106.818659] systemd-shutdown[1]: Stopping MD devices.
2023-07-01-21:26:17: [  106.818697] systemd-shutdown[1]: All MD devices stopped.
2023-07-01-21:26:17: [  106.818698] systemd-shutdown[1]: Detaching DM devices.
2023-07-01-21:26:17: [  106.819017] systemd-shutdown[1]: Detaching DM /dev/dm-2 (253:2).
2023-07-01-21:26:17: [  106.819152] dm-2: detected capacity change from 64995328 to 0
2023-07-01-21:26:17: [  106.880897] systemd-shutdown[1]: Detaching DM /dev/dm-1 (253:1).
2023-07-01-21:26:17: [  106.881041] dm-1: detected capacity change from 8265728 to 0
2023-07-01-21:27:17: [  167.020750] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2023-07-01-21:27:17: [  167.020756] rcu: 	3-...0: (0 ticks this GP) idle=2104/1/0x4000000000000000 softirq=5599/5599 fqs=1236
2023-07-01-21:27:17: [  167.020763] 	(detected by 2, t=6005 jiffies, g=9105, q=1064 ncpus=4)
2023-07-01-21:27:17: [  167.020766] Task dump for CPU 3:
2023-07-01-21:27:17: [  167.020767] task:systemd-shutdow state:R  running task     stack:0     pid:1     ppid:0      flags:0x0000000e
2023-07-01-21:27:17: [  167.020773] Call trace:
2023-07-01-21:27:17: [  167.020774]  __switch_to+0xc8/0x110
2023-07-01-21:27:17: [  167.020802]  blkg_destroy_all.isra.0+0x6c/0xd0
2023-07-01-21:27:17: [  167.020815]  blkcg_exit_disk+0x20/0x40
2023-07-01-21:27:17: [  167.020818]  disk_release+0x4c/0xe0
2023-07-01-21:27:17: [  167.020821]  device_release+0x38/0x90
2023-07-01-21:27:17: [  167.020829]  kobject_cleanup+0x48/0x180
2023-07-01-21:27:17: [  167.020837]  kobject_put+0x7c/0xcc
2023-07-01-21:27:17: [  167.020840]  put_device+0x18/0x24
2023-07-01-21:27:17: [  167.020844]  put_disk+0x20/0x30
2023-07-01-21:27:17: [  167.020846]  cleanup_mapped_device+0xa8/0xf0 [dm_mod]
2023-07-01-21:27:17: [  167.020887]  __dm_destroy+0x13c/0x28c [dm_mod]
2023-07-01-21:27:17: [  167.020905]  dm_destroy+0x18/0x20 [dm_mod]
2023-07-01-21:27:17: [  167.020922]  dev_remove+0x14c/0x1e0 [dm_mod]
2023-07-01-21:27:17: [  167.020939]  ctl_ioctl+0x218/0x384 [dm_mod]
2023-07-01-21:27:17: [  167.020955]  dm_ctl_ioctl+0x14/0x20 [dm_mod]
2023-07-01-21:27:17: [  167.020970]  __arm64_sys_ioctl+0xa8/0xec
2023-07-01-21:27:17: [  167.020983]  invoke_syscall.constprop.0+0x7c/0xd0
2023-07-01-21:27:17: [  167.020988]  el0_svc_common.constprop.0+0x58/0x174
2023-07-01-21:27:17: [  167.020992]  do_el0_svc+0x38/0x9c
2023-07-01-21:27:17: [  167.020995]  el0_svc+0x38/0x18c
2023-07-01-21:27:17: [  167.020998]  el0t_64_sync_handler+0xb4/0x130
2023-07-01-21:27:17: [  167.021001]  el0t_64_sync+0x178/0x17c


Expected results:
guest can reboot successfully


Additional info:
1.After guest hit call trace, try shutdown guest by sending 'shutdown -h now' through console, but fail to execute this cmd, no respond for the shutdown cmd.

Comment 1 Li Xiaohui 2023-07-04 07:14:03 UTC
Qemu commands:
/usr/libexec/qemu-kvm  \
-name "mouse-vm",debug-threads=on \
-sandbox on \
-machine virt,gic-version=host,pflash0=drive_aavmf_code,pflash1=drive_aavmf_vars,memory-backend=mach-virt.ram \
-cpu host \
-nodefaults  \
-chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1,server=on,wait=off \
-chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor,server=on,wait=off \
-mon chardev=qmp_id_qmpmonitor1,mode=control \
-mon chardev=qmp_id_catch_monitor,mode=control \
-device '{"driver":"pcie-root-port","id":"pcie-root-port-0","multifunction":true,"bus":"pcie.0","addr":"0x1","chassis":1}' \
-device '{"driver":"pcie-root-port","id":"pcie-root-port-1","port":1,"addr":"0x1.0x1","bus":"pcie.0","chassis":2}' \
-device '{"driver":"pcie-root-port","id":"pcie-root-port-2","port":2,"addr":"0x1.0x2","bus":"pcie.0","chassis":3}' \
-device '{"driver":"pcie-root-port","id":"pcie-root-port-3","port":3,"addr":"0x1.0x3","bus":"pcie.0","chassis":4}' \
-device '{"driver":"pcie-root-port","id":"pcie-root-port-4","port":4,"addr":"0x1.0x4","bus":"pcie.0","chassis":5}' \
-device '{"driver":"pcie-root-port","id":"pcie-root-port-5","port":5,"addr":"0x1.0x5","bus":"pcie.0","chassis":6}' \
-device '{"driver":"pcie-root-port","id":"pcie_extra_root_port_0","multifunction":true,"bus":"pcie.0","addr":"0x2","chassis":7}' \
-device '{"driver":"pcie-root-port","id":"pcie_extra_root_port_1","addr":"0x2.0x1","bus":"pcie.0","chassis":8}' \
-device '{"driver":"pcie-root-port","id":"pcie_extra_root_port_2","addr":"0x2.0x2","bus":"pcie.0","chassis":9}' \
-device '{"driver":"pcie-pci-bridge","id":"pcie-pci-bridge-0","bus":"pcie-root-port-0","addr":"0x0"}' \
-device '{"driver":"virtio-gpu-pci","id":"video0","max_outputs":1,"bus":"pcie-root-port-1","addr":"0x0"}' \
-device '{"driver":"qemu-xhci","id":"usb1","bus":"pcie-root-port-2","addr":"0x0"}' \
-device '{"driver":"usb-tablet","id":"usb-tablet1","bus":"usb1.0","port":"1"}' \
-device '{"driver":"usb-kbd","id":"usb-kbd1","bus":"usb1.0","port":"2"}' \
-device '{"driver":"virtio-scsi-pci","id":"virtio_scsi_pci0","bus":"pcie-root-port-3","addr":"0x0"}' \
-device '{"driver":"scsi-hd","id":"image1","device_id":"drive-image1","drive":"drive_image1","bus":"virtio_scsi_pci0.0","channel":0,"scsi-id":0,"lun":0,"bootindex":0,"write-cache":"on"}' \
-device '{"driver":"virtio-net-pci","mac":"9a:0a:71:f3:69:7d","rombar":0,"id":"net0","netdev":"tap0","bus":"pcie-root-port-4","addr":"0x0"}' \
-device '{"driver":"virtio-balloon-pci","id":"balloon0","bus":"pcie-root-port-5","addr":"0x0"}' \
-blockdev '{"driver":"file","auto-read-only":true,"discard":"unmap","aio":"threads","cache":{"direct":true,"no-flush":false},"filename":"/mnt/xiaohli/rhel930-aarch64-64k-virtio-scsi.qcow2","node-name":"drive_sys1"}' \
-blockdev '{"driver":"qcow2","node-name":"drive_image1","read-only":false,"cache":{"direct":true,"no-flush":false},"file":"drive_sys1"}' \
-blockdev '{"node-name":"file_aavmf_code","driver":"file","filename":"/usr/share/edk2/aarch64/QEMU_EFI-silent-pflash.qcow2","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"drive_aavmf_code","driver":"qcow2","read-only":true,"file":"file_aavmf_code"}' \
-blockdev '{"node-name":"file_aavmf_vars","driver":"file","filename":"/mnt/xiaohli/rhel930-aarch64-64k-virtio-scsi.qcow2_VARS.fd","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"drive_aavmf_vars","driver":"qcow2","read-only":false,"file":"file_aavmf_vars"}' \
-netdev tap,id=tap0,vhost=on \
-m 4096 \
-object '{"qom-type":"memory-backend-ram","id":"mach-virt.ram","size":4294967296}' \
-smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
-vnc :10 \
-rtc base=utc,clock=host,driftfix=slew \
-enable-kvm  \
-qmp tcp:0:3333,server=on,wait=off \
-qmp tcp:0:9999,server=on,wait=off \
-qmp tcp:0:9888,server=on,wait=off \
-serial tcp:0:4444,server=on,wait=off \
-monitor stdio \
-msg timestamp=on \

Comment 2 Li Xiaohui 2023-07-04 07:20:28 UTC
Hi Yihuang, 
please help confirm if guest would hit call trace during reboot without migration on same kernel and qemu versions. 
You may need to repeat one case that includes reboot operation for about 500 times since it's hard to reproduce this bug.
Thanks in advance.

Comment 3 Peter Xu 2023-07-04 13:07:51 UTC
Yeah worth trying to see whether it's just a kernel bug.  Is this arm-specific?  Maybe we should tag the hardware before it's reproduced elsewhere to be clear?

Comment 4 Li Xiaohui 2023-07-05 02:36:15 UTC
(In reply to Peter Xu from comment #3)
> Yeah worth trying to see whether it's just a kernel bug.  

> Is this arm-specific?  

I'm not sure now. Because the reproduction rate is very low 1/260

> Maybe we should tag the hardware before it's reproduced elsewhere to be clear?

Done.

Comment 5 Yihuang Yu 2023-07-05 06:37:21 UTC
Currently I cannot reproduce it with directly reboot 200+ times. Before try to trigger this problem, I will hit an edk2 assert issue first, I will continue to test.

Comment 6 Li Xiaohui 2023-07-05 06:47:53 UTC
Also reproduce this bug with 1/200 rate when shutdown vm after multifd migration.

This time only do basic multifd migration, enable multifd capabilities, start to multifd migration. After migration, shutdown vm, then VM hit call trace.


Note: For description and this comment, each case repeat 3 rounds for migration scenarios testing.

Comment 7 Li Xiaohui 2023-07-05 06:52:06 UTC
(In reply to Yihuang Yu from comment #5)
> Currently I cannot reproduce it with directly reboot 200+ times. Before try
> to trigger this problem, I will hit an edk2 assert issue first, I will
> continue to test.

Thank you. Per Comment 6, I think you'd better repeat more than 600 times since it repeats 3 rounds of migration test for my each case.
So if you want to confirm whether reproduce this bug without migration, you need to test more than 600 (200 multiply by 3)

Comment 8 Yihuang Yu 2023-07-06 01:08:39 UTC
(In reply to Li Xiaohui from comment #7)
> (In reply to Yihuang Yu from comment #5)
> > Currently I cannot reproduce it with directly reboot 200+ times. Before try
> > to trigger this problem, I will hit an edk2 assert issue first, I will
> > continue to test.
> 
> Thank you. Per Comment 6, I think you'd better repeat more than 600 times
> since it repeats 3 rounds of migration test for my each case.
> So if you want to confirm whether reproduce this bug without migration, you
> need to test more than 600 (200 multiply by 3)

Also, the case passed with 1000 times.

[stdlog] 2023-07-05 18:10:26,331 avocado.test DEBUG| Reboot count: 998
[stdlog] 2023-07-05 18:10:26,331 avocado.virttest.qemu_vm INFO | Context: Reboot guest 'avocado-vt-vm1'. --> rebooting 'avocado-vt-vm1'
[stdlog] 2023-07-05 18:10:26,331 avocado.virttest.qemu_vm DEBUG| Send command: reboot
[stdlog] 2023-07-05 18:10:26,331 aexpect.client DEBUG| [192.168.3.90] Sending command: reboot
[stdlog] 2023-07-05 18:10:26,484 avocado.virttest.qemu_vm INFO | Context: Reboot guest 'avocado-vt-vm1'. --> rebooting 'avocado-vt-vm1' --> waiting for guest to go down
[stdlog] 2023-07-05 18:10:33,493 avocado.virttest.qemu_vm INFO | Context: Reboot guest 'avocado-vt-vm1'. --> rebooting 'avocado-vt-vm1' --> logging in after reboot
[stdlog] 2023-07-05 18:10:33,493 avocado.virttest.virt_vm DEBUG| Attempting to log into 'avocado-vt-vm1' (timeout 233s)
[stdlog] 2023-07-05 18:10:33,494 avocado.virttest.virt_vm DEBUG| Found/Verified IP 192.168.3.90 for VM avocado-vt-vm1 NIC 0
[stdlog] 2023-07-05 18:11:24,602 avocado.test DEBUG| Reboot count: 999
[stdlog] 2023-07-05 18:11:24,603 avocado.virttest.qemu_vm INFO | Context: Reboot guest 'avocado-vt-vm1'. --> rebooting 'avocado-vt-vm1'
[stdlog] 2023-07-05 18:11:24,603 avocado.virttest.qemu_vm DEBUG| Send command: reboot
[stdlog] 2023-07-05 18:11:24,603 aexpect.client DEBUG| [192.168.3.90] Sending command: reboot
[stdlog] 2023-07-05 18:11:24,754 avocado.virttest.qemu_vm INFO | Context: Reboot guest 'avocado-vt-vm1'. --> rebooting 'avocado-vt-vm1' --> waiting for guest to go down
[stdlog] 2023-07-05 18:11:31,763 avocado.virttest.qemu_vm INFO | Context: Reboot guest 'avocado-vt-vm1'. --> rebooting 'avocado-vt-vm1' --> logging in after reboot
[stdlog] 2023-07-05 18:11:31,764 avocado.virttest.virt_vm DEBUG| Attempting to log into 'avocado-vt-vm1' (timeout 233s)
[stdlog] 2023-07-05 18:11:31,764 avocado.virttest.virt_vm DEBUG| Found/Verified IP 192.168.3.90 for VM avocado-vt-vm1 NIC 0
[stdlog] 2023-07-05 18:12:24,176 avocado.virttest.qemu_vm DEBUG| Destroying VM avocado-vt-vm1 (PID 111416)
[stdlog] 2023-07-05 18:12:24,177 avocado.virttest.qemu_vm DEBUG| Shutting down VM avocado-vt-vm1 (shell)
[stdlog] 2023-07-05 18:12:31,715 avocado.virttest.qemu_vm INFO | [qemu output] (Process terminated with status 0)

serial:
2023-07-05 18:12:31: [   42.514344] dracut Warning: Unmounted /oldroot.
2023-07-05 18:12:31: 42.447658 | /etc/multipath.conf does not exist, blacklisting all devices.
2023-07-05 18:12:31: 42.448942 | You can run "/sbin/mpathconf --enable" to create
2023-07-05 18:12:31: 42.450090 | /etc/multipath.conf. See man mpathconf(8) for more details
2023-07-05 18:12:31: 42.451455 | invalid value for WatchdogSec: "0"
2023-07-05 18:12:31: [   42.560312] dracut: Disassembling device-mapper devices
2023-07-05 18:12:31: [   42.567147] dm-0: detected capacity change from 34414592 to 0
2023-07-05 18:12:31: [   42.655185] dracut: Waiting for mdraid devices to be clean.
2023-07-05 18:12:31: [   42.658159] dracut: Disassembling mdraid devices.
2023-07-05 18:12:31: Powering off.
2023-07-05 18:12:31: [   42.680270] sd 0:0:0:0: [sda] Synchronizing SCSI cache
2023-07-05 18:12:31: [   42.698341] reboot: Power down
2023-07-05 18:12:32: (Process terminated with status 0)

Comment 9 Juan Quintela 2023-07-13 14:18:58 UTC
One question:
guest info: kernel-5.14.0-331.el9.aarch64+64k

Does this mean that we are using 64K pages on the guest?  And what are we using on the host?

Just to make sure what is going on here.

Comment 10 Juan Quintela 2023-07-13 14:19:28 UTC
Can you reproduce it, even on arm with 4KB pages on both host and guest?

Later, Juan.

Comment 12 Li Xiaohui 2023-07-14 02:52:18 UTC
(In reply to Juan Quintela from comment #9)
> One question:
> guest info: kernel-5.14.0-331.el9.aarch64+64k
> 
> Does this mean that we are using 64K pages on the guest?  And what are we
> using on the host?

It should be.
Hosts also use 64K page here, you can refer to the description of this bug:
*****************************************************************************
Version-Release number of selected component (if applicable):
hosts info: kernel-5.14.0-327.el9.aarch64+64k && qemu-kvm-8.0.0-5.el9.aarch64
guest info: kernel-5.14.0-331.el9.aarch64+64k


Yihuang, please correct me if I'm not right.

> 
> Just to make sure what is going on here.

Comment 13 Li Xiaohui 2023-07-14 02:53:09 UTC
(In reply to Juan Quintela from comment #10)
> Can you reproduce it, even on arm with 4KB pages on both host and guest?

I didn't try with 4KB page on both host and guest.
Will try next week.

> 
> Later, Juan.

Comment 14 Laszlo Ersek 2023-07-14 10:06:58 UTC
(In reply to Li Xiaohui from comment #0)

> 2023-07-01-21:26:17: [  106.881041] dm-1: detected capacity change from 8265728 to 0
> 2023-07-01-21:27:17: [  167.020750] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

This is where things go wrong.

In the kernel tree, "Documentation/RCU/stallwarn.rst" explains what
problems can cause such messages. In my reading, it more or less boils
down to "stuck/starved CPU".

Note that there is an exactly 1 minute long delay before the
"rcu_preempt detected stalls" message. In the RH kernel, we have:

> redhat/configs/common/generic/CONFIG_RCU_CPU_STALL_TIMEOUT:CONFIG_RCU_CPU_STALL_TIMEOUT=60

My guess is that either (1) the migration causes too much downtime for
one of the CPUs, or else (2) the migration messes up block device state
in some way, and therefore blkg_destroy_all() gets stuck, with
interrupts disabled. (The documentation says that "The offending
function will usually be near the top of the stack".)

At tag "kernel-5.14.0-331.el9", the function looks like this
[block/blk-cgroup.c]:

> static void blkg_destroy_all(struct gendisk *disk)
> {
> 	struct request_queue *q = disk->queue;
> 	struct blkcg_gq *blkg, *n;
> 	int count = BLKG_DESTROY_BATCH_SIZE;
>
> restart:
> 	spin_lock_irq(&q->queue_lock);
> 	list_for_each_entry_safe(blkg, n, &q->blkg_list, q_node) {
> 		struct blkcg *blkcg = blkg->blkcg;
>
> 		if (hlist_unhashed(&blkg->blkcg_node))
> 			continue;
>
> 		spin_lock(&blkcg->lock);
> 		blkg_destroy(blkg);
> 		spin_unlock(&blkcg->lock);
>
> 		/*
> 		 * in order to avoid holding the spin lock for too long, release
> 		 * it when a batch of blkgs are destroyed.
> 		 */
> 		if (!(--count)) {
> 			count = BLKG_DESTROY_BATCH_SIZE;
> 			spin_unlock_irq(&q->queue_lock);
> 			cond_resched();
> 			goto restart;
> 		}
> 	}
>
> 	q->root_blkg = NULL;
> 	spin_unlock_irq(&q->queue_lock);
> }

BLKG_DESTROY_BATCH_SIZE is 64 [block/blk-cgroup.c]. This code certainly
seems like a good candidate for stalling a CPU for too long, with
interrupts disabled; the BLKG_DESTROY_BATCH_SIZE value is clearly
experimental.

Look in particular at the following (guest) kernel commit:

> commit a731763fc479a9c64456e0643d0ccf64203100c9
> Author: Yu Kuai <yukuai3>
> Date:   Wed Jul 7 09:56:49 2021 +0800
>
>     blk-cgroup: prevent rcu_sched detected stalls warnings while iterating blkgs
>
>     We run a test that create millions of cgroups and blkgs, and then trigger
>     blkg_destroy_all(). blkg_destroy_all() will hold spin lock for a long
>     time in such situation. Thus release the lock when a batch of blkgs are
>     destroyed.
>
>     blkcg_activate_policy() and blkcg_deactivate_policy() might have the
>     same problem, however, as they are basically only called from module
>     init/exit paths, let's leave them alone for now.
>
>     Signed-off-by: Yu Kuai <yukuai3>
>     Acked-by: Tejun Heo <tj>
>     Link: https://lore.kernel.org/r/20210707015649.1929797-1-yukuai3@huawei.com
>     Signed-off-by: Jens Axboe <axboe>

(First appeared in v5.14.)

So they had already hit the warning we're seeing now, and mitigated it
by dropping the lock periodically, using an experimental frequency. If
scheduling is different (which I figure can be the case during
migration), their workaround might not work.

Or else, the underlying virtio-scsi-pci disk might indeed be stuck in
some way.

I think this is related to reboot / shutdown only in that we attempt to
deactivate the device (see dev_remove / dm_destroy in the outer frames
of the stack). I guess if we had dozens of disks attached to the guest,
and *stopped* those non-system disks after migration (rather than
shutting down or rebooting the VM), that might reproduce the same
symptom, and probably with higher frequency.

(What "stopping a disk" means in this context is unclear to me. For a
scsi disk, I figure unmounting all filesystems, deactivating all LVs /
PVs, and then ejecting the disk might suffice? I figure if we place the
system disk on virtio-blk, and the dozen data disks on virtio-scsi-pci,
then going as far as unloading the virtio-scsi module should certainly
"stop" those disks.)

Anyway,

- FWIW, bug 2131144 is related to blkg_destroy_all(). Is this a
  regression? It might not pop up with earlier kernels.

- Comment 7 confirms migration is required for triggering the issue;
  does it have to be *multifd* migration though?

- After the "rcu_preempt detected stalls" message, what is the general
  state of the guest? If the reboot (or shutdown) never completes, there
  should be time enough for pausing the guest from the monitor, and then
  poking around in the block device states. (I figure with "info qtree",
  or even with gdb.)

- Does it reproduce with virtio-blk too? (= different front-end)

- Does it reproduce with virtio-scsi-pci, but with "less sophisticated"
  backend settings? ("cache=direct aio=threads" looks quite
  sophisticated.)

Comment 15 Li Xiaohui 2023-07-17 09:52:44 UTC
Hi, I also hit the guest call trace on x86 (kernel-5.14.0-334.el9.x86_64 && qemu-kvm-8.0.0-7.el9.x86_64) for once time when test case VIRT-294977. But this case is not multifd migration, not enable any migration capability.
Guest (rhel 9.3, kernel-5.14.0-339.el9.x86_64) hit call trace after migration.

Can anyone help confirm if this issue is same as this bug? If so, I believe we can change the hardware to all.

************************************************************************
[ 1947.636787] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1947.636790] rcu: 	1-...0: (0 ticks this GP) idle=237c/1/0x4000000000000000 softirq=11358/11358 fqs=12089
[ 1947.636794] 	(detected by 3, t=60004 jiffies, g=16845, q=95 ncpus=4)
[ 1947.636804] Sending NMI from CPU 3 to CPUs 1:
[ 1947.636833] NMI backtrace for cpu 1
[ 1947.636836] CPU: 1 PID: 1 Comm: systemd-shutdow Kdump: loaded Not tainted 5.14.0-339.el9.x86_64 #1
[ 1947.636838] Hardware name: Red Hat KVM/RHEL, BIOS edk2-20230524-2.el9 05/24/2023
[ 1947.636839] RIP: 0010:blkg_destroy_all.isra.0+0x8d/0xd0
[ 1947.636848] Code: e7 e8 37 66 64 00 48 89 ef e8 4f fc ff ff 4c 89 e7 e8 d7 66 64 00 83 eb 01 74 36 49 8b 47 08 49 8d 57 08 4c 89 fd 48 83 e8 08 <49> 39 d5 75 ba 49 c7 86 70 01 00 00 00 00 00 00 48 8b 3c 24 48 83
[ 1947.636850] RSP: 0018:ffff998b4001bc30 EFLAGS: 00000086
[ 1947.636851] RAX: ffff8a997c054200 RBX: 000000000000003f RCX: 0000000000000046
[ 1947.636853] RDX: ffff8a997c054208 RSI: 0000000000000046 RDI: ffff8a9900c1f4c8
[ 1947.636853] RBP: ffff8a997c054200 R08: 0000000000000238 R09: ffff8a9900c1f4d0
[ 1947.636854] R10: 0000000000000002 R11: ffff8a9915ee6480 R12: ffff8a9912483400
[ 1947.636854] R13: ffff8a990c2b7d78 R14: ffff8a990c2b7c00 R15: ffff8a997c054200
[ 1947.636857] FS:  00007faa828b4b40(0000) GS:ffff8a997fc80000(0000) knlGS:0000000000000000
[ 1947.636859] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1947.636860] CR2: 00007fde27c7a758 CR3: 000000010c320000 CR4: 0000000000350ee0
[ 1947.636861] Call Trace:
[ 1947.636866]  <NMI>
[ 1947.636871]  ? show_trace_log_lvl+0x1c4/0x2df
[ 1947.636875]  ? show_trace_log_lvl+0x1c4/0x2df
[ 1947.636877]  ? blkcg_exit_disk+0x12/0x30
[ 1947.636878]  ? nmi_cpu_backtrace.cold+0x1b/0x70
[ 1947.636881]  ? nmi_cpu_backtrace_handler+0xd/0x20
[ 1947.636884]  ? nmi_handle+0x5e/0x120
[ 1947.636887]  ? default_do_nmi+0x40/0x130
[ 1947.636889]  ? exc_nmi+0x111/0x140
[ 1947.636890]  ? end_repeat_nmi+0x16/0x67
[ 1947.636893]  ? blkg_destroy_all.isra.0+0x8d/0xd0
[ 1947.636894]  ? blkg_destroy_all.isra.0+0x8d/0xd0
[ 1947.636895]  ? blkg_destroy_all.isra.0+0x8d/0xd0
[ 1947.636896]  </NMI>
[ 1947.636896]  <TASK>
[ 1947.636897]  blkcg_exit_disk+0x12/0x30
[ 1947.636898]  disk_release+0x46/0xe0
[ 1947.636900]  device_release+0x34/0x90
[ 1947.636904]  kobject_cleanup+0x3a/0x130
[ 1947.636908]  cleanup_mapped_device+0xcd/0x120 [dm_mod]
[ 1947.636922]  __dm_destroy+0x141/0x1e0 [dm_mod]
[ 1947.636927]  dev_remove+0x119/0x1a0 [dm_mod]
[ 1947.636933]  ctl_ioctl+0x1a2/0x290 [dm_mod]
[ 1947.636939]  dm_ctl_ioctl+0xa/0x20 [dm_mod]
[ 1947.636943]  __x64_sys_ioctl+0x8a/0xc0
[ 1947.636946]  do_syscall_64+0x5c/0x90
[ 1947.636948]  ? syscall_exit_to_user_mode+0x12/0x30
[ 1947.636950]  ? do_syscall_64+0x69/0x90
[ 1947.636950]  ? do_syscall_64+0x69/0x90
[ 1947.636951]  ? do_syscall_64+0x69/0x90
[ 1947.636952]  ? do_syscall_64+0x69/0x90
[ 1947.636953]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 1947.636954] RIP: 0033:0x7faa82a3ec6b
[ 1947.636965] Code: 73 01 c3 48 8b 0d b5 b1 1b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 85 b1 1b 00 f7 d8 64 89 01 48
[ 1947.636966] RSP: 002b:00007ffc06c9ace8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1947.636967] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007faa82a3ec6b
[ 1947.636968] RDX: 00007ffc06c9ad80 RSI: 00000000c138fd04 RDI: 0000000000000004
[ 1947.636968] RBP: 00007ffc06c9aef0 R08: 0000000000000000 R09: 00007ffc06c9a0c0
[ 1947.636969] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000004
[ 1947.636969] R13: 0000000000000000 R14: 00007ffc06c9ad40 R15: 0000000000000000
[ 1947.636971]  </TASK>

Comment 16 Peter Xu 2023-07-17 18:39:55 UTC
(In reply to Li Xiaohui from comment #15)
> Can anyone help confirm if this issue is same as this bug? If so, I believe
> we can change the hardware to all.

Not 100% sure, but I think highly likely.  For now I'd say we can mark it the same bug unless further justified.

I just noticed bz2193077 but it may not be relevant anymore, after checking that it's fixed in kernel-5.14.0-329.el9, while this bug is for 331+ (guest kernel).  Maybe this ruled out old kernels too because then it won't surprise me it can be reproduced in old kernels if without Ming's fix.

Maybe worth enabling kernel.panic_on_rcu_stall=1 in sysctl before any reboot attempt to get a dump.

Comment 17 Juan Quintela 2023-07-18 16:01:25 UTC
Hi Xiaohui

My understanding is that this is the same bug.  And as you can reproduce it without multifd, multifd don't matter (it didn't make sense to start with).  And you are hitting the same function that Laszlo found, blkg_destroy_all().

So we can remove the ARM tag and just add if it is more /less probable to happen on that architecture.

Later, Juan.

Comment 18 Li Xiaohui 2023-07-20 07:46:49 UTC
(In reply to Peter Xu from comment #16)
> (In reply to Li Xiaohui from comment #15)
> > Can anyone help confirm if this issue is same as this bug? If so, I believe
> > we can change the hardware to all.
> 
> Not 100% sure, but I think highly likely.  For now I'd say we can mark it
> the same bug unless further justified.
> 
> I just noticed Red Hatbz2193077 but it may not be relevant anymore, after
> checking that it's fixed in kernel-5.14.0-329.el9, while this bug is for
> 331+ (guest kernel).  Maybe this ruled out old kernels too because then it
> won't surprise me it can be reproduced in old kernels if without Ming's fix.
> 
> Maybe worth enabling kernel.panic_on_rcu_stall=1 in sysctl before any reboot
> attempt to get a dump.

I will try enabling kernel.panic_on_rcu_stall=1 in sysctl next time.

Based Peter's, Juan's comments, I would adjust the hardware to all. 

Thank you all!

Comment 19 Ming Lei 2023-07-21 04:13:18 UTC
(In reply to Li Xiaohui from comment #15)
> Hi, I also hit the guest call trace on x86 (kernel-5.14.0-334.el9.x86_64 &&
> qemu-kvm-8.0.0-7.el9.x86_64) for once time when test case VIRT-294977. But
> this case is not multifd migration, not enable any migration capability.
> Guest (rhel 9.3, kernel-5.14.0-339.el9.x86_64) hit call trace after
> migration.
> 
> Can anyone help confirm if this issue is same as this bug? If so, I believe
> we can change the hardware to all.

It should be one type of issue, and often triggered in case that there are too many blkcg_gq,
which number is nr_blkcg for each disk.

Recently we don't backport two blkcg fixes yet for rhel9, which probably holds blkcg_gq
a bit long, can you test the two patches?

9c39b7a905d8 block: make sure local irq is disabled when calling __blkcg_rstat_flush
20cb1c2fb756 blk-cgroup: Flush stats before releasing blkcg_gq


Also I guess reboot does move on finally after the warning is dumpped, right? 


Thanks,

Comment 20 Ming Lei 2023-07-21 05:09:46 UTC
Follows another improvement which may address too many unhashed blkcg_gq left in q->blkg_list,
but still depends on the workloads running in guest, especially if too many blkcg_gq are involved.


diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index fc49be622e05..960e508c5699 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -575,14 +575,18 @@ static void blkg_destroy_all(struct gendisk *disk)
        struct request_queue *q = disk->queue;
        struct blkcg_gq *blkg, *n;
        int count = BLKG_DESTROY_BATCH_SIZE;
+       LIST_HEAD(tmp);
+
 
 restart:
        spin_lock_irq(&q->queue_lock);
        list_for_each_entry_safe(blkg, n, &q->blkg_list, q_node) {
                struct blkcg *blkcg = blkg->blkcg;
 
-               if (hlist_unhashed(&blkg->blkcg_node))
+               if (hlist_unhashed(&blkg->blkcg_node)) {
+                       list_move_tail(&blkg->q_node, &tmp);
                        continue;
+               }
 
                spin_lock(&blkcg->lock);
                blkg_destroy(blkg);
@@ -601,6 +605,7 @@ static void blkg_destroy_all(struct gendisk *disk)
        }
 
        q->root_blkg = NULL;
+       list_splice_tail(&tmp, &q->blkg_list);
        spin_unlock_irq(&q->queue_lock);
 }


Note You need to log in before you can comment on or make changes to this bug.