2099934 – Guest reboot on destination host after postcopy migration completed

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2099934 - Guest reboot on destination host after postcopy migration completed

Summary: Guest reboot on destination host after postcopy migration completed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	9.1
Hardware:	s390x
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	9.1
Assignee:	Thomas Huth
QA Contact:	bfu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2066305
TreeView+	depends on / blocked

Reported:	2022-06-22 03:06 UTC by Li Xiaohui
Modified:	2022-11-15 10:19 UTC (History)
CC List:	18 users (show)
Fixed In Version:	qemu-kvm-7.0.0-10.el9
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-11-15 09:54:42 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Gitlab	redhat/centos-stream/src qemu-kvm merge_requests 112	None	opened	Fix postcopy migration on s390x	2022-08-03 06:01:26 UTC
IBM Linux Technology Center	198917	None	None	None	2022-07-11 12:08:47 UTC
Red Hat Issue Tracker	RHELPLAN-125940	None	None	None	2022-06-22 03:16:06 UTC
Red Hat Product Errata	RHSA-2022:7967	None	None	None	2022-11-15 09:55:16 UTC

Description Li Xiaohui 2022-06-22 03:06:50 UTC

Description of problem:
Guest reboot automatically on destination host after postcopy migration completed.
I don't get useful console information by automation, I will try to collect by manually


Version-Release number of selected component (if applicable):
hosts info: kernel-5.14.0-114.el9.s390x & qemu-kvm-7.0.0-6.el9.s390x
guest info: kernel-5.14.0-114.el9.s390x


How reproducible:
1/10


Steps to Reproduce:
1.Boot a guest on source host
2.Boot a guest on destination host with appending '-incoming defer'
3.In guest, run one program to produce dirty page
# cd /home && gcc migration_dirtypage_2.c -o dirty_page
# cd /home && ./dirty_page > /dev/null &
4.Enable postcopy capability on both src and dst qemu, do migration, switch to postcopy during migration is active
5.After postcopy migration completed, check guest on destination host


Actual results:
Guest seems crash after postcopy migration completed since can't execute 'dmesg' to check dmesg info in guest


Expected results:
Guest works well on destination host after postcopy migration completed


Additional info:
1.Will collect guest crash info through console by manually
2.Will check whether this bug is a regression later.
3.Will check whether other architectures can reproduce this bug

Comment 1 smitterl 2022-06-22 10:38:09 UTC

(In reply to Li Xiaohui from comment #0)
> Description of problem:
> Guest seems crash on destination host after postcopy migration completed.
> I don't get useful console information by automation, I will try to collect
> by manually
> 
> 
> Version-Release number of selected component (if applicable):
> hosts info: kernel-5.14.0-114.el9.s390x & qemu-kvm-7.0.0-6.el9.s390x
> guest info: kernel-5.14.0-114.el9.s390x
> 
> 
> How reproducible:
> 1/10
> 
> 
> Steps to Reproduce:
> 1.Boot a guest on source host
> 2.Boot a guest on destination host with appending '-incoming defer'
> 3.In guest, run one program to produce dirty page
> # cd /home && gcc migration_dirtypage_2.c -o dirty_page
> # cd /home && ./dirty_page > /dev/null &
> 4.Enable postcopy capability on both src and dst qemu, do migration, switch
> to postcopy during migration is active
> 5.After postcopy migration completed, check guest on destination host
> 
> 
> Actual results:
> Guest seems crash after postcopy migration completed since can't execute
> 'dmesg' to check dmesg info in guest
> 
> 
> Expected results:
> Guest works well on destination host after postcopy migration completed
> 
> 
> Additional info:
> 1.Will collect guest crash info through console by manually
> 2.Will check whether this bug is a regression later.
> 3.Will check whether other architectures can reproduce this bug

Thank you for this Li!

Could you share some more details about which commands you use to run the migration and to start the vm with?

For CTC1, we tested several live migration scenarios with postcopy and they passed.

Comment 2 smitterl 2022-06-22 10:39:25 UTC

(In reply to smitterl from comment #1)
> (In reply to Li Xiaohui from comment #0)
> > Description of problem:
> > Guest seems crash on destination host after postcopy migration completed.
> > I don't get useful console information by automation, I will try to collect
> > by manually
> > 
> > 
> > Version-Release number of selected component (if applicable):
> > hosts info: kernel-5.14.0-114.el9.s390x & qemu-kvm-7.0.0-6.el9.s390x
> > guest info: kernel-5.14.0-114.el9.s390x
Note, we used the following component versions for the migration testing:
libvirt-8.4.0-2.el9.s390x
qemu-kvm-7.0.0-6.el9.s390x
kernel-5.14.0-106.el9.s390x

Comment 3 smitterl 2022-06-22 10:44:59 UTC

I checked the linked Polarion test case; IIUC, you're using hugepage memory backing during the live migration is that correct?

I have not found coverage for that in our test suite, Fangge do you happen to know if we have a test scenario for this on libvirt level?

Comment 4 Fangge Jin 2022-06-22 11:11:56 UTC

There is a test case in libvirt-level:
VIRT-17395 - [Migration] Migrate vm with hugepage memory + numa
But there is no "produce dirty page in vm" step in this case.

Comment 5 Li Xiaohui 2022-06-22 13:14:19 UTC

Hi Sebastian, the reason I didn't provide the qemu command and didn't say something about huge pages since I guess we can reproduce this bug without huge pages. It turns out that my guess is correct

In summary, I can also reproduce this bug only do postcopy migration without huge pages like Description, reproducible rate is nearly 1/50, the rate is higher (1/10) with hugepage configured

The qemu command line is:
/usr/libexec/qemu-kvm  \
-name "mouse-vm" \
-sandbox on \
-machine s390-ccw-virtio,memory-backend=mem-machine_mem \
-cpu host \
-nodefaults  \
-chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1,server=on,wait=off \
-chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor,server=on,wait=off \
-mon chardev=qmp_id_qmpmonitor1,mode=control \
-mon chardev=qmp_id_catch_monitor,mode=control \
-device virtio-scsi-ccw,id=virtio_scsi_ccw0,devno=fe.0.0000 \
-device scsi-hd,id=image1,drive=drive_image1,write-cache=on,bus=virtio_scsi_ccw0.0 \
-device virtio-net-ccw,mac=9a:fc:6e:3c:a6:f0,id=idE6xm3i,netdev=idnKxBqM,devno=fe.0.0001 \
-device virtio-mouse-ccw,id=input_mouse1,devno=fe.0.0002 \
-device virtio-keyboard-ccw,id=input_keyboard1,devno=fe.0.0003 \
-device virtio-balloon-ccw,id=balloon0,devno=fe.0.0004 \
-blockdev driver=file,auto-read-only=on,discard=unmap,aio=threads,cache.direct=on,cache.no-flush=off,filename=/mnt/xiaohli/rhel910-s390x-virtio-scsi.qcow2,node-name=drive_sys1 \
-blockdev driver=qcow2,node-name=drive_image1,read-only=off,cache.direct=on,cache.no-flush=off,file=drive_sys1 \
-netdev tap,id=idnKxBqM,vhost=on \
-m 4096 \
-object memory-backend-ram,id=mem-machine_mem,size=4096M \
-smp 1,maxcpus=1,cores=1,threads=1,sockets=1 \
-nographic  \
-rtc base=utc,clock=host,driftfix=slew \
-boot strict=on \
-enable-kvm  \
-qmp tcp:0:3333,server=on,wait=off \
-qmp tcp:0:9999,server=on,wait=off \
-qmp tcp:0:9888,server=on,wait=off \
-serial tcp:0:4444,server=on,wait=off \
-monitor stdio \

And migration_dirtypage_2.c is as below: 
# cat migration_dirtypage_2.c
#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
int main()
{
void wakeup();
signal(SIGALRM,wakeup);
alarm(120);
char *buf = (char *) calloc(40960, 4096);
while (1) {
int i;
for (i = 0; i < 40960 * 4; i++) {
buf[i * 4096 / 4]++;
}
printf(".");
}
}
void wakeup()
{
exit(0);
}

Comment 6 Li Xiaohui 2022-06-22 13:25:51 UTC

I have tried this case with more than 100 times on x86 (RHEL 9.1.0 kernel-5.14.0-110.el9.x86_64 & qemu-kvm-7.0.0-6.el9.x86_64), they all pass, don't reproduce this bug. So seems only s390x issue.

Comment 7 Dr. David Alan Gilbert 2022-06-22 14:42:36 UTC

OK, so we need to know if this is a regression in either qemu or the kernel or if it's already been there.

Comment 8 Li Xiaohui 2022-06-24 07:27:58 UTC

Tried 100 times to reproduce this bug on RHEL 8.7.0 (kernel-4.18.0-402.el8.s390x & qemu-kvm-6.2.0-15.module+el8.7.0+15644+189a21f6.s390x) with same RHEL 9.1.0 guest, but didn't hit the issue, hit a new call trace with 1/100 rate when boot guest on source host with huge page configured. Do we need to open a new bug?
2022-06-23-10:34:30: [  243.884523] Call Trace:
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.884534]  [<000000001ab2c4c2>] __schedule+0x2c2/0x6b0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.884584]  [<000000001ab2c904>] schedule+0x54/0xf0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.884586]  [<000000001ab2f362>] rwsem_down_read_slowpath+0x182/0x400 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.884589]  [<000003ff80283ada>] xlog_cil_commit+0x4a/0x180 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885033]  [<000003ff8027b56c>] __xfs_trans_commit+0x9c/0x330 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885106]  [<000003ff8026d966>] xfs_remove+0x276/0x400 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885178]  [<000003ff802679aa>] xfs_vn_unlink+0x5a/0xb0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885248]  [<000000001a57260c>] vfs_rmdir+0x9c/0x200 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885268]  [<000000001a578d74>] do_rmdir+0x164/0x1d0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885271]  [<000000001ab25518>] __do_syscall+0x1b8/0x1e0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885282]  [<000000001ab330a8>] system_call+0x78/0xa0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885292] INFO: task systemd:971 blocked for more than 122 seconds.
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885294]       Not tainted 5.14.0-114.el9.s390x #1
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885295] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885298] task:systemd         state:D stack:    0 pid:  971 ppid:     0 flags:0x00000000
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885301] Call Trace:
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885302]  [<000000001ab2c4c2>] __schedule+0x2c2/0x6b0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885305]  [<000000001ab2c904>] schedule+0x54/0xf0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885306]  [<000000001ab2f362>] rwsem_down_read_slowpath+0x182/0x400 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885309]  [<000003ff80283ada>] xlog_cil_commit+0x4a/0x180 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885383]  [<000003ff8027b56c>] __xfs_trans_commit+0x9c/0x330 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885469]  [<000003ff8026d966>] xfs_remove+0x276/0x400 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885540]  [<000003ff802679aa>] xfs_vn_unlink+0x5a/0xb0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885611]  [<000000001a57260c>] vfs_rmdir+0x9c/0x200 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885614]  [<000000001a578d74>] do_rmdir+0x164/0x1d0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885615]  [<000000001ab25518>] __do_syscall+0x1b8/0x1e0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885618]  [<000000001ab330a8>] system_call+0x78/0xa0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885623] INFO: task kworker/u2:0:8 blocked for more than 122 seconds.
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885625]       Not tainted 5.14.0-114.el9.s390x #1
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885626] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885628] task:kworker/u2:0    state:D stack:    0 pid:    8 ppid:     2 flags:0x00000000
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885633] Workqueue: writeback wb_workfn (flush-253:0)
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885653] Call Trace:
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885656]  [<000000001ab2c4c2>] __schedule+0x2c2/0x6b0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885658]  [<000000001ab2c904>] schedule+0x54/0xf0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885660]  [<000000001ab32222>] schedule_timeout+0x122/0x170 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885663]  [<000000001ab2d5ce>] __wait_for_common+0xbe/0x250 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885665]  [<000003ff80251bb4>] xfs_buf_iowait+0x34/0xf0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885735]  [<000003ff80254104>] __xfs_buf_submit+0x114/0x1f0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885811]  [<000003ff80254a74>] xfs_buf_read_map+0xe4/0x210 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885887]  [<000003ff80293856>] xfs_trans_read_buf_map+0x126/0x2e0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.885977]  [<000003ff801fef6e>] xfs_alloc_read_agfl+0x8e/0xd0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886052]  [<000003ff802018a6>] xfs_alloc_fix_freelist+0x1a6/0x410 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886122]  [<000003ff80201cb0>] xfs_alloc_vextent+0x130/0x560 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886190]  [<000003ff80211b6a>] xfs_bmap_btalloc+0x18a/0x790 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886257]  [<000003ff80212234>] xfs_bmapi_allocate+0xc4/0x2f0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886325]  [<000003ff8021335a>] xfs_bmapi_convert_delalloc+0x25a/0x490 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886392]  [<000003ff8024c490>] xfs_map_blocks+0x200/0x420 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886462]  [<000000001a5f09ca>] iomap_writepage_map+0x14a/0x540 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886483]  [<000000001a5f1c9e>] iomap_do_writepage+0x8e/0x230 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886486]  [<000000001a48c370>] write_cache_pages+0x1e0/0x430 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886520]  [<000000001a5ee706>] iomap_writepages+0x36/0x60 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886523]  [<000003ff8024c9f2>] xfs_vm_writepages+0x92/0xd0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886786]  [<000000001a48ee62>] do_writepages+0x62/0x1f0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886789]  [<000000001a5a664c>] __writeback_single_inode+0x4c/0x3e0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886791]  [<000000001a5a6ed6>] writeback_sb_inodes+0x246/0x5b0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886794]  [<000000001a5a729a>] __writeback_inodes_wb+0x5a/0x130 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886796]  [<000000001a5a75de>] wb_writeback+0x26e/0x3a0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886801]  [<000000001a5a89a0>] wb_workfn+0x3a0/0x610 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886803]  [<000000001a2c7146>] process_one_work+0x216/0x4a0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886819]  [<000000001a2c7904>] worker_thread+0x64/0x4a0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886821]  [<000000001a2d044c>] kthread+0x17c/0x190 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886824]  [<000000001a2575dc>] __ret_from_fork+0x3c/0x60 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886839]  [<000000001ab330da>] ret_from_fork+0xa/0x30 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886844] INFO: task kworker/0:1:17 blocked for more than 122 seconds.
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886846]       Not tainted 5.14.0-114.el9.s390x #1
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886848] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886850] task:kworker/0:1     state:D stack:    0 pid:   17 ppid:     2 flags:0x00000000
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886856] Workqueue: xfs-sync/dm-0 xfs_log_worker [xfs]
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886931] Call Trace:
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886934]  [<000000001ab2c4c2>] __schedule+0x2c2/0x6b0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886936]  [<000000001ab2c904>] schedule+0x54/0xf0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886938]  [<000000001ab32222>] schedule_timeout+0x122/0x170 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886941]  [<000000001ab2d5ce>] __wait_for_common+0xbe/0x250 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886943]  [<000000001a2c5e9a>] __flush_work.isra.0+0x17a/0x2b0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.886946]  [<000003ff80283c8c>] xlog_cil_force_seq+0x7c/0x240 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887020]  [<000003ff80280ede>] xfs_log_force+0x8e/0x270 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887093]  [<000003ff80281116>] xfs_log_worker+0x56/0xa0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887166]  [<000000001a2c7146>] process_one_work+0x216/0x4a0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887169]  [<000000001a2c7904>] worker_thread+0x64/0x4a0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887171]  [<000000001a2d044c>] kthread+0x17c/0x190 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887173]  [<000000001a2575dc>] __ret_from_fork+0x3c/0x60 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887178]  [<000000001ab330da>] ret_from_fork+0xa/0x30 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887185] INFO: task kworker/u2:1:35 blocked for more than 122 seconds.
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887187]       Not tainted 5.14.0-114.el9.s390x #1
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887188] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887190] task:kworker/u2:1    state:D stack:    0 pid:   35 ppid:     2 flags:0x00000000
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887194] Workqueue: writeback wb_workfn (flush-253:0)
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887196] Call Trace:
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887200]  [<000000001ab2c4c2>] __schedule+0x2c2/0x6b0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887202]  [<000000001ab2c904>] schedule+0x54/0xf0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887204]  [<000000001ab2ca12>] io_schedule+0x52/0x80 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887206]  [<000000001a703350>] rq_qos_wait+0xc0/0x130 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887223]  [<000000001a725a24>] wbt_wait+0x94/0x110 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887231]  [<000000001a702e3e>] __rq_qos_throttle+0x3e/0x60 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887234]  [<000000001a6ed86a>] blk_mq_get_new_requests+0xda/0x1a0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887243]  [<000000001a6f2eec>] blk_mq_submit_bio+0x31c/0x4c0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887246]  [<000000001a6e204c>] __submit_bio_noacct+0x6c/0x1e0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887260]  [<000000001a5ee68e>] iomap_submit_ioend+0x7e/0xc0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887262]  [<000000001a5f0b52>] iomap_writepage_map+0x2d2/0x540 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887264]  [<000000001a5f1c9e>] iomap_do_writepage+0x8e/0x230 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887266]  [<000000001a48c370>] write_cache_pages+0x1e0/0x430 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887269]  [<000000001a5ee706>] iomap_writepages+0x36/0x60 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887271]  [<000003ff8024c9f2>] xfs_vm_writepages+0x92/0xd0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887344]  [<000000001a48ee62>] do_writepages+0x62/0x1f0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887346]  [<000000001a5a664c>] __writeback_single_inode+0x4c/0x3e0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887349]  [<000000001a5a6ed6>] writeback_sb_inodes+0x246/0x5b0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887351]  [<000000001a5a729a>] __writeback_inodes_wb+0x5a/0x130 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887353]  [<000000001a5a75de>] wb_writeback+0x26e/0x3a0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887355]  [<000000001a5a89a0>] wb_workfn+0x3a0/0x610 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887358]  [<000000001a2c7146>] process_one_work+0x216/0x4a0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887360]  [<000000001a2c7904>] worker_thread+0x64/0x4a0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887363]  [<000000001a2d044c>] kthread+0x17c/0x190 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887394]  [<000000001a2575dc>] __ret_from_fork+0x3c/0x60 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887396]  [<000000001ab330da>] ret_from_fork+0xa/0x30 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887403] INFO: task kworker/u2:3:177 blocked for more than 122 seconds.
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887405]       Not tainted 5.14.0-114.el9.s390x #1
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887407] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887408] task:kworker/u2:3    state:D stack:    0 pid:  177 ppid:     2 flags:0x00000000
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887416] Workqueue: xfs-cil/dm-0 xlog_cil_push_work [xfs]
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887490] Call Trace:
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887494]  [<000000001ab2c4c2>] __schedule+0x2c2/0x6b0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887496]  [<000000001ab2c904>] schedule+0x54/0xf0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887498]  [<000000001ab2ca12>] io_schedule+0x52/0x80 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887500]  [<000000001a703350>] rq_qos_wait+0xc0/0x130 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887503]  [<000000001a725a24>] wbt_wait+0x94/0x110 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887505]  [<000000001a702e3e>] __rq_qos_throttle+0x3e/0x60 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887507]  [<000000001a6ed86a>] blk_mq_get_new_requests+0xda/0x1a0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887510]  [<000000001a6f2eec>] blk_mq_submit_bio+0x31c/0x4c0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887513]  [<000000001a6e204c>] __submit_bio_noacct+0x6c/0x1e0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887516]  [<000003ff8028336e>] xlog_cil_push_work+0x1fe/0x790 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887595]  [<000000001a2c7146>] process_one_work+0x216/0x4a0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887597]  [<000000001a2c7af0>] worker_thread+0x250/0x4a0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887599]  [<000000001a2d044c>] kthread+0x17c/0x190 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887601]  [<000000001a2575dc>] __ret_from_fork+0x3c/0x60 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887604]  [<000000001ab330da>] ret_from_fork+0xa/0x30 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887607] INFO: task plymouthd:511 blocked for more than 122 seconds.
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887609]       Not tainted 5.14.0-114.el9.s390x #1
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887611] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887612] task:plymouthd       state:D stack:    0 pid:  511 ppid:     1 flags:0x00000000
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887615] Call Trace:
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887619]  [<000000001ab2c4c2>] __schedule+0x2c2/0x6b0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887621]  [<000000001ab2c904>] schedule+0x54/0xf0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887623]  [<000000001ab2fa98>] rwsem_down_write_slowpath+0x258/0x540 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887626]  [<000003ff80267dac>] xfs_vn_update_time+0xbc/0x1c0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887701]  [<000000001a587d06>] file_update_time+0xd6/0x140 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887708]  [<000003ff8025a1be>] xfs_file_write_checks+0x24e/0x2f0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887782]  [<000003ff8025a8ec>] xfs_file_buffered_write+0x7c/0x2c0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887853]  [<000000001a55ffa0>] new_sync_write+0x100/0x190 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887860]  [<000000001a562c58>] vfs_write+0x1e8/0x280 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887863]  [<000000001a562f68>] ksys_write+0x68/0x100 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887868]  [<000000001ab25518>] __do_syscall+0x1b8/0x1e0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887870]  [<000000001ab330a8>] system_call+0x78/0xa0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887882] INFO: task auditd:900 blocked for more than 122 seconds.
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887884]       Not tainted 5.14.0-114.el9.s390x #1
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887885] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887887] task:auditd          state:D stack:    0 pid:  900 ppid:     1 flags:0x00000000
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887890] Call Trace:
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887891]  [<000000001ab2c4c2>] __schedule+0x2c2/0x6b0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887894]  [<000000001ab2c904>] schedule+0x54/0xf0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887896]  [<000000001ab2f362>] rwsem_down_read_slowpath+0x182/0x400 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887898]  [<000003ff80283ada>] xlog_cil_commit+0x4a/0x180 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.887972]  [<000003ff8027b56c>] __xfs_trans_commit+0x9c/0x330 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888044]  [<000003ff80267e10>] xfs_vn_update_time+0x120/0x1c0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888116]  [<000000001a587d06>] file_update_time+0xd6/0x140 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888118]  [<000003ff8025a1be>] xfs_file_write_checks+0x24e/0x2f0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888189]  [<000003ff8025a8ec>] xfs_file_buffered_write+0x7c/0x2c0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888267]  [<000000001a55ffa0>] new_sync_write+0x100/0x190 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888270]  [<000000001a562c58>] vfs_write+0x1e8/0x280 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888272]  [<000000001a562f68>] ksys_write+0x68/0x100 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888274]  [<000000001ab25518>] __do_syscall+0x1b8/0x1e0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888277]  [<000000001ab330a8>] system_call+0x78/0xa0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888299] INFO: task rs:main Q:Reg:937 blocked for more than 122 seconds.
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888301]       Not tainted 5.14.0-114.el9.s390x #1
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888302] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888304] task:rs:main Q:Reg   state:D stack:    0 pid:  937 ppid:     1 flags:0x00000000
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888307] Call Trace:
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888312]  [<000000001ab2c4c2>] __schedule+0x2c2/0x6b0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888314]  [<000000001ab2c904>] schedule+0x54/0xf0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888316]  [<000000001ab32222>] schedule_timeout+0x122/0x170 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888319]  [<000000001ab2ecce>] __down+0x9e/0x110 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888324]  [<000000001a313e8c>] down+0x8c/0x90 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888337]  [<000003ff802531f6>] xfs_buf_lock+0x46/0x100 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888407]  [<000003ff802534da>] xfs_buf_find+0x22a/0x3d0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888477]  [<000003ff80253860>] xfs_buf_get_map+0x50/0x350 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888548]  [<000003ff802549da>] xfs_buf_read_map+0x4a/0x210 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888625]  [<000003ff80293856>] xfs_trans_read_buf_map+0x126/0x2e0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888698]  [<000003ff80217780>] xfs_btree_read_buf_block.constprop.0+0xa0/0xe0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888765]  [<000003ff80219590>] xfs_btree_lookup_get_block+0xa0/0x190 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888833]  [<000003ff8021deec>] xfs_btree_visit_blocks+0x9c/0x150 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888900]  [<000003ff8020fb7e>] xfs_iread_extents+0xbe/0x150 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.888967]  [<000003ff80266b46>] xfs_buffered_write_iomap_begin+0xf6/0x920 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889039]  [<000000001a5ee078>] iomap_apply+0x68/0x2c0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889041]  [<000000001a5ee460>] iomap_file_buffered_write+0x80/0xb0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889043]  [<000003ff8025a926>] xfs_file_buffered_write+0xb6/0x2c0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889114]  [<000000001a55ffa0>] new_sync_write+0x100/0x190 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889116]  [<000000001a562c58>] vfs_write+0x1e8/0x280 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889121]  [<000000001a562f68>] ksys_write+0x68/0x100 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889124]  [<000000001ab25518>] __do_syscall+0x1b8/0x1e0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889126]  [<000000001ab330a8>] system_call+0x78/0xa0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889130] INFO: task firewalld:962 blocked for more than 122 seconds.
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889132]       Not tainted 5.14.0-114.el9.s390x #1
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889134] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889135] task:firewalld       state:D stack:    0 pid:  962 ppid:     1 flags:0x00000002
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889138] Call Trace:
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889143]  [<000000001ab2c4c2>] __schedule+0x2c2/0x6b0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889149]  [<000000001ab2c904>] schedule+0x54/0xf0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889150]  [<000000001ab32222>] schedule_timeout+0x122/0x170 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889153]  [<000000001ab2d5ce>] __wait_for_common+0xbe/0x250 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889155]  [<000003ff80251bb4>] xfs_buf_iowait+0x34/0xf0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889226]  [<000003ff80254104>] __xfs_buf_submit+0x114/0x1f0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889296]  [<000003ff80254a74>] xfs_buf_read_map+0xe4/0x210 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889367]  [<000003ff80293856>] xfs_trans_read_buf_map+0x126/0x2e0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889441]  [<000003ff8023ad06>] xfs_imap_to_bp+0x66/0x80 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889510]  [<000003ff8025fc7c>] xfs_iget_cache_miss+0x1ac/0x350 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889580]  [<000003ff802600e0>] xfs_iget+0x1d0/0x2e0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889651]  [<000003ff8026c1e0>] xfs_lookup+0xe0/0x110 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889723]  [<000003ff80267b22>] xfs_vn_lookup+0x62/0xa0 [xfs] 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889799]  [<000000001a57198a>] __lookup_slow+0x9a/0x170 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889802]  [<000000001a575f6a>] walk_component+0x14a/0x1e0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889805]  [<000000001a5761f6>] link_path_walk.part.0.constprop.0+0x1f6/0x350 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889808]  [<000000001a57645c>] path_lookupat+0x4c/0x1f0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889810]  [<000000001a577ffa>] filename_lookup+0xaa/0x1a0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889813]  [<000000001a578212>] user_path_at_empty+0x72/0xa0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889816]  [<000000001a569872>] vfs_statx+0x82/0x130 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889818]  [<000000001a569b94>] __do_sys_newfstatat+0x34/0x60 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889820]  [<000000001ab25518>] __do_syscall+0x1b8/0x1e0 
2022-06-23-10:34:30: 
2022-06-23-10:34:30: [  243.889822]  [<000000001ab330a8>] system_call+0x78/0xa

Comment 9 bfu 2022-06-24 10:42:14 UTC

Hit this when I trigger stable guest abi test on s390x as well

After the guest migration back to the src side the guest seems crashed

Test log:
2022-06-24-03:12:48: Host(10.0.160.25) Sending qmp command   : {"execute": "migrate-set-capabilities", "arguments": {"capabilities": [{"capability": "postcopy-ram", "state": true}]}, "id": "BCLFqkWC"}
2022-06-24-03:12:49: Host(10.0.160.25) Responding qmp command: {"return": {}, "id": "BCLFqkWC"}
2022-06-24-03:12:49: Host(10.0.160.25) Sending qmp command   : {"execute": "query-migrate-capabilities", "id": "bHD83ePi"}
2022-06-24-03:12:49: Host(10.0.160.25) Responding qmp command: {"return": [{"state": false, "capability": "xbzrle"}, {"state": false, "capability": "rdma-pin-all"}, {"state": false, "capability": "auto-converge"}, {"state": false, "capability": "zero-blocks"}, {"state": false, "capability": "compress"}, {"state": false, "capability": "events"}, {"state": true, "capability": "postcopy-ram"}, {"state": false, "capability": "x-colo"}, {"state": false, "capability": "release-ram"}, {"state": false, "capability": "return-path"}, {"state": false, "capability": "pause-before-switchover"}, {"state": false, "capability": "multifd"}, {"state": false, "capability": "dirty-bitmaps"}, {"state": false, "capability": "postcopy-blocktime"}, {"state": false, "capability": "late-block-activate"}, {"state": false, "capability": "x-ignore-shared"}, {"state": false, "capability": "validate-uuid"}, {"state": false, "capability": "background-snapshot"}], "id": "bHD83ePi"}
2022-06-24-03:12:49: Succeed to set migration capabilities: postcopy-ram
2022-06-24-03:12:49: Host(10.0.160.24) Sending qmp command   : {"execute": "migrate-set-capabilities", "arguments": {"capabilities": [{"capability": "postcopy-ram", "state": true}]}, "id": "HJpFNe7F"}
2022-06-24-03:12:49: Host(10.0.160.24) Responding qmp command: {"return": {}, "id": "HJpFNe7F"}
2022-06-24-03:12:49: Host(10.0.160.24) Sending qmp command   : {"execute": "query-migrate-capabilities", "id": "rHY0W1i5"}
2022-06-24-03:12:49: Host(10.0.160.24) Responding qmp command: {"return": [{"state": false, "capability": "xbzrle"}, {"state": false, "capability": "rdma-pin-all"}, {"state": false, "capability": "auto-converge"}, {"state": false, "capability": "zero-blocks"}, {"state": false, "capability": "compress"}, {"state": false, "capability": "events"}, {"state": true, "capability": "postcopy-ram"}, {"state": false, "capability": "x-colo"}, {"state": false, "capability": "release-ram"}, {"state": false, "capability": "return-path"}, {"state": false, "capability": "pause-before-switchover"}, {"state": false, "capability": "multifd"}, {"state": false, "capability": "dirty-bitmaps"}, {"state": false, "capability": "postcopy-blocktime"}, {"state": false, "capability": "late-block-activate"}, {"state": false, "capability": "x-ignore-shared"}, {"state": false, "capability": "validate-uuid"}, {"state": false, "capability": "background-snapshot"}], "id": "rHY0W1i5"}
2022-06-24-03:12:49: Succeed to set migration capabilities: postcopy-ram
2022-06-24-03:12:49: Host(10.0.160.24) Sending qmp command   : {"execute": "migrate-incoming", "arguments": {"uri": "tcp:[::]:4000"}, "id": "sPk3duNY"}
2022-06-24-03:12:50: Host(10.0.160.24) Responding qmp command: {"return": {}, "id": "sPk3duNY"}
2022-06-24-03:12:50: Host(10.0.160.25) Sending qmp command   : {"execute": "migrate", "arguments": {"uri": "tcp:10.0.160.24:4000"}, "id": "YCJ4a63J"}
2022-06-24-03:12:50: Host(10.0.160.25) Responding qmp command: {"return": {}, "id": "YCJ4a63J"}
2022-06-24-03:12:50: Host(10.0.160.25) Sending qmp command   : {"execute": "query-migrate", "id": "z4IBS1ZW"}
2022-06-24-03:12:50: Host(10.0.160.25) Responding qmp command: {"return": {"status": "setup"}, "id": "z4IBS1ZW"}
2022-06-24-03:12:50: Host(10.0.160.25) Sending qmp command   : {"execute": "query-migrate", "id": "WNYwVDit"}
2022-06-24-03:12:50: Host(10.0.160.25) Responding qmp command: {"return": {"status": "setup"}, "id": "WNYwVDit"}
2022-06-24-03:12:50: Host(10.0.160.25) Sending qmp command   : {"execute": "query-migrate", "id": "knhb2XR9"}
2022-06-24-03:12:50: Host(10.0.160.25) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 418, "total-time": 424, "ram": {"total": 4294967296, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 0, "pages-per-second": 0, "downtime-bytes": 0, "page-size": 4096, "remaining": 4286066688, "postcopy-bytes": 0, "mbps": 0, "transferred": 8795157, "precopy-bytes": 8795157, "duplicate": 29, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 8777728, "normal": 2143}}, "id": "knhb2XR9"}
2022-06-24-03:12:55: Host(10.0.160.25) Sending qmp command   : {"execute": "query-migrate", "id": "KIrihPs8"}
2022-06-24-03:12:55: Host(10.0.160.25) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 418, "total-time": 5427, "ram": {"total": 4294967296, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 0, "pages-per-second": 32710, "downtime-bytes": 0, "page-size": 4096, "remaining": 1616863232, "postcopy-bytes": 0, "mbps": 1073.93616, "transferred": 594242276, "precopy-bytes": 594242276, "duplicate": 510156, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 588500992, "normal": 143677}}, "id": "KIrihPs8"}
2022-06-24-03:13:00: Host(10.0.160.25) Sending qmp command   : {"execute": "query-migrate", "id": "3CpRTBtr"}
2022-06-24-03:13:00: Host(10.0.160.25) Responding qmp command: {"return": {"expected-downtime": 1851, "status": "active", "setup-time": 418, "total-time": 10435, "ram": {"total": 4294967296, "postcopy-requests": 0, "dirty-sync-count": 2, "multifd-bytes": 0, "pages-per-second": 32710, "downtime-bytes": 0, "page-size": 4096, "remaining": 40591360, "postcopy-bytes": 0, "mbps": 1073.93616, "transferred": 1218196374, "precopy-bytes": 1218196374, "duplicate": 800398, "dirty-pages-rate": 6274, "skipped": 0, "normal-bytes": 1208631296, "normal": 295076}}, "id": "3CpRTBtr"}
2022-06-24-03:13:00: Host(10.0.160.25) Sending qmp command   : {"execute": "migrate-start-postcopy", "id": "BqYu0cZm"}
2022-06-24-03:13:00: Host(10.0.160.25) Responding qmp command: {"return": {}, "id": "BqYu0cZm"}
2022-06-24-03:13:00: Host(10.0.160.25) Sending qmp command   : {"execute": "query-migrate", "id": "fGu3bqI7"}
2022-06-24-03:13:00: Host(10.0.160.25) Responding qmp command: {"return": {"expected-downtime": 1851, "status": "active", "setup-time": 418, "total-time": 10436, "ram": {"total": 4294967296, "postcopy-requests": 0, "dirty-sync-count": 2, "multifd-bytes": 0, "pages-per-second": 32710, "downtime-bytes": 0, "page-size": 4096, "remaining": 40591360, "postcopy-bytes": 0, "mbps": 1073.93616, "transferred": 1218196374, "precopy-bytes": 1218196374, "duplicate": 800398, "dirty-pages-rate": 6274, "skipped": 0, "normal-bytes": 1208631296, "normal": 295076}}, "id": "fGu3bqI7"}
2022-06-24-03:13:05: Host(10.0.160.25) Sending qmp command   : {"execute": "query-migrate", "id": "MXGGD6WW"}
2022-06-24-03:13:06: Host(10.0.160.25) Responding qmp command: {"return": {"status": "completed", "setup-time": 418, "downtime": 372, "total-time": 11429, "ram": {"total": 4294967296, "postcopy-requests": 8, "dirty-sync-count": 3, "multifd-bytes": 0, "pages-per-second": 46451, "downtime-bytes": 0, "page-size": 4096, "remaining": 0, "postcopy-bytes": 192621379, "mbps": 1025.2246626101171, "transferred": 1410817753, "precopy-bytes": 1218196374, "duplicate": 800409, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 1400877056, "normal": 342011}}, "id": "MXGGD6WW"}
2022-06-24-03:13:08: ======= Step 8.Repeat step6 =======
2022-06-24-03:13:08: Connect to monitor(10.0.160.24:4444) successfully.
2022-06-24-03:13:08: Closed the monitor(10.0.160.24:4444).
2022-06-24-03:13:08: [root@guest ~]# dmesg
Traceback (most recent call last):
  File "/home/ipa/runner.py", line 212, in _run
    getattr(self._case_dict[case], "run_case")(self._params)
  File "/home/ipa/virtkvmqe/stable_guest_abi_for_s390x_test_plan/test_scenarios/rhel_292442.py", line 213, in run_case
    src_serial.serial_check_dmesg()
  File "/home/ipa/monitor.py", line 753, in serial_check_dmesg
    getattr(self, '_check_dmesg_%s' % (
  File "/home/ipa/monitor.py", line 745, in _check_dmesg_linux
    output = self.serial_cmd_output(cmd=cmd, timeout=timeout)
  File "/home/ipa/monitor.py", line 633, in serial_cmd_output
    output = RemoteMonitor.recv_data_timeout(
  File "/home/ipa/monitor.py", line 265, in recv_data_timeout
    RemoteMonitor.test_error(self, err_info)
  File "/home/ipa/vm.py", line 93, in test_error
    raise usr_exceptions.Error(info)
usr_exceptions.Error: Failed to run "dmesg" under 60 sec, error info: "[    0.782221] Loading iSCSI transport class v2.0-870.

Comment 10 Li Xiaohui 2022-06-28 06:11:29 UTC

Didn't reproduce this bug on qemu-kvm-6.2.0-13.el9.s390x.rpm with same kernel version as Description: try 60 times with huge pages configured, all cases pass.

So seems it's a regression bug in qemu

Comment 11 Thomas Huth 2022-06-28 06:46:10 UTC

(In reply to Li Xiaohui from comment #10)
> Didn't reproduce this bug on qemu-kvm-6.2.0-13.el9.s390x.rpm with same
> kernel version as Description: try 60 times with huge pages configured, all
> cases pass.
> 
> So seems it's a regression bug in qemu

I just wanted to ask whether it's reproducible in an older version - thanks for reading my mind :-) ... this will make it a lot easier to find out where the problem comes from.

Comment 12 Thomas Huth 2022-07-05 11:15:09 UTC

(In reply to Li Xiaohui from comment #8)
> Tried 100 times to reproduce this bug on RHEL 8.7.0
> (kernel-4.18.0-402.el8.s390x &
> qemu-kvm-6.2.0-15.module+el8.7.0+15644+189a21f6.s390x) with same RHEL 9.1.0
> guest, but didn't hit the issue, hit a new call trace with 1/100 rate when
> boot guest on source host with huge page configured. Do we need to open a
> new bug?

Yes, please open a new BZ, since this seems to be an unrelated issue. Thanks!

Comment 13 Thomas Huth 2022-07-05 14:52:51 UTC

FWIW, I tried to reproduce this the whole day, and so far I've only seen the guest crashing once with:

[   61.492895] User process fault: interruption code 0010 ilc:3 in libc.so.6[3ffa5880000+1c8000]
[   61.492923] Failing address: 0000000000000000 TEID: 0000000000000800
[   61.492925] Fault in primary space mode while using user ASCE.
[   61.492928] AS:0000000082fc41c7 R3:0000000080bfc007 S:0000000000000020 
[   61.492934] CPU: 0 PID: 1217 Comm: migration_dirty Kdump: loaded Not tainted 5.14.0-114.el9.s390x #1
[   61.492939] Hardware name: IBM 8561 LT1 400 (KVM/Linux)
[   61.492941] User PSW : 0705200180000000 000003ffa590c062
[   61.492944]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:2 PM:0 RI:0 EA:3
[   61.492948] User GPRS: 0000000000000003 000003ffa590c050 000000000000002e 000000000000002e
[   61.492950]            00000000fbad2a84 0000000000000001 0000000001001df0 000003ffdcd7f368
[   61.492952]            00000000010007e0 0000000000000000 0000000000000001 000003ffdcd7efc8
[   61.492954]            000003ffa5b2ef78 000003ffa5b2f018 0000000001000868 000003ffdcd7ef28
[   61.492962] User Code: 000003ffa590c050: eb6ff0300024	stmg	%r6,%r15,48(%r15)
[   61.492962]            000003ffa590c056: c498000a06b5	lgrl	%r9,000003ffa5a4cdc0
[   61.492962]           #000003ffa590c05c: e3f0ff60ff71	lay	%r15,-160(%r15)
[   61.492962]           >000003ffa590c062: e3b090000004	lg	%r11,0(%r9)
[   61.492962]            000003ffa590c068: b90400a2		lgr	%r10,%r2
[   61.492962]            000003ffa590c06c: 9180b002		tm	2(%r11),128
[   61.492962]            000003ffa590c070: a7740043		brc	7,000003ffa590c0f6
[   61.492962]            000003ffa590c074: b24f0080		ear	%r8,%a0
[   61.492981] Last Breaking-Event-Address:
[   61.492982]  [<0000000000000001>] 0x1
[   61.494193] Core dump to |/usr/lib/systemd/systemd-coredump pipe failed
[   61.494399] list_del corruption. next->prev should be 0000000083492f98, but was 0000000000000000
[   61.494416] ------------[ cut here ]------------
[   61.494417] kernel BUG at lib/list_debug.c:54!
[   61.494489] monitor event: 0040 ilc:2 [#1] SMP 
[   61.494493] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink vfio_ccw mdev vfio_iommu_type1 vfio drm fb fuse font i2c_core drm_panel_orientation_quirks xfs libcrc32c ghash_s390 prng aes_s390 sd_mod t10_pi sg des_s390 libdes sha3_512_s390 virtio_scsi virtio_net net_failover failover sha3_256_s390 pkey zcrypt
[   61.494518] CPU: 0 PID: 579 Comm: auditd Kdump: loaded Not tainted 5.14.0-114.el9.s390x #1
[   61.494520] Hardware name: IBM 8561 LT1 400 (KVM/Linux)
[   61.494521] Krnl PSW : 0704c00180000000 000000006d754012 (__list_del_entry_valid+0x82/0xb0)
[   61.494531]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[   61.494533] Krnl GPRS: c0000000ffffefff 0000000000000027 0000000000000054 00000000ffffefff
[   61.494534]            000003800077b6a8 000003800077b6a0 0000000085235a00 ffffffffffffff7f
[   61.494536]            ffffffff00000000 000000006e221150 0000000083492f98 0000000083492e80
[   61.494537]            0000000080b4e900 000003800077bdb0 000000006d75400e 000003800077b8c8
[   61.494541] Krnl Code: 000000006d754002: c020003095a2	larl	%r2,000000006dd66b46
[   61.494541]            000000006d754008: c0e5001ed240	brasl	%r14,000000006db2e488
[   61.494541]           #000000006d75400e: af000000		mc	0,0
[   61.494541]           >000000006d754012: b9040043		lgr	%r4,%r3
[   61.494541]            000000006d754016: b9040032		lgr	%r3,%r2
[   61.494541]            000000006d75401a: c02000309577	larl	%r2,000000006dd66b08
[   61.494541]            000000006d754020: c0e5001ed234	brasl	%r14,000000006db2e488
[   61.494541]            000000006d754026: af000000		mc	0,0
[   61.494551] Call Trace:
[   61.494553]  [<000000006d754012>] __list_del_entry_valid+0x82/0xb0 
[   61.494555] ([<000000006d75400e>] __list_del_en

Reproducibility is really low for me, even with huge pages activated ... do you see a way for increasing the reproducibility or could you provide me with some instructions for easy automation of this issue (so far I only ran it manually)?

Comment 16 IBM Bug Proxy 2022-07-18 18:00:50 UTC

------- Comment From jjherne.com 2022-07-18 13:51 EDT-------

So you tried to reproduce this bug with RHEL9.1, everything default, except that you downgraded qemu-kvm from qemu-kvm-6.2.0-15 to qemu-kvm-6.2.0-13? And when you did this, you could not reproduce the problem? I want to ensure I'm understanding exactly what you changed to make the problem go away.

If this is true, can you run this environment for a longer period of time to be really sure that the crash won't happen?

Comment 17 Li Xiaohui 2022-07-19 09:02:49 UTC

(In reply to IBM Bug Proxy from comment #16)
> ------- Comment From jjherne.com 2022-07-18 13:51 EDT-------
> 
> So you tried to reproduce this bug with RHEL9.1, everything default, except
> that you downgraded qemu-kvm from qemu-kvm-6.2.0-15 to qemu-kvm-6.2.0-13?
> And when you did this, you could not reproduce the problem? I want to ensure
> I'm understanding exactly what you changed to make the problem go away.

Not right. See Description and Comment 10, on RHEL 9.1.0, I can reproduce bug on qemu-kvm-7.0.0-6.el9.s390x, but can't reproduce on qemu-kvm-6.2.0-13, so I think qemu-kvm-7.0 brought out this problem

> 
> If this is true, can you run this environment for a longer period of time to
> be really sure that the crash won't happen?

I would try more times on qemu-kvm-6.2.0-13 to ensure the crash won't happen.

Comment 18 Thomas Huth 2022-07-26 09:46:51 UTC

(In reply to Li Xiaohui from comment #17)
> I would try more times on qemu-kvm-6.2.0-13 to ensure the crash won't happen.

I have now finally also an environment set up for reproducing the problem (thanks to bfu), and I think I can confirm the findings so far:

I ran the test 15 times with qemu-kvm-7.0.0-8.el9.s390x on the target host and 3 times it was failing.

Then I ran the test 15 times with qemu-img-6.2.0-13.el9.s390x on the target host, and all runs were successful in that case.

I'll try to reproduce the issue with upstream QEMU - if it also happens there, I should be able to bisect the issue.

Comment 19 IBM Bug Proxy 2022-07-27 15:00:51 UTC

------- Comment From jjherne.com 2022-07-27 10:55 EDT-------
(In reply to comment #8)
> (In reply to Li Xiaohui from comment #17)
> > I would try more times on qemu-kvm-6.2.0-13 to ensure the crash won't happen.
> I have now finally also an environment set up for reproducing the problem
> (thanks to bfu), and I think I can confirm the findings so far:
> I ran the test 15 times with qemu-kvm-7.0.0-8.el9.s390x on the target host
> and 3 times it was failing.
> Then I ran the test 15 times with qemu-img-6.2.0-13.el9.s390x on the target
> host, and all runs were successful in that case.
> I'll try to reproduce the issue with upstream QEMU - if it also happens
> there, I should be able to bisect the issue.

Thomas,
Thanks for taking a look at this :). I was unable to quickly reproduce the bug. I'll hold off on reproducing the target environment since you seem to have the environment nailed down. But I'm happy to jump in if you need anything. Just keep me posted on your progress. Thanks again :)

Comment 20 Thomas Huth 2022-07-28 12:10:10 UTC

FWIW, I think we definitely want this commit in downstream, too:

 https://gitlab.com/qemu-project/qemu/-/commit/007e179ef0e97eafda4c9ff
 "multifd: Copy pages before compressing them with zlib"

Not sure yet whether that's the only issue that we're currently facing, or whether there are more bugs lurking around ... I'm still testing...

Comment 21 Thomas Huth 2022-07-29 16:11:43 UTC

The zlib patch was indeed not the culprit - it seems to break with one of the commits at the end of the merge request that has been merged with commit ID d90e6f665d3ac197f83d93.

So far, two of the lengthy bisect runs point to this commit here being the first bad commit:

cfd66f30fb0f735df06ff4220e5000290a43dad3 is the first bad commit                                                                    
commit cfd66f30fb0f735df06ff4220e5000290a43dad3                                               
Author: Peter Xu <peterx>                                                                                               
Date:   Wed Jan 19 16:09:19 2022 +0800                                                                                             
                                                                                                                                    
    migration: Simplify unqueue_page()

Peter, David, Juan, could it be that this modification maybe had some unforseen side effects on s390x?

Comment 22 Peter Xu 2022-07-29 23:12:36 UTC

Thomas & all, sorry to have caused such a problem.

What's the host (getpagesize() / guest (TARGET_PAGE_SIZE) page size for s390x?  That'll be the most suspecious thing I'd like to know if it can reproduce without huge pages.  I'd expect for the whole process of remote page faults we should be using host page size all along the way (e.g. UFFDIO_COPY uses qemu_ram_pagesize(), which AFAICT is host page size, and IIUC that's the only way we can do atomic pgtable updates - we can only atomically update pgtable in host page size), but maybe there's something I overlooked so they can differ.

Comment 23 Thomas Huth 2022-08-01 06:14:59 UTC

FWIW, I've done another bisection run over the weekend, this time with even more retries for each commit (50 instead of 20), just to be sure that I ended up at the right place. And also this run showed that "migration: Simplify unqueue_page()" is the first bad commit, so the problem must have been introduced here indeed.

> What's the host (getpagesize() / guest (TARGET_PAGE_SIZE) page size for s390x? 

Just to be sure, I added some debugging code in the unqueue_page function() and all values are 4096, i.e. qemu_ram_pagesize(block) return 4096, getpagesize() returns 4096 and TARGET_PAGE_SIZE is 4096.

Comment 24 Dr. David Alan Gilbert 2022-08-01 08:24:10 UTC

I'd be tempted to

  1) add back the tracing after unqueue_page that prints whether it's dirty or not.
     The other loops should still try again as far as I can tell

  2) When it hangs, is there anything left in the queue on the source?

Dave

Comment 25 Thomas Huth 2022-08-01 09:40:01 UTC

(In reply to Dr. David Alan Gilbert from comment #24)
> I'd be tempted to
> 
>   1) add back the tracing after unqueue_page that prints whether it's dirty
> or not.
>      The other loops should still try again as far as I can tell

I checked out commit a1fe28df7547120b (the last working one) and added some test there (since I don't know how to get to the trace points in the complicated test harness), and indeed, I can see that there are 7 cases where "block" is not NULL and "dirty" is true.

>   2) When it hangs, is there anything left in the queue on the source?

It doesn't hang - the migration finishes "successfully" and then the guest sometimes crashes later.

Given the fact that this problem also occurs in upstream, and QEMU is in freeze mode before the release already there, should we maybe simply revert the patch upstream first (and then downstream), and have another try with this optimization in the QEMU 7.2 release cycle instead?

 Thomas

Comment 26 Dr. David Alan Gilbert 2022-08-01 09:50:31 UTC

(In reply to Thomas Huth from comment #25)
> (In reply to Dr. David Alan Gilbert from comment #24)
> > I'd be tempted to
> > 
> >   1) add back the tracing after unqueue_page that prints whether it's dirty
> > or not.
> >      The other loops should still try again as far as I can tell
> 
> I checked out commit a1fe28df7547120b (the last working one) and added some
> test there (since I don't know how to get to the trace points in the
> complicated test harness), and indeed, I can see that there are 7 cases
> where "block" is not NULL and "dirty" is true.

What I'm curious of, is in your failing case whethere you're seeing any block!=NULL, dirty=false

> >   2) When it hangs, is there anything left in the queue on the source?
> 
> It doesn't hang - the migration finishes "successfully" and then the guest
> sometimes crashes later.

But at the point the migration goes to complete on the source, is that queue empty?

> 
> Given the fact that this problem also occurs in upstream, and QEMU is in
> freeze mode before the release already there, should we maybe simply revert
> the patch upstream first (and then downstream), and have another try with
> this optimization in the QEMU 7.2 release cycle instead?

Does it cleanly revert though? I thought there were loads of other patches on top that were reorganising that code?

>  Thomas

Comment 27 Thomas Huth 2022-08-01 10:25:22 UTC

(In reply to Dr. David Alan Gilbert from comment #26)
> (In reply to Thomas Huth from comment #25)
> > (In reply to Dr. David Alan Gilbert from comment #24)
> > > I'd be tempted to
> > > 
> > >   1) add back the tracing after unqueue_page that prints whether it's dirty
> > > or not.
> > >      The other loops should still try again as far as I can tell
> > 
> > I checked out commit a1fe28df7547120b (the last working one) and added some
> > test there (since I don't know how to get to the trace points in the
> > complicated test harness), and indeed, I can see that there are 7 cases
> > where "block" is not NULL and "dirty" is true.
> 
> What I'm curious of, is in your failing case whethere you're seeing any
> block!=NULL, dirty=false

Yes, there were also two cases with block!=NULL and dirty=false !

> > >   2) When it hangs, is there anything left in the queue on the source?
> > 
> > It doesn't hang - the migration finishes "successfully" and then the guest
> > sometimes crashes later.
> 
> But at the point the migration goes to complete on the source, is that queue
> empty?

Sorry, now you've lost me ... I thought we were talking about the QEMU on the destination side only here, since that's the one where it matters whether the bad commit is included or not?

Anyway, maybe it's easier if I provide you with access to the system ... ping me on IRC if you want to have a look on your own there.

> > 
> > Given the fact that this problem also occurs in upstream, and QEMU is in
> > freeze mode before the release already there, should we maybe simply revert
> > the patch upstream first (and then downstream), and have another try with
> > this optimization in the QEMU 7.2 release cycle instead?
> 
> Does it cleanly revert though? I thought there were loads of other patches
> on top that were reorganising that code?

The bad commit is pretty much at the end of that merge request, so it's fortunately rather on top of the reorganising patches. As far as I can see, it reverts fine apart from the changes to the trace-events file, but that should be easy to fix up manually. I'll check whether a revert of the patch fixes the problem on QEMU's master branch...

Comment 28 Dr. David Alan Gilbert 2022-08-01 11:42:33 UTC

(In reply to Thomas Huth from comment #27)
> (In reply to Dr. David Alan Gilbert from comment #26)
> > (In reply to Thomas Huth from comment #25)
> > > (In reply to Dr. David Alan Gilbert from comment #24)
> > > > I'd be tempted to
> > > > 
> > > >   1) add back the tracing after unqueue_page that prints whether it's dirty
> > > > or not.
> > > >      The other loops should still try again as far as I can tell
> > > 
> > > I checked out commit a1fe28df7547120b (the last working one) and added some
> > > test there (since I don't know how to get to the trace points in the
> > > complicated test harness), and indeed, I can see that there are 7 cases
> > > where "block" is not NULL and "dirty" is true.
> > 
> > What I'm curious of, is in your failing case whethere you're seeing any
> > block!=NULL, dirty=false
> 
> Yes, there were also two cases with block!=NULL and dirty=false !
> 
> > > >   2) When it hangs, is there anything left in the queue on the source?
> > > 
> > > It doesn't hang - the migration finishes "successfully" and then the guest
> > > sometimes crashes later.
> > 
> > But at the point the migration goes to complete on the source, is that queue
> > empty?
> 
> Sorry, now you've lost me ... I thought we were talking about the QEMU on
> the destination side only here, since that's the one where it matters
> whether the bad commit is included or not?

That patch only touches sending side code!

Dave

> Anyway, maybe it's easier if I provide you with access to the system ...
> ping me on IRC if you want to have a look on your own there.
> 
> > > 
> > > Given the fact that this problem also occurs in upstream, and QEMU is in
> > > freeze mode before the release already there, should we maybe simply revert
> > > the patch upstream first (and then downstream), and have another try with
> > > this optimization in the QEMU 7.2 release cycle instead?
> > 
> > Does it cleanly revert though? I thought there were loads of other patches
> > on top that were reorganising that code?
> 
> The bad commit is pretty much at the end of that merge request, so it's
> fortunately rather on top of the reorganising patches. As far as I can see,
> it reverts fine apart from the changes to the trace-events file, but that
> should be easy to fix up manually. I'll check whether a revert of the patch
> fixes the problem on QEMU's master branch...

Comment 29 Thomas Huth 2022-08-01 15:49:14 UTC

(In reply to Dr. David Alan Gilbert from comment #28)
> (In reply to Thomas Huth from comment #27)
[...]
> > Sorry, now you've lost me ... I thought we were talking about the QEMU on
> > the destination side only here, since that's the one where it matters
> > whether the bad commit is included or not?
> 
> That patch only touches sending side code!

For the records: The problem is indeed occurring on the sending side - I missed the fact that the test script is doing a ping-pong migration, first from the source to the destination and then back to the source.

Comment 30 Thomas Huth 2022-08-02 05:10:41 UTC

Reverting cfd66f30fb0f735df06ff4220e5000290a43dad3 on top of QEMU's master branch fixes the issue in upstream QEMU. I did some test runs - with the clean unmodified master branch it fails multiple times out of 5 runs already, but if I revert the patch, I got 49 successful runs out of 50 where the 1 single failure was due to some bug in the test script instead.

Thus I'd recommend to really revert the patch in upstream for QEMU 7.1, and maybe introduced it in a fixed form later again once we understood what was really going wrong here. I'll send a patch.

Comment 33 Li Xiaohui 2022-08-03 02:53:47 UTC

Test VIRT-86251 case with repeating 150 times on qemu-img-7.0.0-9.el9._bz2099934v2.s390x, they all pass, didn't hit any issues.

Comment 36 Li Xiaohui 2022-08-16 01:19:32 UTC

Tried 200 times for VIRT_86251 cases on qemu-kvm-7.0.0-10.el9.s390x, they all pass. 

We can mark this bug verified per the test results.

Comment 39 bfu 2022-08-16 07:21:18 UTC

RESULTS [RHEL-292436]:
==>TOTAL : 80
==>PASS : 60
   1: RHEL-292438-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.4.0" machine type (13 min 10 sec)
   2: RHEL-292441-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.5.0" machine type (12 min 45 sec)
   3: RHEL-292443-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.6.0" machine type (12 min 36 sec)
   4: RHEL-292440-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.4.0" machine type (8 min 
9 sec)
   5: RHEL-292442-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.5.0" machine type (8 min 
9 sec)
   6: RHEL-292444-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.6.0" machine type (8 min 
17 sec)
   7: RHEL-292438-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.4.0" machine type (13 min 3 sec)
   8: RHEL-292441-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.5.0" machine type (13 min 22 sec)
   9: RHEL-292443-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.6.0" machine type (13 min 12 sec)
   10: RHEL-292440-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.4.0" machine type (8 min
 10 sec)
   11: RHEL-292442-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.5.0" machine type (8 min
 13 sec)
   12: RHEL-292444-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.6.0" machine type (8 min
 9 sec)
   13: RHEL-292438-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.4.0" machine type (12 min 49 sec)
   14: RHEL-292441-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.5.0" machine type (12 min 21 sec)
   15: RHEL-292443-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.6.0" machine type (12 min 34 sec)
   16: RHEL-292440-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.4.0" machine type (8 min
 9 sec)
   17: RHEL-292442-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.5.0" machine type (8 min 9 sec)
   18: RHEL-292444-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.6.0" machine type (8 min 13 sec)
   19: RHEL-292438-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.4.0" machine type (12 min 26 sec)
   20: RHEL-292441-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.5.0" machine type (12 min 41 sec)
   21: RHEL-292443-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.6.0" machine type (12 min 20 sec)
   22: RHEL-292440-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.4.0" machine type (8 min 9 sec)
   23: RHEL-292442-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.5.0" machine type (8 min 5 sec)
   24: RHEL-292444-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.6.0" machine type (8 min 5 sec)
   25: RHEL-292438-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.4.0" machine type (12 min 29 sec)
   26: RHEL-292441-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.5.0" machine type (12 min 21 sec)
   27: RHEL-292443-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.6.0" machine type (13 min 22 sec)
   28: RHEL-292440-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.4.0" machine type (8 min 9 sec)
   29: RHEL-292442-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.5.0" machine type (8 min 9 sec)
   30: RHEL-292444-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.6.0" machine type (8 min 9 sec)
   31: RHEL-292438-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.4.0" machine type (12 min 37 sec)
   32: RHEL-292441-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.5.0" machine type (12 min 15 sec)
   33: RHEL-292443-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.6.0" machine type (12 min 21 sec)
   34: RHEL-292440-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.4.0" machine type (8 min 9 sec)
   35: RHEL-292442-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.5.0" machine type (8 min 9 sec)
   36: RHEL-292444-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.6.0" machine type (8 min 13 sec)
   37: RHEL-292438-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.4.0" machine type (10 min 42 sec)
   38: RHEL-292441-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.5.0" machine type (12 min 16 sec)
   39: RHEL-292443-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.6.0" machine type (12 min 41 sec)
   40: RHEL-292440-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.4.0" machine type (8 min 5 sec)
   41: RHEL-292442-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.5.0" machine type (8 min 5 sec)
   42: RHEL-292444-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.6.0" machine type (8 min 9 sec)
   43: RHEL-292438-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.4.0" machine type (12 min 9 sec)
   44: RHEL-292441-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.5.0" machine type (12 min 11 sec)
   45: RHEL-292443-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.6.0" machine type (12 min 11 sec)
   46: RHEL-292440-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.4.0" machine type (8 min 9 sec)
   47: RHEL-292442-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.5.0" machine type (8 min 9 sec)
   48: RHEL-292444-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.6.0" machine type (8 min 9 sec)
   49: RHEL-292438-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.4.0" machine type (13 min 10 sec)
   50: RHEL-292441-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.5.0" machine type (12 min 5 sec)
   51: RHEL-292443-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.6.0" machine type (12 min 1 sec)
   52: RHEL-292440-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.4.0" machine type (8 min 5 sec)
   53: RHEL-292442-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.5.0" machine type (8 min 23 sec)
   54: RHEL-292444-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.6.0" machine type (9 min 20 sec)
   55: RHEL-292438-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.4.0" machine type (11 min 19 sec)
   56: RHEL-292441-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.5.0" machine type (11 min 22 sec)
   57: RHEL-292443-[stable guest abi][s390x]forward and backward migration with "s390-ccw-virtio-rhel8.6.0" machine type (13 min 50 sec)
   58: RHEL-292440-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.4.0" machine type (8 min 29 sec)
   59: RHEL-292442-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.5.0" machine type (8 min 21 sec)
   60: RHEL-292444-[stable guest abi][s390x]forward and backward postcopy migration with "s390-ccw-virtio-rhel8.6.0" machine type (8 min 22 sec)

As the test result, set this bz to verified

Comment 41 errata-xmlrpc 2022-11-15 09:54:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: qemu-kvm security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7967

Note You need to log in before you can comment on or make changes to this bug.