Bug 1653569 - Stress guest and stop it, then do live migration, guest hit call trace on destination end
Summary: Stress guest and stop it, then do live migration, guest hit call trace on des...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: qemu-kvm
Version: 8.0
Hardware: s390x
OS: Linux
high
medium
Target Milestone: rc
: 8.0
Assignee: David Hildenbrand
QA Contact: xianwang
URL:
Whiteboard:
Depends On:
Blocks: 1661167
TreeView+ depends on / blocked
 
Reported: 2018-11-27 07:15 UTC by xianwang
Modified: 2019-06-14 00:48 UTC (History)
16 users (show)

Fixed In Version: qemu-kvm-2.12.0-54.module+el8+2640+20cc81e1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1659127 1661167 (view as bug list)
Environment:
Last Closed: 2019-06-14 00:48:03 UTC
Type: Bug
Target Upstream Version:


Attachments (Terms of Use)
core file for comment7 (384.79 KB, application/octet-stream)
2018-11-28 08:20 UTC, xianwang
no flags Details

Description xianwang 2018-11-27 07:15:16 UTC
Description of problem:
Stress guest and stop guest, then do live migration, guest hit call trace on destination end. This issue also could be hit on local migration. If without "stress" in guest or without "stop" guest, this issue can't be reproduced.

Version-Release number of selected component (if applicable):
Host:
4.18.0-40.el8.s390x
qemu-kvm-2.12.0-42.module+el8+2173+537e5cb5.s390x

Guest:
4.18.0-44.el8.s390x

How reproducible:
100%

Steps to Reproduce:
1.Boot a guest on src end
2.Boot a guest on dst end with incoming mode
3.In guest: execute "stress --cpu 2 --io 2 --vm 2 --vm-bytes 64M" 
4.On src, stop guest and then do live migration
{"execute":"stop"}
{"execute": "migrate", "arguments": {"uri": "tcp:10.16.106.172:5801"}}
5.After migration completed, execute "cont" on dst end
{"execute":"cont"}

Actual results:
Guest hit call trace as following,"dd" and "reboot" work well.

[ 2667.765246] rcu: INFO: rcu_sched self-detected stall on CPU
[ 2667.765269] rcu: 	1-...!: (1 ticks this GP) idle=3c2/1/0x4000000000000002 softirq=84260/84260 fqs=0 
[ 2667.765271] rcu: 	 (t=17921 jiffies g=20885 q=1)
[ 2667.765276] rcu: rcu_sched kthread starved for 17921 jiffies! g20885 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[ 2667.765278] rcu: RCU grace-period kthread stack dump:
[ 2667.765280] rcu_sched       I    0    10      2 0x00000000
[ 2667.765284] Call Trace:
[ 2667.765294] ([<0000000000784066>] __schedule+0x2ce/0x8e8)
[ 2667.765296]  [<00000000007846ca>] schedule+0x4a/0xb8 
[ 2667.765299]  [<0000000000788586>] schedule_timeout+0x1ce/0x458 
[ 2667.765306]  [<00000000001bf9fc>] rcu_gp_kthread+0x2fc/0x630 
[ 2667.765310]  [<00000000001696ac>] kthread+0x14c/0x168 
[ 2667.765313]  [<0000000000789672>] kernel_thread_starter+0xa/0x10 
[ 2667.765315]  [<0000000000789668>] kernel_thread_starter+0x0/0x10 
[ 2667.765320] Task dump for CPU 1:
[ 2667.765321] stress          R  running task        0  6995   6989 0x02000000
[ 2667.765324] Call Trace:
[ 2667.765330] ([<000000007fff3700>] 0x7fff3700)
[ 2667.765331]  [<0000000077123c70>] 0x77123c70 
[ 2667.765335] ([<00000000002d2aa6>] clear_huge_page+0x1ce/0x320)
[ 2667.765340]  [<0000000000314fdc>] do_huge_pmd_anonymous_page+0x1bc/0x6c0 
[ 2667.765342]  [<00000000002d228a>] __handle_mm_fault+0x9ea/0xa98 
[ 2667.765344]  [<00000000002d243a>] handle_mm_fault+0x102/0x1c0 
[ 2667.765349]  [<00000000001248b4>] do_dat_exception+0x194/0x3e0 
[ 2667.765351]  [<0000000000789818>] pgm_check_handler+0x1a0/0x1f4 

Expected results:
Guest should not call trace.

Additional info:
1)I have tried this scenario on power9, it does not exist this issue
qemu cli:
/usr/libexec/qemu-kvm \
   -name 'avocado-vt-vm1'  \
   -sandbox off  \
   -machine s390-ccw-virtio-rhel7.6.0  \
   -nodefaults  \
   -nographic \
   -mon chardev=serial_id_serial0,mode=readline \
   -chardev stdio,mux=on,id=serial_id_serial0,server,nowait,signal=off \
   -device sclpconsole,id=serial111,chardev=serial_id_serial0 \
   -object iothread,id=iothread0 \
   -device virtio-scsi-ccw,id=virtio_scsi_ccw0,iothread=iothread0 \
   -drive id=drive1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/xianwang/mount_point/rhel80-s390x-virtio-scsi.qcow2 \
   -device scsi-hd,bus=virtio_scsi_ccw0.0,drive=drive1,id=hd1,bootindex=1 \
   -m 2048 \
   -smp 2  \
   -cpu 'host'  \
   -vnc :1  \
   -qmp tcp:0:8881,server,nowait \
   -rtc base=utc,clock=host,driftfix=slew \
   -boot menu=on \
   -enable-kvm \

Comment 1 xianwang 2018-11-27 08:08:14 UTC
I have tried several times on power9(both two hosts migration and local migration) and didn't hit this issue.
Host:
4.18.0-32.el8.ppc64le
qemu-kvm-2.12.0-41.el8+2104+3e32e6f8.ppc64le
SLOF-20170724-5.git89f519f.el8+5.ppc64le

Guest:
4.18.0-32.el8.ppc64le

Comment 2 David Hildenbrand 2018-11-27 08:58:56 UTC
Just to make sure, are you running the exact same kernel in the source and target host?

Comment 3 David Hildenbrand 2018-11-27 09:15:20 UTC
Another set of questions:

1. Can it be reproduced without the stop?
2. How long is the guest stopped? (time between stop and cont on destination)

Comment 4 David Hildenbrand 2018-11-27 09:27:41 UTC
I am pretty sure this is related to us not loading/restoring the actual TOD clock when stopping/starting the VM. PPC does this. My assumption is true if getting rid of the stop fixes the issue.

Comment 5 xianwang 2018-11-27 09:32:12 UTC
> Another set of questions:
> 
> 1. Can it be reproduced without the stop?
No, if without "stop" it can't be hit; if without "stress", it also can't be hit.
> 2. How long is the guest stopped? (time between stop and cont on destination)
Although I didn't calculate the time, because there is "migration" between "stop" and "cont", I think maybe it is about 2 minutes. 

what's more, the kernel and qemu build between src and dst hosts are same; this issue is also reproduced on local migration(src host is dst host).

Comment 6 David Hildenbrand 2018-11-27 11:01:35 UTC
Okay, so I am pretty sure the problem is that we don't actually stop the guest clock (TOD) while not running the guest. I am working on a QEMU fix.

If so, the problem can be reproduced by not migrating but simply doing

{"execute":"stop"}
... wait 2 minutes ...
{"execute":"cont"}

Nice finding! :)

Comment 7 xianwang 2018-11-28 08:05:51 UTC
(In reply to David Hildenbrand from comment #6)
> Okay, so I am pretty sure the problem is that we don't actually stop the
> guest clock (TOD) while not running the guest. I am working on a QEMU fix.
> 
> If so, the problem can be reproduced by not migrating but simply doing
> 
> {"execute":"stop"}
> ... wait 2 minutes ...
> {"execute":"cont"}
> 
> Nice finding! :)

Yes, you are right, I have tried {"execute":"stop"}--wait 7 minutes--{"execute":"cont"}, without migration, then reproduce it as following and there is a core file generated:

[  885.086312] rcu: INFO: rcu_sched self-detected stall on CPU
[  885.086335] rcu: 	0-...!: (1 GPs behind) idle=a46/1/0x4000000000000002 softirq=19869/19872 fqs=0 
[  885.086337] rcu: 	 (t=70414 jiffies g=3441 q=939)
[  885.086345] rcu: rcu_sched kthread starved for 70414 jiffies! g3441 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
[  885.086346] rcu: RCU grace-period kthread stack dump:
[  885.086349] rcu_sched       I    0    10      2 0x00000000
[  885.086353] Call Trace:
[  885.086368] ([<0000000000784066>] __schedule+0x2ce/0x8e8)
[  885.086370]  [<00000000007846ca>] schedule+0x4a/0xb8 
[  885.086373]  [<0000000000788586>] schedule_timeout+0x1ce/0x458 
[  885.086382]  [<00000000001bf9fc>] rcu_gp_kthread+0x2fc/0x630 
[  885.086386]  [<00000000001696ac>] kthread+0x14c/0x168 
[  885.086390]  [<0000000000789672>] kernel_thread_starter+0xa/0x10 
[  885.086392]  [<0000000000789668>] kernel_thread_starter+0x0/0x10 
[  885.086397] Task dump for CPU 0:
[  885.086398] gdbus           R  running task        0  2102   2094 0x02000000
[  885.086403] Call Trace:
[  885.086408] ([<04000000007004c0>] 0x4000000007004c0)
[  885.086409] Task dump for CPU 1:
[  885.086411] gnome-shell     R  running task        0  2105   2098 0x02000000
[  885.086414] Call Trace:
[  885.086416] ([<000000007b753800>] 0x7b753800)
[  889.892214] systemd[1]: systemd-journald.service: Main process exited, code=dumped, status=6/ABRT
[  889.892576] systemd[1]: systemd-journald.service: Failed with result 'watchdog'.
[  889.894103] systemd[1]: systemd-journald.service: Service has no hold-off time (RestartSec=0), scheduling restart.
[  889.894173] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 2.
[  889.894330] systemd[1]: Stopped Flush Journal to Persistent Storage.
[  889.894357] systemd[1]: Stopping Flush Journal to Persistent Storage...
[  889.894363] systemd[1]: Stopped Journal Service.
[  889.897644] systemd[1]: Starting Journal Service...
[  889.918854] systemd-journald[2133]: File /run/log/journal/42b29c3f22be43d1b910415752bf6f31/system.journal corrupted or uncleanly shut down, renaming and replacing.
[  890.098641] systemd[1]: Started Journal Service.
[  892.212559] systemd-coredump[2113]: MESSAGE=Process 757 (systemd-journal) of user 0 dumped core.
[  892.212568] systemd-coredump[2113]: Coredump diverted to /var/lib/systemd/coredump/core.systemd-journal.0.c38ccab115aa4e07871f3e480d2f9742.757.1543392185000000.lz4
[  892.212572] systemd-coredump[2113]: Stack trace of thread 757:
[  892.212575] systemd-coredump[2113]: #0  0x000003ffbaeff7ee epoll_wait (libc.so.6)
[  892.212578] systemd-coredump[2113]: #1  0x000003ffbb364604 sd_event_wait (libsystemd-shared-239.so)
[  892.212581] systemd-coredump[2113]: #2  0x000003ffbb365628 sd_event_run (libsystemd-shared-239.so)
[  892.212584] systemd-coredump[2113]: #3  0x000002aa33e8767a main (systemd-journald)
[  892.212587] systemd-coredump[2113]: #4  0x000003ffbae24172 __libc_start_main (libc.so.6)
[  892.212590] systemd-coredump[2113]: #5  0x000002aa33e8789c _start (systemd-journald)

Comment 8 xianwang 2018-11-28 08:20:20 UTC
Created attachment 1509410 [details]
core file for comment7

Comment 16 Danilo Cesar Lemes de Paula 2019-01-08 18:14:35 UTC
Fix included in qemu-kvm-2.12.0-54.module+el8+2640+20cc81e1

Comment 18 xianwang 2019-01-11 09:06:22 UTC
Bug verify:
Host:
4.18.0-56.el8.s390x
qemu-kvm-2.12.0-55.module+el8+2645+07371192.s390x

Guest:
4.18.0-34.el8.s390x

steps are same with bug report.

result:
migration completed, vm works well, no call trace.

So, this bug is fixed.


Note You need to log in before you can comment on or make changes to this bug.