Bug 1396546 - migration/postcopy: Serial hang after migration [NEEDINFO]
Summary: migration/postcopy: Serial hang after migration
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: qemu-kvm
Version: ---
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: pre-dev-freeze
: 8.2
Assignee: Dr. David Alan Gilbert
QA Contact: Li Xiaohui
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-18 15:35 UTC by Dr. David Alan Gilbert
Modified: 2021-03-22 11:00 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-01 07:27:39 UTC
Type: Bug
Target Upstream Version:
xiaohli: needinfo? (dgilbert)


Attachments (Terms of Use)

Description Dr. David Alan Gilbert 2016-11-18 15:35:38 UTC
Description of problem:
Processes writing to the serial console in the guests are stuck after a postcopy migration.  I've got a stressapptest running in the (f20) guest and it's not giving any output.  I can ssh into the guest and look at it's stack and it's stuck in tty output

Version-Release number of selected component (if applicable):
2.6.0.27
(I can see it happens on upstream 2.5.1, 2.6.2 but seems to be fixed in 2.7.0)

How reproducible:
100%

Steps to Reproduce:
1. Start an f20 guest with a qemu command line like:
    -machine pc,accel=kvm -m 8192 -smp 6 -nographic -drive id=image,file=/home/vms/f20.qcow2,cache=none -serial pipe:/tmp/guest-serial-35728 -monitor pipe:/tmp/guest-hmp-35728
2.  In the guest run stuff writing to the console - e.g. stressapptest
3.  Postcopy migrate

Actual results:
VM is alive but anything writing to serial is dead

Expected results:
Serial output comes out.

Additional info:

Comment 6 Ademar Reis 2020-02-05 22:43:12 UTC
QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 9 RHEL Program Management 2020-12-01 07:27:39 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 10 Li Xiaohui 2021-01-26 13:30:01 UTC
Didn't reproduce this bz on rhel-8.4.0-av hosts(kernel-4.18.0-277.el8.x86_64&qemu-kvm-5.2.0-3.module+el8.4.0+9499+42e58f08.x86_64):


test steps:
1.Boot a guest with serial console:
/usr/libexec/qemu-kvm \
-machine q35 \
-cpu EPYC \
-nodefaults \
-device pcie-root-port,port=0x10,chassis=1,id=root0,bus=pcie.0,multifunction=on,addr=0x2 \
-device pcie-root-port,port=0x11,chassis=2,id=root1,bus=pcie.0,addr=0x2.0x1 \
-device pcie-root-port,port=0x12,chassis=3,id=root2,bus=pcie.0,addr=0x2.0x2 \
-device pcie-root-port,port=0x13,chassis=4,id=root3,bus=pcie.0,addr=0x2.0x3 \
-device pcie-root-port,port=0x14,chassis=5,id=root4,bus=pcie.0,addr=0x2.0x4 \
-device pcie-root-port,port=0x15,chassis=6,id=root5,bus=pcie.0,addr=0x2.0x5 \
-device pcie-root-port,port=0x16,chassis=7,id=root6,bus=pcie.0,addr=0x2.0x6 \
-device pcie-root-port,port=0x17,chassis=8,id=root7,bus=pcie.0,addr=0x2.0x7 \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=root1 \
-device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 \
-device virtio-net-pci,mac=9a:8a:8b:8c:8d:8e,id=net0,vectors=4,netdev=tap0,bus=root2 \
-blockdev driver=file,cache.direct=on,cache.no-flush=off,filename=/mnt/nfs/rhel840-64-virtio-scsi.qcow2,node-name=drive_sys1 \
-blockdev driver=qcow2,node-name=drive_image1,file=drive_sys1 \
-netdev tap,id=tap0,vhost=on \
-m 4096,slots=1,maxmem=6G \
-smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
-enable-kvm  \
-device VGA \
-vnc :10 \
-rtc base=localtime,clock=host \
-boot menu=off,strict=off,order=cdn,once=c \
-qmp tcp:0:3333,server,nowait \
-serial pipe:/home/test/guest-serial-35728 \                 ----> serial command, and /home/test shares to dst host
-monitor stdio \
2.Connect serial console on src host:
# cat /home/test/guest-serial-35728
3.Boot a guest with same clis but appends "-incoming defer" on dst host;
4.Connect serial console on dst host:
# cat /home/test/guest-serial-35728
5.After guest starts, run stressapptest in guest(guest connected via remote-viewer):
# stressapptest -M 80 -s 1000000
6.Enable postcopy capability on src and dst, start postcopy migration.


Actual result:
postcopy migration succeed, and the serial console on src host before migration, the serial console on dst host after migration work well, doesn't hit serial hang. 


I have tried twice above test, all pass. 
Dave, could you help check above tests. if no problem, could we close this bz as currentrelease since I couldn't reproduce it now?

Comment 11 Dr. David Alan Gilbert 2021-01-26 17:14:51 UTC
(In reply to Li Xiaohui from comment #10)
> Didn't reproduce this bz on rhel-8.4.0-av
> hosts(kernel-4.18.0-277.el8.x86_64&qemu-kvm-5.2.0-3.module+el8.4.
> 0+9499+42e58f08.x86_64):
> 
> 
> test steps:
> 1.Boot a guest with serial console:
> /usr/libexec/qemu-kvm \
> -machine q35 \
> -cpu EPYC \
> -nodefaults \
> -device
> pcie-root-port,port=0x10,chassis=1,id=root0,bus=pcie.0,multifunction=on,
> addr=0x2 \
> -device pcie-root-port,port=0x11,chassis=2,id=root1,bus=pcie.0,addr=0x2.0x1 \
> -device pcie-root-port,port=0x12,chassis=3,id=root2,bus=pcie.0,addr=0x2.0x2 \
> -device pcie-root-port,port=0x13,chassis=4,id=root3,bus=pcie.0,addr=0x2.0x3 \
> -device pcie-root-port,port=0x14,chassis=5,id=root4,bus=pcie.0,addr=0x2.0x4 \
> -device pcie-root-port,port=0x15,chassis=6,id=root5,bus=pcie.0,addr=0x2.0x5 \
> -device pcie-root-port,port=0x16,chassis=7,id=root6,bus=pcie.0,addr=0x2.0x6 \
> -device pcie-root-port,port=0x17,chassis=8,id=root7,bus=pcie.0,addr=0x2.0x7 \
> -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=root1 \
> -device
> scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-
> id=0,lun=0,bootindex=0 \
> -device
> virtio-net-pci,mac=9a:8a:8b:8c:8d:8e,id=net0,vectors=4,netdev=tap0,bus=root2
> \
> -blockdev
> driver=file,cache.direct=on,cache.no-flush=off,filename=/mnt/nfs/rhel840-64-
> virtio-scsi.qcow2,node-name=drive_sys1 \
> -blockdev driver=qcow2,node-name=drive_image1,file=drive_sys1 \
> -netdev tap,id=tap0,vhost=on \
> -m 4096,slots=1,maxmem=6G \
> -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
> -enable-kvm  \
> -device VGA \
> -vnc :10 \
> -rtc base=localtime,clock=host \
> -boot menu=off,strict=off,order=cdn,once=c \
> -qmp tcp:0:3333,server,nowait \
> -serial pipe:/home/test/guest-serial-35728 \                 ----> serial
> command, and /home/test shares to dst host
> -monitor stdio \
> 2.Connect serial console on src host:
> # cat /home/test/guest-serial-35728
> 3.Boot a guest with same clis but appends "-incoming defer" on dst host;
> 4.Connect serial console on dst host:
> # cat /home/test/guest-serial-35728
> 5.After guest starts, run stressapptest in guest(guest connected via
> remote-viewer):
> # stressapptest -M 80 -s 1000000
> 6.Enable postcopy capability on src and dst, start postcopy migration.
> 
> 
> Actual result:
> postcopy migration succeed, and the serial console on src host before
> migration, the serial console on dst host after migration work well, doesn't
> hit serial hang. 
> 
> 
> I have tried twice above test, all pass. 
> Dave, could you help check above tests. if no problem, could we close this
> bz as currentrelease since I couldn't reproduce it now?

Try dropping the video connection, so you actually login/have the console on serial.

Comment 12 Li Xiaohui 2021-01-30 14:11:04 UTC
Thank you(In reply to Dr. David Alan Gilbert from comment #11)
> (In reply to Li Xiaohui from comment #10)
> > Didn't reproduce this bz on rhel-8.4.0-av
> > hosts(kernel-4.18.0-277.el8.x86_64&qemu-kvm-5.2.0-3.module+el8.4.
> > 0+9499+42e58f08.x86_64):

> > I have tried twice above test, all pass. 
> > Dave, could you help check above tests. if no problem, could we close this
> > bz as currentrelease since I couldn't reproduce it now?
> 
> Try dropping the video connection, so you actually login/have the console on
> serial.

Thank you Dave.


Tried again with right cli and steps:
1.Boot a guest with '-nographic' and serial console:
[root@hp-dl385g10-09 home]# sh 1.sh 
/usr/libexec/qemu-kvm  \
-machine q35,accel=kvm \
-cpu EPYC \
-nographic \
-device pcie-root-port,port=0x10,chassis=1,id=root0,bus=pcie.0,multifunction=on,addr=0x2 \
-device pcie-root-port,port=0x11,chassis=2,id=root1,bus=pcie.0,addr=0x2.0x1 \
-device pcie-root-port,port=0x12,chassis=3,id=root2,bus=pcie.0,addr=0x2.0x2 \
-device pcie-root-port,port=0x13,chassis=4,id=root3,bus=pcie.0,addr=0x2.0x3 \
-device pcie-root-port,port=0x14,chassis=5,id=root4,bus=pcie.0,addr=0x2.0x4 \
-device pcie-root-port,port=0x15,chassis=6,id=root5,bus=pcie.0,addr=0x2.0x5 \
-device pcie-root-port,port=0x16,chassis=7,id=root6,bus=pcie.0,addr=0x2.0x6 \
-device pcie-root-port,port=0x17,chassis=8,id=root7,bus=pcie.0,addr=0x2.0x7 \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=root1 \
-device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 \
-device virtio-net-pci,mac=9a:8a:8b:8c:8d:8e,id=net0,vectors=4,netdev=tap0,bus=root2 \
-blockdev driver=file,cache.direct=on,cache.no-flush=off,filename=/mnt/nfs/rhel840-64-virtio-scsi.qcow2,node-name=drive_sys1 \
-blockdev driver=qcow2,node-name=drive_image1,file=drive_sys1 \
-netdev tap,id=tap0,vhost=on \
-m 4096,slots=1,maxmem=6G \
-smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
-qmp tcp:0:3333,server,nowait \
-device virtio-serial-pci,bus=root3 \
-chardev pipe,id=ch0,path=/home/test/guest-serial-35728 \
-device virtserialport,chardev=ch0,name=serial1 \
2.After step 1, the session under 1 is console session, I could login guest and execute command stressapptest:
Red Hat Enterprise Linux 8.4 Beta (Ootpa)
Kernel 4.18.0-259.el8.x86_64 on an x86_64

Activate the web console with: systemctl enable --now cockpit.socket

localhost login: root
Password: 
Last login: Sat Jan 30 21:43:27 from 10.72.13.227
[root@localhost ~]# stressapptest -M 200 -s 1000000
2021/01/30-21:43:40(CST) Log: Commandline - stressapptest -M 200 -s 1000000
2021/01/30-21:43:40(CST) Stats: SAT revision 1.0.9_autoconf, 64 bit binary
2021/01/30-21:43:40(CST) Log: root @ localhost.localdomain on Sat Jan 30 21:35:47 CST 20e
2021/01/30-21:43:40(CST) Log: 1 nodes, 4 cpus.
2021/01/30-21:43:40(CST) Log: Defaulting to 4 copy threads
2021/01/30-21:43:40(CST) Log: Prefer plain malloc memory allocation.
2021/01/30-21:43:40(CST) Log: Using mmap() allocation at 0x7f1faf800000.
2021/01/30-21:43:40(CST) Stats: Starting SAT, 200M, 1000000 seconds
2021/01/30-21:43:40(CST) Log: region number 8 exceeds region count 1
2021/01/30-21:43:40(CST) Log: Region mask: 0x1
2021/01/30-21:43:50(CST) Log: Seconds remaining: 999990
2021/01/30-21:44:00(CST) Log: Seconds remaining: 999980
2021/01/30-21:44:10(CST) Log: Seconds remaining: 999970
3.Boot a guest on dst host with '-incoming defer':
[root@hp-dl385g10-10 home]# sh 1.sh 
4.Send qmp commands on src and dst qmp capabilities, then start postcopy migration.


Actual result:
the console on dst host works well, the log of stressapptest print correctly:
[root@hp-dl385g10-10 home]# sh 1.sh 
2021/01/30-21:45:40(CST) Log: Seconds remaining: 999880

2021/01/30-21:45:50(CST) Log: Seconds remaining: 999870
2021/01/30-21:46:00(CST) Log: Seconds remaining: 999860
2021/01/30-21:46:10(CST) Log: Seconds remaining: 999850
...


I didn't use the same qemu cli "-serial pipe:/home/test/guest-serial-35728"(but use another serial pipe command as step 1) since guest usually couldn't start successfully when used it. I would ask corresponding QE for help checking this question.
And I think it doesn't effect the bz reproduction.


So I would insist on closing this bz currentrelease if you have no other advise. Thank you.


Note You need to log in before you can comment on or make changes to this bug.