Bug 1451631

Summary: Keyboard does not work after migration
Product: Red Hat Enterprise Linux 7 Reporter: xianwang <xianwang>
Component: qemu-kvm-rhevAssignee: Laurent Vivier <lvivier>
Status: CLOSED ERRATA QA Contact: xianwang <xianwang>
Severity: high Docs Contact:
Priority: high    
Version: 7.4CC: aliang, chayang, coli, dgibson, dgilbert, hachen, juzhang, knoel, kraxel, lmiksik, lvivier, mdeng, michen, mrezanin, qzhang, virt-maint, xianwang, xuma
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-rhev-2.9.0-10.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-02 04:38:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1376765    

Description xianwang 2017-05-17 08:13:46 UTC
Description of problem:
Lock the screen of vm, then do a local migration, after migration completed, the keyboard does not work while mouse working well. this issue is hit for both local migration and migration between two hosts.

Version-Release number of selected component (if applicable):
Host:
3.10.0-666.el7.ppc64le
qemu-kvm-rhev-2.9.0-4.el7.ppc64le
SLOF-20170303-3.git66d250e.el7.noarch

How reproducible:
4/5

Steps to Reproduce:
1.Boot a guest with qemu cli on a host:
/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -nodefaults  \
    -machine pseries \
    -vga std \
    -device virtio-serial-pci,id=virtio_serial_pci0,bus=pci.0,addr=03 \
    -device virtio-scsi-pci,id=scsi1,bus=pci.0,addr=0x4 \
    -device nec-usb-xhci,id=usb1,bus=pci.0,addr=05 \
    -drive file=/root/RHEL74_1.qcow2,format=qcow2,if=none,id=drive_blk1,werror=stop,rerror=stop \
    -device virtio-blk-pci,drive=drive_blk1,id=blk-disk1,bootindex=1,bus=pci.0,addr=06 \
    -device virtio-net-pci,mac=9a:7b:7c:7d:7e:72,id=id9HRc5V,vectors=4,netdev=idjlQN53,bus=pci.0,addr=10 \
    -netdev tap,id=idjlQN53,vhost=off,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
    -m 4G \
    -smp 2,maxcpus=4,cores=2,threads=2,sockets=1 \
    -cpu host \
    -device usb-kbd \
    -device usb-mouse \
    -qmp tcp:0:8881,server,nowait \
    -vnc :1  \
    -msg timestamp=on \
    -rtc base=localtime,clock=vm,driftfix=slew  \
    -monitor stdio \
    -boot order=cdn,once=c,menu=on,strict=off \
    -enable-kvm

2.In the guest, lock the screen by press the button on the right and above side.
3.Then launch listening mode in the same host with same qemu cli as above appending "incoming tcp:0:5801 "
4.Do a local migration
(qemu) migrate -d tcp:10.16.69.69:5801
5.After migration completed, unlock the screen by inputting password through keyboard
(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off 
Migration status: completed


Actual results:
the keyboard does not work while mouse working well, but I can ssh into guest, and the network is well,(qemu)system_reset and (qemu) system_powerdown both work well.

Expected results:


Additional info:

Comment 2 xianwang 2017-05-17 08:17:47 UTC
This bug is only for powerpc, I have tested this scenario on x86_64, this bug does not exist for x86_64.
Host:
3.10.0-666.el7.x86_64
qemu-kvm-rhev-2.9.0-4.el7.x86_64

Guest:
3.10.0-666.el7.x86_64

Comment 3 xianwang 2017-05-17 09:52:44 UTC
1)Whether lock the screen or not, the keyboard both do not work
2)This bug is a regression, this bug does not exist on the following version:
3.10.0-666.el7.ppc64le
qemu-kvm-rhev-2.6.0-27.el7.ppc64le
SLOF-20170303-3.git66d250e.el7.noarch

Comment 4 David Gibson 2017-05-18 01:26:35 UTC
Hi Xianwang, can you add the following info:

 * What guest image was in use?
   * In particular what was the guest kernel version?
 * Re comment 2: could you check and see if it is the host kernel, qemu or guest kernel change which causes the regression?

Comment 5 David Gibson 2017-05-18 01:27:59 UTC
Sorry, I misread your posts, I see that only the qemu version has changed between working and non-working versions.  Did the guest image change between working and non-working runs?

Comment 6 xianwang 2017-05-18 02:43:34 UTC
(In reply to David Gibson from comment #4)
> Hi Xianwang, can you add the following info:
> 
>  * What guest image was in use?
>    * In particular what was the guest kernel version?
>  * Re comment 2: could you check and see if it is the host kernel, qemu or
> guest kernel change which causes the regression?

I am sorry I forgot to describe the guest kernel version, the guest version is:
3.10.0-666.el7.ppc64le, both for working and non-working versions, and the guest img do not change between working and non-working runs.

what's more, I tried to find the first bad commit by "bisect", it seems that I find out the first bad commit, but the result is not same with this bug, the result is as following:

the version of host and guest ,and the test steps are all same as bugs.
# git bisect good 29ba0cdc1fd1300f910d150c03a0f74236083bf7
# git bisect bad ddc371e5a0a569b9c02522bc6ec26ce16f6e126c
Bisecting: 912 revisions left to test after this (roughly 10 steps)
[dd3dd4ba7b949662d2c67a4c041549b3d79c4b0e] virtio: check for vring setup in virtio_queue_empty

compile and test.

test result:
after migration:
src end:
(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off 
Migration status: completed
dst end:
[root@ibm-p8-rhevm-05 ~]# sh boot_d.sh 
QEMU 2.8.50 monitor - type 'help' for more information
(qemu) info status 
VM status: paused (inmigrate)
(qemu) 2017-05-18T02:12:36.451619Z qemu-system-ppc64: VQ 0 size 0x80 < last_avail_idx 0x1bf0 - used_idx 0x0
2017-05-18T02:12:36.451679Z qemu-system-ppc64: Failed to load virtio-blk:virtio
2017-05-18T02:12:36.451684Z qemu-system-ppc64: error while loading state for instance 0x0 of device 'pci@800000020000000:05.0/virtio-blk'
2017-05-18T02:12:36.451976Z qemu-system-ppc64: load of migration failed: Operation not permitted

# git branch -r --contains dd3dd4ba7b949662d2c67a4c041549b3d79c4b0e
  origin/preview/2.9.0-rc4
  origin/rhv7/master-2.9.0

so, the first commit is:
dd3dd4ba7b949662d2c67a4c041549b3d79c4b0e

Comment 7 xianwang 2017-05-18 02:49:24 UTC
(In reply to xianwang from comment #6)
> (In reply to David Gibson from comment #4)
> > Hi Xianwang, can you add the following info:
> > 
> >  * What guest image was in use?
> >    * In particular what was the guest kernel version?
> >  * Re comment 2: could you check and see if it is the host kernel, qemu or
> > guest kernel change which causes the regression?
> 
> I am sorry I forgot to describe the guest kernel version, the guest version
> is:
> 3.10.0-666.el7.ppc64le, both for working and non-working versions, and the
> guest img do not change between working and non-working runs.
> 
> what's more, I tried to find the first bad commit by "bisect", it seems that
> I find out the first bad commit, but the result is not same with this bug,
> the result is as following:
> 
> the version of host and guest ,and the test steps are all same as bugs.
> # git bisect good 29ba0cdc1fd1300f910d150c03a0f74236083bf7
> # git bisect bad ddc371e5a0a569b9c02522bc6ec26ce16f6e126c
> Bisecting: 912 revisions left to test after this (roughly 10 steps)
> [dd3dd4ba7b949662d2c67a4c041549b3d79c4b0e] virtio: check for vring setup in
> virtio_queue_empty
> 
> compile and test.
> 
> test result:
> after migration:
> src end:
> (qemu) info migrate
> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks:
> off compress: off events: off postcopy-ram: off x-colo: off release-ram: off 
> Migration status: completed
> dst end:
> [root@ibm-p8-rhevm-05 ~]# sh boot_d.sh 
> QEMU 2.8.50 monitor - type 'help' for more information
> (qemu) info status 
> VM status: paused (inmigrate)
> (qemu) 2017-05-18T02:12:36.451619Z qemu-system-ppc64: VQ 0 size 0x80 <
> last_avail_idx 0x1bf0 - used_idx 0x0
> 2017-05-18T02:12:36.451679Z qemu-system-ppc64: Failed to load
> virtio-blk:virtio
> 2017-05-18T02:12:36.451684Z qemu-system-ppc64: error while loading state for
> instance 0x0 of device 'pci@800000020000000:05.0/virtio-blk'
> 2017-05-18T02:12:36.451976Z qemu-system-ppc64: load of migration failed:
> Operation not permitted
> 
> # git branch -r --contains dd3dd4ba7b949662d2c67a4c041549b3d79c4b0e
>   origin/preview/2.9.0-rc4
>   origin/rhv7/master-2.9.0
> 
> so, the first commit is:
> dd3dd4ba7b949662d2c67a4c041549b3d79c4b0e

for this test, the qemu cli is as below:
/root/qemu-kvm/ppc64-softmmu/qemu-system-ppc64 \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -nodefaults  \
    -machine pseries \
    -vga std \
    -device virtio-serial-pci,id=virtio_serial_pci0,bus=pci.0,addr=03 \
    -device nec-usb-xhci,id=usb1,bus=pci.0,addr=04 \
    -drive file=/root/RHEL74_1.qcow2,format=qcow2,if=none,id=drive_blk1,werror=stop,rerror=stop \
    -device virtio-blk-pci,drive=drive_blk1,id=blk-disk1,bootindex=1,bus=pci.0,addr=05 \
    -device virtio-net-pci,mac=9a:7b:7c:7d:7e:72,id=id9HRc5V,vectors=4,netdev=idjlQN53,bus=pci.0,addr=06 \
    -netdev tap,id=idjlQN53,vhost=off,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
    -m 4G \
    -smp 2,maxcpus=4,cores=2,threads=2,sockets=1 \
    -cpu host \
    -device usb-kbd \
    -device usb-mouse \
    -qmp tcp:0:8881,server,nowait \
    -vnc :1  \
    -msg timestamp=on \
    -rtc base=localtime,clock=vm,driftfix=slew  \
    -monitor stdio \
    -boot order=cdn,once=c,menu=on,strict=off \
    -enable-kvm

Comment 8 xianwang 2017-05-19 05:50:13 UTC
1)on the same host as this bug report
I re-test this scenario with another guest img that is installed by avocado on the same host, then after migration, the keyboard works well, the qemu cli and the kernel version of guest are same, only the img is different, the img that I installed manually is bad.

2)on another host(host2) that different from that of this bug report
test result is same with 1), the img that I installed manually can reproduce this bug, while the img that installed by avocado can not reproduce this bug

Hi, David, do you think there is something wrong with my manually installing img?

Comment 9 David Gibson 2017-05-19 07:48:55 UTC
It does seem like there's something wrong with the image, although I can't quite think what it could be.

I was also unable to reproduce the problem with my own image (so far, anyway).

Using the ssh connection, with the broken image, can you show me the output of "lsusb" on the guest while in the broken state.  The output from before the migration would also be useful for comparison.


Re: comment 6.  Thanks for attempting a bisect, however unless I'm misunderstanding the comment it looks like you didn't complete the bisect, just did the first step.  Completing a bisect generally requires testing a number of different versions.

Comment 10 xianwang 2017-05-22 11:44:33 UTC
(In reply to David Gibson from comment #9)
> It does seem like there's something wrong with the image, although I can't
> quite think what it could be.
> 
> I was also unable to reproduce the problem with my own image (so far,
> anyway).
> 
> Using the ssh connection, with the broken image, can you show me the output
> of "lsusb" on the guest while in the broken state.  The output from before
> the migration would also be useful for comparison.
> 
> 
> Re: comment 6.  Thanks for attempting a bisect, however unless I'm
> misunderstanding the comment it looks like you didn't complete the bisect,
> just did the first step.  Completing a bisect generally requires testing a
> number of different versions.

"lsusb" on the guest while in the broken state, the result is as following,and it is same with the state before migration.:
[root@dhcp70-148 ~]# lsusb
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 003: ID 0627:0001 Adomax Technology Co., Ltd 
Bus 001 Device 002: ID 0627:0001 Adomax Technology Co., Ltd 
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
dst end:
(qemu) info usb
  Device 0.1, Port 1, Speed 480 Mb/s, Product QEMU USB Keyboard
  Device 0.2, Port 2, Speed 480 Mb/s, Product QEMU USB Mouse

and I tried "dmesg | grep usb", it displays no error infomation,but the keyboard can't work and other function is well.

Comment 11 David Gibson 2017-05-24 06:01:04 UTC
Ok, that lsusb looks as expected, so it doesn't appear the device is disappearing entirely across the migration.

Unless we find a reproducible way of making an image where this doesn't work, I'm inclined to close this as WORKSFORME.

Comment 12 David Gibson 2017-05-25 05:07:12 UTC
Given the updates on bug 1448810, I'm less inclined to drop this.  I think there may be a real bug, even if the triggering circumstances are unclear.

Xianwang, are you able to put your disk image which shows this problem somewhere I can access?

Comment 15 David Gibson 2017-05-29 02:48:06 UTC
Unfortunately, even with the image from xianwang, I haven't been able to reproduce this.  I do have a few differences from the setup described from xianwang, although none of them seem likely to cause this problem:

 * I'm using 'user' network instead of 'tap'
 * I'm using a slightly different hypervisor console and qemu monitor configuration
 * I'm doing a local migration, which requires that the source and destination have different vnc and qmp ports

I also have a slightly newer qemu package:

qemu-kvm-rhev-2.9.0-6.el7.ppc64le

xianwang, could you see if you're able to reproduce this with the newer qemu?

Comment 16 Dr. David Alan Gilbert 2017-05-30 14:17:08 UTC
xianwang: 
  Two other suggestions (after you've tried David Gibson's):
    a) Does the qemu   sendkey   command work after migration - e.g.
           sendkey x
       or  sendkey ctrl-alt-f4
    b) If the guest is at a text-console (e.g. ctrl-alt-f4 before migration) does it work?
    c) After the migrate can you do the qemu command:
       info mice

       I'm suspicious the usb-mouse isn't really working and it's going a different way.

Dave

Comment 17 xianwang 2017-05-31 09:00:20 UTC
(In reply to David Gibson from comment #15)
> Unfortunately, even with the image from xianwang, I haven't been able to
> reproduce this.  I do have a few differences from the setup described from
> xianwang, although none of them seem likely to cause this problem:
> 
>  * I'm using 'user' network instead of 'tap'
>  * I'm using a slightly different hypervisor console and qemu monitor
> configuration
>  * I'm doing a local migration, which requires that the source and
> destination have different vnc and qmp ports
> 
> I also have a slightly newer qemu package:
> 
> qemu-kvm-rhev-2.9.0-6.el7.ppc64le
> 
> xianwang, could you see if you're able to reproduce this with the newer qemu?

I re-test it with following version:
Host:
3.10.0-663.el7.ppc64le
qemu-kvm-rhev-2.9.0-6.el7.ppc64le
SLOF-20170303-4.git66d250e.el7.noarch

Guest:
3.10.0-666.el7.ppc64le

test result is as following:
a)Yes, with the above host version qemu-kvm-rhev-2.9.0-6.el7.ppc64le, this bug exists.

b)For Dave's advice, after migration, I tried:
(qemu) sendkey x*********************not work
(qemu) sendkey ctrl-alt-f4***********not work
(qemu) sendkey x*********************not work
(qemu) sendkey ctrl-alt-f4***********not work
(qemu) sendkey ctrl-alt-f2***********not work
(qemu) sendkey ctrl-alt-f3***********not work
(qemu) info mice 
* Mouse #2: QEMU HID Mouse

in guest:
# lsusb -v | grep HID
    iConfiguration          6 HID Mouse
        HID Device Descriptor:
          bcdHID               0.01
    iConfiguration          8 HID Keyboard
        HID Device Descriptor:
          bcdHID               1.11

Comment 18 Laurent Vivier 2017-06-06 17:59:36 UTC
Xianwang,

could you check the following build fixes the problem:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13356214

It reverts the commit found by BZ1448810

Thanks

Comment 19 xianwang 2017-06-07 07:14:03 UTC
(In reply to Laurent Vivier from comment #18)
> Xianwang,
> 
> could you check the following build fixes the problem:
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13356214
> 
> It reverts the commit found by BZ1448810
> 
> Thanks

Yes, with the same problematic guest img(3.10.0-666.el7.ppc64le) as #c8, I re-test this scenario with qemu-kvm-rhev-2.9.0-8.el7.lvivier201706061743.ppc64le which is specified by Laurent, this bug does not exist. 

In the same host, with same host kernel version and img as former, this bug exist on qemu-kvm-rhev-2.9.0-7.el7.ppc64le.

Host:
3.10.0-675.el7.ppc64le
SLOF-20170303-4.git66d250e.el7.noarch

Guest:
3.10.0-666.el7.ppc64le

Comment 20 Laurent Vivier 2017-06-07 08:32:51 UTC
*** Bug 1448810 has been marked as a duplicate of this bug. ***

Comment 21 Laurent Vivier 2017-06-07 08:35:54 UTC
Involved commit found in BZ1448810:

Bisected to:

commit 243afe858b95765b98d16a1f0dd50dca262858ad
Author: Gerd Hoffmann <kraxel>
Date:   Fri Mar 31 12:25:21 2017 +0200

    xhci: flush dequeue pointer to endpoint context
    
    When done processing a endpoint ring we must update the dequeue pointer
    in the endpoint context in guest memory.  This is needed to make sure
    the guest has a correct view of things and also to make live migration
    work properly, because xhci post_load restores alot of the state from
    xhci data structures in guest memory.
    
    Add xhci_set_ep_state() call to do that.
    
    The recursive calls stopped by commit
    ddb603ab6c981c1d67cb42266fc700c33e5b2d8f had the (unintentional) side
    effect to hiding this bug.  xhci_set_ep_state() was called before
    processing, to set the state to running, which updated the dequeue
    pointer too.
    
    Reported-by: Dr. David Alan Gilbert <dgilbert>
    Signed-off-by: Gerd Hoffmann <kraxel>
    Tested-by: Dr. David Alan Gilbert <dgilbert>
    Message-id: 20170331102521.29253-1-kraxel

Comment 22 Dr. David Alan Gilbert 2017-06-07 08:43:02 UTC
and that commit was put in to fix

https://bugzilla.redhat.com/show_bug.cgi?id=1436616

Comment 23 Laurent Vivier 2017-06-07 16:24:55 UTC
After migration, in xhci_kick_epctx(), xhci_kick_epctx() returns always 0 and stop the processing loop.
xhci_kick_epctx() returns 0, because the XHCITRB structures retrieved with pci_dma_read() is totally cleared. It seems the value of the pointer to the ring is not good one: it changes between the source and the destination.

Comment 24 Laurent Vivier 2017-06-08 07:29:14 UTC
Gerd has fixed the bug with commit from:
https://www.kraxel.org/cgit/qemu/log/?h=work/xhci-hid-migration

I prepare a build for QE.

Comment 30 Gerd Hoffmann 2017-06-13 10:11:56 UTC
*** Bug 1454580 has been marked as a duplicate of this bug. ***

Comment 31 Miroslav Rezanina 2017-06-13 16:34:58 UTC
Fix included in qemu-kvm-rhev-2.9.0-10.el7

Comment 32 xianwang 2017-06-14 07:45:18 UTC
This bug is reproduced for qemu-kvm-rhev-2.9.0-5.el7.ppc64le, and verified pass for qemu-kvm-rhev-2.9.0-10.el7.ppc64le

Bug reproduction:
Host:
3.10.0-679.el7.ppc64le
qemu-kvm-rhev-2.9.0-5.el7.ppc64le
SLOF-20170303-4.git66d250e.el7.noarch

Guest:
3.10.0-666.el7.ppc64le

steps:
1.Boot a guest as following qemu cli:
/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -nodefaults  \
    -machine pseries \
    -vga std \
    -device virtio-serial-pci,id=virtio_serial_pci0,bus=pci.0,addr=03 \
    -device qemu-xhci,id=usb1,bus=pci.0,addr=04 \
    -drive file=/root/RHEL74_1_bug1451631.qcow2,format=qcow2,if=none,id=drive_blk1,werror=stop,rerror=stop \
    -device virtio-blk-pci,drive=drive_blk1,id=blk-disk1,bootindex=1,bus=pci.0,addr=05 \
    -device virtio-net-pci,mac=9a:7b:7c:7d:7e:72,id=id9HRc5V,vectors=4,netdev=idjlQN53,bus=pci.0,addr=06 \
    -netdev tap,id=idjlQN53,vhost=off,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
    -m 4G \
    -smp 2,maxcpus=4,cores=2,threads=2,sockets=1 \
    -cpu host \
    -device usb-hub,id=hub1,port=1 \
    -device usb-mouse,id=usbmouse,port=1.1\
    -device usb-kbd,id=usbkbd,port=1.2\
    -device usb-tablet,id=usbtablet,port=1.3\
    -device usb-storage,id=storage1,port=1.4,drive=drive1 \
    -drive file=/root/data1.qcow2,id=drive1,if=none \
    -qmp tcp:0:8881,server,nowait \
    -vnc :1  \
    -msg timestamp=on \
    -rtc base=localtime,clock=vm,driftfix=slew  \
    -monitor stdio \
    -boot order=cdn,once=c,menu=on,strict=off \
    -enable-kvm
2.Launch listening mode on same host with appending command:
-incoming tcp:0:58001
3.do local migration
(qemu) migrate -d tcp:127.0.0.1:5802
4.Check the function of mouse, keyboard and usb-storage

Result:
The mouse and keyboard can not work, but the usb-storage work well(#fdisk -l, the /dev/sda can be shown).

Bug verify:
Host:
3.10.0-679.el7.ppc64le
qemu-kvm-rhev-2.9.0-10.el7.ppc64le
SLOF-20170303-4.git66d250e.el7.noarch

Guest:
3.10.0-666.el7.ppc64le

steps are same with bug reproduction.

result:
The mouse, keyboard and the usb-storage all work well.

So, this bug is verified pass.

Comment 35 xianwang 2017-06-14 08:04:33 UTC
I have alos verify this bug for x86_64 on qemu-kvm-rhev-2.9.0-10.el7.x86_64, the result is pass.
Host:
3.10.0-679.el7.x86_64
qemu-kvm-rhev-2.9.0-10.el7.x86_64

guest:
3.10.0-679.el7.x86_64

the qemu cli and steps are same as #C32.

Comment 40 errata-xmlrpc 2017-08-02 04:38:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392