Description of problem: inconsistent guest index found on target host if rebooting guest with multiple virtio videos while do migration. Version-Release number of selected component (if applicable): qemu-kvm-rhev-2.12.0-5.el7.x86_64 libvirt-4.4.0-2.el7.x86_64 How reproducible: 10% Steps to Reproduce: 1.Start a guest with multiple virtio videos: #virsh dumpxml iommu1 <os> <type arch='x86_64' machine='pc-q35-rhel7.5.0'>hvm</type> <boot dev='hd'/> </os> ... <video> <model type='virtio' heads='1' primary='yes'> <acceleration accel3d='no'/> </model> <alias name='ua-04c2decd-4e33-4023-84de-12205c777af6'/> <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/> </video> <video> <model type='virtio' heads='1'> <acceleration accel3d='no'/> </model> <alias name='ua-04c2decd-4e35-4023-84de-12205c777af6'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </video> 2.Do migration while rebooting the guest: #virsh reboot iommu1; virsh migrate iommu1 qemu+ssh://10.66.4.101/system --live --verbose --p2p --tunnelled Migration: [ 98 %]error: internal error: qemu unexpectedly closed the monitor: 2018-06-27T12:38:48.500703Z qemu-kvm: VQ 0 size 0x40 Guest index 0xbf55 inconsistent with Host index 0x43a: delta 0xbb1b 2018-06-27T12:38:48.500726Z qemu-kvm: Failed to load virtio-gpu:virtio 2018-06-27T12:38:48.500737Z qemu-kvm: error while loading state for instance 0x0 of device '0000:00:01.4:00.0/virtio-gpu' 2018-06-27T12:38:48.500779Z qemu-kvm: warning: TSC frequency mismatch between VM (2099996 kHz) and host (2593995 kHz), and TSC scaling unavailable 2018-06-27T12:38:48.501780Z qemu-kvm: warning: TSC frequency mismatch between VM (2099996 kHz) and host (2593995 kHz), and TSC scaling unavailable 2018-06-27T12:38:48.501873Z qemu-kvm: warning: TSC frequency mismatch between VM (2099996 kHz) and host (2593995 kHz), and TSC scaling unavailable 2018-06-27T12:38:48.501940Z qemu-kvm: warning: TSC frequency mismatch between VM (2099996 kHz) and host (2593995 kHz), and TSC scaling unavailable 2018-06-27T12:38:48.502026Z qemu-kvm: load of migration failed: Operation not permitted 2018-06-27 12:38:48.715+0000: shutting down, reason=failed Actual results: Migration failed when rebooting guest with multiple virtio video. Expected results: Should do migration successfully. Additional info:
Can you attach the complete domain xml please?
Created attachment 1478026 [details] domain xml
(In reply to Gerd Hoffmann from comment #2) > Can you attach the complete domain xml please? Please see the domain xml in the attachment.
<domain type='kvm' id='5'> <name>iommu1</name> <uuid>1b3268d6-b59c-406b-a14c-33b000b15b6c</uuid> <controller type='pci' index='2' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='2' port='0x9'/> <alias name='pci.2'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/> </controller> <controller type='pci' index='7' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='7' port='0xc'/> <alias name='pci.7'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x4'/> </controller> <video> <model type='virtio' heads='1' primary='yes'> <acceleration accel3d='no'/> </model> <alias name='ua-04c2decd-4e33-4023-84de-12205c777af6'/> <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/> </video> <video> <model type='virtio' heads='1'> <acceleration accel3d='no'/> </model> <alias name='ua-04c2decd-4e35-4023-84de-12205c777af6'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </video> Ok, the primary is bus 7, which is root port 00:1.4 The secondary is bus 2, which is root port 00:1.1 So the secondary comes first in pci scan order.
Can you configure a serial console for the guest, log the serial console output on the source host to a file, then try to reproduce it? The kernel log hopefully gives us a clue where exactly in the shutdown or boot process the guest kernel is when this bug happens.
Created attachment 1479680 [details] console log
(In reply to Gerd Hoffmann from comment #6) > Can you configure a serial console for the guest, log the serial console > output on the source host to a file, then try to reproduce it? > > The kernel log hopefully gives us a clue where exactly in the shutdown or > boot process the guest kernel is when this bug happens. Please see the log in the attachment.
(In reply to yafu from comment #8) > (In reply to Gerd Hoffmann from comment #6) > > Can you configure a serial console for the guest, log the serial console > > output on the source host to a file, then try to reproduce it? > > > > The kernel log hopefully gives us a clue where exactly in the shutdown or > > boot process the guest kernel is when this bug happens. > > Please see the log in the attachment. Can you please remove the "quiet" from the kernel command line so all the kernel messages are in the log too? The log looks like the guest is fully booted. Is this the log of a migration failure? The initial comment says 10% reproducable, so I assumed you have to hit the right moment in the shutdown or boot process to actually hit it. Is this correct?
(In reply to Gerd Hoffmann from comment #9) > (In reply to yafu from comment #8) > > (In reply to Gerd Hoffmann from comment #6) > > > Can you configure a serial console for the guest, log the serial console > > > output on the source host to a file, then try to reproduce it? > > > > > > The kernel log hopefully gives us a clue where exactly in the shutdown or > > > boot process the guest kernel is when this bug happens. > > > > Please see the log in the attachment. > > Can you please remove the "quiet" from the kernel command line so all the > kernel messages are in the log too? > > The log looks like the guest is fully booted. Is this the log of a > migration failure? The initial comment says 10% reproducable, so I assumed > you have to hit the right moment in the shutdown or boot process to actually > hit it. Is this correct? Yes, It's the log of a migration failure. The guest can boot successfully even migration failed. I will attach log without "quiet" in kernel command line.
> Yes, It's the log of a migration failure. The guest can boot successfully > even migration failed. Ah, right, the guest is restarted on the source host then, so the log does not stop at the point where the migration was tried. But that is exactly what I want to know: Where in the boot process is linux when the migration fails. Hmm ...
Ok, stop & go approach should help debugging this. Can you try this: (1) reboot the guest. (2) pause the guest. (3) try migrate the guest. (4a) when migration fails: found the guest state which breaks migration, done ;) (4b) when migration succeeds: unpause, let it run for a moment, pause again, continue with (3).
Created attachment 1479716 [details] boot log -2
(In reply to Gerd Hoffmann from comment #12) > Ok, stop & go approach should help debugging this. Can you try this: > > (1) reboot the guest. > (2) pause the guest. > (3) try migrate the guest. > (4a) when migration fails: found the guest state which breaks migration, > done ;) > (4b) when migration succeeds: unpause, let it run for a moment, pause again, > continue with (3). Please see the log in attachment 'boot log -2'. I paused the guest after migration failed.
> Please see the log in attachment 'boot log -2'. I paused the guest after > migration failed. Ok, so it happens after booting the kernel but before loading the virtio-gpu driver.
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=18106944 Can you try whenever this build works?
Created attachment 1479959 [details] boot log -3
(In reply to Gerd Hoffmann from comment #16) > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=18106944 > > Can you try whenever this build works? Still can reproduce with this build.
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=19290458 How about this one?
(In reply to Gerd Hoffmann from comment #19) > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=19290458 > How about this one? Still can reproduce with this build.
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=20430769 Can you test please?
(In reply to Gerd Hoffmann from comment #21) > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=20430769 > Can you test please? I ran the test in 200 loops and can not reproduce the issue any more.
patches merged upstream: 8ea90ee690eb78bbe6644cae3a7eff857f8b4569 3912e66a3febdea3b89150f923ca9be3f02f7ae3 0be00346d1e3d96b839832809d7042db8c7d4300 (optional cleanup)
Oops, missed this one. Patches are ready. Scratch build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=22120034 I guess the real question is whenever this qualifies for exception. If not it'll be 7.8 anyway.
This is not critical and we're not releasing qemu-kvm-rhev in 7.8, so deferring it to RHEL8-AV (where it's fixed already, given the upstream commits are in qemu-4.0).
Test against qemu-kvm-4.1.0-10.module+el8.1.0+4234+33aa4f57.x86_64, follow the steps(comment 12): 1)Reboot vm 2)Pause the vm 3)Migrate the vm to des 4)After migration finish, resume the vm Repeat 1) -> 4) With above steps, I'm not able to reproduce the issue in 20 times ping-pong migration Also with the same steps, I can easily reproduce the issue with 8.0.1 qemu-kvm: qemu-kvm-3.1.0-30.module+el8.0.1+3755+6782b0ed.x86_64
Created attachment 1617052 [details] vm xml used
Verified per comment 29 - 30
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3723