Bug 1597621

Summary: inconsistent guest index found on target host if rebooting guest with multiple virtio videos while do migration
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: yafu <yafu>
Component: qemu-kvmAssignee: Gerd Hoffmann <kraxel>
Status: CLOSED ERRATA QA Contact: Guo, Zhiyi <zhguo>
Severity: medium Docs Contact:
Priority: medium    
Version: ---CC: areis, chayang, coli, ddepaula, fjin, jinzhao, juzhang, kraxel, marcandre.lureau, ngu, virt-maint, yafu, yuhuang, zhguo
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qemu-kvm-4.1.0-10.module+el8.1.0+4234+33aa4f57 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-06 07:11:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
domain xml
none
console log
none
boot log -2
none
boot log -3
none
vm xml used none

Description yafu 2018-07-03 10:09:16 UTC
Description of problem:
inconsistent guest index found on target host if rebooting guest with multiple virtio videos while do migration.

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.12.0-5.el7.x86_64
libvirt-4.4.0-2.el7.x86_64

How reproducible:
10%

Steps to Reproduce:
1.Start a guest with multiple virtio videos:
#virsh dumpxml iommu1
<os>
    <type arch='x86_64' machine='pc-q35-rhel7.5.0'>hvm</type>
    <boot dev='hd'/>
  </os>
...
<video>
      <model type='virtio' heads='1' primary='yes'>
        <acceleration accel3d='no'/>
      </model>
      <alias name='ua-04c2decd-4e33-4023-84de-12205c777af6'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </video>
    <video>
      <model type='virtio' heads='1'>
        <acceleration accel3d='no'/>
      </model>
      <alias name='ua-04c2decd-4e35-4023-84de-12205c777af6'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
 </video>

2.Do migration while rebooting the guest:
#virsh reboot iommu1; virsh migrate iommu1 qemu+ssh://10.66.4.101/system --live --verbose --p2p --tunnelled
Migration: [ 98 %]error: internal error: qemu unexpectedly closed the monitor: 
2018-06-27T12:38:48.500703Z qemu-kvm: VQ 0 size 0x40 Guest index 0xbf55 inconsistent with Host index 0x43a: delta 0xbb1b
2018-06-27T12:38:48.500726Z qemu-kvm: Failed to load virtio-gpu:virtio
2018-06-27T12:38:48.500737Z qemu-kvm: error while loading state for instance 0x0 of device '0000:00:01.4:00.0/virtio-gpu'
2018-06-27T12:38:48.500779Z qemu-kvm: warning: TSC frequency mismatch between VM (2099996 kHz) and host (2593995 kHz), and TSC scaling unavailable
2018-06-27T12:38:48.501780Z qemu-kvm: warning: TSC frequency mismatch between VM (2099996 kHz) and host (2593995 kHz), and TSC scaling unavailable
2018-06-27T12:38:48.501873Z qemu-kvm: warning: TSC frequency mismatch between VM (2099996 kHz) and host (2593995 kHz), and TSC scaling unavailable
2018-06-27T12:38:48.501940Z qemu-kvm: warning: TSC frequency mismatch between VM (2099996 kHz) and host (2593995 kHz), and TSC scaling unavailable
2018-06-27T12:38:48.502026Z qemu-kvm: load of migration failed: Operation not permitted
2018-06-27 12:38:48.715+0000: shutting down, reason=failed

Actual results:
Migration failed when rebooting guest with multiple virtio video.

Expected results:
Should do migration successfully.

Additional info:

Comment 2 Gerd Hoffmann 2018-08-15 11:31:41 UTC
Can you attach the complete domain xml please?

Comment 3 yafu 2018-08-23 02:58:46 UTC
Created attachment 1478026 [details]
domain xml

Comment 4 yafu 2018-08-23 02:59:42 UTC
(In reply to Gerd Hoffmann from comment #2)
> Can you attach the complete domain xml please?

Please see the domain xml in the attachment.

Comment 5 Gerd Hoffmann 2018-08-23 07:53:02 UTC
<domain type='kvm' id='5'>
  <name>iommu1</name>
  <uuid>1b3268d6-b59c-406b-a14c-33b000b15b6c</uuid>

    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x9'/>
      <alias name='pci.2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>

    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0xc'/>
      <alias name='pci.7'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x4'/>
    </controller>

    <video>
      <model type='virtio' heads='1' primary='yes'>
        <acceleration accel3d='no'/>
      </model>
      <alias name='ua-04c2decd-4e33-4023-84de-12205c777af6'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </video>

    <video>
      <model type='virtio' heads='1'>
        <acceleration accel3d='no'/>
      </model>
      <alias name='ua-04c2decd-4e35-4023-84de-12205c777af6'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </video>

Ok, the primary is bus 7, which is root port 00:1.4
The secondary is bus 2, which is root port 00:1.1

So the secondary comes first in pci scan order.

Comment 6 Gerd Hoffmann 2018-08-23 07:57:12 UTC
Can you configure a serial console for the guest, log the serial console output on the source host to a file, then try to reproduce it?

The kernel log hopefully gives us a clue where exactly in the shutdown or boot process the guest kernel is when this bug happens.

Comment 7 yafu 2018-08-30 03:57:54 UTC
Created attachment 1479680 [details]
console log

Comment 8 yafu 2018-08-30 03:59:04 UTC
(In reply to Gerd Hoffmann from comment #6)
> Can you configure a serial console for the guest, log the serial console
> output on the source host to a file, then try to reproduce it?
> 
> The kernel log hopefully gives us a clue where exactly in the shutdown or
> boot process the guest kernel is when this bug happens.

Please see the log in the attachment.

Comment 9 Gerd Hoffmann 2018-08-30 06:18:06 UTC
(In reply to yafu from comment #8)
> (In reply to Gerd Hoffmann from comment #6)
> > Can you configure a serial console for the guest, log the serial console
> > output on the source host to a file, then try to reproduce it?
> > 
> > The kernel log hopefully gives us a clue where exactly in the shutdown or
> > boot process the guest kernel is when this bug happens.
> 
> Please see the log in the attachment.

Can you please remove the "quiet" from the kernel command line so all the kernel messages are in the log too?

The log looks like the guest is fully booted.  Is this the log of a migration failure?  The initial comment says 10% reproducable, so I assumed you have to hit the right moment in the shutdown or boot process to actually hit it.  Is this correct?

Comment 10 yafu 2018-08-30 06:35:20 UTC
(In reply to Gerd Hoffmann from comment #9)
> (In reply to yafu from comment #8)
> > (In reply to Gerd Hoffmann from comment #6)
> > > Can you configure a serial console for the guest, log the serial console
> > > output on the source host to a file, then try to reproduce it?
> > > 
> > > The kernel log hopefully gives us a clue where exactly in the shutdown or
> > > boot process the guest kernel is when this bug happens.
> > 
> > Please see the log in the attachment.
> 
> Can you please remove the "quiet" from the kernel command line so all the
> kernel messages are in the log too?
> 
> The log looks like the guest is fully booted.  Is this the log of a
> migration failure?  The initial comment says 10% reproducable, so I assumed
> you have to hit the right moment in the shutdown or boot process to actually
> hit it.  Is this correct?

Yes, It's the log of a migration failure. The guest can boot successfully even migration failed.

I will attach log without "quiet" in kernel command line.

Comment 11 Gerd Hoffmann 2018-08-30 06:49:14 UTC
> Yes, It's the log of a migration failure. The guest can boot successfully
> even migration failed.

Ah, right, the guest is restarted on the source host then, so the log does not stop at the point where the migration was tried.  But that is exactly what I want to know:  Where in the boot process is linux when the migration fails.  Hmm ...

Comment 12 Gerd Hoffmann 2018-08-30 07:00:14 UTC
Ok, stop & go approach should help debugging this.  Can you try this:

(1) reboot the guest.
(2) pause the guest.
(3) try migrate the guest.
(4a) when migration fails: found the guest state which breaks migration, done ;)
(4b) when migration succeeds: unpause, let it run for a moment, pause again, continue with (3).

Comment 13 yafu 2018-08-30 08:04:01 UTC
Created attachment 1479716 [details]
boot log -2

Comment 14 yafu 2018-08-30 08:05:59 UTC
(In reply to Gerd Hoffmann from comment #12)
> Ok, stop & go approach should help debugging this.  Can you try this:
> 
> (1) reboot the guest.
> (2) pause the guest.
> (3) try migrate the guest.
> (4a) when migration fails: found the guest state which breaks migration,
> done ;)
> (4b) when migration succeeds: unpause, let it run for a moment, pause again,
> continue with (3).

Please see the log in attachment 'boot log -2'. I paused the guest after migration failed.

Comment 15 Gerd Hoffmann 2018-08-30 09:18:10 UTC
> Please see the log in attachment 'boot log -2'. I paused the guest after
> migration failed.

Ok, so it happens after booting the kernel but before loading the virtio-gpu driver.

Comment 16 Gerd Hoffmann 2018-08-30 09:52:12 UTC
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=18106944

Can you try whenever this build works?

Comment 17 yafu 2018-08-31 05:56:36 UTC
Created attachment 1479959 [details]
boot log -3

Comment 18 yafu 2018-08-31 05:57:58 UTC
(In reply to Gerd Hoffmann from comment #16)
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=18106944
> 
> Can you try whenever this build works?

Still can reproduce with this build.

Comment 19 Gerd Hoffmann 2018-11-27 10:10:00 UTC
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=19290458
How about this one?

Comment 20 yafu 2018-11-30 05:24:14 UTC
(In reply to Gerd Hoffmann from comment #19)
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=19290458
> How about this one?

Still can reproduce with this build.

Comment 21 Gerd Hoffmann 2019-03-04 12:16:37 UTC
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=20430769
Can you test please?

Comment 22 yafu 2019-03-07 03:12:49 UTC
(In reply to Gerd Hoffmann from comment #21)
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=20430769
> Can you test please?

I ran the test in 200 loops and can not reproduce the issue any more.

Comment 23 Gerd Hoffmann 2019-03-13 07:38:08 UTC
patches merged upstream:
8ea90ee690eb78bbe6644cae3a7eff857f8b4569
3912e66a3febdea3b89150f923ca9be3f02f7ae3
0be00346d1e3d96b839832809d7042db8c7d4300 (optional cleanup)

Comment 25 Gerd Hoffmann 2019-06-12 14:13:55 UTC
Oops, missed this one.  Patches are ready.  Scratch build:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=22120034

I guess the real question is whenever this qualifies for exception.
If not it'll be 7.8 anyway.

Comment 27 Ademar Reis 2019-08-19 19:56:28 UTC
This is not critical and we're not releasing qemu-kvm-rhev in 7.8, so deferring it to RHEL8-AV (where it's fixed already, given the upstream commits are in qemu-4.0).

Comment 29 Guo, Zhiyi 2019-09-20 07:51:25 UTC
Test against qemu-kvm-4.1.0-10.module+el8.1.0+4234+33aa4f57.x86_64, follow the steps(comment 12):

1)Reboot vm
2)Pause the vm
3)Migrate the vm to des
4)After migration finish, resume the vm

Repeat 1) -> 4)

With above steps, I'm not able to reproduce the issue in 20 times ping-pong migration

Also with the same steps, I can easily reproduce the issue with 8.0.1 qemu-kvm: qemu-kvm-3.1.0-30.module+el8.0.1+3755+6782b0ed.x86_64

Comment 30 Guo, Zhiyi 2019-09-20 07:53:17 UTC
Created attachment 1617052 [details]
vm xml used

Comment 31 Guo, Zhiyi 2019-09-20 07:54:04 UTC
Verified per comment 29 - 30

Comment 33 errata-xmlrpc 2019-11-06 07:11:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3723