Bug 1357468

Summary: Cross migration fails with error qemu-kvm: ... 'pci@800000020000000:00.0/ohci'
Product: Red Hat Enterprise Linux 7 Reporter: Dan Zheng <dzheng>
Component: libvirtAssignee: Andrea Bolognani <abologna>
Status: CLOSED CANTFIX QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact: Jiri Herrmann <jherrman>
Priority: unspecified    
Version: 7.3CC: abologna, bugproxy, dgilbert, dyuan, dzheng, fjin, gsun, hannsj_uhl, jdenemar, jsuchane, mdeng, michal.skrivanek, mkolaja, mzhan, qzhang, rbalakri, zpeng
Target Milestone: rcKeywords: Reopened
Target Release: 7.3   
Hardware: ppc64le   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
Migration of certain guests from Red Hat Enterprise Linux 7.2 to 7.3 hosts is not possible Prior to this update, the PCI address of any USB controller that did not have an explicitly specified `model` value was ignored on IBM Power guest virtual machines. This bug has been fixed, but as a consequence of the fix, it is not possible to perform a live migration of guests that use the described USB controllers from a Red Hat Enterprise Linux 7.2 host to a Red Hat Enterprise Linux 7.3 host, due to the different PCI addresses of the USB controller. To work around this problem, edit the guest XML file and add a `model` attribute with the `pci-ohci` value to the USB <controller> element, for example as follows: <controller type='usb' model='pci-ohci' index='0'> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </controller> Afterwards, shut down the guest and start it again for the changes to take effect. As a result, the guest can be migrated from Red Hat Enterprise Linux 7.2 to 7.3.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-07 15:57:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1230910, 1287890, 1289202, 1299681, 1359843, 1362179, 1369086    
Attachments:
Description Flags
qemu command line log for local host (above star line) and remote host (under star line)
none
guest can not start without usb controller model in libvirt-2.0.0-6.el7.abologna.bz1357468.ppc64le none

Description Dan Zheng 2016-07-18 09:54:33 UTC
Created attachment 1181007 [details]
qemu command line log for local host (above star line) and remote host (under star line)

Description of problem:
It is to migrate PPC64LE OS RHEL7 guest from 7.2 to 7.3, but fails with qemu error.

Version-Release number of selected component (if applicable):

Host 1: [RHEL7.2.z]
OS tree: RHEL7.2-20151030.0
kernel-3.10.0-327.el7.ppc64le
qemu-kvm-rhev-2.3.0-31.el7_2.18.ppc64le
SLOF-20150313-5.gitc89b0df.el7.noarch
libvirt-1.2.17-13.el7_2.5.ppc64le

Host 2:(RHEL7.3)
OS: RHEL-7.3-20160707.2
kernel-3.10.0-461.el7.ppc64le
qemu-kvm-rhev-2.6.0-11.el7.ppc64le
SLOF-20160223-4.gitdbbfda4.el7.noarch
libvirt-2.0.0-1.el7.ppc64le

How reproducible:
100%

Steps to Reproduce:
1. Setup NFS and start guest
2. virsh migrate avocado-vt-vm1 --live --verbose --unsafe qemu+ssh://10.19.112.45:22/system
root.112.45's password: 
Migration: [100 %]error: internal error: qemu unexpectedly closed the monitor: 2016-07-18T09:34:03.327713Z qemu-kvm: Unknown savevm section or instance 'pci@800000020000000:00.0/ohci' 0
2016-07-18T09:34:03.328136Z qemu-kvm: load of migration failed: Invalid argument


Actual results:
See above.
Guest is still running on local host. No host is migrated to remote host.

Expected results:
Migration ok.

Additional info:

qemu command line on local host after starting guest:

qemu command line on remote host after migration:

See attachment qemu_command_line.log

Comment 2 Jaroslav Suchanek 2016-07-19 13:00:44 UTC
Adding Jirka to CC.

Comment 3 Andrea Bolognani 2016-07-19 14:18:51 UTC
Fix proposed upstream:

  https://www.redhat.com/archives/libvir-list/2016-July/msg00727.html

Comment 4 Jaroslav Suchanek 2016-08-03 12:46:11 UTC
Michale, can you please comment on how RHEV uses USB controller model? Is is it specified in the guest configuration? Left blank? Can you please estimate what could be impact of this bz on RHEV? Thanks.

Comment 5 Michal Skrivanek 2016-08-03 13:43:05 UTC
(In reply to Jaroslav Suchanek from comment #4)
> Michale, can you please comment on how RHEV uses USB controller model? Is is
> it specified in the guest configuration? Left blank? Can you please estimate
> what could be impact of this bz on RHEV? Thanks.

we let libvirt to pick a model

Comment 6 Andrea Bolognani 2016-08-03 16:21:36 UTC
So, just to confirm, the XML for a RHEV guest looks like

  <controller type='usb' index='0'>
    <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
  </controller>

with no 'model' attribute for the <controller> element,
right?

Comment 7 Michal Skrivanek 2016-08-04 08:41:53 UTC
right

Comment 8 Andrea Bolognani 2016-08-15 09:05:04 UTC
After discussing this upstream[1], we have concluded that
making live migration between RHEL 7.2 and 7.3 work when the
guest is using the default USB controller is not possible
without reintroducing a broken behavior. Closing as CANTFIX.

As a workaround, it is possible to edit the guest XML and
add a 'model' attribute with value 'pci-ohci' to the relevant
<controller> element, so that it looks like

  <controller type='usb' model='pci-ohci' index='0'>
    <address type='pci' domain='0x0000'
             bus='0x00' slot='0x03' function='0x0'/>
  </controller>

The PCI address might of course be different.

After a full power cycle (shutdown followed by start), the
guest will be live-migratable to RHEL 7.3.

Please note that this will cause a change in guest ABI,
because the USB controller will now be at eg. PCI address
00:03.0 instead of 00:00.0.


[1] https://www.redhat.com/archives/libvir-list/2016-July/msg00727.html

Comment 9 Dr. David Alan Gilbert 2016-08-24 09:10:43 UTC
Andrea: Wouldn't applying that fix downstream only solve the problem?

Comment 10 Jaroslav Suchanek 2016-08-24 16:14:43 UTC
*** Bug 1368972 has been marked as a duplicate of this bug. ***

Comment 11 Andrea Bolognani 2016-08-24 16:30:21 UTC
(In reply to Dr. David Alan Gilbert from comment #9)
> Andrea: Wouldn't applying that fix downstream only solve the problem?

It would. But

  1) we would have to keep that patch around forever, because
     reverting it at any time would again break migration, and
     it's clear that getting it upstream is just not happening

  2) most importantly, doing so would *undo* the fix that
     allows users to pick whatever PCI address they want for
     the USB controller: once again, the address would be
     ignored by libvirt

Comment 12 Dr. David Alan Gilbert 2016-08-25 10:26:11 UTC
(In reply to Andrea Bolognani from comment #11)
> (In reply to Dr. David Alan Gilbert from comment #9)
> > Andrea: Wouldn't applying that fix downstream only solve the problem?
> 
> It would. But
> 
>   1) we would have to keep that patch around forever, because
>      reverting it at any time would again break migration, and
>      it's clear that getting it upstream is just not happening

Yep, we've got a bunch of those in qemu - it's not pretty, but it does mean that the users can keep their migration compatibility.  We can drop old ones when we no longer support migration from old VMs.

>   2) most importantly, doing so would *undo* the fix that
>      allows users to pick whatever PCI address they want for
>      the USB controller: once again, the address would be
>      ignored by libvirt

That is the nice thing about machine types at the qemu level; we can tie broken things to machine type versions and say that something is only fixed on new machine types.

Comment 13 Andrea Bolognani 2016-09-02 16:38:31 UTC
> > (In reply to Dr. David Alan Gilbert from comment #9)
> > > Andrea: Wouldn't applying that fix downstream only solve the problem?
> > 
> > It would. But
> > 
> >   1) we would have to keep that patch around forever, because
> >      reverting it at any time would again break migration, and
> >      it's clear that getting it upstream is just not happening
> 
> Yep, we've got a bunch of those in qemu - it's not pretty, but it does mean
> that the users can keep their migration compatibility.  We can drop old ones
> when we no longer support migration from old VMs.
> 
> >   2) most importantly, doing so would *undo* the fix that
> >      allows users to pick whatever PCI address they want for
> >      the USB controller: once again, the address would be
> >      ignored by libvirt
> 
> That is the nice thing about machine types at the qemu level; we can tie
> broken things to machine type versions and say that something is only fixed
> on new machine types.

We don't usually key stuff off machine type versions in
upstream libvirt, because it's simply impossible to implement
in a way that works reliably for both upstream *and*
downstream versioned machine types.

That said, I think it's okay to perform such a check in a
downstream-only patch. I've posted an implementation of the
approach you suggested, that doesn't involve reverting the
fix for Bug 1297020, for downstream review.

Comment 15 Dan Zheng 2016-09-06 10:18:07 UTC
Test below 6 scenarios. and all PASS.  No product bug is found.
The 7.3->7.2 scenarios will be updated when done later.

1. 7.2->7.3, ppc64le, guest os 7.2, without model
2. 7.2->7.3, ppc64,  guest os 7.2, without model
3. 7.2->7.3, ppc64,  guest os 6.8, without model
4. 7.2->7.3, ppc64le,  guest os 7.2, with model
5. 7.2->7.3, ppc64,  guest os 7.2, with model
6. 7.2->7.3, ppc64,  guest os 6.8, with model

Details:

Case 1: 7.2->7.3, ppc64le, guest os 7.2, without model
"21:36:16 (1/12) virsh.migrate_vm.positive_testing.live_migration.pause_vm Result: PASS 74.41 s
21:37:31 (2/12) virsh.migrate_vm.positive_testing.live_migration.cpuset Result: PASS 100.50 s
21:39:13 (3/12) virsh.migrate_vm.positive_testing.live_migration.with_hugepages Result: FAIL 119.79 s
21:41:13 (4/12) virsh.migrate_vm.positive_testing.p2p_migration.listen_address.with_tcp Result: PASS 75.83 s
21:42:30 (5/12) virsh.migrate_vm.positive_testing.migration_with_ipv6.with_tls Result: PASS 102.56 s
21:44:13 (6/12) virsh.migrate_vm.positive_testing.migration_with_devices.attach_virtual_nic Result: PASS 76.87 s
21:45:30 (7/12) virsh.migrate_vm.positive_testing.cross_rhel_platform_migration.with_io_throttling.total_bytes_sec Result: PASS 73.22 s
21:46:44 (8/12) virsh.migrate_vm.positive_testing.live_storage_migration.backing_file_with_copy_storage_inc Result: SKIP 68.16 s
21:47:53 (9/12) virsh.migrate_vm.negative_testing.live_migration.noexist_xml Result: PASS 60.88 s
21:48:55 (10/12) virsh.migrate_vm.negative_testing.live_migration.abort_job Result: PASS 72.91 s
21:50:09 (11/12) virsh.migrate_vm.negative_testing.p2p_migration.unreachable_destenation.with_tcp Result: PASS 187.12 s
21:53:16 (12/12) virsh.migrate_vm.negative_testing.p2p_migration.invalid_listen_address.with_ssh Result: PASS 65.86 s

Case 2: 7.2->7.3, ppc64, 7.2, without model

"22:33:03 (1/7) virsh.migrate_vm.positive_testing.p2p_migration.basic.with_tls Result: PASS 100.06 s
22:34:44 (2/7) virsh.migrate_vm.positive_testing.cross_rhel_platform_migration.with_watchdog.i6300esb Result: PASS 98.84 s
22:36:23 (3/7) virsh.migrate_vm.negative_testing.live_migration.stop_libvirtd_remotely Result: PASS 66.73 s
22:37:31 (4/7) virsh.migrate_vm.negative_testing.live_migration.abort_job Result: PASS 70.38 s
22:38:42 (5/7) virsh.migrate_vm.negative_testing.live_migration.cancel_migration Result: PASS 65.67 s
22:39:48 (6/7) virsh.migrate_vm.negative_testing.p2p_migration.invalid_listen_address.with_tls Result: PASS 54.40 s
22:40:43 (7/7) virsh.migrate_vm.negative_testing.rdma_migration.no_rdma_env_rdma_pin_all Result: PASS 43.59 s
"

Case 3: 7.2->7.3, ppc64, 6.8, without model

"00:59:12 (1/6) virsh.migrate_vm.positive_testing.live_migration.track_statistics Result: FAIL 84.03 s
01:00:37 (2/6) virsh.migrate_vm.positive_testing.p2p_migration.with_keepalive_protocol.default_conf_less_than_keepalive_time Result: PASS 661.27 s
01:11:39 (3/6) virsh.migrate_vm.positive_testing.tunnelled_migration.basic.with_ssh Result: PASS 65.04 s
01:12:45 (4/6) virsh.migrate_vm.positive_testing.tunnelled_migration.basic.with_tls Result: PASS 86.59 s
01:14:12 (5/6) virsh.migrate_vm.negative_testing.live_migration.abort_job Result: PASS 67.75 s
01:15:20 (6/6) virsh.migrate_vm.negative_testing.live_storage_migration.no_create_target_image.simple Result: PASS 37.20 s
"

Case 4: 7.2->7.3, ppc64le, 7.2, with model
"01:30:16 (1/5) virsh.migrate_vm.positive_testing.live_migration.listen_address Result: PASS 72.19 s
01:31:29 (2/5) virsh.migrate_vm.negative_testing.live_migration.abort_job Result: PASS 70.76 s
01:32:41 (3/5) virsh.migrate_vm.negative_testing.live_migration.restart_local_libvirtd Result: PASS 199.96 s
01:36:01 (4/5) virsh.migrate_vm.negative_testing.p2p_migration.unreachable_destenation.with_ssh Result: PASS 184.35 s
01:39:07 (5/5) virsh.migrate_vm.negative_testing.live_storage_migration.mutually_exclusive_options Result: PASS 49.03 s
"

Case 5: 7.2->7.3, ppc64, 7.2, with model

"02:13:21 (1/4) virsh.migrate_vm.positive_testing.live_migration.timeout Result: PASS 79.19 s
02:14:40 (2/4) virsh.migrate_vm.negative_testing.live_migration.unprivileged_user Result: PASS 60.97 s
02:15:42 (3/4) virsh.migrate_vm.negative_testing.live_migration.abort_job Result: PASS 70.87 s
02:16:54 (4/4) virsh.migrate_vm.negative_testing.rdma_migration.no_rdma_env_turn_off_rdma_pin_all Result: PASS 42.04 s
"

Case 6: 7.2->7.3, ppc64, 6.8, with model
"02:25:48 (1/5) virsh.migrate_vm.positive_testing.live_migration.reboot_vm Result: PASS 98.78 s
02:27:27 (2/5) virsh.migrate_vm.positive_testing.p2p_migration.listen_address.with_ssh Result: PASS 64.10 s
02:28:32 (3/5) virsh.migrate_vm.positive_testing.tunnelled_migration.basic.with_tcp Result: PASS 65.57 s
02:29:39 (4/5) virsh.migrate_vm.negative_testing.live_migration.abort_job Result: PASS 63.54 s
02:30:43 (5/5) virsh.migrate_vm.negative_testing.tunnelled_migration.restart_local_libvirtd Result: PASS 164.42 s
"

Comment 16 Dan Zheng 2016-09-07 07:30:27 UTC
The tests in comment 15 and comment 16 are using below environment.

[7.2.z]
OS tree: RHEL7.2-20151030.0 + updated with z repo
kernel-3.10.0-327.el7.ppc64le
qemu-kvm-rhev-2.3.0-31.el7_2.21.ppc64le
SLOF-20150313-5.gitc89b0df.el7.noarch 
libvirt-1.2.17-13.el7_2.5.ppc64le

[7.3]
Host 2:(RHEL7.3)
OS: RHEL-7.3-20160901.1
kernel-3.10.0-495.el7.ppc64le
qemu-kvm-rhev-2.6.0-22.el7.ppc64le
SLOF-20160223-6.gitdbbfda4.el7.noarch
libvirt-2.0.0-6.el7.abologna.bz1357468.ppc64le


Test left 6 scenarios.

7. 7.3->7.2, ppc64le, guest os 7.2, without model       FAIL
8. 7.3->7.2, ppc64,  guest os 7.2, without model        FAIL
9. 7.3->7.2, ppc64,  guest os 6.8, without model        FAIL
10. 7.3->7.2, ppc64le,  guest os 7.2, with model         PASS
11. 7.3->7.2, ppc64,  guest os 7.2, with model           PASS
12. 7.3->7.2, ppc64,  guest os 6.8, with model           PASS

Details:

Case 7 ~9:  Fail
Using usb controller without model setting, the VM can not start up.

# virsh start avocado-vt-vm1
error: Failed to start domain avocado-vt-vm1
error: internal error: process exited while connecting to monitor: 2016-09-07T07:15:42.174828Z qemu-kvm: -device usb-kbd,id=input0,bus=usb.0,port=1: Bus 'usb.0' not found

See the guest XML in attachment.

The vm migrated from 7.2 to 7.3 is using '-usb' in qemu process cmd line.

# ps -ef|grep qemu
-device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x3 ***-usb***-drive file=/usr/share/avo



Case 10:  7.3->7.2, ppc64le,  guest os 7.2, with model  
02:20:33 (1/4) virsh.migrate_vm.positive_testing.p2p_migration.basic.with_ssh Result: PASS 39.01 s
02:21:14 (2/4) virsh.migrate_vm.positive_testing.cross_rhel_platform_migration.with_io_throttling.read_bytes_sec Result: PASS 38.62 s
02:21:54 (3/4) virsh.migrate_vm.negative_testing.live_migration.abort_job Result: PASS 42.37 s
02:22:38 (4/4) virsh.migrate_vm.negative_testing.live_storage_migration.no_create_target_image.basic Result: PASS 108.01 s

Case 11:  7.3->7.2, ppc64,  guest os 7.2, with model   
02:38:11 (1/4) virsh.migrate_vm.positive_testing.p2p_migration.listen_address.with_tls Result: PASS 62.68 s
02:39:16 (2/4) virsh.migrate_vm.positive_testing.cross_rhel_platform_migration.with_io_throttling.total_iops_sec Result: PASS 38.92 s
02:39:56 (3/4) virsh.migrate_vm.negative_testing.live_migration.abort_job Result: PASS 41.76 s
02:40:39 (4/4) virsh.migrate_vm.negative_testing.rdma_migration.no_rdma_env Result: PASS 14.34 s


Case 12:   7.3->7.2, ppc64,  guest os 6.8, with model      
02:53:14 (1/3) virsh.migrate_vm.positive_testing.live_migration.iscsi.ipv6 Result: PASS 40.22 s
02:53:55 (2/3) virsh.migrate_vm.positive_testing.migration_with_devices.attach_virtual_disk Result: PASS 117.70 s
02:55:54 (3/3) virsh.migrate_vm.negative_testing.live_migration.abort_job Result: PASS 52.06 s

Comment 17 Dan Zheng 2016-09-07 07:33:46 UTC
Added for failure of Case 7 ~9.
The VM can not start on RHEL 7.3 machine without USB controller model configured.
See attachment for guest xml.

Comment 18 Dan Zheng 2016-09-07 07:39:09 UTC
Created attachment 1198542 [details]
guest can not start without usb controller model in libvirt-2.0.0-6.el7.abologna.bz1357468.ppc64le

Comment 20 Andrea Bolognani 2016-09-07 15:57:16 UTC
The proposed approach didn't survive a round of testing,
so I'm moving the bug back to CLOSED CANTFIX.