Bug 1861363 - [OSP16.1] Live migration is failing in real time deployments
Summary: [OSP16.1] Live migration is failing in real time deployments
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z1
: 16.1 (Train on RHEL 8.2)
Assignee: Stephen Finucane
QA Contact: James Parker
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-28 12:10 UTC by Vadim Khitrin
Modified: 2020-08-27 15:21 UTC (History)
15 users (show)

Fixed In Version: openstack-nova-20.3.1-0.20200626213434.38ee1f3.el8ost
Doc Type: Known Issue
Doc Text:
OSP 16.0 introduced full support for live migration of pinned instances. Due to a bug in this feature, instances with a real-time CPU policy and more than one real-time CPU cannot migrate successfully. As a result, live migration of real-time instances is not possible. There is currently no workaround.
Clone Of:
Environment:
Last Closed: 2020-08-27 15:21:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1889257 0 None None None 2020-07-28 14:51:51 UTC
OpenStack gerrit 743568 0 None MERGED Handle multiple 'vcpusched' elements during live migrate 2020-12-04 16:58:07 UTC
Red Hat Product Errata RHBA-2020:3572 0 None None None 2020-08-27 15:21:43 UTC

Description Vadim Khitrin 2020-07-28 12:10:51 UTC
Description of problem:
Overcloud deployment using real time kernel have failures during live migration:
2020-07-24 11:38:11.848 8 DEBUG nova.virt.libvirt.driver [-] [instance: f9db64ac-8e4f-4cf7-b9c5-9ab8629466bc] About to invoke the migrate API _live_migration_operation /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8779
2020-07-24 11:38:12.056 8 ERROR nova.virt.libvirt.driver [-] [instance: f9db64ac-8e4f-4cf7-b9c5-9ab8629466bc] Live Migration failure: vcpussched attributes 'vcpus' must not overlap: libvirt.libvirtError: vcpussched attributes 'vcpus' must not overlap
2020-07-24 11:38:12.058 8 DEBUG nova.virt.libvirt.driver [-] [instance: f9db64ac-8e4f-4cf7-b9c5-9ab8629466bc] Migration operation thread notification thread_finished /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9135
2020-07-24 11:38:12.332 8 DEBUG nova.virt.libvirt.migration [-] [instance: f9db64ac-8e4f-4cf7-b9c5-9ab8629466bc] VM running on src, migration failed _log /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:405
2020-07-24 11:38:12.332 8 DEBUG nova.virt.libvirt.driver [-] [instance: f9db64ac-8e4f-4cf7-b9c5-9ab8629466bc] Fixed incorrect job type to be 4 _live_migration_monitor /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8949
2020-07-24 11:38:12.332 8 ERROR nova.virt.libvirt.driver [-] [instance: f9db64ac-8e4f-4cf7-b9c5-9ab8629466bc] Migration operation has aborted

This behavior does not occur in regular non real time deployments.

Version-Release number of selected component (if applicable):
Encountered in composes: 'RHOS-16.1-RHEL-8-20200723.n.0' and 'RHOS-16.1-RHEL-8-20200701.n.0'
real time kernel: 4.18.0-193.rt13.51.el8.x86_64

nova related packages might be different between composes.

How reproducible:
Always.

Steps to Reproduce:
1. Modify overcloud image to use real time kernel according to documentation.
2. Deploy overcloud.
3. Create resources and attempt live migration.

Actual results:
Live migration is failing.

Expected results:
Live migration is successful.

Additional info:
Will attach logs in comment.

Comment 2 Stephen Finucane 2020-07-28 13:41:25 UTC
I can reproduce this on master with a basic DevStack deployment, using the following:

  $ openstack flavor create --ram 1024 --disk 0 --vcpu 4 \
    --property 'hw:cpu_policy=dedicated' \
    --property 'hw:cpu_realtime=yes' \
    --property 'hw:cpu_realtime_mask=^0-1' \
    realtime

  $ openstack server create --os-compute-api-version=2.latest \
    --flavor realtime --image cirros-0.5.1-x86_64-disk --nic none \
    --boot-from-volume 1 --wait \
    test.realtime

  $ openstack server migrate --live-migration test.realtime

Looking at the logs, I see the same failure:

  Traceback (most recent call last):
    File "/usr/local/lib/python3.6/dist-packages/eventlet/hubs/hub.py", line 461, in fire_timers
      timer()
    File "/usr/local/lib/python3.6/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
      cb(*args, **kw)
    File "/usr/local/lib/python3.6/dist-packages/eventlet/event.py", line 175, in _do_send
      waiter.switch(result)
    File "/usr/local/lib/python3.6/dist-packages/eventlet/greenthread.py", line 221, in main
      result = function(*args, **kwargs)
    File "/opt/stack/nova/nova/utils.py", line 670, in context_wrapper
      return func(*args, **kwargs)
    File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 8966, in _live_migration_operation
      #     is still ongoing, or failed
    File "/usr/local/lib/python3.6/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
      self.force_reraise()
    File "/usr/local/lib/python3.6/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
      six.reraise(self.type_, self.value, self.tb)
    File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
      raise value
    File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 8959, in _live_migration_operation
      #  2. src==running, dst==paused
    File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 658, in migrate
      destination, params=params, flags=flags)
    File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 190, in doit
      result = proxy_call(self._autowrap, f, *args, **kwargs)
    File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 148, in proxy_call
      rv = execute(f, *args, **kwargs)
    File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 129, in execute
      six.reraise(c, e, tb)
    File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
      raise value
    File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 83, in tworker
      rv = meth(*args, **kwargs)
    File "/usr/local/lib/python3.6/dist-packages/libvirt.py", line 1745, in migrateToURI3
      if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
  libvirt.libvirtError: vcpussched attributes 'vcpus' must not overlap

I suspect this is a flaw in the live migration of pinned instances support first introduced in 16.0. Continuing investigation.

Comment 3 Stephen Finucane 2020-07-28 14:02:44 UTC
Yes, this is a flaw in the live migration code. You can see this in logs. From my own reproducer, we see the XML before and after it is updated for the destination host. Before:

  DEBUG nova.virt.libvirt.migration [-] _update_numa_xml input xml=<domain type="kvm">
    ...
    <cputune>
      <shares>4096</shares>
      <vcpupin vcpu="0" cpuset="0"/>
      <vcpupin vcpu="1" cpuset="1"/>
      <vcpupin vcpu="2" cpuset="4"/>
      <vcpupin vcpu="3" cpuset="5"/>
      <emulatorpin cpuset="0-1"/>
      <vcpusched vcpus="2" scheduler="fifo" priority="1"/>
      <vcpusched vcpus="3" scheduler="fifo" priority="1"/>
    </cputune
    ...
  </domain>
   {{(pid=12600) _update_numa_xml /opt/stack/nova/nova/virt/libvirt/migration.py:97}

After:

  DEBUG nova.virt.libvirt.migration [-] _update_numa_xml output xml=<domain type="kvm">
    ...
    <cputune>
      <shares>4096</shares>
      <vcpupin vcpu="0" cpuset="0"/>
      <vcpupin vcpu="1" cpuset="1"/>
      <vcpupin vcpu="2" cpuset="4"/>
      <vcpupin vcpu="3" cpuset="5"/>
      <emulatorpin cpuset="0-1"/>
      <vcpusched vcpus="2-3" scheduler="fifo" priority="1"/>
      <vcpusched vcpus="3" scheduler="fifo" priority="1"/>
    </cputune>
    ...
  </domain>
   {{(pid=12600) _update_numa_xml /opt/stack/nova/nova/virt/libvirt/migration.py:131}}

The issue is the 'vcpusched' elements. We're assuming there are only one of these elements when updating the XML for the destination [1]. Have to figure out why there are multiple elements and how best to handle this (likely by deleting and recreating everything).

[1] https://github.com/openstack/nova/blob/21.0.0/nova/virt/libvirt/migration.py#L152-L155

Comment 5 Stephen Finucane 2020-07-28 14:24:49 UTC
And the reason we didn't spot this is because libvirt is rewriting the XML on us. This is what nova is providing libvirt upon boot:

  DEBUG nova.virt.libvirt.driver [...] [instance: ...] End _get_guest_xml xml=<domain type="kvm">
    ...
    <cputune>
      <shares>4096</shares>
      <emulatorpin cpuset="0-1"/>
      <vcpupin vcpu="0" cpuset="0"/>
      <vcpupin vcpu="1" cpuset="1"/>
      <vcpupin vcpu="2" cpuset="4"/>
      <vcpupin vcpu="3" cpuset="5"/>
      <vcpusched vcpus="2-3" scheduler="fifo" priority="1"/>
    </cputune>
    ...
  </domain>
   {{(pid=12600) _get_guest_xml /opt/stack/nova/nova/virt/libvirt/driver.py:6331}}

but that's changed by time we get to recalculating things.

Comment 6 Vadim Khitrin 2020-07-28 15:14:44 UTC
Hey Stephen,
Thanks a lot for the in depth debug and explanation!

May I ask why we did not encounter this issue in non real time deployments?
We have a test where we are attempting to live migrate an instance which is pinned and it is working for us.

Comment 7 Stephen Finucane 2020-07-28 15:36:09 UTC
(In reply to Vadim Khitrin from comment #6)
> Hey Stephen,
> Thanks a lot for the in depth debug and explanation!
> 
> May I ask why we did not encounter this issue in non real time deployments?
> We have a test where we are attempting to live migrate an instance which is
> pinned and it is working for us.

The issue lies with the '<vcpusched>' element, which is only configured for realtime instances. It will never be generated for non-realtime instances, meaning we don't see this issue there.

Comment 13 errata-xmlrpc 2020-08-27 15:21:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (openstack-nova bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3572


Note You need to log in before you can comment on or make changes to this bug.