Description of problem: Overcloud deployment using real time kernel have failures during live migration: 2020-07-24 11:38:11.848 8 DEBUG nova.virt.libvirt.driver [-] [instance: f9db64ac-8e4f-4cf7-b9c5-9ab8629466bc] About to invoke the migrate API _live_migration_operation /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8779 2020-07-24 11:38:12.056 8 ERROR nova.virt.libvirt.driver [-] [instance: f9db64ac-8e4f-4cf7-b9c5-9ab8629466bc] Live Migration failure: vcpussched attributes 'vcpus' must not overlap: libvirt.libvirtError: vcpussched attributes 'vcpus' must not overlap 2020-07-24 11:38:12.058 8 DEBUG nova.virt.libvirt.driver [-] [instance: f9db64ac-8e4f-4cf7-b9c5-9ab8629466bc] Migration operation thread notification thread_finished /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9135 2020-07-24 11:38:12.332 8 DEBUG nova.virt.libvirt.migration [-] [instance: f9db64ac-8e4f-4cf7-b9c5-9ab8629466bc] VM running on src, migration failed _log /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:405 2020-07-24 11:38:12.332 8 DEBUG nova.virt.libvirt.driver [-] [instance: f9db64ac-8e4f-4cf7-b9c5-9ab8629466bc] Fixed incorrect job type to be 4 _live_migration_monitor /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8949 2020-07-24 11:38:12.332 8 ERROR nova.virt.libvirt.driver [-] [instance: f9db64ac-8e4f-4cf7-b9c5-9ab8629466bc] Migration operation has aborted This behavior does not occur in regular non real time deployments. Version-Release number of selected component (if applicable): Encountered in composes: 'RHOS-16.1-RHEL-8-20200723.n.0' and 'RHOS-16.1-RHEL-8-20200701.n.0' real time kernel: 4.18.0-193.rt13.51.el8.x86_64 nova related packages might be different between composes. How reproducible: Always. Steps to Reproduce: 1. Modify overcloud image to use real time kernel according to documentation. 2. Deploy overcloud. 3. Create resources and attempt live migration. Actual results: Live migration is failing. Expected results: Live migration is successful. Additional info: Will attach logs in comment.
I can reproduce this on master with a basic DevStack deployment, using the following: $ openstack flavor create --ram 1024 --disk 0 --vcpu 4 \ --property 'hw:cpu_policy=dedicated' \ --property 'hw:cpu_realtime=yes' \ --property 'hw:cpu_realtime_mask=^0-1' \ realtime $ openstack server create --os-compute-api-version=2.latest \ --flavor realtime --image cirros-0.5.1-x86_64-disk --nic none \ --boot-from-volume 1 --wait \ test.realtime $ openstack server migrate --live-migration test.realtime Looking at the logs, I see the same failure: Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/eventlet/hubs/hub.py", line 461, in fire_timers timer() File "/usr/local/lib/python3.6/dist-packages/eventlet/hubs/timer.py", line 59, in __call__ cb(*args, **kw) File "/usr/local/lib/python3.6/dist-packages/eventlet/event.py", line 175, in _do_send waiter.switch(result) File "/usr/local/lib/python3.6/dist-packages/eventlet/greenthread.py", line 221, in main result = function(*args, **kwargs) File "/opt/stack/nova/nova/utils.py", line 670, in context_wrapper return func(*args, **kwargs) File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 8966, in _live_migration_operation # is still ongoing, or failed File "/usr/local/lib/python3.6/dist-packages/oslo_utils/excutils.py", line 220, in __exit__ self.force_reraise() File "/usr/local/lib/python3.6/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise six.reraise(self.type_, self.value, self.tb) File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise raise value File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 8959, in _live_migration_operation # 2. src==running, dst==paused File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 658, in migrate destination, params=params, flags=flags) File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 190, in doit result = proxy_call(self._autowrap, f, *args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 148, in proxy_call rv = execute(f, *args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 129, in execute six.reraise(c, e, tb) File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise raise value File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 83, in tworker rv = meth(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/libvirt.py", line 1745, in migrateToURI3 if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self) libvirt.libvirtError: vcpussched attributes 'vcpus' must not overlap I suspect this is a flaw in the live migration of pinned instances support first introduced in 16.0. Continuing investigation.
Yes, this is a flaw in the live migration code. You can see this in logs. From my own reproducer, we see the XML before and after it is updated for the destination host. Before: DEBUG nova.virt.libvirt.migration [-] _update_numa_xml input xml=<domain type="kvm"> ... <cputune> <shares>4096</shares> <vcpupin vcpu="0" cpuset="0"/> <vcpupin vcpu="1" cpuset="1"/> <vcpupin vcpu="2" cpuset="4"/> <vcpupin vcpu="3" cpuset="5"/> <emulatorpin cpuset="0-1"/> <vcpusched vcpus="2" scheduler="fifo" priority="1"/> <vcpusched vcpus="3" scheduler="fifo" priority="1"/> </cputune ... </domain> {{(pid=12600) _update_numa_xml /opt/stack/nova/nova/virt/libvirt/migration.py:97} After: DEBUG nova.virt.libvirt.migration [-] _update_numa_xml output xml=<domain type="kvm"> ... <cputune> <shares>4096</shares> <vcpupin vcpu="0" cpuset="0"/> <vcpupin vcpu="1" cpuset="1"/> <vcpupin vcpu="2" cpuset="4"/> <vcpupin vcpu="3" cpuset="5"/> <emulatorpin cpuset="0-1"/> <vcpusched vcpus="2-3" scheduler="fifo" priority="1"/> <vcpusched vcpus="3" scheduler="fifo" priority="1"/> </cputune> ... </domain> {{(pid=12600) _update_numa_xml /opt/stack/nova/nova/virt/libvirt/migration.py:131}} The issue is the 'vcpusched' elements. We're assuming there are only one of these elements when updating the XML for the destination [1]. Have to figure out why there are multiple elements and how best to handle this (likely by deleting and recreating everything). [1] https://github.com/openstack/nova/blob/21.0.0/nova/virt/libvirt/migration.py#L152-L155
And the reason we didn't spot this is because libvirt is rewriting the XML on us. This is what nova is providing libvirt upon boot: DEBUG nova.virt.libvirt.driver [...] [instance: ...] End _get_guest_xml xml=<domain type="kvm"> ... <cputune> <shares>4096</shares> <emulatorpin cpuset="0-1"/> <vcpupin vcpu="0" cpuset="0"/> <vcpupin vcpu="1" cpuset="1"/> <vcpupin vcpu="2" cpuset="4"/> <vcpupin vcpu="3" cpuset="5"/> <vcpusched vcpus="2-3" scheduler="fifo" priority="1"/> </cputune> ... </domain> {{(pid=12600) _get_guest_xml /opt/stack/nova/nova/virt/libvirt/driver.py:6331}} but that's changed by time we get to recalculating things.
Hey Stephen, Thanks a lot for the in depth debug and explanation! May I ask why we did not encounter this issue in non real time deployments? We have a test where we are attempting to live migrate an instance which is pinned and it is working for us.
(In reply to Vadim Khitrin from comment #6) > Hey Stephen, > Thanks a lot for the in depth debug and explanation! > > May I ask why we did not encounter this issue in non real time deployments? > We have a test where we are attempting to live migrate an instance which is > pinned and it is working for us. The issue lies with the '<vcpusched>' element, which is only configured for realtime instances. It will never be generated for non-realtime instances, meaning we don't see this issue there.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (openstack-nova bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3572