Bug 1222414

Summary: [RFE] Enable live migration for pinned instances
Product: Red Hat OpenStack Reporter: Itzik Brown <itbrown>
Component: openstack-novaAssignee: Artom Lifshitz <alifshit>
Status: CLOSED ERRATA QA Contact: James Parker <jparker>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.0 (Kilo)CC: acanan, aherr, akarlsso, alifshit, amodi, asimonel, assaf.eylath, brault, ccopello, cswanson, dasmith, djuran, dvd, ealcaniz, eelena, egallen, eglynn, fbaudin, fherrman, fsoppels, houyatao, itbrown, Jing.C.Zhang, jjoyce, joflynn, jraju, jschluet, kchamart, lyarwood, marjones, maydin, mbooth, michael.or, mmethot, myllynen, nlevinki, sbauza, sclewis, scohen, sgordon, sputhenp, srevivo, stephenfin, tamar.inbar-shelach, tvvcox, vaggarwa, vcojot, vromanso, weiyongjun
Target Milestone: betaKeywords: FutureFeature, Triaged
Target Release: 16.0 (Train on RHEL 8.1)   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-nova-20.0.1-0.20191025043858.390db63.el8ost Doc Type: Enhancement
Doc Text:
With this enhancement, support for live migration of instances with a NUMA topology has been added. Previously, this action was disabled by default. It could be enabled using the '[workarounds] enable_numa_live_migration' config option, but this defaulted to False because live migrating such instances resulted in them being moved to the destination host without updating any of the underlying NUMA guest-to-host mappings or the resource usage. With the new NUMA-aware live migration feature, if the instance cannot fit on the destination, the live migration will be attempted on an alternate destination if the request is set up to have alternates. If the instance can fit on the destination, the NUMA guest-to-host mappings will be re-calculated to reflect its new host, and its resource usage updated.
Story Points: ---
Clone Of:
: 1780366 (view as bug list) Environment:
Last Closed: 2020-02-06 14:37:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1281573, 1339866, 1414999, 1431627, 1442136, 1478186, 1500145, 1500557, 1595325, 1669579, 1732913, 1756916, 1769425, 1780366    
Attachments:
Description Flags
NUMALiveMigrationTest Whitebox Tempest results none

Description Itzik Brown 2015-05-18 07:47:59 UTC
Description of problem:

When Live migrating an instance from a Hypervisor with more CPUs than the destination the migration fails with error in Nova log:
instance: 44ab86db-e529-4a73-8a51-d3a59fdc90c5] Live Migration failure: Invalid value '0-5,12-17' for 'cpuset.cpus': Invalid argument


Version-Release number of selected component (if applicable):
python-nova-2015.1.0-3.el7ost.noarch
libvirt-1.2.8-16.el7_1.2.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Set up an environment where one machine has more CPUs than the other

2. Launch an instance on a machine with the most CPUs 
   # nova boot --flavor m1.small --image fedora --nic net-id=8ddecf6b-7dc9-4899-961b- da7ff778f2c8 vm1

3. Verify the host of the instance
   # nova show <instance id>

4. Live migrate the instance
   # nova live-migration --block-migrate 523becab-ed4d-413e-8bda-dc1b698bd1e9

5. Verify the instance stays on the source machine
   # nova show <instance id>
6. Look for errors in /var/log/nova/nova-compute.log

Actual results:
Migration Fails

Expected results:


Additional info:

Comment 3 Nikola Dipanov 2015-06-26 14:34:52 UTC
This is a well know issue upstream. There is a blueprint proposed (not approved for Liberty release at this point, but likely to get) to fix this.

The fix is (as is described on https://review.openstack.org/#/c/193576/) very invasive and unlikely to be easily backportable.

We should probably add a release note for this saying that live migration is not supported for instances with CPU pinning, (in addition we might want to outright disable it).

If we decide to disable it - then it makes sense to keep this as a blocker and do it for GA, otherwise we should not block on this, relnote, and clone the bug for the next release of RHOS where it will get properly fixed (upstream in Liberty).

Comment 5 Jon Schlueter 2015-07-31 14:50:04 UTC
Moving out to A1 as it's not a regression and not a blocker for GA

Comment 7 Stephen Gordon 2015-11-26 17:10:50 UTC
(In reply to Nikola Dipanov from comment #3)
> This is a well know issue upstream. There is a blueprint proposed (not
> approved for Liberty release at this point, but likely to get) to fix this.
> 
> The fix is (as is described on https://review.openstack.org/#/c/193576/)
> very invasive and unlikely to be easily backportable.
> 
> We should probably add a release note for this saying that live migration is
> not supported for instances with CPU pinning, (in addition we might want to
> outright disable it).
> 
> If we decide to disable it - then it makes sense to keep this as a blocker
> and do it for GA, otherwise we should not block on this, relnote, and clone
> the bug for the next release of RHOS where it will get properly fixed
> (upstream in Liberty).

Based on the above and my understanding that this was not in fact fixed in Liberty I am moving the flags to rhos-9.0, Mitaka. Let me know if my interpretation is incorrect...

Comment 16 Eoghan Glynn 2017-01-09 20:15:45 UTC
The patch is well-developed, but dependent on review traction to land.

Comment 17 Stephen Gordon 2017-01-18 19:47:28 UTC
*** Bug 1319385 has been marked as a duplicate of this bug. ***

Comment 18 Stephen Gordon 2017-01-29 23:27:12 UTC
Hi Sahid,

Is there any chance of this being accepted in the rc-* phase given it's treated as a bug upstream, or should I move this to Pike?

Thanks,

Steve

Comment 19 Sahid Ferdjaoui 2017-01-30 09:03:27 UTC
(In reply to Stephen Gordon from comment #18)
> Hi Sahid,
> 
> Is there any chance of this being accepted in the rc-* phase given it's
> treated as a bug upstream, or should I move this to Pike?
> 
> Thanks,
> 
> Steve

Nothing is really moving in upstream. I guess you should move it to Pike.

Comment 33 Stephen Finucane 2018-02-07 16:58:16 UTC
*** Bug 1360970 has been marked as a duplicate of this bug. ***

Comment 35 Stephen Finucane 2018-03-23 12:13:55 UTC
*** Bug 1559314 has been marked as a duplicate of this bug. ***

Comment 49 Artom Lifshitz 2019-01-16 16:54:08 UTC
*** Bug 1585068 has been marked as a duplicate of this bug. ***

Comment 50 michaelor 2019-01-29 07:33:10 UTC
Any estimation on which version this issue will get resolved?

Comment 51 Artom Lifshitz 2019-05-17 13:48:02 UTC
*** Bug 1703734 has been marked as a duplicate of this bug. ***

Comment 61 Artom Lifshitz 2019-08-30 13:00:37 UTC
Feature freeze upstream is September 12th. The series is under active review, and has a decent chance of landing before then. If it lands, it'll be in the OSP16 release, but is not backportable to previous releases.

Comment 64 Artom Lifshitz 2019-10-15 16:01:43 UTC
*** Bug 1565129 has been marked as a duplicate of this bug. ***

Comment 66 Artom Lifshitz 2019-11-26 15:46:42 UTC
I'm going to set HasTestAutomation, since we have test cases in upstream whitebox [1]. We could probably add more, but what we currently have at least tests the happy path. We also have functional tests [2] up for review that cover the Nova-specific bits (rollback, rolling upgrade, etc).

[1] https://opendev.org/x/whitebox-tempest-plugin/src/branch/master/whitebox_tempest_plugin/api/compute/test_cpu_pinning.py#L419
[2] https://review.opendev.org/#/c/672595/

Comment 67 James Parker 2019-12-04 18:08:15 UTC
Created attachment 1642181 [details]
NUMALiveMigrationTest Whitebox Tempest results

Comment 70 errata-xmlrpc 2020-02-06 14:37:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:0283