Bug 1613242
| Summary: | Cold Migration fails with NUMATopologyFilter and CPU pinning | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Saravanan KR <skramaja> |
| Component: | openstack-nova | Assignee: | Artom Lifshitz <alifshit> |
| Status: | CLOSED ERRATA | QA Contact: | Yariv <yrachman> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 10.0 (Newton) | CC: | alifshit, asoni, atelang, berrange, bshephar, cfields, cswanson, dasmith, dmaley, eglynn, fherrman, jamsmith, jhakimra, jsisul, kchamart, lhh, lyarwood, marjones, mircea.vutcovici, mschuppe, sbauza, sgordon, skramaja, sputhenp, srevivo, stephenfin, supadhya, vromanso, yrachman |
| Target Milestone: | z10 | Keywords: | Triaged, ZStream |
| Target Release: | 10.0 (Newton) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | openstack-nova-14.1.0-35.el7ost | Doc Type: | Release Note |
| Doc Text: |
This update introduces a new config option, cpu_pinning_migration_quick_fail.
The option only applies to instances with pinned CPUs. It controls the failing of live migrations in the scheduler if the required CPUs aren't available on the destination host.
When an instance with CPU pinning is live migrated, the upstream behavior is
to keep the CPU mapping on the destination identical to the source. This can
result in multiple instances pinned to the same host CPUs. OSP contains a
downstream workaround (absent from upstream OpenStack) that prevents live
migration if the required CPUs aren't available on the destination host.
The workaround's implementation places the same destination CPU availability
restrictions on all other operations that involve scheduling an instance on a
new host. These are cold migration, evacuation, resize and unshelve. For these
operations, the instance CPU pinning is recalculated to fit the new host,
making the restrictions unnecessary. For example, if the exact CPUs needed by
an instance are not available on any compute host, a cold migration would fail.
Without the workaround, a host would be found that can accept the instance by
recalculating its CPU pinning.
You can disable this workaround by setting cpu_pinning_migration_quick_fail to False. With the quick-fail workaround disabled, live migration with CPU pinning reverts to the
upstream behavior, but the restrictions are lifted from all other move
operations, allowing them to work correctly.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-01-16 17:09:01 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1611654, 1612788 | ||
|
Description
Saravanan KR
2018-08-07 09:38:30 UTC
I think this occurs because of a downstream-only change that we're carrying. That change is intended to work around the fact we don't currently rebuild an instance's XML (including pinning information) when live migrating. The patch checks to ensure the destination has the same free CPUs as the source and fails the migration if so. Clearly this filter should only be doing this for live migration and not cold migration, where the XML _is_ recalculated. I don't see any clear evidence on how we can make differentiation between cold-migration and live-migration during the scheduling phase. I'm working on it to see what could be done. I have a patc, Saravanan, can you make a try?
--- a/nova/scheduler/filters/numa_topology_filter.py
+++ b/nova/scheduler/filters/numa_topology_filter.py
@@ -13,6 +13,7 @@
from oslo_log import log as logging
import six
+from nova import context
from nova import objects
from nova.objects import fields
from nova.scheduler import filters
@@ -21,9 +22,6 @@ from nova.virt import hardware
LOG = logging.getLogger(__name__)
-LOG = logging.getLogger(__name__)
-
-
class NUMATopologyFilter(filters.BaseHostFilter):
"""Filter on requested NUMA topology."""
@@ -96,25 +94,32 @@ class NUMATopologyFilter(filters.BaseHostFilter):
cpu_allocation_ratio=cpu_ratio,
ram_allocation_ratio=ram_ratio)
- # Computing Host CPUs already pinned on the compute host to
- # compare them to the requested by instance topology.
- pinned_cpus = sum(
- [list(c.pinned_cpus) for c in host_topology.cells], [])
- LOG.debug("Host already pinned CPUs %s", pinned_cpus)
- for cell in requested_topology.cells:
- # To have the attribute cpu_pinning_raw set the
- # instance has already been scheduled on a compute
- # node and so its pinning information againsts Host
- # CPUs computed and stored to the database.
- if (cell.obj_attr_is_set('cpu_pinning_raw')
- and cell.cpu_pinning_raw):
- for vcpu, pcpu in six.iteritems(cell.cpu_pinning_raw):
- LOG.debug("vCPUs(%s) wants to be pinned on pCPUs(%s)",
- vcpu, pcpu)
- if pcpu in pinned_cpus:
- LOG.debug("Can't move instance on host, "
- "requested CPU %s already pinned", pcpu)
- return False
+ # Ensure that it's a live-migration
+ if objects.MigrationList.get_in_progress_by_instance(
+ context.get_admin_context(),
+ spec_obj.instance_uuid,
+ "live-migration"):
+ # Computing Host CPUs already pinned on the compute host to
+ # compare them to the requested by instance topology.
+ pinned_cpus = sum(
+ [list(c.pinned_cpus) for c in host_topology.cells], [])
+ LOG.debug("Host already pinned CPUs %s", pinned_cpus)
+ for cell in requested_topology.cells:
+ # To have the attribute cpu_pinning_raw set the
+ # instance has already been scheduled on a compute
+ # node and so its pinning information againsts Host
+ # CPUs computed and stored to the database.
+ if (cell.obj_attr_is_set('cpu_pinning_raw')
+ and cell.cpu_pinning_raw):
+ for vcpu, pcpu in six.iteritems(cell.cpu_pinning_raw):
+ LOG.debug("vCPUs(%s) wants to be pinned on "
+ "pCPUs(%s)",
+ vcpu, pcpu)
+ if pcpu in pinned_cpus:
+ LOG.debug("Can't move instance on host, "
+ "requested CPU %s already pinned",
+ pcpu)
+ return False
instance_topology = (hardware.numa_fit_instance_to_host(
host_topology, requested_topology,
Cold Migration works with NUMATopologyFilter with this patch. ON the latest passed_phase1 puddle - puddle-id=2018-12-04.1 Cold migration tests have passed - https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/nfv/view/nfv/job/DFG-nfv-10-director-hybrid-3cont-2comp-ipv4-vxlan-dpdk-sriov-ctlplane-dataplane-bonding-hybrid/13/testReport/nfv_tempest_plugin.tests.scenario.test_nfv_basic/TestNfvBasic/ Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0074 *** Bug 1713530 has been marked as a duplicate of this bug. *** |