2301525 – Sometimes instances with PCI passthrough are created with more PCI devices than requested

Bug 2301525 - Sometimes instances with PCI passthrough are created with more PCI devices than requested

Summary: Sometimes instances with PCI passthrough are created with more PCI devices th...

Keywords:
Status:	CLOSED DUPLICATE of bug 2301551
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	16.2 (Train)
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	OSP DFG:Compute
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-07-30 08:43 UTC by Alex Stupnikov
Modified:	2024-12-11 15:50 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-08-10 13:59:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1860555	None	None	None	2024-07-31 13:42:29 UTC
OpenStack gerrit	710848	None	NEW	Fix PCI passthrough race on reschedule (refresh)	2024-08-06 15:27:38 UTC
Red Hat Issue Tracker	OSP-32584	None	None	None	2024-07-30 08:45:10 UTC
Red Hat Knowledge Base (Solution)	7081142	None	None	None	2024-07-31 13:46:22 UTC

Description Alex Stupnikov 2024-07-30 08:43:45 UTC

Description of problem:
Two instances in customer's deployment have two PCI devices (GPUs) attached instead of 1 (as prescribed by a flavor). This situation causes scheduling problems and looks similar to https://bugs.launchpad.net/nova/+bug/1860555.

We are looking for a workaround for this problem (it will be nice to have it ASAP): in customer's deployment compute nodes are quite packed, so ideally some solution that doesn't require migration is needed.

Information about collected data will be provided privately.


Version-Release number of selected component (if applicable): RHOSP 16.2, but newer releases are affected as well.


How reproducible: in customer's deployment this was likely triggered by failed host evacuations when multiple VMs were scheduled on the same compute and then re-scheduled. Upstream bug has different steps that may be better for lab.


Actual results:
Sometimes, instances may fail during creation, or may be created with more PCI devices than requested.


Expected results:
The instances are created successfully, and each have the expected number of PCI devices attached.



Additional info: to be provided privately

Comment 4 Artom Lifshitz 2024-08-10 13:59:02 UTC


*** This bug has been marked as a duplicate of bug 2301551 ***

Note You need to log in before you can comment on or make changes to this bug.