2116979 – [17.0 ga tech preview] vGPU support

Bug 2116979 - [17.0 ga tech preview] vGPU support

Summary: [17.0 ga tech preview] vGPU support

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	documentation
Sub Component:
Version:	17.0 (Wallaby)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	ga
Target Release:	---
Assignee:	RHOS Documentation Team
QA Contact:	RHOS Documentation Team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-08-09 16:32 UTC by Artom Lifshitz
Modified:	2024-01-05 04:25 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	When using the Technology Preview vGPU support features, a known issue prevents `mdev` devices from being freed when stopping, moving, or deleting vGPU instances in RHOSP 17. Eventually, all `mdev` devices become consumed, and additional instances with vGPUs cannot be created on the compute host.
Clone Of:
Environment:
Last Closed:	2022-09-12 15:21:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-18130	0	None	None	None	2022-08-09 16:35:26 UTC

Description Artom Lifshitz 2022-08-09 16:32:45 UTC

This bug was initially created as a copy of Bug #2092979

I am copying this bug because: 

This BZ is to track the Tech Preview status of vGPU support in OSP 17.0 GA. Only vGPU "panda" VMs are known to work, and should not be rebooted, shut down, or cold migrated, otherwise we leak vGPUs and eventually run out entirely. This includes the "Multiple GPU types RFE" (https://bugzilla.redhat.com/show_bug.cgi?id=1761861) and its deployment counterpart (https://bugzilla.redhat.com/show_bug.cgi?id=1793957), as well as GPU cold migration (https://bugzilla.redhat.com/show_bug.cgi?id=1701281).

Description of problem:
After deleting multiple guests utilizing vGPU instances the mdev's remain claimed and are not removed. Resulting in placement reporting vGPU resources that are less than what the system should be available.

# Usage on host of one of the GPUs before creating instances
[root@computesriov-0 heat-admin]# cat /sys/bus/pci/devices/0000\:04\:00.0/mdev_supported_types/nvidia-319/available_instances 
4

# Placement reporting
(.tempest) (overcloud) [stack@undercloud-0 tempest_workspace]$ openstack --os-placement-
+----------------+------------------+----------+----------+----------+-----------+-------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total |
+----------------+------------------+----------+----------+----------+-----------+-------+
| VGPU           |              1.0 |        1 |        4 |        0 |         1 |     4 |
+----------------+------------------+----------+----------+----------+-----------+-------+
(.tempest) (overcloud) [stack@undercloud-0 tempest_workspace]$ openstack --os-placement-api-version 1.17 resource provider inventory list 35d7ddbc-a37c-42b6-9b6f-41832865f142
+----------------+------------------+----------+----------+----------+-----------+-------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total |
+----------------+------------------+----------+----------+----------+-----------+-------+
| VGPU           |              1.0 |        1 |        4 |        0 |         1 |     4 |
+----------------+------------------+----------+----------+----------+-----------+-------+
(.tempest) (overcloud) [stack@undercloud-0 tempest_workspace]$ openstack --os-placement-
+----------------+------------------+----------+----------+----------+-----------+-------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total |
+----------------+------------------+----------+----------+----------+-----------+-------+
| VGPU           |              1.0 |        1 |        4 |        0 |         1 |     4 |
+----------------+------------------+----------+----------+----------+-----------+-------+
(.tempest) (overcloud) [stack@undercloud-0 tempest_workspace]$ openstack --os-placement-api-version 1.17 resource provider inventory list 9ccde640-6755-451b-acb9-9b4dc2af643f

+----------------+------------------+----------+----------+----------+-----------+-------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total |
+----------------+------------------+----------+----------+----------+-----------+-------+
| VGPU           |              1.0 |        1 |        4 |        0 |         1 |     4 |
+----------------+------------------+----------+----------+----------+-----------+-------+

# Run whitebox to create, resize, and migrate vGPU enabled instances
(.tempest) (overcloud) [stack@undercloud-0 tempest_workspace]$ tempest run --serial --regex whitebox_tempest_plugin.api.compute.test_vgpu | tee vgpu_smoke_tests.log
{0} whitebox_tempest_plugin.api.compute.test_vgpu.VGPUColdMigration.test_revert_vgpu_cold_migration [59.047790s] ... ok
{0} whitebox_tempest_plugin.api.compute.test_vgpu.VGPUColdMigration.test_vgpu_cold_migration [35.478767s] ... ok
{0} whitebox_tempest_plugin.api.compute.test_vgpu.VGPUResizeInstance.test_standard_to_vgpu_resize [43.503475s] ... ok
{0} whitebox_tempest_plugin.api.compute.test_vgpu.VGPUResizeInstance.test_vgpu_to_standard_resize [37.093947s] ... ok
{0} whitebox_tempest_plugin.api.compute.test_vgpu.VGPUSanity.test_boot_instance_with_vgpu [19.356411s] ... ok

======
Totals
======
Ran: 5 tests in 233.1257 sec.
 - Passed: 5
 - Skipped: 0
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 194.4804 sec.

==============
Worker Balance
==============
 - Worker 0 (5 tests) => 0:03:53.125749

# Confirm there are no more instances running
(.tempest) (overcloud) [stack@undercloud-0 tempest_workspace]$ openstack server list --all-projects

(.tempest) (overcloud) [stack@undercloud-0 tempest_workspace]$

# Check placement reporting now
(.tempest) (overcloud) [stack@undercloud-0 tempest_workspace]$ openstack --os-placement-api-version 1.17 resource provider inventory list 1e3e7132-83ce-40b8-80f1-0da01d12e067
+----------------+------------------+----------+----------+----------+-----------+-------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total |
+----------------+------------------+----------+----------+----------+-----------+-------+
| VGPU           |              1.0 |        1 |        3 |        0 |         1 |     3 |
+----------------+------------------+----------+----------+----------+-----------+-------+

# Check previous instances
[heat-admin@computesriov-0 ~]$ sudo cat /sys/bus/pci/devices/0000\:04\:00.0/mdev_supported_types/nvidia-319/available_instances
3

# Not seeing any issues when attempting to remove via host and container
[root@computesriov-0 0000:04:00.0]# ls
5a49f19e-5dde-49cb-8a37-12734913db95  class                     device           i2c-0        local_cpulist         msi_bus      rescan        resource1_wc             sriov_offset         subsystem_device
aer_dev_correctable                   config                    dma_mask_bits    i2c-1        local_cpus            msi_irqs     reset         resource3                sriov_stride         subsystem_vendor
aer_dev_fatal                         consistent_dma_mask_bits  driver           iommu        max_link_speed        numa_node    reset_method  resource3_wc             sriov_totalvfs       uevent
aer_dev_nonfatal                      current_link_speed        driver_override  iommu_group  max_link_width        power        resource      revision                 sriov_vf_device      vendor
ari_enabled                           current_link_width        enable           irq          mdev_supported_types  power_state  resource0     sriov_drivers_autoprobe  sriov_vf_total_msix
broken_parity_status                  d3cold_allowed            firmware_node    link         modalias              remove       resource1     sriov_numvfs             subsystem
[root@computesriov-0 heat-admin]# echo 1 > /sys/bus/mdev/devices/5a49f19e-5dde-49cb-8a37-12734913db95/remove
[root@computesriov-0 heat-admin]# ls /sys/bus/pci/devices/0000\:04\:00.0/
aer_dev_correctable   class                     d3cold_allowed   enable         iommu_group    max_link_speed        msi_irqs     rescan        resource1     sriov_drivers_autoprobe  sriov_vf_device      uevent
aer_dev_fatal         config                    device           firmware_node  irq            max_link_width        numa_node    reset         resource1_wc  sriov_numvfs             sriov_vf_total_msix  vendor
aer_dev_nonfatal      consistent_dma_mask_bits  dma_mask_bits    i2c-0          link           mdev_supported_types  power        reset_method  resource3     sriov_offset             subsystem
ari_enabled           current_link_speed        driver           i2c-1          local_cpulist  modalias              power_state  resource      resource3_wc  sriov_stride             subsystem_device
broken_parity_status  current_link_width        driver_override  iommu          local_cpus     msi_bus               remove       resource0     revision      sriov_totalvfs           subsystem_vendor

[root@computesriov-0 heat-admin]# podman exec -it -u root nova_virtqemud /bin/bash
[root@computesriov-0 /]# ls /sys/bus/pci/devices/0000\:82\:00.0/
3514412c-0fbc-4c1d-bc36-434ceaeecfff  ari_enabled		current_link_speed  driver	     i2c-3	  local_cpulist		modalias   power_state	 resource      resource3_wc		sriov_stride	     subsystem_device
aer_dev_correctable		      broken_parity_status	current_link_width  driver_override  iommu	  local_cpus		msi_bus    remove	 resource0     revision			sriov_totalvfs	     subsystem_vendor
aer_dev_fatal			      class			d3cold_allowed	    enable	     iommu_group  max_link_speed	msi_irqs   rescan	 resource1     sriov_drivers_autoprobe	sriov_vf_device      uevent
aer_dev_nonfatal		      config			device		    firmware_node    irq	  max_link_width	numa_node  reset	 resource1_wc  sriov_numvfs		sriov_vf_total_msix  vendor
af68c4f2-1630-4b3d-a5aa-88e25e72c047  consistent_dma_mask_bits	dma_mask_bits	    i2c-2	     link	  mdev_supported_types	power	   reset_method  resource3     sriov_offset		subsystem

[root@computesriov-0 /]# echo 1 > /sys/bus/pci/devices/af68c4f2-1630-4b3d-a5aa-88e25e72c047/remove 

[root@computesriov-0 /]# echo 1 > /sys/bus/mdev/devices/3514412c-0fbc-4c1d-bc36-434ceaeecfff/remove 

[root@computesriov-0 /]# ls sys/bus/pci/devices/0000\:82\:00.0/
aer_dev_correctable   class			d3cold_allowed	 enable		iommu_group    max_link_speed	     msi_irqs	  rescan	resource1     sriov_drivers_autoprobe  sriov_vf_device	    uevent
aer_dev_fatal	      config			device		 firmware_node	irq	       max_link_width	     numa_node	  reset		resource1_wc  sriov_numvfs	       sriov_vf_total_msix  vendor
aer_dev_nonfatal      consistent_dma_mask_bits	dma_mask_bits	 i2c-2		link	       mdev_supported_types  power	  reset_method	resource3     sriov_offset	       subsystem
ari_enabled	      current_link_speed	driver		 i2c-3		local_cpulist  modalias		     power_state  resource	resource3_wc  sriov_stride	       subsystem_device
broken_parity_status  current_link_width	driver_override  iommu		local_cpus     msi_bus		     remove	  resource0	revision      sriov_totalvfs	       subsystem_vendor

(.tempest) (overcloud) [stack@undercloud-0 tempest_workspace]$ openstack --os-placement-api-version 1.17 resource provider inventory list 1e3e7132-83ce-40b8-80f1-0da01d12e067
+----------------+------------------+----------+----------+----------+-----------+-------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total |
+----------------+------------------+----------+----------+----------+-----------+-------+
| VGPU           |              1.0 |        1 |        4 |        0 |         1 |     4 |
+----------------+------------------+----------+----------+----------+-----------+-------+

(.tempest) (overcloud) [stack@undercloud-0 tempest_workspace]$ openstack --os-placement-api-version 1.17 resource provider inventory list 9ccde640-6755-451b-acb9-9b4dc2af643f
+----------------+------------------+----------+----------+----------+-----------+-------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total |
+----------------+------------------+----------+----------+----------+-----------+-------+
| VGPU           |              1.0 |        1 |        4 |        0 |         1 |     4 |
+----------------+------------------+----------+----------+----------+-----------+-------+


Version-Release number of selected component (if applicable):
RHOS-17.0-RHEL-9-20220526.n.0

How reproducible:
100%

Steps to Reproduce:
1. Deploy a 17 environment that supports vGPU instances
2. Create one or two vGPU instances, preform several movement actions, delete instances
3.

Actual results:
Instance is deleted but associated mdev instance remains

Expected results:
All resources are correctly deleted and placement reports correct availability for VGPU


Additional info:
Bed can be made available if necessary

Comment 1 Artom Lifshitz 2022-08-23 13:49:26 UTC

So I filed this BZ specifically to call out vGPU support as tech preview because of the multitude of issues. I guess we need one main tech preview note, and then individual known issues for each specific bug?

Comment 2 Artom Lifshitz 2022-08-23 15:59:18 UTC

Reverting flags and renaming to better indicate intent. Multiple GPU types and cold migration are new RFEs, so don't need known issues. We have two known issues in OSP:

https://bugzilla.redhat.com/show_bug.cgi?id=2116980 / vGPU mdev instances are not being cleaned up after guest deletion / known issue release note done.
https://bugzilla.redhat.com/show_bug.cgi?id=2120726 / [17.0 ga known issue] Nova fails to parse new libvirt mediated device name format/ TODO

Comment 3 Artom Lifshitz 2022-09-12 15:21:42 UTC

The release notes automation will pick up BZs with the right flags even if they're closed. With the doc text done and requires_doc_text set to +, we can close this.

Comment 9 Red Hat Bugzilla 2024-01-05 04:25:05 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.