Bug 1614452 - [Intel OSP16][RSD] Pooled FPGA over PCIe
Summary: [Intel OSP16][RSD] Pooled FPGA over PCIe
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 16.0 (Train)
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: OSP DFG:Compute
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On: 1562173
Blocks: 1595325 1636090 epic-rsd
TreeView+ depends on / blocked
 
Reported: 2018-08-09 15:27 UTC by Krish Raghuram
Modified: 2023-03-21 18:58 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-26 17:47:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Krish Raghuram 2018-08-09 15:27:34 UTC
Description of feature:
Intel RSD platforms starting with v2.4 will support pooling of FPGA devices connected through PCIe interfaces. This request is for a tenant to be able to request FPGA resources from the pool to be attached to a node, and made available to a VM on that node when a workload is instantiated with that request

Version-Release number of selected component (if applicable):
OpenStack Nova version in OpenStack Stein release

2. Business Justification:
  a) Why is this feature needed?
     As more and more applications get deployed to the cloud, performance becomes a more critical issue. FPGA-based acceleration is now seen as a cost-effective way to give specific workloads the additional computing resources they need to deliver on SLAs, whether in terms of throughput or reduced latencies
  b) What hardware does this enable?
   New FPGA hardware on Intel RSD platforms
  c) Is this hardware on-board in a system (eg, LOM) or an add-on card? 
  FPGA accelerators will be available in PCIe add-on boards
  d) Business impact? CSPs and Communication Service Providers (CoSPs) can deploy demanding workloads more cost-effectively
  
  e) Other business drivers: N/A

3. Primary contact at Partner, email, phone (chat)
   sundar.nadathur, lin.a.yang

4. Expected results:
- Pooled devices should be discovered and tracked by the Nova Resource Tracker via Placement API
- RSD Pod Manager should be the source of device information
- Pod Manager will expose the topology (PCIe zones) of nodes & resources that can be composed together
- Those zones will be marked as the Nova host aggregates by an Operator (with help of the Ansible scripts to automate the work).
- Scheduler filter will return a list of machines that are capable of providing requested FPGA function (PF or VF) to the VM.
- Conductor entity will monitor VMs and resource attachment and detach resources that are not in use

Additional info:
- Links to blueprints and specs will be added as soon as they're done
- Will need close interaction with the Cyborg community to ensure the Cyborg agent has the ability to act on the Nova request to attach an FPGA device

Comment 1 smooney 2018-08-17 15:16:43 UTC

Hi Krish.
This feature request has several unresolved dependencies.

Firstly cyborg is not a currently supported project in OSP and is not currently targeted to be added in OSP 15. 
Can you open a sperate Bugzilla to track that request and add it
as a dependency for this request.

Adding cyborg as a supported project is not trivial as it would require
packaging the project as an rpm, adding a set of cyborg containers to kolla
and the integrating the deployment of those containers with tripleo/director.

In addition to the generic cyborg support above OSP director would have to
be enhanced to be able to configure the cyborg agent with the credential
for the PDOM to enable this feature.

With that in mind, this is likely and OSP-next-next intersect not OSP-15.

As you indicated this feature depends on upstream changes to Nova and cyborg
that are yet to be implemented. when you have that info available
please update this thicket with the relevant blueprint/reviews.

Finally, from my reading of the request, we would require a specific hardware
configuration to develop and validate this feature request.

In particular a minimum of the following:
    - 1 networks switch for management/provisioning.
    - 1 RSD 2.4 compatible PODM (could be deployed in a VM if reference code is 
                                 used else this is an appliance.)
    - 1 RSD 2.4 compatible PCIe switch with PSME.
    - 1+ RSD 2.4 compatible computer drawer with external PCIe backplane support
    - 1+ RSD 2.4 compatible FPGA drawer with external PCIe backplane support
    - 1+ FPGAs that are compatible with both RSD 2.4 and the cyborg agent.
    - 1+ standard servers for OSP control plane and standard compute nodes.

Can you provide a detailed description of the hardware and topology required
to deploy and test this feature and indicate whether intel
would be able to provide a minimal RSD system  as described above 
or access to one in a lab for the development and validation 
of this feature request.

Comment 5 Krish Raghuram 2018-08-17 19:54:20 UTC
(In reply to smooney from comment #1)
> 
> Hi Krish.
> This feature request has several unresolved dependencies.
> 
> Firstly cyborg is not a currently supported project in OSP and is not
> currently targeted to be added in OSP 15. 
> Can you open a sperate Bugzilla to track that request and add it
> as a dependency for this request.
> 
> Adding cyborg as a supported project is not trivial as it would require
> packaging the project as an rpm, adding a set of cyborg containers to kolla
> and the integrating the deployment of those containers with tripleo/director.
> 
> In addition to the generic cyborg support above OSP director would have to
> be enhanced to be able to configure the cyborg agent with the credential
> for the PDOM to enable this feature.
> 
> With that in mind, this is likely and OSP-next-next intersect not OSP-15.
> 
> As you indicated this feature depends on upstream changes to Nova and cyborg
> that are yet to be implemented. when you have that info available
> please update this thicket with the relevant blueprint/reviews.
> 
> Finally, from my reading of the request, we would require a specific hardware
> configuration to develop and validate this feature request.
> 
> In particular a minimum of the following:
>     - 1 networks switch for management/provisioning.
>     - 1 RSD 2.4 compatible PODM (could be deployed in a VM if reference code
> is 
>                                  used else this is an appliance.)
>     - 1 RSD 2.4 compatible PCIe switch with PSME.
>     - 1+ RSD 2.4 compatible computer drawer with external PCIe backplane
> support
>     - 1+ RSD 2.4 compatible FPGA drawer with external PCIe backplane support
>     - 1+ FPGAs that are compatible with both RSD 2.4 and the cyborg agent.
>     - 1+ standard servers for OSP control plane and standard compute nodes.
> 
> Can you provide a detailed description of the hardware and topology required
> to deploy and test this feature and indicate whether intel
> would be able to provide a minimal RSD system  as described above 
> or access to one in a lab for the development and validation 
> of this feature request.

Sean, the basic Cyborg request is already at https://bugzilla.redhat.com/show_bug.cgi?id=1562173 

Lin Yang will add links to the BPs or specs as they are submitted.

I will have to discuss the hardware availability with the team and get back. I believe Red Hat has had access to an RSD rack in one of our labs in the past and probably still does - I'll investigate

Comment 6 smooney 2018-08-19 17:21:47 UTC
Thanks krish.
i did not see that in my bugzilla query.
i have added it as a dependency.

Comment 7 Pragyan Pathi 2018-09-27 21:52:37 UTC
As discussed in Sep 27th engineering meeting:
We understand Red Hat has moved this to RH OSP16.

FYI - Intel continues to work on this. Red Hat has moved to OSP16 (enhancement work of Cyborg). Revisit based on upstream status/customer use case.

Comment 8 Krish Raghuram 2019-02-21 16:41:28 UTC
We are de-prioritizing this in favor of FPGA pooling over Ethernet fabric. Will open a separate BZ for the latter

Comment 9 Pavan Chavva 2019-02-26 17:47:18 UTC
Closed based on the feedback from Intel.

Comment 10 Pragyan Pathi 2019-02-28 00:40:24 UTC
BZ can be Closed, as this project is being changed


Note You need to log in before you can comment on or make changes to this bug.