Bug 2042143 - [RFE] add nvidia firstboot yaml with support DPU ConnectX mode to tripleo-heat-template
Summary: [RFE] add nvidia firstboot yaml with support DPU ConnectX mode to tripleo-hea...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z3
: 16.2 (Train on RHEL 8.4)
Assignee: OSP Team
QA Contact: Miguel Angel Nieto
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-18 20:57 UTC by Moshe Levi
Modified: 2022-06-23 10:38 UTC (History)
21 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.6.1-2.20220409014848.7c89b16.el8ost
Doc Type: Enhancement
Doc Text:
In Red Hat OpenStack Platform (RHOSP) 16.2.3, there is support for upgrading firmware and configuring NVIDIA Mellanox BlueField-2 into ConnectX mode by using the mstflint tool, with these two known issues: * If your RHOSP deployment uses `os-net-config-mappings.yaml` for NIC ordering, then you must use a custom first-boot.yaml file. * Set tripleo cloud init timeout through templates. (BZ#link:https://bugzilla.redhat.com/2097271[2097271])
Clone Of:
Environment:
Last Closed: 2022-06-23 10:38:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 2097271 1 medium CLOSED Set tripleo cloud init timeout through templates 2022-10-04 20:04:53 UTC
Red Hat Issue Tracker NFV-2396 0 None None None 2022-01-24 14:56:54 UTC
Red Hat Issue Tracker OSP-12197 0 None None None 2022-01-18 21:06:38 UTC

Description Moshe Levi 2022-01-18 20:57:54 UTC
Description of problem:
Extend OSP firstboot folder to support Firmware burn and NVCONFIG on Nvidia Mellanox NIC. 

The script is currently maintain in Mellanox repo [1]. 
we want to upstream it to tripleo-heat-template to [2] so it will be natively support in OSP. We also add the necessary config to transition from DPU mode to Connect-X mode.


[1]- https://github.com/Mellanox/switchdev-utils/blob/master/templates/firstboot/mellanox_fw_update.yaml
[2] - https://github.com/openstack/tripleo-heat-templates/tree/master/firstboot

Comment 2 Saravanan KR 2022-01-28 08:51:58 UTC
Moshe, tag us when the upstream patch is created, we will be able to downstream it once it is merged.

Comment 8 spower 2022-02-15 16:05:27 UTC
We don't have hardware to test this so we will make it other QA and validate with code inspection

Comment 10 Moshe Levi 2022-03-15 14:00:50 UTC
Waleed upload first draft of the PS https://review.opendev.org/c/openstack/tripleo-heat-templates/+/833785
reviews are welcome :) . I need to review it as well.

Comment 18 Moshe Levi 2022-04-13 06:06:33 UTC
I see the flag need info assign to me. Can you elaborate what is missing?

Comment 28 Miguel Angel Nieto 2022-06-10 12:45:16 UTC
Hi,

I got the package from rhel [1] and installed without any issue in the overcloud image using virt-customize and it updated firmware properly [2]. Apparently everything is ok, can you  check nvidia_nic_fw_update.log to confirm it?

But the overcloud deployment failed because cloud-init took too long, I think it is related with firmware update

Apparently it waited for 261 seconds [3], but it took 395 seconds to finish [4]

overcloud-install log [3]
2022-06-09 17:11:58.324929 | 5254006b-95aa-73db-56fd-000000000033 |    SUMMARY | computedpdksriov-tigon13 | Wait for cloud-init to finish, if enabled | 261.34s

cloud-init log [4]
2022-06-09 17:12:23,489 - util.py[DEBUG]: cloud-init mode 'modules' took 395.039 seconds (395.04)
2022-06-09 17:12:23,490 - handlers.py[DEBUG]: finish: modules-final: SUCCESS: running modules for final

I have uploaded to the ftp sos report for undercloud [5] and compute [6]

I retried the deployment and the second time the deployment was sucessfull, this time the firmware was already updated and it didnt take so long. This is the update log this second time [7]

[1] http://file.mad.redhat.com/~mnietoji/bf2/mstflint-4.19.0-0.3.20220328git2b02298.el8_4.x86_64.rpm
[2] http://file.mad.redhat.com/~mnietoji/bf2/nvidia_nic_fw_update.log
[3] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-nfv-16.2-director-3cont-2comp-ipv4-geneve-ovn-dpdk-sriov-ctlplane-dataplane-bonding-hybrid/177/undercloud-0/home/stack/overcloud_install.log.gz
[4] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-nfv-16.2-director-3cont-2comp-ipv4-geneve-ovn-dpdk-sriov-ctlplane-dataplane-bonding-hybrid/177/computedpdksriov-tigon13/var/log/cloud-init.log.gz
[5] http://file.mad.redhat.com/~mnietoji/bf2/sosreport-undercloud-0-2022-06-10-ffcqdlv.tar.xz
[6] http://file.mad.redhat.com/~mnietoji/bf2/sosreport-computedpdksriov-tigon13-2022-06-10-euqdsbe.tar.xz
[7] http://file.mad.redhat.com/~mnietoji/bf2/nvidia_nic_fw_update_2.log

Comment 29 Miguel Angel Nieto 2022-06-14 09:23:35 UTC
According to the code, cloud-init should last less than 250 seconds (50 retries each 5 seconds), so there will be a problem if it takes longer.
/usr/share/openstack-tripleo-heat-templates/common/deploy-steps-tasks-step-0.j2.yaml
- name: Wait for cloud-init to finish, if enabled
  cloud_init_data_facts:
    filter: status
  register: res
  until: >
    res.cloud_init_data_facts.status.v1.stage is defined and
    not res.cloud_init_data_facts.status.v1.stage
  retries: 50
  delay: 5
  when:
    - cloud_init_enabled.rc is defined
    - cloud_init_enabled.rc == 0
    - cloud_init_vendor_disabled.rc is not defined or cloud_init_vendor_disabled.rc != 0

Firmware update duration will depend on the time it takes per update (longer if the current firmware is older) and the number of devices to update as it is done one by one in a serial mode

Comment 30 Miguel Angel Nieto 2022-06-14 09:26:37 UTC
I have found a new limitation for updating firmware and DPU mode[1]. It must be done through OS::TripleO::{{ROLE_NAME}}:NodeUserData. We are already using this variable for defining nic names in the nodes through os-net-config-mappings.yaml [2]
So, we are not able to use both of them at the same time. The only solution would be to merge both files [3] that doesn't look very nice.

[1] https://code.engineering.redhat.com/gerrit/c/nfv-qe/+/414302/11/tht/setups/tigon17/16.2/ospd-16.2-geneve-ovn-dpdk-sriov-ctlplane-dataplane-bonding-hybrid/bf2-update.yaml
[2] https://code.engineering.redhat.com/gerrit/c/nfv-qe/+/414302/11/tht/setups/tigon17/16.2/ospd-16.2-geneve-ovn-dpdk-sriov-ctlplane-dataplane-bonding-hybrid/os-net-config-mappings.yaml
[3] https://code.engineering.redhat.com/gerrit/c/nfv-qe/+/414302/15/tht/setups/tigon17/16.2/ospd-16.2-geneve-ovn-dpdk-sriov-ctlplane-dataplane-bonding-hybrid/nvidia_firstboot_os-net-config-mappings.yaml

Comment 31 Miguel Angel Nieto 2022-06-14 09:34:48 UTC
So, as a summary, issues found are the following:

1. mstflint missing in overcloud image
2. cloud-init timeout
3. OS::TripleO::{{ROLE_NAME}}:NodeUserData: only one file can be configured

I also saw that old versions of RHEL images does not recognized vfs (driver issue), but using RHEL 8.4 is ok

Comment 32 Ofer Blaut 2022-06-14 13:19:56 UTC
Hi Moshe

Can you please share how long firmware update should take ? 
Is there a way to "downgrade back" for further testing ?

Thanks

Comment 33 Moshe Levi 2022-06-14 18:34:28 UTC
@mohammedt can you please help Ofer with his questions?

Comment 34 Mohammed Taha 2022-06-15 04:44:08 UTC
(In reply to Ofer Blaut from comment #32)
> Hi Moshe
> 
> Can you please share how long firmware update should take ? 
> Is there a way to "downgrade back" for further testing ?
> 
> Thanks


Its depend on the old FW that flashed on the device, because burn process performed by the old FW to load new FW 
can you please let me know what is the old FW that flashed on the device before update it via firstboot ?

regarding the second question,
Yes, you can downgrade back by set FORCE_UPDATE=True ( Nvidia firstboot parameter in the yaml file )   and pass old FW in BIN_DIR_URL parameter

Comment 35 Miguel Angel Nieto 2022-06-15 07:14:05 UTC
Hi

This is the FW version. I already saw that it is possible to use option  FORCE_UPDATE=True to install  an  old version, the problem is that I can not download an old version, only latest version is available.

INFO:nvidia_nic_fw_update:Downloading file: http://file.mad.redhat.com/~mnietoji/bf2//fw-BlueField-2-rel-24_33_1048-MBF2H532C-AECO_Ax-NVME-20.3.1-UEFI-21.2.10-UEFI-22.2.10-UEFI-14.26.17-FlexBoot-3.6.502.signed.bin to /tmp/tmpou8ln5egtripleo_mlnx_firmware
DEBUG:oslo_concurrency.processutils:Running cmd (subprocess): mstflint -i /tmp/tmpou8ln5egtripleo_mlnx_firmware/fw-BlueField-2-rel-24_33_1048-MBF2H532C-AECO_Ax-NVME-20.3.1-UEFI-21.2.10-UEFI-22.2.10-UEFI-14.26.17-FlexBoot-3.6.502.signed.bin query
DEBUG:oslo_concurrency.processutils:CMD "mstflint -i /tmp/tmpou8ln5egtripleo_mlnx_firmware/fw-BlueField-2-rel-24_33_1048-MBF2H532C-AECO_Ax-NVME-20.3.1-UEFI-21.2.10-UEFI-22.2.10-UEFI-14.26.17-FlexBoot-3.6.502.signed.bin query" returned: 0 in 0.040s
DEBUG:nvidia_nic_fw_update:Image type:            FS4
FW Version:            24.33.1048
FW Release Date:       29.4.2022
Product Version:       rel-24_33_1048
Rom Info:              type=NVMe version=20.3.1 cpu=AMD64
                       type=UEFI Virtio net version=21.2.10 cpu=AMD64
                       type=UEFI Virtio blk version=22.2.10 cpu=AMD64
                       type=UEFI version=14.26.17 cpu=AMD64,AARCH64
                       type=PXE version=3.6.502 cpu=AMD64
Description:           UID                GuidsNumber
Base GUID:             N/A                     16
Base MAC:              N/A                     16
Image VSD:             N/A
Device VSD:            N/A
PSID:                  MT_0000000765
Security Attributes:   secure-fw
Security Ver:          0

Comment 36 Mohammed Taha 2022-06-15 07:40:54 UTC
Hi
Where did you get fw image from ? 
Will try to get old image that march your device and let you know, how long time tale to burn new fw from old one for your further testing

Thanks

Comment 37 Miguel Angel Nieto 2022-06-15 07:49:29 UTC
Hi, I downloaded it from this URL using PSID MT_0000000765

https://network.nvidia.com/support/firmware/bluefield2/

Comment 38 Miguel Angel Nieto 2022-06-15 10:27:20 UTC
About the issues found:
1. mstflint missing in overcloud image 
   --> It will be available in next puddle
2. cloud-init timeout 
   --> created a bz to configure it https://bugzilla.redhat.com/show_bug.cgi?id=2097271
3. OS::TripleO::{{ROLE_NAME}}:NodeUserData: only one file can be configured
   --> it should be documented

Setting this bz as verified

Comment 45 OSP Team 2022-06-23 10:38:41 UTC
According to our records, this should be resolved by openstack-tripleo-heat-templates-11.6.1-2.20220409014852.el8ost.  This build is available now.


Note You need to log in before you can comment on or make changes to this bug.