Description of problem: Extend OSP firstboot folder to support Firmware burn and NVCONFIG on Nvidia Mellanox NIC. The script is currently maintain in Mellanox repo [1]. we want to upstream it to tripleo-heat-template to [2] so it will be natively support in OSP. We also add the necessary config to transition from DPU mode to Connect-X mode. [1]- https://github.com/Mellanox/switchdev-utils/blob/master/templates/firstboot/mellanox_fw_update.yaml [2] - https://github.com/openstack/tripleo-heat-templates/tree/master/firstboot
Moshe, tag us when the upstream patch is created, we will be able to downstream it once it is merged.
We don't have hardware to test this so we will make it other QA and validate with code inspection
Waleed upload first draft of the PS https://review.opendev.org/c/openstack/tripleo-heat-templates/+/833785 reviews are welcome :) . I need to review it as well.
I see the flag need info assign to me. Can you elaborate what is missing?
Hi, I got the package from rhel [1] and installed without any issue in the overcloud image using virt-customize and it updated firmware properly [2]. Apparently everything is ok, can you check nvidia_nic_fw_update.log to confirm it? But the overcloud deployment failed because cloud-init took too long, I think it is related with firmware update Apparently it waited for 261 seconds [3], but it took 395 seconds to finish [4] overcloud-install log [3] 2022-06-09 17:11:58.324929 | 5254006b-95aa-73db-56fd-000000000033 | SUMMARY | computedpdksriov-tigon13 | Wait for cloud-init to finish, if enabled | 261.34s cloud-init log [4] 2022-06-09 17:12:23,489 - util.py[DEBUG]: cloud-init mode 'modules' took 395.039 seconds (395.04) 2022-06-09 17:12:23,490 - handlers.py[DEBUG]: finish: modules-final: SUCCESS: running modules for final I have uploaded to the ftp sos report for undercloud [5] and compute [6] I retried the deployment and the second time the deployment was sucessfull, this time the firmware was already updated and it didnt take so long. This is the update log this second time [7] [1] http://file.mad.redhat.com/~mnietoji/bf2/mstflint-4.19.0-0.3.20220328git2b02298.el8_4.x86_64.rpm [2] http://file.mad.redhat.com/~mnietoji/bf2/nvidia_nic_fw_update.log [3] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-nfv-16.2-director-3cont-2comp-ipv4-geneve-ovn-dpdk-sriov-ctlplane-dataplane-bonding-hybrid/177/undercloud-0/home/stack/overcloud_install.log.gz [4] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-nfv-16.2-director-3cont-2comp-ipv4-geneve-ovn-dpdk-sriov-ctlplane-dataplane-bonding-hybrid/177/computedpdksriov-tigon13/var/log/cloud-init.log.gz [5] http://file.mad.redhat.com/~mnietoji/bf2/sosreport-undercloud-0-2022-06-10-ffcqdlv.tar.xz [6] http://file.mad.redhat.com/~mnietoji/bf2/sosreport-computedpdksriov-tigon13-2022-06-10-euqdsbe.tar.xz [7] http://file.mad.redhat.com/~mnietoji/bf2/nvidia_nic_fw_update_2.log
According to the code, cloud-init should last less than 250 seconds (50 retries each 5 seconds), so there will be a problem if it takes longer. /usr/share/openstack-tripleo-heat-templates/common/deploy-steps-tasks-step-0.j2.yaml - name: Wait for cloud-init to finish, if enabled cloud_init_data_facts: filter: status register: res until: > res.cloud_init_data_facts.status.v1.stage is defined and not res.cloud_init_data_facts.status.v1.stage retries: 50 delay: 5 when: - cloud_init_enabled.rc is defined - cloud_init_enabled.rc == 0 - cloud_init_vendor_disabled.rc is not defined or cloud_init_vendor_disabled.rc != 0 Firmware update duration will depend on the time it takes per update (longer if the current firmware is older) and the number of devices to update as it is done one by one in a serial mode
I have found a new limitation for updating firmware and DPU mode[1]. It must be done through OS::TripleO::{{ROLE_NAME}}:NodeUserData. We are already using this variable for defining nic names in the nodes through os-net-config-mappings.yaml [2] So, we are not able to use both of them at the same time. The only solution would be to merge both files [3] that doesn't look very nice. [1] https://code.engineering.redhat.com/gerrit/c/nfv-qe/+/414302/11/tht/setups/tigon17/16.2/ospd-16.2-geneve-ovn-dpdk-sriov-ctlplane-dataplane-bonding-hybrid/bf2-update.yaml [2] https://code.engineering.redhat.com/gerrit/c/nfv-qe/+/414302/11/tht/setups/tigon17/16.2/ospd-16.2-geneve-ovn-dpdk-sriov-ctlplane-dataplane-bonding-hybrid/os-net-config-mappings.yaml [3] https://code.engineering.redhat.com/gerrit/c/nfv-qe/+/414302/15/tht/setups/tigon17/16.2/ospd-16.2-geneve-ovn-dpdk-sriov-ctlplane-dataplane-bonding-hybrid/nvidia_firstboot_os-net-config-mappings.yaml
So, as a summary, issues found are the following: 1. mstflint missing in overcloud image 2. cloud-init timeout 3. OS::TripleO::{{ROLE_NAME}}:NodeUserData: only one file can be configured I also saw that old versions of RHEL images does not recognized vfs (driver issue), but using RHEL 8.4 is ok
Hi Moshe Can you please share how long firmware update should take ? Is there a way to "downgrade back" for further testing ? Thanks
@mohammedt can you please help Ofer with his questions?
(In reply to Ofer Blaut from comment #32) > Hi Moshe > > Can you please share how long firmware update should take ? > Is there a way to "downgrade back" for further testing ? > > Thanks Its depend on the old FW that flashed on the device, because burn process performed by the old FW to load new FW can you please let me know what is the old FW that flashed on the device before update it via firstboot ? regarding the second question, Yes, you can downgrade back by set FORCE_UPDATE=True ( Nvidia firstboot parameter in the yaml file ) and pass old FW in BIN_DIR_URL parameter
Hi This is the FW version. I already saw that it is possible to use option FORCE_UPDATE=True to install an old version, the problem is that I can not download an old version, only latest version is available. INFO:nvidia_nic_fw_update:Downloading file: http://file.mad.redhat.com/~mnietoji/bf2//fw-BlueField-2-rel-24_33_1048-MBF2H532C-AECO_Ax-NVME-20.3.1-UEFI-21.2.10-UEFI-22.2.10-UEFI-14.26.17-FlexBoot-3.6.502.signed.bin to /tmp/tmpou8ln5egtripleo_mlnx_firmware DEBUG:oslo_concurrency.processutils:Running cmd (subprocess): mstflint -i /tmp/tmpou8ln5egtripleo_mlnx_firmware/fw-BlueField-2-rel-24_33_1048-MBF2H532C-AECO_Ax-NVME-20.3.1-UEFI-21.2.10-UEFI-22.2.10-UEFI-14.26.17-FlexBoot-3.6.502.signed.bin query DEBUG:oslo_concurrency.processutils:CMD "mstflint -i /tmp/tmpou8ln5egtripleo_mlnx_firmware/fw-BlueField-2-rel-24_33_1048-MBF2H532C-AECO_Ax-NVME-20.3.1-UEFI-21.2.10-UEFI-22.2.10-UEFI-14.26.17-FlexBoot-3.6.502.signed.bin query" returned: 0 in 0.040s DEBUG:nvidia_nic_fw_update:Image type: FS4 FW Version: 24.33.1048 FW Release Date: 29.4.2022 Product Version: rel-24_33_1048 Rom Info: type=NVMe version=20.3.1 cpu=AMD64 type=UEFI Virtio net version=21.2.10 cpu=AMD64 type=UEFI Virtio blk version=22.2.10 cpu=AMD64 type=UEFI version=14.26.17 cpu=AMD64,AARCH64 type=PXE version=3.6.502 cpu=AMD64 Description: UID GuidsNumber Base GUID: N/A 16 Base MAC: N/A 16 Image VSD: N/A Device VSD: N/A PSID: MT_0000000765 Security Attributes: secure-fw Security Ver: 0
Hi Where did you get fw image from ? Will try to get old image that march your device and let you know, how long time tale to burn new fw from old one for your further testing Thanks
Hi, I downloaded it from this URL using PSID MT_0000000765 https://network.nvidia.com/support/firmware/bluefield2/
About the issues found: 1. mstflint missing in overcloud image --> It will be available in next puddle 2. cloud-init timeout --> created a bz to configure it https://bugzilla.redhat.com/show_bug.cgi?id=2097271 3. OS::TripleO::{{ROLE_NAME}}:NodeUserData: only one file can be configured --> it should be documented Setting this bz as verified
According to our records, this should be resolved by openstack-tripleo-heat-templates-11.6.1-2.20220409014852.el8ost. This build is available now.