Created attachment 1832183 [details] MCP+sriovnetworknodestate+workerNode yamls Description of problem: SRIOV policy taking around 30 mins of time to get affect, whenever customer reboot the node same time was taking to get into affect. Customer description: customer had configured SRIOV worker node with 127 vfs , in general for vf count of 127 the nodes takes around 30mins to take affect if we configure the sriov node with 8vf the configuration time is less than 5min but when we configure the sriov node with 127 vf then it takes 30 mins to have sriovNetworkNode policy to take affect. ** whenever customer reboot the node, MCP will be in ready state but the policies are taking time. ------- #oc describe pod/sdn-4wjxk -n openshift-sdn' to see all of the containers in this pod. sh-4.4# sh-4.4# ethtool -i ens1f0 driver: mlx5_core version: 5.0-0 firmware-version: 14.31.1200(HP_2420110034) expansion-rom-version: bus-info: 0000:08:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: no supports-priv-flags: yes ------ Nic Model : HPE Eth 10/25Gb 2p 640SFP28 Adptr ------ Below are timeframe: 127 vf for each nic ( total 2 nics 2*127 = 254 vf) 4:30 pm rebooted 4:30 4:36 got ready 4:36 4:54 sriov node poliy succeed
Due to size constraints unable to upload few more datasets, Please check attachment in case #03038286.
Can you provide the log of sriov-config-daemon pod for the node (worker01.ocp47.lab.com) when this issue happened? The sriov-config-daemon log provided in https://access.redhat.com/support/cases/#/case/03038286/discussion?attachmentId=a092K00002l1mcgQAA doesn't have the information when the issue happened.
Created attachment 1841166 [details] SRIOV logs(config)
Hi Peng, Customer had did the same test yesterday and it took 25 mins for 127 VFs. => Can you provide the log of sriov-config-daemon pod for the node (worker01.ocp47.lab.com) when this issue happened? PFA. Regards, Eswar.
@evadla I still cannot find useful information from the logs. All the provided sriov-config-daemon log show 'syncStatus: Succeeded'. I want to have a log containing from the sriov-config-daemon starts, to it becomes 'syncStatus: Succeeded'. So that we can find out what happened during that 20~30 mins after reboot.
Hi @pliu Any update? BR Eswar.
upstream PR https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/207
fix merged in 4.10. https://github.com/openshift/sriov-network-operator/commit/dc1a4fa6a6245e7c6ddb9e761beb8de1bd9ed7c3
Verified on sriov-network-operator.4.10.0-202112210953, creating 127 VFs using yaml files as below. It took less than 3 min for sriovnetworknodestates Sync status to be succeeded. Removing the policy took about 5 min. apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: mlx277 namespace: openshift-sriov-network-operator spec: mtu: 1500 nicSelector: vendor: '15b3' deviceID: '1017' nodeSelector: feature.node.kubernetes.io/sriov-capable: 'true' numVfs: 127 resourceName: mlx277 # oc describe sriovnetworknodestates.sriovnetwork.openshift.io openshift-qe-029.lab.eng.rdu2.redhat.com | grep Sync Sync Status: Succeeded Added "isRdma: true" in yaml file and created sriov policy again, it took about 8 min for sync status to be succeeded and removed it taking about 5 min. apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: mlx277 namespace: openshift-sriov-network-operator spec: mtu: 1500 nicSelector: vendor: '15b3' deviceID: '1017' nodeSelector: feature.node.kubernetes.io/sriov-capable: 'true' numVfs: 127 isRdma: true resourceName: mlx277 Without isRdma can reduce the time of creating sriovnetworknodepolicy.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056