Bug 2013199
| Summary: | post reboot of node SRIOV policy taking huge time | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Eswar Vadla <evadla> | ||||||
| Component: | Networking | Assignee: | Peng Liu <pliu> | ||||||
| Networking sub component: | SR-IOV | QA Contact: | Ying Wang <yingwang> | ||||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||||
| Severity: | urgent | ||||||||
| Priority: | urgent | CC: | dosmith, jpradhan, nm-s, pliu | ||||||
| Version: | 4.6.z | ||||||||
| Target Milestone: | --- | ||||||||
| Target Release: | 4.10.0 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | No Doc Update | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | |||||||||
| : | 2040122 (view as bug list) | Environment: | |||||||
| Last Closed: | 2022-03-10 16:18:42 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 2040122 | ||||||||
| Attachments: |
|
||||||||
Due to size constraints unable to upload few more datasets, Please check attachment in case #03038286. Can you provide the log of sriov-config-daemon pod for the node (worker01.ocp47.lab.com) when this issue happened? The sriov-config-daemon log provided in https://access.redhat.com/support/cases/#/case/03038286/discussion?attachmentId=a092K00002l1mcgQAA doesn't have the information when the issue happened. Created attachment 1841166 [details]
SRIOV logs(config)
Hi Peng, Customer had did the same test yesterday and it took 25 mins for 127 VFs. => Can you provide the log of sriov-config-daemon pod for the node (worker01.ocp47.lab.com) when this issue happened? PFA. Regards, Eswar. @evadla I still cannot find useful information from the logs. All the provided sriov-config-daemon log show 'syncStatus: Succeeded'. I want to have a log containing from the sriov-config-daemon starts, to it becomes 'syncStatus: Succeeded'. So that we can find out what happened during that 20~30 mins after reboot. Hi @pliu Any update? BR Eswar. fix merged in 4.10. https://github.com/openshift/sriov-network-operator/commit/dc1a4fa6a6245e7c6ddb9e761beb8de1bd9ed7c3 Verified on sriov-network-operator.4.10.0-202112210953, creating 127 VFs using yaml files as below. It took less than 3 min for sriovnetworknodestates Sync status to be succeeded.
Removing the policy took about 5 min.
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: mlx277
namespace: openshift-sriov-network-operator
spec:
mtu: 1500
nicSelector:
vendor: '15b3'
deviceID: '1017'
nodeSelector:
feature.node.kubernetes.io/sriov-capable: 'true'
numVfs: 127
resourceName: mlx277
# oc describe sriovnetworknodestates.sriovnetwork.openshift.io openshift-qe-029.lab.eng.rdu2.redhat.com | grep Sync
Sync Status: Succeeded
Added "isRdma: true" in yaml file and created sriov policy again, it took about 8 min for sync status to be succeeded and removed it taking about 5 min.
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: mlx277
namespace: openshift-sriov-network-operator
spec:
mtu: 1500
nicSelector:
vendor: '15b3'
deviceID: '1017'
nodeSelector:
feature.node.kubernetes.io/sriov-capable: 'true'
numVfs: 127
isRdma: true
resourceName: mlx277
Without isRdma can reduce the time of creating sriovnetworknodepolicy.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |
Created attachment 1832183 [details] MCP+sriovnetworknodestate+workerNode yamls Description of problem: SRIOV policy taking around 30 mins of time to get affect, whenever customer reboot the node same time was taking to get into affect. Customer description: customer had configured SRIOV worker node with 127 vfs , in general for vf count of 127 the nodes takes around 30mins to take affect if we configure the sriov node with 8vf the configuration time is less than 5min but when we configure the sriov node with 127 vf then it takes 30 mins to have sriovNetworkNode policy to take affect. ** whenever customer reboot the node, MCP will be in ready state but the policies are taking time. ------- #oc describe pod/sdn-4wjxk -n openshift-sdn' to see all of the containers in this pod. sh-4.4# sh-4.4# ethtool -i ens1f0 driver: mlx5_core version: 5.0-0 firmware-version: 14.31.1200(HP_2420110034) expansion-rom-version: bus-info: 0000:08:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: no supports-priv-flags: yes ------ Nic Model : HPE Eth 10/25Gb 2p 640SFP28 Adptr ------ Below are timeframe: 127 vf for each nic ( total 2 nics 2*127 = 254 vf) 4:30 pm rebooted 4:30 4:36 got ready 4:36 4:54 sriov node poliy succeed