Bug 2013199

Summary: post reboot of node SRIOV policy taking huge time
Product: OpenShift Container Platform Reporter: Eswar Vadla <evadla>
Component: NetworkingAssignee: Peng Liu <pliu>
Networking sub component: SR-IOV QA Contact: Ying Wang <yingwang>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: dosmith, jpradhan, nm-s, pliu
Version: 4.6.z   
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2040122 (view as bug list) Environment:
Last Closed: 2022-03-10 16:18:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2040122    
Attachments:
Description Flags
MCP+sriovnetworknodestate+workerNode yamls
none
SRIOV logs(config) none

Description Eswar Vadla 2021-10-12 11:05:38 UTC
Created attachment 1832183 [details]
MCP+sriovnetworknodestate+workerNode yamls

Description of problem:

SRIOV policy taking around 30 mins of time to get affect, whenever customer reboot the node same time was taking to get into affect.

Customer description:
customer had configured SRIOV worker node  with 127 vfs , in general for vf count of 127 the nodes takes around 30mins to take affect
if we configure the sriov node with 8vf the configuration time is less than 5min but when we configure the sriov node with 127 vf then it takes 30 mins to have sriovNetworkNode policy to take affect.

** whenever customer reboot the node, MCP will be in ready state but the policies are taking time.

-------

#oc describe pod/sdn-4wjxk -n openshift-sdn' to see all of the containers in this pod.
sh-4.4#
sh-4.4# ethtool -i ens1f0
driver: mlx5_core
version: 5.0-0
firmware-version: 14.31.1200(HP_2420110034)
expansion-rom-version:
bus-info: 0000:08:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

------

Nic Model :   HPE Eth 10/25Gb 2p 640SFP28 Adptr

------

Below are timeframe:
127 vf  for each nic ( total 2 nics 2*127 = 254 vf) 
4:30 pm rebooted    4:30 
4:36 got ready      4:36
4:54    sriov node poliy succeed

Comment 1 Eswar Vadla 2021-10-12 11:16:41 UTC
Due to size constraints unable to upload few more datasets, Please check attachment in case #03038286.

Comment 2 Peng Liu 2021-10-13 03:35:51 UTC
Can you provide the log of sriov-config-daemon pod for the node (worker01.ocp47.lab.com) when this issue happened?

The sriov-config-daemon log provided in https://access.redhat.com/support/cases/#/case/03038286/discussion?attachmentId=a092K00002l1mcgQAA doesn't have the information when the issue happened.

Comment 3 Eswar Vadla 2021-11-11 09:29:37 UTC
Created attachment 1841166 [details]
SRIOV logs(config)

Comment 4 Eswar Vadla 2021-11-11 09:38:47 UTC
Hi Peng,

Customer had did the same test yesterday and it took 25 mins for 127 VFs.

=> Can you provide the log of sriov-config-daemon pod for the node (worker01.ocp47.lab.com) when this issue happened?
PFA.

Regards,
Eswar.

Comment 5 Peng Liu 2021-11-15 03:29:22 UTC
@evadla I still cannot find useful information from the logs. All the provided sriov-config-daemon log show 'syncStatus: Succeeded'. I want to have a log containing from the sriov-config-daemon starts, to it becomes 'syncStatus: Succeeded'. So that we can find out what happened during that 20~30 mins after reboot.

Comment 8 Eswar Vadla 2021-11-30 09:32:39 UTC
Hi @pliu 

Any update?

BR
Eswar.

Comment 18 Ying Wang 2021-12-23 06:26:29 UTC
Verified on sriov-network-operator.4.10.0-202112210953, creating 127 VFs using yaml files as below. It took less than 3 min for sriovnetworknodestates Sync status to be succeeded. 
Removing the policy took about 5 min.

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlx277
  namespace: openshift-sriov-network-operator
spec:
  mtu: 1500
  nicSelector:
    vendor: '15b3'
    deviceID: '1017'
  nodeSelector:
    feature.node.kubernetes.io/sriov-capable: 'true'
  numVfs: 127
  resourceName: mlx277

# oc describe sriovnetworknodestates.sriovnetwork.openshift.io openshift-qe-029.lab.eng.rdu2.redhat.com | grep Sync
  Sync Status:      Succeeded


Added "isRdma: true" in yaml file and created sriov policy again, it took about 8 min for sync status to be succeeded and removed it taking about 5 min.

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlx277
  namespace: openshift-sriov-network-operator
spec:
  mtu: 1500
  nicSelector:
    vendor: '15b3'
    deviceID: '1017'
  nodeSelector:
    feature.node.kubernetes.io/sriov-capable: 'true'
  numVfs: 127
  isRdma: true
  resourceName: mlx277


Without isRdma can reduce the time of creating sriovnetworknodepolicy.

Comment 26 errata-xmlrpc 2022-03-10 16:18:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056