2013199 – post reboot of node SRIOV policy taking huge time

Bug 2013199 - post reboot of node SRIOV policy taking huge time

Summary: post reboot of node SRIOV policy taking huge time

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Peng Liu
QA Contact:	Ying Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2040122
TreeView+	depends on / blocked

Reported:	2021-10-12 11:05 UTC by Eswar Vadla
Modified:	2022-03-10 16:19 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2040122 (view as bug list)
Environment:
Last Closed:	2022-03-10 16:18:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
MCP+sriovnetworknodestate+workerNode yamls (193.67 KB, text/plain) 2021-10-12 11:05 UTC, Eswar Vadla	no flags	Details
SRIOV logs(config) (12.02 MB, application/vnd.rar) 2021-11-11 09:29 UTC, Eswar Vadla	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift sriov-network-operator pull 606	0	None	open	BUG 2013199: Update operator bundle	2021-12-21 07:04:30 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:19:09 UTC

Description Eswar Vadla 2021-10-12 11:05:38 UTC

Created attachment 1832183 [details]
MCP+sriovnetworknodestate+workerNode yamls

Description of problem:

SRIOV policy taking around 30 mins of time to get affect, whenever customer reboot the node same time was taking to get into affect.

Customer description:
customer had configured SRIOV worker node  with 127 vfs , in general for vf count of 127 the nodes takes around 30mins to take affect
if we configure the sriov node with 8vf the configuration time is less than 5min but when we configure the sriov node with 127 vf then it takes 30 mins to have sriovNetworkNode policy to take affect.

** whenever customer reboot the node, MCP will be in ready state but the policies are taking time.

-------

#oc describe pod/sdn-4wjxk -n openshift-sdn' to see all of the containers in this pod.
sh-4.4#
sh-4.4# ethtool -i ens1f0
driver: mlx5_core
version: 5.0-0
firmware-version: 14.31.1200(HP_2420110034)
expansion-rom-version:
bus-info: 0000:08:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

------

Nic Model :   HPE Eth 10/25Gb 2p 640SFP28 Adptr

------

Below are timeframe:
127 vf  for each nic ( total 2 nics 2*127 = 254 vf) 
4:30 pm rebooted    4:30 
4:36 got ready      4:36
4:54    sriov node poliy succeed

Comment 1 Eswar Vadla 2021-10-12 11:16:41 UTC

Due to size constraints unable to upload few more datasets, Please check attachment in case #03038286.

Comment 2 Peng Liu 2021-10-13 03:35:51 UTC

Can you provide the log of sriov-config-daemon pod for the node (worker01.ocp47.lab.com) when this issue happened?

The sriov-config-daemon log provided in https://access.redhat.com/support/cases/#/case/03038286/discussion?attachmentId=a092K00002l1mcgQAA doesn't have the information when the issue happened.

Comment 3 Eswar Vadla 2021-11-11 09:29:37 UTC

Created attachment 1841166 [details]
SRIOV logs(config)

Comment 4 Eswar Vadla 2021-11-11 09:38:47 UTC

Hi Peng,

Customer had did the same test yesterday and it took 25 mins for 127 VFs.

=> Can you provide the log of sriov-config-daemon pod for the node (worker01.ocp47.lab.com) when this issue happened?
PFA.

Regards,
Eswar.

Comment 5 Peng Liu 2021-11-15 03:29:22 UTC

@evadla I still cannot find useful information from the logs. All the provided sriov-config-daemon log show 'syncStatus: Succeeded'. I want to have a log containing from the sriov-config-daemon starts, to it becomes 'syncStatus: Succeeded'. So that we can find out what happened during that 20~30 mins after reboot.

Comment 8 Eswar Vadla 2021-11-30 09:32:39 UTC

Hi @pliu 

Any update?

BR
Eswar.

Comment 9 Peng Liu 2021-11-30 09:49:31 UTC

upstream PR https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/207

Comment 15 Peng Liu 2021-12-17 06:37:17 UTC

fix merged in 4.10. https://github.com/openshift/sriov-network-operator/commit/dc1a4fa6a6245e7c6ddb9e761beb8de1bd9ed7c3

Comment 18 Ying Wang 2021-12-23 06:26:29 UTC

Verified on sriov-network-operator.4.10.0-202112210953, creating 127 VFs using yaml files as below. It took less than 3 min for sriovnetworknodestates Sync status to be succeeded. 
Removing the policy took about 5 min.

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlx277
  namespace: openshift-sriov-network-operator
spec:
  mtu: 1500
  nicSelector:
    vendor: '15b3'
    deviceID: '1017'
  nodeSelector:
    feature.node.kubernetes.io/sriov-capable: 'true'
  numVfs: 127
  resourceName: mlx277

# oc describe sriovnetworknodestates.sriovnetwork.openshift.io openshift-qe-029.lab.eng.rdu2.redhat.com | grep Sync
  Sync Status:      Succeeded


Added "isRdma: true" in yaml file and created sriov policy again, it took about 8 min for sync status to be succeeded and removed it taking about 5 min.

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlx277
  namespace: openshift-sriov-network-operator
spec:
  mtu: 1500
  nicSelector:
    vendor: '15b3'
    deviceID: '1017'
  nodeSelector:
    feature.node.kubernetes.io/sriov-capable: 'true'
  numVfs: 127
  isRdma: true
  resourceName: mlx277


Without isRdma can reduce the time of creating sriovnetworknodepolicy.

Comment 26 errata-xmlrpc 2022-03-10 16:18:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.