Bug 2013199 - post reboot of node SRIOV policy taking huge time
Summary: post reboot of node SRIOV policy taking huge time
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.10.0
Assignee: Peng Liu
QA Contact: Ying Wang
URL:
Whiteboard:
Depends On:
Blocks: 2040122
TreeView+ depends on / blocked
 
Reported: 2021-10-12 11:05 UTC by Eswar Vadla
Modified: 2022-03-10 16:19 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2040122 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:18:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
MCP+sriovnetworknodestate+workerNode yamls (193.67 KB, text/plain)
2021-10-12 11:05 UTC, Eswar Vadla
no flags Details
SRIOV logs(config) (12.02 MB, application/vnd.rar)
2021-11-11 09:29 UTC, Eswar Vadla
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift sriov-network-operator pull 606 0 None open BUG 2013199: Update operator bundle 2021-12-21 07:04:30 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:19:09 UTC

Description Eswar Vadla 2021-10-12 11:05:38 UTC
Created attachment 1832183 [details]
MCP+sriovnetworknodestate+workerNode yamls

Description of problem:

SRIOV policy taking around 30 mins of time to get affect, whenever customer reboot the node same time was taking to get into affect.

Customer description:
customer had configured SRIOV worker node  with 127 vfs , in general for vf count of 127 the nodes takes around 30mins to take affect
if we configure the sriov node with 8vf the configuration time is less than 5min but when we configure the sriov node with 127 vf then it takes 30 mins to have sriovNetworkNode policy to take affect.

** whenever customer reboot the node, MCP will be in ready state but the policies are taking time.

-------

#oc describe pod/sdn-4wjxk -n openshift-sdn' to see all of the containers in this pod.
sh-4.4#
sh-4.4# ethtool -i ens1f0
driver: mlx5_core
version: 5.0-0
firmware-version: 14.31.1200(HP_2420110034)
expansion-rom-version:
bus-info: 0000:08:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

------

Nic Model :   HPE Eth 10/25Gb 2p 640SFP28 Adptr

------

Below are timeframe:
127 vf  for each nic ( total 2 nics 2*127 = 254 vf) 
4:30 pm rebooted    4:30 
4:36 got ready      4:36
4:54    sriov node poliy succeed

Comment 1 Eswar Vadla 2021-10-12 11:16:41 UTC
Due to size constraints unable to upload few more datasets, Please check attachment in case #03038286.

Comment 2 Peng Liu 2021-10-13 03:35:51 UTC
Can you provide the log of sriov-config-daemon pod for the node (worker01.ocp47.lab.com) when this issue happened?

The sriov-config-daemon log provided in https://access.redhat.com/support/cases/#/case/03038286/discussion?attachmentId=a092K00002l1mcgQAA doesn't have the information when the issue happened.

Comment 3 Eswar Vadla 2021-11-11 09:29:37 UTC
Created attachment 1841166 [details]
SRIOV logs(config)

Comment 4 Eswar Vadla 2021-11-11 09:38:47 UTC
Hi Peng,

Customer had did the same test yesterday and it took 25 mins for 127 VFs.

=> Can you provide the log of sriov-config-daemon pod for the node (worker01.ocp47.lab.com) when this issue happened?
PFA.

Regards,
Eswar.

Comment 5 Peng Liu 2021-11-15 03:29:22 UTC
@evadla I still cannot find useful information from the logs. All the provided sriov-config-daemon log show 'syncStatus: Succeeded'. I want to have a log containing from the sriov-config-daemon starts, to it becomes 'syncStatus: Succeeded'. So that we can find out what happened during that 20~30 mins after reboot.

Comment 8 Eswar Vadla 2021-11-30 09:32:39 UTC
Hi @pliu 

Any update?

BR
Eswar.

Comment 18 Ying Wang 2021-12-23 06:26:29 UTC
Verified on sriov-network-operator.4.10.0-202112210953, creating 127 VFs using yaml files as below. It took less than 3 min for sriovnetworknodestates Sync status to be succeeded. 
Removing the policy took about 5 min.

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlx277
  namespace: openshift-sriov-network-operator
spec:
  mtu: 1500
  nicSelector:
    vendor: '15b3'
    deviceID: '1017'
  nodeSelector:
    feature.node.kubernetes.io/sriov-capable: 'true'
  numVfs: 127
  resourceName: mlx277

# oc describe sriovnetworknodestates.sriovnetwork.openshift.io openshift-qe-029.lab.eng.rdu2.redhat.com | grep Sync
  Sync Status:      Succeeded


Added "isRdma: true" in yaml file and created sriov policy again, it took about 8 min for sync status to be succeeded and removed it taking about 5 min.

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlx277
  namespace: openshift-sriov-network-operator
spec:
  mtu: 1500
  nicSelector:
    vendor: '15b3'
    deviceID: '1017'
  nodeSelector:
    feature.node.kubernetes.io/sriov-capable: 'true'
  numVfs: 127
  isRdma: true
  resourceName: mlx277


Without isRdma can reduce the time of creating sriovnetworknodepolicy.

Comment 26 errata-xmlrpc 2022-03-10 16:18:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.