Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2118416

Summary: OCP 4.10.25 - SRIOV config-daemon pods reached ~200% container CPU and workload pods failed to attach to SRIOV networks
Product: OpenShift Container Platform Reporter: Noreen <nchhabra>
Component: NetworkingAssignee: Vrinda <vpunj>
Networking sub component: SR-IOV QA Contact: zhaozhanqi <zzhao>
Status: CLOSED WONTFIX Docs Contact:
Severity: medium    
Priority: medium CC: nchhabra, zshi
Version: 4.10   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-29 22:29:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Noreen 2022-08-15 19:21:52 UTC
Description of problem:

SNO (with DU profile) with OCP 4.10.25 - SRIOV config-daemon pods reached ~200% container CPU and all workload pods failed to attach to SRIOV networks inspite of the network-attachments which got created prior to creation of workload pods. Due to this, workload pods we stuck in Pending state.

Version-Release number of selected component (if applicable):
OCP 4.10.25
Kernel - 4.18.0-305.57.1.rt7.129.el8_4.x86_64
Crio - cri-o://1.23.3-11.rhaos4.10.gitddf4b1a.1.el8

How reproducible:
1. Create sriovnetworknodepolicies and sriovnetworks to be assigned to workload pods:

# oc get sriovnetworknodepolicies.sriovnetwork.openshift.io -A | wc -l              
63   

# oc get sriovnetworks.sriovnetwork.openshift.io -A | wc -l                         
62   

2. 61 namespaces corresponding to the target namespaces in the SRIOV config were deployed 

3. Kube-burner was used for the creation of 61 workload pods, 60 pods were stuck in pending state as the pods failed to attach to SRIOV networks

4. sriov config daemon pod had high container cpu of~200%

5. The issue is corrected when the workload pods and namespaces are deleted by force, the sriov config-daemon pod is deleted (and redeployed by sriov-network-operator), and all sriov networknodepolicies and networks are also redeployed


Steps to Reproduce:
1.
2.
3.

Actual results:
pods stuck in pending state

Expected results:
Expected the pods to attach to SRIOV networks without encountering issues

Additional info:
Configs attached, along with must-gather logs and snapshot of performance dashboard from the failed run

Comment 3 Vrinda 2022-11-16 21:11:03 UTC
Could you update the must gather to when the issue is encountered. I believe this must gather was done prior to the system encountering this issue, so it would be great if we could get an updated must gather when this issue is actually occuring.