Bug 2102722

Summary:	CRI-O taking long time for creation of pod sandbox
Product:	OpenShift Container Platform	Reporter:	Arkadeep Sen <arsen>
Component:	Node	Assignee:	Sascha Grunert <sgrunert>
Node sub component:	CRI-O	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	medium
Priority:	medium	CC:	harpatil, pehunt, sgrunert
Version:	4.11
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-10-12 13:57:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Arkadeep Sen 2022-06-30 13:34:16 UTC

Description of problem:
Pod sandbox creation gets stuck for approximately 17mins and then fails. Subsequent sandbox creation request from kubelet for the same pod succeeds. During this period of time other pod sandboxes get created without any issue.

This issue has lead to the failure of some of the CI jobs in particularly the following:
1. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn/1518876947986255872
2. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn/1521829895582257152
3. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn/1533907006589505536

The cri-o logs of the particular node on which the pod is scheduled does not provide much details as to what is causing the 17mins delay.

In all the above cases host network is set to true for the particular pod. So, the delay is not caused by CNI plugin not being ready.

Version-Release number of selected component (if applicable):

How reproducible: Not able to reproduce the issue. Found them in the CI job failures.

Steps to Reproduce:
1.
2.
3.

Actual results:
Pod sandbox creation gets stuck for 17mins and then fails with either of the following errors:
1. Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving pod name
2. Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Expected results:
Pod sandbox creation should not get stuck.

Additional info:
Slack thread regarding the issue: https://coreos.slack.com/archives/CK1AE4ZCK/p1655287375267959

Comment 3 Sascha Grunert 2022-10-12 13:57:16 UTC

Thanks, I'll close this bug now since it seems to be either already fixed or not reproducible. Let's reconsider the case once we find a similar issue.

Comment 4 Arkadeep Sen 2022-10-12 14:09:19 UTC

Would adding some log messages by default, rather than through CRI-O debug logs, be of help in the future if such issues arise again?

Comment 5 Peter Hunt 2022-10-12 14:13:12 UTC

Luckily, we have added such log messages recently in 4.11, and we intend on backporting them. They're available by default at the info level

Comment 6 Arkadeep Sen 2022-10-12 14:20:37 UTC

Sounds good then.