2102722 – CRI-O taking long time for creation of pod sandbox

Bug 2102722 - CRI-O taking long time for creation of pod sandbox

Summary: CRI-O taking long time for creation of pod sandbox

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Sascha Grunert
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-30 13:34 UTC by Arkadeep Sen
Modified:	2022-10-12 14:20 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-10-12 13:57:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Arkadeep Sen 2022-06-30 13:34:16 UTC

Description of problem:
Pod sandbox creation gets stuck for approximately 17mins and then fails. Subsequent sandbox creation request from kubelet for the same pod succeeds. During this period of time other pod sandboxes get created without any issue.

This issue has lead to the failure of some of the CI jobs in particularly the following:
1. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn/1518876947986255872
2. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn/1521829895582257152
3. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn/1533907006589505536

The cri-o logs of the particular node on which the pod is scheduled does not provide much details as to what is causing the 17mins delay.

In all the above cases host network is set to true for the particular pod. So, the delay is not caused by CNI plugin not being ready.

Version-Release number of selected component (if applicable):

How reproducible: Not able to reproduce the issue. Found them in the CI job failures.

Steps to Reproduce:
1.
2.
3.

Actual results:
Pod sandbox creation gets stuck for 17mins and then fails with either of the following errors:
1. Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving pod name
2. Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Expected results:
Pod sandbox creation should not get stuck.

Additional info:
Slack thread regarding the issue: https://coreos.slack.com/archives/CK1AE4ZCK/p1655287375267959

Comment 3 Sascha Grunert 2022-10-12 13:57:16 UTC

Thanks, I'll close this bug now since it seems to be either already fixed or not reproducible. Let's reconsider the case once we find a similar issue.

Comment 4 Arkadeep Sen 2022-10-12 14:09:19 UTC

Would adding some log messages by default, rather than through CRI-O debug logs, be of help in the future if such issues arise again?

Comment 5 Peter Hunt 2022-10-12 14:13:12 UTC

Luckily, we have added such log messages recently in 4.11, and we intend on backporting them. They're available by default at the info level

Comment 6 Arkadeep Sen 2022-10-12 14:20:37 UTC

Sounds good then.

Note You need to log in before you can comment on or make changes to this bug.