Bug 1794142

Summary:

Pod startup time in OpenShift 4 has increased

Product:

OpenShift Container Platform

Reporter:

Alan Field <afield>

Component:

Networking

Assignee:

Douglas Smith <dosmith>

Networking sub component:

openshift-sdn

QA Contact:

zhaozhanqi <zzhao>

Status:

CLOSED DUPLICATE

Docs Contact:

Severity:

high

Priority:

urgent

CC:

aconstan, aos-bugs, jokerman, lmohanty, mfojtik, mifiedle, scuppett, sdodson, umohnani

Version:

4.3.0

Target Milestone:

---

Target Release:

4.4.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-03-02 12:21:44 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
cri-o logs	none

Description Alan Field 2020-01-22 18:42:03 UTC

Description of problem:

This issue was discovered as part of this JIRA discovered by the serverless team:

https://issues.redhat.com/browse/SRVKS-379

Details of the testing are there, but a bare pod takes almost 7 seconds to be ready, and a pod in a deployment takes more like 8 or 9 seconds. Scale from zero on GKE happens in about 2 seconds., and this is a big problem for OpenShift serverless.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Maciej Szulik 2020-01-27 17:29:32 UTC

From the log snippet in the jira:

Dec 17 03:28:30 ip-10-0-174-45 hyperkube[2277]: I1217 03:28:30.132130    2277 manager.go:1011] Added container: "/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4b71d398_207d_11ea_8028_0a1263433394.slice/crio-a40b6ac715335fcd3227fb2e5c84f14ae52cd319d703f8718873cecfba2f81a8.scope" (aliases: [k8s_POD_test-selector-74fc74c8d8-xl4b7_bbrowning-sandbox_4b71d398-207d-11ea-8028-0a1263433394_0 a40b6ac715335fcd3227fb2e5c84f14ae52cd319d703f8718873cecfba2f81a8], namespace: "crio")
Dec 17 03:28:37 ip-10-0-174-45 crio[1889]: 2019-12-17T03:28:37Z [verbose] Add: bbrowning-sandbox:test-selector-74fc74c8d8-xl4b7:openshift-sdn:eth0 {"cniVersion":"0.3.1","interfaces":[{"name":"eth0","sandbox":"/proc/3285126/ns/net"}],"ips":[{"version":"4","interface":0,"address":"10.128.2.78/23"}],"routes":[{"dst":"0.0.0.0/0","gw":"10.128.2.1"},{"dst":"224.0.0.0/4"},{"dst":"10.128.0.0/14"}],"dns":{}}

this looks like either an issue with kubelet or the runtime. Moving to node team for investigation.

Comment 2 Ryan Phillips 2020-01-27 17:32:04 UTC

Which release are you testing with? RHCOS version?

Comment 3 Alan Field 2020-01-27 18:49:52 UTC

The SRVKS-379 was discovered against OCP 4.2.10, but I do not know the exact version of RHCOS.

Comment 5 Urvashi Mohnani 2020-02-07 20:56:12 UTC

Created attachment 1661789 [details]
cri-o logs

Comment 6 Douglas Smith 2020-02-10 19:00:32 UTC

I've identified the root cause of the introduced delay, and have create a number of PR's to address it.
 
The core issue that the use of the "readinessindicatorfile" option introduces a delay every time when this option is not used. I believe this was used in previous OpenShift versions and removed in a recent version. I'll continue to assess if other previous versions will need a fix in the z-stream.

Upstream PR @ https://github.com/intel/multus-cni/pull/439

Downstream master PR @ https://github.com/openshift/multus-cni/pull/43
Downstream 4.4 PR @ https://github.com/openshift/multus-cni/pull/44
Downstream 4.3 PR @ https://github.com/openshift/multus-cni/pull/45

Comment 9 W. Trevor King 2021-04-05 17:47:56 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475