Bug 1794142

Summary: Pod startup time in OpenShift 4 has increased
Product: OpenShift Container Platform Reporter: Alan Field <afield>
Component: NetworkingAssignee: Douglas Smith <dosmith>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: urgent CC: aconstan, aos-bugs, jokerman, lmohanty, mfojtik, mifiedle, scuppett, sdodson, umohnani
Version: 4.3.0   
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-02 12:21:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
cri-o logs none

Description Alan Field 2020-01-22 18:42:03 UTC
Description of problem:

This issue was discovered as part of this JIRA discovered by the serverless team:

https://issues.redhat.com/browse/SRVKS-379

Details of the testing are there, but a bare pod takes almost 7 seconds to be ready, and a pod in a deployment takes more like 8 or 9 seconds. Scale from zero on GKE happens in about 2 seconds., and this is a big problem for OpenShift serverless.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Maciej Szulik 2020-01-27 17:29:32 UTC
From the log snippet in the jira:

Dec 17 03:28:30 ip-10-0-174-45 hyperkube[2277]: I1217 03:28:30.132130    2277 manager.go:1011] Added container: "/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4b71d398_207d_11ea_8028_0a1263433394.slice/crio-a40b6ac715335fcd3227fb2e5c84f14ae52cd319d703f8718873cecfba2f81a8.scope" (aliases: [k8s_POD_test-selector-74fc74c8d8-xl4b7_bbrowning-sandbox_4b71d398-207d-11ea-8028-0a1263433394_0 a40b6ac715335fcd3227fb2e5c84f14ae52cd319d703f8718873cecfba2f81a8], namespace: "crio")
Dec 17 03:28:37 ip-10-0-174-45 crio[1889]: 2019-12-17T03:28:37Z [verbose] Add: bbrowning-sandbox:test-selector-74fc74c8d8-xl4b7:openshift-sdn:eth0 {"cniVersion":"0.3.1","interfaces":[{"name":"eth0","sandbox":"/proc/3285126/ns/net"}],"ips":[{"version":"4","interface":0,"address":"10.128.2.78/23"}],"routes":[{"dst":"0.0.0.0/0","gw":"10.128.2.1"},{"dst":"224.0.0.0/4"},{"dst":"10.128.0.0/14"}],"dns":{}}

this looks like either an issue with kubelet or the runtime. Moving to node team for investigation.

Comment 2 Ryan Phillips 2020-01-27 17:32:04 UTC
Which release are you testing with? RHCOS version?

Comment 3 Alan Field 2020-01-27 18:49:52 UTC
The SRVKS-379 was discovered against OCP 4.2.10, but I do not know the exact version of RHCOS.

Comment 5 Urvashi Mohnani 2020-02-07 20:56:12 UTC
Created attachment 1661789 [details]
cri-o logs

Comment 6 Douglas Smith 2020-02-10 19:00:32 UTC
I've identified the root cause of the introduced delay, and have create a number of PR's to address it.
 
The core issue that the use of the "readinessindicatorfile" option introduces a delay every time when this option is not used. I believe this was used in previous OpenShift versions and removed in a recent version. I'll continue to assess if other previous versions will need a fix in the z-stream.

Upstream PR @ https://github.com/intel/multus-cni/pull/439

Downstream master PR @ https://github.com/openshift/multus-cni/pull/43
Downstream 4.4 PR @ https://github.com/openshift/multus-cni/pull/44
Downstream 4.3 PR @ https://github.com/openshift/multus-cni/pull/45

Comment 9 W. Trevor King 2021-04-05 17:47:56 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475