1927263 – kubelet service takes around 43 secs to start container when started from stopped state

Bug 1927263 - kubelet service takes around 43 secs to start container when started from stopped state

Summary: kubelet service takes around 43 secs to start container when started from sto...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Elana Hashman
QA Contact:	Weinan Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1955231
TreeView+	depends on / blocked

Reported:	2021-02-10 12:15 UTC by Praveen Kumar
Modified:	2021-07-27 22:44 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1955231 (view as bug list)
Environment:
Last Closed:	2021-07-27 22:43:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
kubelet journal logs for 4.6.15 (228.80 KB, text/plain) 2021-02-10 12:15 UTC, Praveen Kumar	no flags	Details
kubelet journal logs for 4.7.0-rc.0 (280.50 KB, text/plain) 2021-02-10 12:16 UTC, Praveen Kumar	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes kubernetes pull 99336	None	closed	pkg/kubelet: improve the node informer sync check	2021-04-22 17:22:33 UTC
Github	openshift kubernetes pull 704	None	open	Bug 1927263: UPSTREAM: 99336: kubelet: improve the node informer sync check	2021-04-22 17:21:50 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 22:44:17 UTC

Description Praveen Kumar 2021-02-10 12:15:15 UTC

Created attachment 1756195 [details]
kubelet journal logs for  4.6.15

Description of problem:
For CRC we stop the kubelet service before generate the disk image and then start it to have the environment as expected. But with 4.7.0 we are seeing that the time we start kubelet service and time it start the first container is now around 43 sec which used to be around 2 sec in case of 4.6 , This increase in the time causing issue for CRC.

Version-Release number of selected component (if applicable):
4.7.0-rc.0

How reproducible:
Login to master node and try to execute following steps.

- [root@crc-ctj2r-master-0 core]# systemctl stop kubelet
- [root@crc-ctj2r-master-0 core]# pods=$(crictl pods -o json | jq '.items[] | select(.metadata.namespace != "openshift-sdn")' | jq -r .id)
- [root@crc-ctj2r-master-0 core]# crictl stopp $pods
- [root@crc-ctj2r-master-0 core]# crictl rmp -f $pods
- [root@crc-ctj2r-master-0 core]# systemctl restart cri-o

Actual results:

```
# systemctl start kubelet && time watch -g crictl ps -a

real	0m43.007s
user	0m0.441s
sys	0m0.570s
```

Expected results:
Should be same as 4.6.x cluster
```
systemctl start kubelet && time watch -g crictl ps -a

real	0m2.093s
user	0m0.025s
sys	0m0.032s
```


Additional info:
I am attaching the journal logs for kubelet for both 4.6 and 4.7 side.

Comment 1 Praveen Kumar 2021-02-10 12:16:20 UTC

Created attachment 1756196 [details]
kubelet journal logs for  4.7.0-rc.0

Comment 2 Christophe Fergeau 2021-02-10 13:19:42 UTC

The "kubelet nodes not sync" messages come from https://github.com/kubernetes/kubernetes/commit/7521352 but I don't know if these messages are related to the delay we are seeing.

Comment 4 Stefan Schimanski 2021-02-16 08:37:34 UTC

Wrong component. Unsetting.

Comment 5 Elana Hashman 2021-02-24 00:20:25 UTC

I suspect this is https://github.com/kubernetes/kubeadm/issues/2395

Comment 6 Elana Hashman 2021-02-24 17:55:27 UTC

Discussed as a team, this is not typically how the kubelet would be used in OpenShift/Kubernetes so the change would not cause any major regressions for normal operations; the linked issue is limited to kubeadm/other consumers. Hence, not a blocker for 4.7.

The delay in kubelet Ready status is because it needs to sync with the API Server at least once after going offline before admitting workloads. Otherwise, we run into correctness issues (bug 1930960).

There may be additional workarounds/fixes for this in other consumers other than increasing the timeout; Praveen, what API server is this kubelet talking to?

Comment 7 Praveen Kumar 2021-02-25 04:48:17 UTC

> There may be additional workarounds/fixes for this in other consumers other than increasing the timeout; Praveen, what API server is this kubelet talking to?

@Elana For single node use case, if you shutdown the node and then start, there is no API server until kubelet start the apiserver static pods and to start those pods it always do check for node sync which causing the delay.

Comment 8 Elana Hashman 2021-03-10 21:47:06 UTC

Still awaiting upstream fix. May not land this release due to Kubernetes code freeze deadline.

https://github.com/kubernetes/kubernetes/pull/99336

Comment 9 Elana Hashman 2021-03-26 23:03:48 UTC

Once upstream patch merges, I plan to backport. It should resolve this.

Comment 10 Elana Hashman 2021-04-28 22:53:08 UTC

PR approved, waiting on merge.

Comment 15 errata-xmlrpc 2021-07-27 22:43:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.