Bug 1927263 - kubelet service takes around 43 secs to start container when started from stopped state
Summary: kubelet service takes around 43 secs to start container when started from sto...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Elana Hashman
QA Contact: Weinan Liu
URL:
Whiteboard:
Depends On:
Blocks: 1955231
TreeView+ depends on / blocked
 
Reported: 2021-02-10 12:15 UTC by Praveen Kumar
Modified: 2021-07-27 22:44 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1955231 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:43:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
kubelet journal logs for 4.6.15 (228.80 KB, text/plain)
2021-02-10 12:15 UTC, Praveen Kumar
no flags Details
kubelet journal logs for 4.7.0-rc.0 (280.50 KB, text/plain)
2021-02-10 12:16 UTC, Praveen Kumar
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes kubernetes pull 99336 0 None closed pkg/kubelet: improve the node informer sync check 2021-04-22 17:22:33 UTC
Github openshift kubernetes pull 704 0 None open Bug 1927263: UPSTREAM: 99336: kubelet: improve the node informer sync check 2021-04-22 17:21:50 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:44:17 UTC

Description Praveen Kumar 2021-02-10 12:15:15 UTC
Created attachment 1756195 [details]
kubelet journal logs for  4.6.15

Description of problem:
For CRC we stop the kubelet service before generate the disk image and then start it to have the environment as expected. But with 4.7.0 we are seeing that the time we start kubelet service and time it start the first container is now around 43 sec which used to be around 2 sec in case of 4.6 , This increase in the time causing issue for CRC.

Version-Release number of selected component (if applicable):
4.7.0-rc.0

How reproducible:
Login to master node and try to execute following steps.

- [root@crc-ctj2r-master-0 core]# systemctl stop kubelet
- [root@crc-ctj2r-master-0 core]# pods=$(crictl pods -o json | jq '.items[] | select(.metadata.namespace != "openshift-sdn")' | jq -r .id)
- [root@crc-ctj2r-master-0 core]# crictl stopp $pods
- [root@crc-ctj2r-master-0 core]# crictl rmp -f $pods
- [root@crc-ctj2r-master-0 core]# systemctl restart cri-o

Actual results:

```
# systemctl start kubelet && time watch -g crictl ps -a

real	0m43.007s
user	0m0.441s
sys	0m0.570s
```

Expected results:
Should be same as 4.6.x cluster
```
systemctl start kubelet && time watch -g crictl ps -a

real	0m2.093s
user	0m0.025s
sys	0m0.032s
```


Additional info:
I am attaching the journal logs for kubelet for both 4.6 and 4.7 side.

Comment 1 Praveen Kumar 2021-02-10 12:16:20 UTC
Created attachment 1756196 [details]
kubelet journal logs for  4.7.0-rc.0

Comment 2 Christophe Fergeau 2021-02-10 13:19:42 UTC
The "kubelet nodes not sync" messages come from https://github.com/kubernetes/kubernetes/commit/7521352 but I don't know if these messages are related to the delay we are seeing.

Comment 4 Stefan Schimanski 2021-02-16 08:37:34 UTC
Wrong component. Unsetting.

Comment 5 Elana Hashman 2021-02-24 00:20:25 UTC
I suspect this is https://github.com/kubernetes/kubeadm/issues/2395

Comment 6 Elana Hashman 2021-02-24 17:55:27 UTC
Discussed as a team, this is not typically how the kubelet would be used in OpenShift/Kubernetes so the change would not cause any major regressions for normal operations; the linked issue is limited to kubeadm/other consumers. Hence, not a blocker for 4.7.

The delay in kubelet Ready status is because it needs to sync with the API Server at least once after going offline before admitting workloads. Otherwise, we run into correctness issues (bug 1930960).

There may be additional workarounds/fixes for this in other consumers other than increasing the timeout; Praveen, what API server is this kubelet talking to?

Comment 7 Praveen Kumar 2021-02-25 04:48:17 UTC
> There may be additional workarounds/fixes for this in other consumers other than increasing the timeout; Praveen, what API server is this kubelet talking to?

@Elana For single node use case, if you shutdown the node and then start, there is no API server until kubelet start the apiserver static pods and to start those pods it always do check for node sync which causing the delay.

Comment 8 Elana Hashman 2021-03-10 21:47:06 UTC
Still awaiting upstream fix. May not land this release due to Kubernetes code freeze deadline.

https://github.com/kubernetes/kubernetes/pull/99336

Comment 9 Elana Hashman 2021-03-26 23:03:48 UTC
Once upstream patch merges, I plan to backport. It should resolve this.

Comment 10 Elana Hashman 2021-04-28 22:53:08 UTC
PR approved, waiting on merge.

Comment 15 errata-xmlrpc 2021-07-27 22:43:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.