Bug 1927263

Summary: kubelet service takes around 43 secs to start container when started from stopped state
Product: OpenShift Container Platform Reporter: Praveen Kumar <prkumar>
Component: NodeAssignee: Elana Hashman <ehashman>
Node sub component: Kubelet QA Contact: Weinan Liu <weinliu>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, cfergeau, chewi, ehashman, eparis, harpatil, jokerman, mfojtik, sspeiche, xxia
Version: 4.7   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1955231 (view as bug list) Environment:
Last Closed: 2021-07-27 22:43:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1955231    
Attachments:
Description Flags
kubelet journal logs for 4.6.15
none
kubelet journal logs for 4.7.0-rc.0 none

Description Praveen Kumar 2021-02-10 12:15:15 UTC
Created attachment 1756195 [details]
kubelet journal logs for  4.6.15

Description of problem:
For CRC we stop the kubelet service before generate the disk image and then start it to have the environment as expected. But with 4.7.0 we are seeing that the time we start kubelet service and time it start the first container is now around 43 sec which used to be around 2 sec in case of 4.6 , This increase in the time causing issue for CRC.

Version-Release number of selected component (if applicable):
4.7.0-rc.0

How reproducible:
Login to master node and try to execute following steps.

- [root@crc-ctj2r-master-0 core]# systemctl stop kubelet
- [root@crc-ctj2r-master-0 core]# pods=$(crictl pods -o json | jq '.items[] | select(.metadata.namespace != "openshift-sdn")' | jq -r .id)
- [root@crc-ctj2r-master-0 core]# crictl stopp $pods
- [root@crc-ctj2r-master-0 core]# crictl rmp -f $pods
- [root@crc-ctj2r-master-0 core]# systemctl restart cri-o

Actual results:

```
# systemctl start kubelet && time watch -g crictl ps -a

real	0m43.007s
user	0m0.441s
sys	0m0.570s
```

Expected results:
Should be same as 4.6.x cluster
```
systemctl start kubelet && time watch -g crictl ps -a

real	0m2.093s
user	0m0.025s
sys	0m0.032s
```


Additional info:
I am attaching the journal logs for kubelet for both 4.6 and 4.7 side.

Comment 1 Praveen Kumar 2021-02-10 12:16:20 UTC
Created attachment 1756196 [details]
kubelet journal logs for  4.7.0-rc.0

Comment 2 Christophe Fergeau 2021-02-10 13:19:42 UTC
The "kubelet nodes not sync" messages come from https://github.com/kubernetes/kubernetes/commit/7521352 but I don't know if these messages are related to the delay we are seeing.

Comment 4 Stefan Schimanski 2021-02-16 08:37:34 UTC
Wrong component. Unsetting.

Comment 5 Elana Hashman 2021-02-24 00:20:25 UTC
I suspect this is https://github.com/kubernetes/kubeadm/issues/2395

Comment 6 Elana Hashman 2021-02-24 17:55:27 UTC
Discussed as a team, this is not typically how the kubelet would be used in OpenShift/Kubernetes so the change would not cause any major regressions for normal operations; the linked issue is limited to kubeadm/other consumers. Hence, not a blocker for 4.7.

The delay in kubelet Ready status is because it needs to sync with the API Server at least once after going offline before admitting workloads. Otherwise, we run into correctness issues (bug 1930960).

There may be additional workarounds/fixes for this in other consumers other than increasing the timeout; Praveen, what API server is this kubelet talking to?

Comment 7 Praveen Kumar 2021-02-25 04:48:17 UTC
> There may be additional workarounds/fixes for this in other consumers other than increasing the timeout; Praveen, what API server is this kubelet talking to?

@Elana For single node use case, if you shutdown the node and then start, there is no API server until kubelet start the apiserver static pods and to start those pods it always do check for node sync which causing the delay.

Comment 8 Elana Hashman 2021-03-10 21:47:06 UTC
Still awaiting upstream fix. May not land this release due to Kubernetes code freeze deadline.

https://github.com/kubernetes/kubernetes/pull/99336

Comment 9 Elana Hashman 2021-03-26 23:03:48 UTC
Once upstream patch merges, I plan to backport. It should resolve this.

Comment 10 Elana Hashman 2021-04-28 22:53:08 UTC
PR approved, waiting on merge.

Comment 15 errata-xmlrpc 2021-07-27 22:43:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438