Created attachment 1756195 [details]
kubelet journal logs for 4.6.15
Description of problem:
For CRC we stop the kubelet service before generate the disk image and then start it to have the environment as expected. But with 4.7.0 we are seeing that the time we start kubelet service and time it start the first container is now around 43 sec which used to be around 2 sec in case of 4.6 , This increase in the time causing issue for CRC.
Version-Release number of selected component (if applicable):
Login to master node and try to execute following steps.
- [root@crc-ctj2r-master-0 core]# systemctl stop kubelet
- [root@crc-ctj2r-master-0 core]# pods=$(crictl pods -o json | jq '.items | select(.metadata.namespace != "openshift-sdn")' | jq -r .id)
- [root@crc-ctj2r-master-0 core]# crictl stopp $pods
- [root@crc-ctj2r-master-0 core]# crictl rmp -f $pods
- [root@crc-ctj2r-master-0 core]# systemctl restart cri-o
# systemctl start kubelet && time watch -g crictl ps -a
Should be same as 4.6.x cluster
systemctl start kubelet && time watch -g crictl ps -a
I am attaching the journal logs for kubelet for both 4.6 and 4.7 side.
Created attachment 1756196 [details]
kubelet journal logs for 4.7.0-rc.0
The "kubelet nodes not sync" messages come from https://github.com/kubernetes/kubernetes/commit/7521352 but I don't know if these messages are related to the delay we are seeing.
Wrong component. Unsetting.
I suspect this is https://github.com/kubernetes/kubeadm/issues/2395
Discussed as a team, this is not typically how the kubelet would be used in OpenShift/Kubernetes so the change would not cause any major regressions for normal operations; the linked issue is limited to kubeadm/other consumers. Hence, not a blocker for 4.7.
The delay in kubelet Ready status is because it needs to sync with the API Server at least once after going offline before admitting workloads. Otherwise, we run into correctness issues (bug 1930960).
There may be additional workarounds/fixes for this in other consumers other than increasing the timeout; Praveen, what API server is this kubelet talking to?
> There may be additional workarounds/fixes for this in other consumers other than increasing the timeout; Praveen, what API server is this kubelet talking to?
@Elana For single node use case, if you shutdown the node and then start, there is no API server until kubelet start the apiserver static pods and to start those pods it always do check for node sync which causing the delay.
Still awaiting upstream fix. May not land this release due to Kubernetes code freeze deadline.
Once upstream patch merges, I plan to backport. It should resolve this.
PR approved, waiting on merge.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.