Description of problem: During the upgrade of worker nodes (4.7.0-fc.2 to 4.7.0-fc.3), starting Pods can take 15-30 minutes due to errors like Warning InspectFailed 7m58s (x4 over 14m) kubelet, ip-10-0-164-20.ec2.internal Failed to inspect image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:876095045e7d1ce1fc6eef3e4eeeef2ec8905d38008469de70490bbe76f29863": rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning Failed 7m58s (x4 over 14m) kubelet, ip-10-0-164-20.ec2.internal Error: ImageInspectError Warning Failed 4m29s (x2 over 16m) kubelet, ip-10-0-164-20.ec2.internal Error: context deadline exceeded Version-Release number of selected component (if applicable): How reproducible: We saw this during different upgrades of build01 and build02 Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Will provide must-gather later
Comment 2 suggests the issue is throttling in the CRI-O image pull. Do we have any idea what part of the inspection call was slow? ~5 minutes to construct an ImageStatusResponse that adds nothing to the incoming by-digest pullspec beyond the image size seems... surprisingly slow. Is that just a single auth request followed by a single manifest request? Or is it pulling down the image and all referenced layers to local storage before returning the size?
The CI cluster is not even what I would call overloaded and is well within the bounds of normal OpenShift use. If it is "overloaded" we have a performance issue causing us to violate SLOs. It should never take 30m to start any pod ever. In this cluster, based on what we know, 1-2m is the upper bound excluding pull times, and all pull times I've seen have been 30-60s (we have massive disks). Moving this back to 4.7, I consider this a serious regression of pod startup latency in medium to high load environments and it is likely we would degrade customer environments if we shipped with this (just based on what we have). If this is a quay issue we need to isolate it. If we can pin this on another bug (like BFQ scheduling?) then I'd consider deferring.
I think we should try to reproduce the problem and when that happens, monitor few things. - Capture journal logs. If processes are blocking for too long in kernel, generally there is a warning about task waiting too long (with a backtrace). If that happens, that can give us some idea. - We should run vmstat and capture some stats at that time to get some sense about cpu, memory and I/O state. Say run "vmstat -t 2". This will print stats every 2 seconds and "-t" will also append timestamp. - Run iostat. This will give some idea if I/O on root disk was still progressing that time or not. Run "iostat -t 2". This will capture stats for all disks in the node.
Also capture "lsblk" and "df -h" output on the node in question when this problem happens again.
Hi Sunil, Can you please take a look and see if you can reproduce this upgrade issue on a cluster? Looks like the cluster has to be under a good amount of load. I haven't been able to reproduce so far with a basic cluster.
Spoke with Sunil and he tried to upgrade a loaded cluster that was running multiple pods and PDBs but was unable to reproduce the timeout issue. Have asked him to try again with an increased load and multiple builds going on at the same time. Have shared a daemonset with him that will gather the data Vivek mentioned above in the event that the upgrade process hits the timeout issues.
Tried to reproduce this issue yesterday by creating an app with around 100 pods and PodDisruptionBudget of MaxUnavailable 20 and started upgrade from 4.7-fc.2 to 4.7-fc.3, but did not notice any context deadline exceeded errors on nodes. Tried again today on 2 separate clusters with running builds while upgrade and have captured data with help from Uravashi from comment #9. Uploaded the data at https://drive.google.com/drive/folders/120l6KzwFPYJTf9AJ80oBjsF7ObqF2OEf
(In reply to Sunil Choudhary from comment #10) > Tried to reproduce this issue yesterday by creating an app with around 100 > pods and PodDisruptionBudget of MaxUnavailable 20 and started upgrade from > 4.7-fc.2 to 4.7-fc.3, but did not notice any context deadline exceeded > errors on nodes. > > Tried again today on 2 separate clusters with running builds while upgrade > and have captured data with help from Uravashi from comment #9. So did you see the issue happening in these two separate clusters? I looked at cluster2 data and I don't see many deadline exceeded messages. When you capture the data, please also put some notes describing the logs and where to look at. One file name is "worker-journal.logs". What is that file? Is it journal logs of master node, or journal logs of one of the slave nodes? Please also put some information on which node did you experience the issue so that I can directly dive into the logs of that node. I looked at vmstat data for node with ip ip-10-0-176-10. vmstat ran roughly for 13 minutes (from 13:36:41 UTC to 13:49:20 UTC). And node looks fine. - Runnable processes are pretty low and even if number goes up, it comes down pretty quickly. - There seem to be good amount of free memory. - In the initial part of logs "id" is not high. That seems to suggest that not much of cpu idle time is available. So some cpu intesive jobs are probably running. - IO wait time "wa" is pretty low. That suggests not many jobs have to wait for long for I/O to finish. IOW, node seems fine to me. Only deadline exceeded message in logs I noticed in this window is.
One timeout error I noticed for node with ip ending in "10" is this. Feb 05 13:32:12.578923 ip-10-0-176-10 hyperkube[1399]: I0205 13:32:12.577122 1399 prober.go:117] Readiness probe for "image-registry-66849968b9-wdvrx_openshift-image-registry(b5affc4b-71da-433e-97a1-7449eef04335):registry" failed (failure): Get "https://10.131.0.100:5000/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Not sure what this is. Is this is due to some netowrking issue, or what. node is not showing any signs of stress around this time.
> So did you see the issue happening in these two separate clusters? I looked > at cluster2 > data and I don't see many deadline exceeded messages. > > When you capture the data, please also put some notes describing the logs and > where to look at. One file name is "worker-journal.logs". What is that > file? Is it journal logs of master node, or journal logs of one of the > slave nodes? > > Please also put some information on which node did you experience the > issue so that I can directly dive into the logs of that node. > > I looked at vmstat data for node with ip ip-10-0-176-10. vmstat ran roughly > for 13 > minutes (from 13:36:41 UTC to 13:49:20 UTC). And node looks fine. > > - Runnable processes are pretty low and even if number goes up, it comes > down pretty quickly. > > - There seem to be good amount of free memory. > > - In the initial part of logs "id" is not high. That seems to suggest that > not much of cpu idle time is available. So some cpu intesive jobs are > probably running. > > - IO wait time "wa" is pretty low. That suggests not many jobs have > to wait for long for I/O to finish. > > IOW, node seems fine to me. Only deadline exceeded message in logs > I noticed in this window is. Sorry, I did not describe it as I was in sync with Urvashi on slack and captured data as she requested and uploaded on bug. Yes, I did not see any image service failed with context deadline exceeded. I see few for createPodSandbox & runPodSandbox on cluster2 but I guess they are not related to the issue in this bug.
We will add the debug patch during the 4.9 time frame to continue working on this.
*** Bug 2025857 has been marked as a duplicate of this bug. ***
Ping @umohnani please see comment https://bugzilla.redhat.com/show_bug.cgi?id=1921182#c57
Hello Team, It looks like still, customers have facing issues. I will go remote call and update the status. BR. Dr VY
verified! % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-15-161625 True False 4h54m Cluster version is 4.11.0-0.nightly-2022-06-15-161625 % oc label mcp worker custom-kubelet=test-timeout % oc create -f - <<EOF apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: runtime-timeout spec: machineConfigPoolSelector: matchLabels: custom-kubelet: test-timeout #aro.openshift.io/limits: "" kubeletConfig: runtimeRequestTimeout: 3m EOF wait some time for the mcp worker rolling out... % oc debug node/ip-10-0-154-112.us-east-2.compute.internal sh-4.4# chroot /host sh-4.4# cat /etc/kubernetes/kubelet.conf { "kind": "KubeletConfiguration", "apiVersion": "kubelet.config.k8s.io/v1beta1", "staticPodPath": "/etc/kubernetes/manifests", "syncFrequency": "0s", "fileCheckFrequency": "0s", ... "runtimeRequestTimeout": "3m0s", ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069