Bug 1921182 - InspectFailed: kubelet Failed to inspect image: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Summary: InspectFailed: kubelet Failed to inspect image: rpc error: code = DeadlineExc...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: Urvashi Mohnani
QA Contact: MinLi
URL:
Whiteboard:
: 2025857 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-27 17:12 UTC by Hongkai Liu
Modified: 2022-09-02 11:34 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 10:35:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:36:00 UTC

Description Hongkai Liu 2021-01-27 17:12:55 UTC
Description of problem:
During the upgrade of worker nodes (4.7.0-fc.2 to 4.7.0-fc.3), starting Pods can take 15-30 minutes due to errors like

 Warning  InspectFailed   7m58s (x4 over 14m)  kubelet, ip-10-0-164-20.ec2.internal  Failed to inspect image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:876095045e7d1ce1fc6eef3e4eeeef2ec8905d38008469de70490bbe76f29863": rpc error: code = DeadlineExceeded desc = context deadline exceeded
 Warning  Failed          7m58s (x4 over 14m)  kubelet, ip-10-0-164-20.ec2.internal  Error: ImageInspectError
 Warning  Failed          4m29s (x2 over 16m)  kubelet, ip-10-0-164-20.ec2.internal  Error: context deadline exceeded

Version-Release number of selected component (if applicable):


How reproducible:
We saw this during different upgrades of build01 and build02

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Will provide must-gather later

Comment 3 W. Trevor King 2021-02-03 03:34:59 UTC
Comment 2 suggests the issue is throttling in the CRI-O image pull.  Do we have any idea what part of the inspection call was slow?  ~5 minutes to construct an ImageStatusResponse that adds nothing to the incoming by-digest pullspec beyond the image size seems... surprisingly slow.  Is that just a single auth request followed by a single manifest request?  Or is it pulling down the image and all referenced layers to local storage before returning the size?

Comment 4 Clayton Coleman 2021-02-03 15:59:15 UTC
The CI cluster is not even what I would call overloaded and is well within the bounds of normal OpenShift use.  If it is "overloaded" we have a performance issue causing us to violate SLOs.  It should never take 30m to start any pod ever.  In this cluster, based on what we know, 1-2m is the upper bound excluding pull times, and all pull times I've seen have been 30-60s (we have massive disks).

Moving this back to 4.7, I consider this a serious regression of pod startup latency in medium to high load environments and it is likely we would degrade customer environments if we shipped with this (just based on what we have).  If this is a quay issue we need to isolate it.

If we can pin this on another bug (like BFQ scheduling?) then I'd consider deferring.

Comment 6 Vivek Goyal 2021-02-03 22:24:57 UTC
I think we should try to reproduce the problem and when that happens, monitor few things.

- Capture journal logs. If processes are blocking for too long in kernel, generally there is a warning about task waiting too long (with a backtrace). If that happens, that can give us some idea.
- We should run vmstat and capture some stats at that time to get some sense about cpu, memory and I/O state. Say run "vmstat -t 2". This will print stats every 2 seconds and "-t" will also append timestamp.
- Run iostat. This will give some idea if I/O on root disk was still progressing that time or not. Run "iostat -t 2". This will capture stats for all disks in the node.

Comment 7 Vivek Goyal 2021-02-03 22:35:09 UTC
Also capture "lsblk" and "df -h" output on the node in question when this problem happens again.

Comment 8 Urvashi Mohnani 2021-02-04 00:03:10 UTC
Hi Sunil,

Can you please take a look and see if you can reproduce this upgrade issue on a cluster? Looks like the cluster has to be under a good amount of load. I haven't been able to reproduce so far with a basic cluster.

Comment 9 Urvashi Mohnani 2021-02-04 21:52:11 UTC
Spoke with Sunil and he tried to upgrade a loaded cluster that was running multiple pods and PDBs but was unable to reproduce the timeout issue. Have asked him to try again with an increased load and multiple builds going on at the same time. Have shared a daemonset with him that will gather the data Vivek mentioned above in the event that the upgrade process hits the timeout issues.

Comment 10 Sunil Choudhary 2021-02-05 14:00:17 UTC
Tried to reproduce this issue yesterday by creating an app with around 100 pods and PodDisruptionBudget of MaxUnavailable 20 and started upgrade from 4.7-fc.2 to 4.7-fc.3, but did not notice any context deadline exceeded errors on nodes.

Tried again today on 2 separate clusters with running builds while upgrade and have captured data with help from Uravashi from comment #9.

Uploaded the data at https://drive.google.com/drive/folders/120l6KzwFPYJTf9AJ80oBjsF7ObqF2OEf

Comment 11 Vivek Goyal 2021-02-05 15:56:59 UTC
(In reply to Sunil Choudhary from comment #10)
> Tried to reproduce this issue yesterday by creating an app with around 100
> pods and PodDisruptionBudget of MaxUnavailable 20 and started upgrade from
> 4.7-fc.2 to 4.7-fc.3, but did not notice any context deadline exceeded
> errors on nodes.
> 
> Tried again today on 2 separate clusters with running builds while upgrade
> and have captured data with help from Uravashi from comment #9.

So did you see the issue happening in these two separate clusters? I looked at cluster2
data and I don't see many deadline exceeded messages.

When you capture the data, please also put some notes describing the logs and
where to look at. One file name is "worker-journal.logs". What is that
file? Is it journal logs of master node, or journal logs of one of the
slave nodes?

Please also put some information on which node did you experience the
issue so that I can directly dive into the logs of that node.

I looked at vmstat data for node with ip ip-10-0-176-10. vmstat ran roughly for 13
minutes (from 13:36:41 UTC to 13:49:20 UTC). And node looks fine. 

- Runnable processes are pretty low and even if number goes up, it comes
  down pretty quickly. 

- There seem to be good amount of free memory.

- In the initial part of logs "id" is not high. That seems to suggest that
  not much of cpu idle time is available. So some cpu intesive jobs are
  probably running.

- IO wait time "wa" is pretty low. That suggests not many jobs have
  to wait for long for I/O to finish.

IOW, node seems fine to me. Only deadline exceeded message in logs
I noticed in this window is.

Comment 12 Vivek Goyal 2021-02-05 17:13:07 UTC
One timeout error I noticed for node with ip ending in "10" is this.


Feb 05 13:32:12.578923 ip-10-0-176-10 hyperkube[1399]: I0205 13:32:12.577122    1399 prober.go:117] Readiness probe for "image-registry-66849968b9-wdvrx_openshift-image-registry(b5affc4b-71da-433e-97a1-7449eef04335):registry" failed (failure): Get "https://10.131.0.100:5000/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Not sure what this is. Is this is due to some netowrking issue, or what. node is not showing any signs of stress around this time.

Comment 14 Sunil Choudhary 2021-02-08 12:37:11 UTC
> So did you see the issue happening in these two separate clusters? I looked
> at cluster2
> data and I don't see many deadline exceeded messages.
> 
> When you capture the data, please also put some notes describing the logs and
> where to look at. One file name is "worker-journal.logs". What is that
> file? Is it journal logs of master node, or journal logs of one of the
> slave nodes?
> 
> Please also put some information on which node did you experience the
> issue so that I can directly dive into the logs of that node.
> 
> I looked at vmstat data for node with ip ip-10-0-176-10. vmstat ran roughly
> for 13
> minutes (from 13:36:41 UTC to 13:49:20 UTC). And node looks fine. 
> 
> - Runnable processes are pretty low and even if number goes up, it comes
>   down pretty quickly. 
> 
> - There seem to be good amount of free memory.
> 
> - In the initial part of logs "id" is not high. That seems to suggest that
>   not much of cpu idle time is available. So some cpu intesive jobs are
>   probably running.
> 
> - IO wait time "wa" is pretty low. That suggests not many jobs have
>   to wait for long for I/O to finish.
> 
> IOW, node seems fine to me. Only deadline exceeded message in logs
> I noticed in this window is.

Sorry, I did not describe it as I was in sync with Urvashi on slack and captured data as she requested and uploaded on bug.

Yes, I did not see any image service failed with context deadline exceeded.
I see few for createPodSandbox & runPodSandbox on cluster2 but I guess they are not related to the issue in this bug.

Comment 19 Urvashi Mohnani 2021-06-07 13:54:25 UTC
We will add the debug patch during the 4.9 time frame to continue working on this.

Comment 24 Harshal Patil 2021-11-24 08:39:44 UTC
*** Bug 2025857 has been marked as a duplicate of this bug. ***

Comment 58 Tom Sweeney 2022-03-22 15:17:57 UTC
Ping @umohnani please see comment https://bugzilla.redhat.com/show_bug.cgi?id=1921182#c57

Comment 60 vyoganan 2022-05-13 08:50:40 UTC
Hello Team,

It looks like still, customers have facing issues. I will go remote call and update the status.

BR.
Dr VY

Comment 69 MinLi 2022-06-16 06:56:07 UTC
verified!

% oc get clusterversion 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-15-161625   True        False         4h54m   Cluster version is 4.11.0-0.nightly-2022-06-15-161625

% oc label mcp worker custom-kubelet=test-timeout

% oc create -f - <<EOF    
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: runtime-timeout
spec:
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: test-timeout
      #aro.openshift.io/limits: ""
  kubeletConfig:
    runtimeRequestTimeout: 3m
EOF

wait some time for the mcp worker rolling out...

% oc debug node/ip-10-0-154-112.us-east-2.compute.internal
sh-4.4# chroot /host 
sh-4.4# cat /etc/kubernetes/kubelet.conf 
{
  "kind": "KubeletConfiguration",
  "apiVersion": "kubelet.config.k8s.io/v1beta1",
  "staticPodPath": "/etc/kubernetes/manifests",
  "syncFrequency": "0s",
  "fileCheckFrequency": "0s",
...
  "runtimeRequestTimeout": "3m0s",
...

Comment 71 errata-xmlrpc 2022-08-10 10:35:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.