2018413 – Error: context deadline exceeded, OCP 4.8.9

Bug 2018413 - Error: context deadline exceeded, OCP 4.8.9 [NEEDINFO]

Summary: Error: context deadline exceeded, OCP 4.8.9

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.8
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Peter Hunt
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-29 08:19 UTC by akmithal
Modified:	2022-06-17 02:26 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:23:25 UTC
Target Upstream Version:
Embargoed:
Flags:	lmauda: needinfo? schoudha: needinfo?

Attachments	(Terms of Use)
Must gather logs collected for this error (14.77 MB, application/gzip) 2021-10-29 08:28 UTC, akmithal	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	noobaa noobaa-operator pull 794/files	0	None	None	None	2022-01-18 07:06:02 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:23:40 UTC

Description akmithal 2021-10-29 08:19:55 UTC

Description of problem:

Got one of the endpoint pod from Noobaa ( deployed via ODF 4.9.0) stuck into CreateContainerError with error - "Error: context deadline exceeded"

Version-Release number of selected component (if applicable):
OCP 4.8.9 

How reproducible:
Have seen it twice on the recent code run


Steps to Reproduce:
1. Upgraded to OCP 4.8.9 from OCP 4.7
2. Did ODF deployment, version 4.9.0
3. While doing IO operations on Noobaa endpoint pod, the pod went into OOMKilled state recorded in defect - https://github.com/noobaa/noobaa-core/issues/6782

Post this, the noobaa endpoint pod remained stuck with CreateContainerError

-------------------------------

NAME                                               READY   STATUS                 RESTARTS   AGE
noobaa-core-0                                      1/1     Running                0          26h
noobaa-db-pg-0                                     1/1     Running                0          26h
noobaa-default-backing-store-noobaa-pod-781468f4   1/1     Running                0          26h
noobaa-endpoint-866d5fc4b4-nnvqq                   0/1     CreateContainerError   2          26h
noobaa-operator-784bb6685b-f9c52                   1/1     Running                8          26h
ocs-metrics-exporter-6cbf9c6bcb-7pn8q              1/1     Running                0          26h
ocs-operator-6c75d9cdb6-4k5cx                      1/1     Running                6          26h
odf-console-77dc4875d4-z6mkq                       1/1     Running                0          26h
odf-operator-controller-manager-568f657687-562g7   2/2     Running                7          26h
rook-ceph-operator-768c66d885-q77m7                1/1     Running                0          26h
-------------------------------

Describe pod o/p:

-------------------------------
Events:
  Type     Reason  Age                       From     Message
  ----     ------  ----                      ----     -------
  Warning  Failed  2m                        kubelet  Error: Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving ctr name k8s_endpoint_noobaa-endpoint-866d5fc4b4-nnvqq_openshift-storage_97f02639-c33c-4afb-b048-42f607006e60_3 for id 95a5b1a9c0d78be869d2c1c02609ea5399342507174db07f01e15b0ec0cf208a: name is reserved
  Warning  Failed  0s                        kubelet  Error: Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving ctr name k8s_endpoint_noobaa-endpoint-866d5fc4b4-nnvqq_openshift-storage_97f02639-c33c-4afb-b048-42f607006e60_3 for id 1e81b0e7248eac14101c78b80674d73cebd361fc70902bae0448d4fa40e22bb0: name is reserved
  Warning  Failed  <invalid> (x7 over 23h)   kubelet  Error: context deadline exceeded
  Normal   Pulled  <invalid> (x15 over 26h)  kubelet  Container image "quay.io/rhceph-dev/mcg-core@sha256:ff043dde04a8b83f10be1a2437c88b3cfd0c7e691868ed418b191a02fb8129c8" already present on machine
-------------------------------



Actual results:
Pod remained stuck with this error. Only way out is to clean deployment and then install a new one which would be unacceptable to customer. 

Expected results:
The endpoint pod should not remain stuck with this error


Additional info:
Original bug was raised here - https://github.com/noobaa/noobaa-core/issues/6786

liranmauda directed to create bug in bugzilla. He is part of Noobaa development team (nbecker)

Comment 1 akmithal 2021-10-29 08:28:54 UTC

Created attachment 1838236 [details]
Must gather logs collected for this error

Comment 2 Peter Hunt 2021-10-29 15:52:08 UTC

you seem to have attached the ceph must-gather, rather than the openshift one. can you get me the resulting tar from
```
oc adm must-gather --node-name $node
```
where $node is the node this deployment is stuck on

Comment 3 akmithal 2021-11-02 11:36:22 UTC

Another instance of this error today on deleting the Noobaa endpoint pod. This pod was running fine since few days. 

-----------------------------------------------
[root@ocp-akshat-1-inf ~]# oc get pod -o wide
NAME                                               READY   STATUS                 RESTARTS   AGE    IP             NODE                                   NOMINATED NODE   READINESS GATES
noobaa-core-0                                      1/1     Running                0          5d3h   10.254.5.77    master0.ocp-akshat-1.cp.fyre.ibm.com   <none>           <none>
noobaa-db-pg-0                                     1/1     Running                0          5d3h   10.254.5.74    master0.ocp-akshat-1.cp.fyre.ibm.com   <none>           <none>
noobaa-default-backing-store-noobaa-pod-f0ff5410   1/1     Running                0          5d3h   10.254.5.76    master0.ocp-akshat-1.cp.fyre.ibm.com   <none>           <none>
noobaa-endpoint-64bc4dffb6-wrw9x                   0/1     CreateContainerError   0          15m    10.254.5.163   master0.ocp-akshat-1.cp.fyre.ibm.com   <none>           <none>
noobaa-operator-9bcc845cb-4r22x                    1/1     Running                32         5d3h   10.254.8.87    master2.ocp-akshat-1.cp.fyre.ibm.com   <none>           <none>
ocs-metrics-exporter-f97b6c966-2ctp9               1/1     Running                0          5d3h   10.254.5.71    master0.ocp-akshat-1.cp.fyre.ibm.com   <none>           <none>
ocs-operator-88f9d4c99-md28g                       1/1     Running                35         5d3h   10.254.5.69    master0.ocp-akshat-1.cp.fyre.ibm.com   <none>           <none>
odf-console-77dc4875d4-sv5f6                       1/1     Running                0          5d3h   10.254.5.72    master0.ocp-akshat-1.cp.fyre.ibm.com   <none>           <none>
odf-operator-controller-manager-6dbb67c6f9-w5mq6   2/2     Running                40         5d3h   10.254.8.86    master2.ocp-akshat-1.cp.fyre.ibm.com   <none>           <none>
rook-ceph-operator-76ff6c5b9b-54j5l                1/1     Running                0          5d3h   10.254.5.70    master0.ocp-akshat-1.cp.fyre.ibm.com   <none>           <none>
-----------------------------------------------

I have collected logs from master0 node and uploaded in box - https://ibm.ent.box.com/folder/145794528783 ( as the size of files is quite big)

Comment 4 Peter Hunt 2021-11-02 13:51:48 UTC

can you attach in google drive or something else? I am not able to access box without an ibm account

Comment 5 akmithal 2021-11-03 00:59:39 UTC

Hi, I have attached this file in google drive - https://drive.google.com/file/d/1zZDNBmcgW0eRmr1sEMPO90deladG2V_Q/view?usp=sharing

Comment 6 Peter Hunt 2021-11-03 17:07:36 UTC

If I were to guess, I would guess this container has a very large directory attached as a volume. Is that the case? If so, following the steps in https://hackmd.io/7heLp_noQmqU_Ef7VaiCKg (eventually will be published to https://access.redhat.com/node/6221251) for the selinux relabeling may help. Can we try upgrading and trying that out?

Comment 10 akmithal 2021-11-24 07:02:56 UTC

Hi @pehunt and @liranmauda

This problem didn't go away with the fix from @liran.mauda 

---------------------------------------------
Events:
  Type     Reason          Age                    From               Message
  ----     ------          ----                   ----               -------
  Normal   Scheduled       17m                    default-scheduler  Successfully assigned openshift-storage/noobaa-endpoint-7cb76c78c6-vt8k5 to worker1.ocp-akshat-2.cp.fyre.ibm.com
  Warning  FailedMount     6m37s (x396 over 16m)  kubelet            MountVolume.SetUp failed for volume "pvc-f266e7f9-da62-41bf-aed8-527f34ccd341" : kubernetes.io/csi: mounter.SetUpAt failed to check for STAGE_UNSTAGE_VOLUME capability: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/spectrumscale.csi.ibm.com/csi.sock: connect: connection refused"
  Normal   AddedInterface  4m26s                  multus             Add eth0 [10.254.16.50/22] from openshift-sdn
  Warning  Failed          25s (x2 over 2m26s)    kubelet            Error: context deadline exceeded
[root@api ~]# podo
NAME                                               READY   STATUS                 RESTARTS        AGE
noobaa-core-0                                      1/1     Running                0               22h
noobaa-db-pg-0                                     1/1     Running                0               23h
noobaa-default-backing-store-noobaa-pod-1bfc596f   1/1     Running                0               23h
noobaa-endpoint-7cb76c78c6-947n5                   0/1     ContainerCreating      0               17m
noobaa-endpoint-7cb76c78c6-hctm7                   0/1     CreateContainerError   0               17m
noobaa-endpoint-7cb76c78c6-hglz4                   1/1     Running                0               17m
noobaa-endpoint-7cb76c78c6-kgnm8                   1/1     Running                0               17m
noobaa-endpoint-7cb76c78c6-mc4mq                   0/1     ContainerCreating      0               17m
noobaa-endpoint-7cb76c78c6-qmr9c                   0/1     ContainerCreating      0               17m
noobaa-endpoint-7cb76c78c6-tpdh4                   0/1     ContainerCreating      0               17m
noobaa-endpoint-7cb76c78c6-vt8k5                   0/1     CreateContainerError   0               17m
noobaa-operator-6c567cfcdd-wvlcn                   1/1     Running                8 (5h10m ago)   23h
ocs-metrics-exporter-5c87b7c77-fpk8s               1/1     Running                0               23h
ocs-operator-c494fbdf5-gq9zw                       1/1     Running                4 (15h ago)     23h
odf-console-67c5878d75-4zl7n                       1/1     Running                0               23h
odf-operator-controller-manager-65c98b8b55-mc7cg   2/2     Running                3 (15h ago)     23h
rook-ceph-operator-8585fd44df-f7vzd                1/1     Running                0               23h

---------------------------------------------

Comment 11 akmithal 2021-11-24 07:05:29 UTC

The fix done was:

1. kubectl edit scc 
Change type to RunAsAny for seLinuxContext:

2. Edit noobaa-endpoint deployment under securityContext:

-----
fsGroupChangePolicy: "OnRootMismatch"
            seLinuxOptions:
              type: "spc_t"
-----

Comment 12 Liran Mauda 2021-11-24 14:56:57 UTC

@akmithal.com 

Looking at your yamls (On DM) it looks like it was not edited...

NooBaa operator runs over those yamls so changing the replica of noobaa operator to 0 then manually editing the yaml should work.

Please update us.

Comment 17 Sunil Choudhary 2022-02-09 09:59:27 UTC

Hello akmithal, could you help check if the issue is fixed after PR is merged?

Comment 18 Sunil Choudhary 2022-02-11 06:06:59 UTC

I see the PR is merged in nooba operator and based on comment #12, I am marking his verified.

Comment 20 errata-xmlrpc 2022-03-10 16:23:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.