Description of problem: Got one of the endpoint pod from Noobaa ( deployed via ODF 4.9.0) stuck into CreateContainerError with error - "Error: context deadline exceeded" Version-Release number of selected component (if applicable): OCP 4.8.9 How reproducible: Have seen it twice on the recent code run Steps to Reproduce: 1. Upgraded to OCP 4.8.9 from OCP 4.7 2. Did ODF deployment, version 4.9.0 3. While doing IO operations on Noobaa endpoint pod, the pod went into OOMKilled state recorded in defect - https://github.com/noobaa/noobaa-core/issues/6782 Post this, the noobaa endpoint pod remained stuck with CreateContainerError ------------------------------- NAME READY STATUS RESTARTS AGE noobaa-core-0 1/1 Running 0 26h noobaa-db-pg-0 1/1 Running 0 26h noobaa-default-backing-store-noobaa-pod-781468f4 1/1 Running 0 26h noobaa-endpoint-866d5fc4b4-nnvqq 0/1 CreateContainerError 2 26h noobaa-operator-784bb6685b-f9c52 1/1 Running 8 26h ocs-metrics-exporter-6cbf9c6bcb-7pn8q 1/1 Running 0 26h ocs-operator-6c75d9cdb6-4k5cx 1/1 Running 6 26h odf-console-77dc4875d4-z6mkq 1/1 Running 0 26h odf-operator-controller-manager-568f657687-562g7 2/2 Running 7 26h rook-ceph-operator-768c66d885-q77m7 1/1 Running 0 26h ------------------------------- Describe pod o/p: ------------------------------- Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Failed 2m kubelet Error: Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving ctr name k8s_endpoint_noobaa-endpoint-866d5fc4b4-nnvqq_openshift-storage_97f02639-c33c-4afb-b048-42f607006e60_3 for id 95a5b1a9c0d78be869d2c1c02609ea5399342507174db07f01e15b0ec0cf208a: name is reserved Warning Failed 0s kubelet Error: Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving ctr name k8s_endpoint_noobaa-endpoint-866d5fc4b4-nnvqq_openshift-storage_97f02639-c33c-4afb-b048-42f607006e60_3 for id 1e81b0e7248eac14101c78b80674d73cebd361fc70902bae0448d4fa40e22bb0: name is reserved Warning Failed <invalid> (x7 over 23h) kubelet Error: context deadline exceeded Normal Pulled <invalid> (x15 over 26h) kubelet Container image "quay.io/rhceph-dev/mcg-core@sha256:ff043dde04a8b83f10be1a2437c88b3cfd0c7e691868ed418b191a02fb8129c8" already present on machine ------------------------------- Actual results: Pod remained stuck with this error. Only way out is to clean deployment and then install a new one which would be unacceptable to customer. Expected results: The endpoint pod should not remain stuck with this error Additional info: Original bug was raised here - https://github.com/noobaa/noobaa-core/issues/6786 liranmauda directed to create bug in bugzilla. He is part of Noobaa development team (nbecker)
Created attachment 1838236 [details] Must gather logs collected for this error
you seem to have attached the ceph must-gather, rather than the openshift one. can you get me the resulting tar from ``` oc adm must-gather --node-name $node ``` where $node is the node this deployment is stuck on
Another instance of this error today on deleting the Noobaa endpoint pod. This pod was running fine since few days. ----------------------------------------------- [root@ocp-akshat-1-inf ~]# oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES noobaa-core-0 1/1 Running 0 5d3h 10.254.5.77 master0.ocp-akshat-1.cp.fyre.ibm.com <none> <none> noobaa-db-pg-0 1/1 Running 0 5d3h 10.254.5.74 master0.ocp-akshat-1.cp.fyre.ibm.com <none> <none> noobaa-default-backing-store-noobaa-pod-f0ff5410 1/1 Running 0 5d3h 10.254.5.76 master0.ocp-akshat-1.cp.fyre.ibm.com <none> <none> noobaa-endpoint-64bc4dffb6-wrw9x 0/1 CreateContainerError 0 15m 10.254.5.163 master0.ocp-akshat-1.cp.fyre.ibm.com <none> <none> noobaa-operator-9bcc845cb-4r22x 1/1 Running 32 5d3h 10.254.8.87 master2.ocp-akshat-1.cp.fyre.ibm.com <none> <none> ocs-metrics-exporter-f97b6c966-2ctp9 1/1 Running 0 5d3h 10.254.5.71 master0.ocp-akshat-1.cp.fyre.ibm.com <none> <none> ocs-operator-88f9d4c99-md28g 1/1 Running 35 5d3h 10.254.5.69 master0.ocp-akshat-1.cp.fyre.ibm.com <none> <none> odf-console-77dc4875d4-sv5f6 1/1 Running 0 5d3h 10.254.5.72 master0.ocp-akshat-1.cp.fyre.ibm.com <none> <none> odf-operator-controller-manager-6dbb67c6f9-w5mq6 2/2 Running 40 5d3h 10.254.8.86 master2.ocp-akshat-1.cp.fyre.ibm.com <none> <none> rook-ceph-operator-76ff6c5b9b-54j5l 1/1 Running 0 5d3h 10.254.5.70 master0.ocp-akshat-1.cp.fyre.ibm.com <none> <none> ----------------------------------------------- I have collected logs from master0 node and uploaded in box - https://ibm.ent.box.com/folder/145794528783 ( as the size of files is quite big)
can you attach in google drive or something else? I am not able to access box without an ibm account
Hi, I have attached this file in google drive - https://drive.google.com/file/d/1zZDNBmcgW0eRmr1sEMPO90deladG2V_Q/view?usp=sharing
If I were to guess, I would guess this container has a very large directory attached as a volume. Is that the case? If so, following the steps in https://hackmd.io/7heLp_noQmqU_Ef7VaiCKg (eventually will be published to https://access.redhat.com/node/6221251) for the selinux relabeling may help. Can we try upgrading and trying that out?
Hi @pehunt and @liranmauda This problem didn't go away with the fix from @liran.mauda --------------------------------------------- Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 17m default-scheduler Successfully assigned openshift-storage/noobaa-endpoint-7cb76c78c6-vt8k5 to worker1.ocp-akshat-2.cp.fyre.ibm.com Warning FailedMount 6m37s (x396 over 16m) kubelet MountVolume.SetUp failed for volume "pvc-f266e7f9-da62-41bf-aed8-527f34ccd341" : kubernetes.io/csi: mounter.SetUpAt failed to check for STAGE_UNSTAGE_VOLUME capability: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/spectrumscale.csi.ibm.com/csi.sock: connect: connection refused" Normal AddedInterface 4m26s multus Add eth0 [10.254.16.50/22] from openshift-sdn Warning Failed 25s (x2 over 2m26s) kubelet Error: context deadline exceeded [root@api ~]# podo NAME READY STATUS RESTARTS AGE noobaa-core-0 1/1 Running 0 22h noobaa-db-pg-0 1/1 Running 0 23h noobaa-default-backing-store-noobaa-pod-1bfc596f 1/1 Running 0 23h noobaa-endpoint-7cb76c78c6-947n5 0/1 ContainerCreating 0 17m noobaa-endpoint-7cb76c78c6-hctm7 0/1 CreateContainerError 0 17m noobaa-endpoint-7cb76c78c6-hglz4 1/1 Running 0 17m noobaa-endpoint-7cb76c78c6-kgnm8 1/1 Running 0 17m noobaa-endpoint-7cb76c78c6-mc4mq 0/1 ContainerCreating 0 17m noobaa-endpoint-7cb76c78c6-qmr9c 0/1 ContainerCreating 0 17m noobaa-endpoint-7cb76c78c6-tpdh4 0/1 ContainerCreating 0 17m noobaa-endpoint-7cb76c78c6-vt8k5 0/1 CreateContainerError 0 17m noobaa-operator-6c567cfcdd-wvlcn 1/1 Running 8 (5h10m ago) 23h ocs-metrics-exporter-5c87b7c77-fpk8s 1/1 Running 0 23h ocs-operator-c494fbdf5-gq9zw 1/1 Running 4 (15h ago) 23h odf-console-67c5878d75-4zl7n 1/1 Running 0 23h odf-operator-controller-manager-65c98b8b55-mc7cg 2/2 Running 3 (15h ago) 23h rook-ceph-operator-8585fd44df-f7vzd 1/1 Running 0 23h ---------------------------------------------
The fix done was: 1. kubectl edit scc Change type to RunAsAny for seLinuxContext: 2. Edit noobaa-endpoint deployment under securityContext: ----- fsGroupChangePolicy: "OnRootMismatch" seLinuxOptions: type: "spc_t" -----
@akmithal.com Looking at your yamls (On DM) it looks like it was not edited... NooBaa operator runs over those yamls so changing the replica of noobaa operator to 0 then manually editing the yaml should work. Please update us.
Hello akmithal, could you help check if the issue is fixed after PR is merged?
I see the PR is merged in nooba operator and based on comment #12, I am marking his verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056