Bug 2094320

Summary: Pods are stuck in CreateContainerError because of blocklisting
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Pratik Surve <prsurve>
Component: csi-driverAssignee: yati padia <ypadia>
Status: CLOSED CURRENTRELEASE QA Contact: Pratik Surve <prsurve>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: amagrawa, dkamboj, ebenahar, edonnell, ekuric, idryomov, jdurgin, kramdoss, mmuench, muagarwa, nberry, odf-bz-bot, olakra, srangana, uchapaga, ypadia
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-25 06:04:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2107226, 2138210    

Description Pratik Surve 2022-06-07 11:38:50 UTC
Description of problem (please be detailed as possible and provide log
snippets):

Pods are stuck in CreateContainerError with msg Error: relabel failed /var/lib/kubelet/pods/cb27938e-f66f-401d-85f0-9eb5cf565ace/volumes/kubernetes.io~csi/pvc-86e7da91-29f9-4418-80a7-4ae7610bb613/mount: lsetxattr /var/lib/kubelet/pods/cb27938e-f66f-401d-85f0-9eb5cf565ace/volumes/kubernetes.io~csi/pvc-86e7da91-29f9-4418-80a7-4ae7610bb613/mount/#ib_16384_0.dblwr: read-only file system

Version of all relevant components (if applicable):

OCP version:- 4.10.0-0.nightly-2022-05-26-102501
ODF version:- 4.10.3-4
CEPH version:- ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy the RDR cluster 
2. Deploy IO and keep it for a long time
3. After some days we will see pods in the CreateContainerError state
 

Actual results:

Events:
  Type     Reason          Age                    From               Message
  ----     ------          ----                   ----               -------
  Normal   Scheduled       93m                    default-scheduler  Successfully assigned busybox-workloads-1/mysql-68f8bf78fd-s5gmz to compute-1
  Normal   AddedInterface  93m                    multus             Add eth0 [10.128.2.142/23] from openshift-sdn
  Normal   Pulled          93m                    kubelet            Successfully pulled image "quay.io/prsurve/mysql:latest" in 1.237571543s
  Normal   Pulled          92m                    kubelet            Successfully pulled image "quay.io/prsurve/mysql:latest" in 1.515721369s
  Warning  Failed          92m                    kubelet            Error: relabel failed /var/lib/kubelet/pods/cb27938e-f66f-401d-85f0-9eb5cf565ace/volumes/kubernetes.io~csi/pvc-86e7da91-29f9-4418-80a7-4ae7610bb613/mount: lsetxattr /var/lib/kubelet/pods/cb27938e-f66f-401d-85f0-9eb5cf565ace/volumes/kubernetes.io~csi/pvc-86e7da91-29f9-4418-80a7-4ae7610bb613/mount/ibtmp1: read-only file system
  Normal   Pulled          92m                    kubelet            Successfully pulled image "quay.io/prsurve/mysql:latest" in 1.319901196s
  Normal   Pulled          92m                    kubelet            Successfully pulled image "quay.io/prsurve/mysql:latest" in 1.49214362s
  Normal   Pulled          92m                    kubelet            Successfully pulled image "quay.io/prsurve/mysql:latest" in 1.274508132s
  Normal   Pulled          92m                    kubelet            Successfully pulled image "quay.io/prsurve/mysql:latest" in 1.412132002s
  Normal   Pulled          91m                    kubelet            Successfully pulled image "quay.io/prsurve/mysql:latest" in 1.294925949s
  Warning  Failed          91m (x7 over 93m)      kubelet            Error: relabel failed /var/lib/kubelet/pods/cb27938e-f66f-401d-85f0-9eb5cf565ace/volumes/kubernetes.io~csi/pvc-86e7da91-29f9-4418-80a7-4ae7610bb613/mount: lsetxattr /var/lib/kubelet/pods/cb27938e-f66f-401d-85f0-9eb5cf565ace/volumes/kubernetes.io~csi/pvc-86e7da91-29f9-4418-80a7-4ae7610bb613/mount/#ib_16384_0.dblwr: read-only file system
  Normal   Pulled          91m                    kubelet            Successfully pulled image "quay.io/prsurve/mysql:latest" in 1.472227825s
  Normal   Pulled          18m (x302 over 91m)    kubelet            (combined from similar events): Successfully pulled image "quay.io/prsurve/mysql:latest" in 1.436491104s
  Normal   Pulling         2m59s (x372 over 93m)  kubelet            Pulling image "quay.io/prsurve/mysql:latest"


Expected results:
There should not be any issue 

Additional info:

Comment 4 Mudit Agarwal 2022-06-21 13:27:05 UTC
Mostly a ceph fix, will wait for Yati's update.
Not a 4.11 blocker, moving it out.

Comment 23 Shyamsundar 2023-01-20 14:00:51 UTC
*** Bug 2136416 has been marked as a duplicate of this bug. ***

Comment 37 Elad 2023-06-19 06:02:59 UTC
Moving to 4.13.z for verification purposes

Comment 39 Divyansh Kamboj 2023-06-20 06:44:46 UTC
The ODFRBDClientBlocked alert is triggered when an RBD client gets blocked by Ceph on a specific node within the Kubernetes cluster. This occurs when the metric ocs_rbd_client_blocklisted reports a value of 1 for the node, indicating that it has been blocklisted. Additionally, the alert is triggered if there are pods in a CreateContainerError state on the same node. 
We cannot identify from ODF side, if krbd client is blocklisted or some other client, so we also check for the CreateContainerError to raise the alert.

https://issues.redhat.com/browse/OCSDOCS-1112
might be tracking the documentation effort