Bug 2314239

Summary: [MDS] "HEALTH_WARN" with "1 clients failing to respond to capability release" on Ceph Version 17.2.6-246.el9cp
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Craig Wayman <crwayman>
Component: cephAssignee: Venky Shankar <vshankar>
ceph sub component: CephFS QA Contact: Elad <ebenahar>
Status: NEW --- Docs Contact:
Severity: high    
Priority: unspecified CC: bniver, muagarwa, sheggodu, sostapov
Version: 4.14   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Craig Wayman 2024-09-23 15:25:35 UTC
Description of problem (please be detailed as possible and provide log snippets):

  This case originally opened up due to the customer hitting slow ops as a result of SELinux relabeling. We fixed this and were in the process of monitoring before case closure. The customer reported everything was working great and when we were about to close the case the customer hit the "HEALTH_WARN" with "1 clients failing to respond to capability release" issue.  

  Upon further research, the customer should not be hitting this issue since they're on ceph version 17.2.6-246.el9cp, or 6.1z7 - 6.1.7 and from what I am tracking the fix was implemented in 6.1z4. 

  The good news is that the customer states that storage is working fine, and we have this tracked down to one specific workload causing this issue (This is a very high file count workload):

Name:            pvc-d272cca3-9f94-423c-ad37-f4993caf2949
Labels:          <none>
Annotations:     pv.kubernetes.io/provisioned-by: openshift-storage.cephfs.csi.ceph.com
                 volume.kubernetes.io/provisioner-deletion-secret-name: rook-csi-cephfs-provisioner
                 volume.kubernetes.io/provisioner-deletion-secret-namespace: openshift-storage
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:    vivo-cephfs-selinux-relabel
Status:          Bound
Claim:           osb-fscustomer-esteira2/osb-fscustomer-esteira2-pvc <------ WORKLOAD
Reclaim Policy:  Delete
Access Modes:    RWX
VolumeMode:      Filesystem
Capacity:        30Gi
Node Affinity:   <none>
Message:
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            openshift-storage.cephfs.csi.ceph.com
    FSType:
    VolumeHandle:      0001-0011-openshift-storage-0000000000000001-aab3519a-5c94-4874-abb1-6cb90ececb6a
    ReadOnly:          false
    VolumeAttributes:      clusterID=openshift-storage
                           fsName=ocs-storagecluster-cephfilesystem
                           kernelMountOptions=context="system_u:object_r:container_file_t:s0"
                           storage.kubernetes.io/csiProvisionerIdentity=1726082784973-9392-openshift-storage.cephfs.csi.ceph.com
                           subvolumeName=csi-vol-aab3519a-5c94-4874-abb1-6cb90ececb6a
                           subvolumePath=/volumes/csi/csi-vol-aab3519a-5c94-4874-abb1-6cb90ececb6a/05ed12dd-b991-4bf0-9b9e-35611555e642
Events:                <none>



  Specifically two particular operations cause this issue:


Issue A and Issue B:


A. PVC clone operation or PVC content copy from pod towards an internal Node folder.


B.  Trying to create 5 Pods (osb-fscustomer-server1, osb-fscustomer-server2, osb-fscustomer-server3, osb-fscustomer-server4, osb-fscustomer-server5) pointing out to  osb-fscustomer-esteira2-pvc. 


  Since we're able to successfully reproduce this issue, the customer has been given a data collection process that should give Engineering all the logs/data needed. 

 

Version of all relevant components (if applicable):

OCP:
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.33   True        False         49d     Cluster version is 4.14.33


ODF:
NAME                                     DISPLAY                       VERSION         REPLACES                                PHASE
argocd-operator.v0.12.0                  Argo CD                       0.12.0          argocd-operator.v0.11.0                 Succeeded
cluster-logging.v5.9.6                   Red Hat OpenShift Logging     5.9.6           cluster-logging.v5.9.5                  Succeeded
loki-operator.v5.9.6                     Loki Operator                 5.9.6           loki-operator.v5.9.5                    Succeeded
mcg-operator.v4.14.10-rhodf              NooBaa Operator               4.14.10-rhodf   mcg-operator.v4.14.9-rhodf              Succeeded
ocs-operator.v4.14.10-rhodf              OpenShift Container Storage   4.14.10-rhodf   ocs-operator.v4.14.9-rhodf              Succeeded
odf-csi-addons-operator.v4.14.10-rhodf   CSI Addons                    4.14.10-rhodf   odf-csi-addons-operator.v4.14.9-rhodf   Succeeded
odf-operator.v4.14.10-rhodf              OpenShift Data Foundation     4.14.10-rhodf   odf-operator.v4.14.9-rhodf              Succeeded


Ceph:

{
    "mon": {
        "ceph version 17.2.6-246.el9cp (0f65af2d95ce0936640f6ccd6a4825dce6237e4f) quincy (stable)": 3
    },
    "mgr": {
        "ceph version 17.2.6-246.el9cp (0f65af2d95ce0936640f6ccd6a4825dce6237e4f) quincy (stable)": 1
    },
    "osd": {
        "ceph version 17.2.6-246.el9cp (0f65af2d95ce0936640f6ccd6a4825dce6237e4f) quincy (stable)": 3
    },
    "mds": {
        "ceph version 17.2.6-246.el9cp (0f65af2d95ce0936640f6ccd6a4825dce6237e4f) quincy (stable)": 2
    },
    "rgw": {
        "ceph version 17.2.6-246.el9cp (0f65af2d95ce0936640f6ccd6a4825dce6237e4f) quincy (stable)": 1
    },
    "overall": {
        "ceph version 17.2.6-246.el9cp (0f65af2d95ce0936640f6ccd6a4825dce6237e4f) quincy (stable)": 10
    }
}



Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)?

Yes, the customer wants to roll this in production and they have raised this issue stating this is crucial to fix due to current timelines. Below is the following statement from the customer:

"This issue is impacting a major project with Telefonica Brazil. If this is a bug, we urgently need a Bugzilla report created as soon as possible to facilitate a timely resolution."


Is there any workaround available to the best of your knowledge?

Yes, the following process will clear the issue, but the operations A and B do not succeed:


1. Run the following command to capture client ID and MDS name holding on to the caps:

$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph health detail

2. Run the following command to capture the session:

$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph tell mds.ocs-storagecluster-cephfilesystem:0 session ls > active-mds-session-ls.txt


3. Search for "<client.ID>" in that active-mds-session-ls.txt file. When you find that client session, scroll down until you see the csi-vol information. Copy the csi-vol-xxx-xxx all the way until the /. Then run the following command:


$ oc get pv | awk 'NR>1 {print $1}' | while read it; do oc describe pv ${it}; echo " "; done > pv.out

4. Search for that csi-vol that you found in that client session in that pv.out file. That's your problematic workload. Scale it down.

5. Once all pods for that workload have terminated, delete the `rook-ceph-mds-ocs-storagecluster-cephfilesystem-a` pod:

~~~
$ oc delete pod -n openshift-storage rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-<pod-name>
~~~

6. After a few minutes Ceph will get back to healthy. 


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?

Yes


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. PVC clone operation or PVC content copy from pod towards an internal Node folder.

or

2.  Trying to create 5 Pods (osb-fscustomer-server1, osb-fscustomer-server2, osb-fscustomer-server3, osb-fscustomer-server4, osb-fscustomer-server5) pointing out to  osb-fscustomer-esteira2-pvc. 

Additional info:

  I will put the data collection steps given to the customer in the private comment. Once collected, we'll set needinfo.