2314239 – [MDS] "HEALTH_WARN" with "1 clients failing to respond to capability release" on Ceph Version 17.2.6-246.el9cp

Bug 2314239 - [MDS] "HEALTH_WARN" with "1 clients failing to respond to capability release" on Ceph Version 17.2.6-246.el9cp

Summary: [MDS] "HEALTH_WARN" with "1 clients failing to respond to capability release"...

Keywords:
Status:	NEW
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Venky Shankar
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-09-23 15:25 UTC by Craig Wayman
Modified:	2024-11-04 06:22 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OCSBZM-9472	0	None	None	None	2024-11-04 06:22:37 UTC

Description Craig Wayman 2024-09-23 15:25:35 UTC

Description of problem (please be detailed as possible and provide log snippets):

  This case originally opened up due to the customer hitting slow ops as a result of SELinux relabeling. We fixed this and were in the process of monitoring before case closure. The customer reported everything was working great and when we were about to close the case the customer hit the "HEALTH_WARN" with "1 clients failing to respond to capability release" issue.  

  Upon further research, the customer should not be hitting this issue since they're on ceph version 17.2.6-246.el9cp, or 6.1z7 - 6.1.7 and from what I am tracking the fix was implemented in 6.1z4. 

  The good news is that the customer states that storage is working fine, and we have this tracked down to one specific workload causing this issue (This is a very high file count workload):

Name:            pvc-d272cca3-9f94-423c-ad37-f4993caf2949
Labels:          <none>
Annotations:     pv.kubernetes.io/provisioned-by: openshift-storage.cephfs.csi.ceph.com
                 volume.kubernetes.io/provisioner-deletion-secret-name: rook-csi-cephfs-provisioner
                 volume.kubernetes.io/provisioner-deletion-secret-namespace: openshift-storage
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:    vivo-cephfs-selinux-relabel
Status:          Bound
Claim:           osb-fscustomer-esteira2/osb-fscustomer-esteira2-pvc <------ WORKLOAD
Reclaim Policy:  Delete
Access Modes:    RWX
VolumeMode:      Filesystem
Capacity:        30Gi
Node Affinity:   <none>
Message:
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            openshift-storage.cephfs.csi.ceph.com
    FSType:
    VolumeHandle:      0001-0011-openshift-storage-0000000000000001-aab3519a-5c94-4874-abb1-6cb90ececb6a
    ReadOnly:          false
    VolumeAttributes:      clusterID=openshift-storage
                           fsName=ocs-storagecluster-cephfilesystem
                           kernelMountOptions=context="system_u:object_r:container_file_t:s0"
                           storage.kubernetes.io/csiProvisionerIdentity=1726082784973-9392-openshift-storage.cephfs.csi.ceph.com
                           subvolumeName=csi-vol-aab3519a-5c94-4874-abb1-6cb90ececb6a
                           subvolumePath=/volumes/csi/csi-vol-aab3519a-5c94-4874-abb1-6cb90ececb6a/05ed12dd-b991-4bf0-9b9e-35611555e642
Events:                <none>



  Specifically two particular operations cause this issue:


Issue A and Issue B:


A. PVC clone operation or PVC content copy from pod towards an internal Node folder.


B.  Trying to create 5 Pods (osb-fscustomer-server1, osb-fscustomer-server2, osb-fscustomer-server3, osb-fscustomer-server4, osb-fscustomer-server5) pointing out to  osb-fscustomer-esteira2-pvc. 


  Since we're able to successfully reproduce this issue, the customer has been given a data collection process that should give Engineering all the logs/data needed. 

 

Version of all relevant components (if applicable):

OCP:
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.33   True        False         49d     Cluster version is 4.14.33


ODF:
NAME                                     DISPLAY                       VERSION         REPLACES                                PHASE
argocd-operator.v0.12.0                  Argo CD                       0.12.0          argocd-operator.v0.11.0                 Succeeded
cluster-logging.v5.9.6                   Red Hat OpenShift Logging     5.9.6           cluster-logging.v5.9.5                  Succeeded
loki-operator.v5.9.6                     Loki Operator                 5.9.6           loki-operator.v5.9.5                    Succeeded
mcg-operator.v4.14.10-rhodf              NooBaa Operator               4.14.10-rhodf   mcg-operator.v4.14.9-rhodf              Succeeded
ocs-operator.v4.14.10-rhodf              OpenShift Container Storage   4.14.10-rhodf   ocs-operator.v4.14.9-rhodf              Succeeded
odf-csi-addons-operator.v4.14.10-rhodf   CSI Addons                    4.14.10-rhodf   odf-csi-addons-operator.v4.14.9-rhodf   Succeeded
odf-operator.v4.14.10-rhodf              OpenShift Data Foundation     4.14.10-rhodf   odf-operator.v4.14.9-rhodf              Succeeded


Ceph:

{
    "mon": {
        "ceph version 17.2.6-246.el9cp (0f65af2d95ce0936640f6ccd6a4825dce6237e4f) quincy (stable)": 3
    },
    "mgr": {
        "ceph version 17.2.6-246.el9cp (0f65af2d95ce0936640f6ccd6a4825dce6237e4f) quincy (stable)": 1
    },
    "osd": {
        "ceph version 17.2.6-246.el9cp (0f65af2d95ce0936640f6ccd6a4825dce6237e4f) quincy (stable)": 3
    },
    "mds": {
        "ceph version 17.2.6-246.el9cp (0f65af2d95ce0936640f6ccd6a4825dce6237e4f) quincy (stable)": 2
    },
    "rgw": {
        "ceph version 17.2.6-246.el9cp (0f65af2d95ce0936640f6ccd6a4825dce6237e4f) quincy (stable)": 1
    },
    "overall": {
        "ceph version 17.2.6-246.el9cp (0f65af2d95ce0936640f6ccd6a4825dce6237e4f) quincy (stable)": 10
    }
}



Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)?

Yes, the customer wants to roll this in production and they have raised this issue stating this is crucial to fix due to current timelines. Below is the following statement from the customer:

"This issue is impacting a major project with Telefonica Brazil. If this is a bug, we urgently need a Bugzilla report created as soon as possible to facilitate a timely resolution."


Is there any workaround available to the best of your knowledge?

Yes, the following process will clear the issue, but the operations A and B do not succeed:


1. Run the following command to capture client ID and MDS name holding on to the caps:

$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph health detail

2. Run the following command to capture the session:

$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph tell mds.ocs-storagecluster-cephfilesystem:0 session ls > active-mds-session-ls.txt


3. Search for "<client.ID>" in that active-mds-session-ls.txt file. When you find that client session, scroll down until you see the csi-vol information. Copy the csi-vol-xxx-xxx all the way until the /. Then run the following command:


$ oc get pv | awk 'NR>1 {print $1}' | while read it; do oc describe pv ${it}; echo " "; done > pv.out

4. Search for that csi-vol that you found in that client session in that pv.out file. That's your problematic workload. Scale it down.

5. Once all pods for that workload have terminated, delete the `rook-ceph-mds-ocs-storagecluster-cephfilesystem-a` pod:

~~~
$ oc delete pod -n openshift-storage rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-<pod-name>
~~~

6. After a few minutes Ceph will get back to healthy. 


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?

Yes


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. PVC clone operation or PVC content copy from pod towards an internal Node folder.

or

2.  Trying to create 5 Pods (osb-fscustomer-server1, osb-fscustomer-server2, osb-fscustomer-server3, osb-fscustomer-server4, osb-fscustomer-server5) pointing out to  osb-fscustomer-esteira2-pvc. 

Additional info:

  I will put the data collection steps given to the customer in the private comment. Once collected, we'll set needinfo.

Note You need to log in before you can comment on or make changes to this bug.