Bug 2120633
| Summary: | [IBM Z] Ceph-fs PVC (RWX) mount involving cephfs client eviction fails to mount the PV on the new pod | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Sravika <sbalusu> |
| Component: | csi-driver | Assignee: | Humble Chirammal <hchiramm> |
| Status: | CLOSED DUPLICATE | QA Contact: | krishnaram Karthick <kramdoss> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.11 | CC: | khiremat, madam, mrajanna, muagarwa, nberry, ocs-bugs, odf-bz-bot, paarora, rar, sapillai, tnielsen |
| Target Milestone: | --- | Flags: | khiremat:
needinfo-
khiremat: needinfo- |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-10-12 14:42:28 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Similar issue - https://github.com/rook/rook/issues/9782#issuecomment-1149514194 The Problem mainly looks at eviction, As I see ` PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)` While looking into the eviction steps, I see you probably missed removing the metadata for that cephfs client ``` 3. Evict the cephfs client with the following commands ceph tell mds.0 client ls ceph tell mds.0 client evict id=<id> ``` You should also do `ceph tell mds.0 client evict client_metadata.=<id>` Reference: https://docs.ceph.com/en/quincy/cephfs/eviction/#manual-client-eviction I would request QE and reporter to verify the above point of `evicting the metadata` if it works fine using that. https://bugzilla.redhat.com/show_bug.cgi?id=2120633#c3 Thanks! @paarora : That command reports invalid key sh-4.4$ ceph tell mds.0 client evict client_metadata.=303212 2022-08-24T08:32:39.050+0000 3ff717fa910 0 client.303440 ms_handle_reset on v2:10.129.2.28:6800/612719934 2022-08-24T08:32:39.080+0000 3ff717fa910 0 client.303446 ms_handle_reset on v2:10.129.2.28:6800/612719934 Error EINVAL: Invalid filter key 'client_metadata.' sh-4.4$ Looks like `Error EINVAL: Invalid filter key 'client_metadata.'` error with in the docs, @khiremat can you please see what's the problem with this cephfs command CSI team PTAL (In reply to Parth Arora from comment #6) > Looks like `Error EINVAL: Invalid filter key 'client_metadata.'` error with > in the docs, > > @khiremat can you please see what's the problem with this cephfs > command Here is the sample client_metadata from the 'client ls' output ----------- "client_metadata": { "client_features": { "feature_bits": "0x000000000003ffff" }, "metric_spec": { "metric_flags": { "feature_bits": "0x000000000000ffff" } }, "ceph_sha1": "no_version", "ceph_version": "ceph version Development (no_version) quincy (dev)", "entity_id": "admin", "hostname": "kotresh-T490s", "mount_point": "/mnt", "pid": "244420", "root": "/" } ---------------- So I think the command should be as below. ceph tell mds.0 client evict client_metadata.pid=244420 Thanks and Regards, Kotresh HR Sravika, can you please re-run the correct command and check as mentioned in the previous comment. @khiremat , @muagarwa : Thanks, the new command works fine. However the ceph client eviction does not seem to be successful sh-4.4$ ceph tell mds.0 client evict id=35263 2022-10-07T14:11:54.969+0000 3ff797fa910 0 client.39521 ms_handle_reset on v2:10.128.2.70:6800/2660727071 2022-10-07T14:11:55.869+0000 3ff797fa910 0 client.39527 ms_handle_reset on v2:10.128.2.70:6800/2660727071 sh-4.4$ ceph tell mds.0 client evict client_metadata.pid=35263 2022-10-07T14:12:32.639+0000 3ff70ff9910 0 client.39542 ms_handle_reset on v2:10.128.2.70:6800/2660727071 2022-10-07T14:12:32.659+0000 3ff70ff9910 0 client.39548 ms_handle_reset on v2:10.128.2.70:6800/2660727071 # ls -lrt /var/lib/kubelet/plugins/kubernetes.io/csi/openshift-storage.cephfs.csi.ceph.com/34c1afd3fe28812bc7428029c7bb7229495aaddccb9dadf420ec680152ee4639/ ls: cannot access '/var/lib/kubelet/plugins/kubernetes.io/csi/openshift-storage.cephfs.csi.ceph.com/34c1afd3fe28812bc7428029c7bb7229495aaddccb9dadf420ec680152ee4639/globalmount': Permission denied total 4 d?????????? ? ? ? ? ? globalmount -rw-r--r--. 1 root root 154 Oct 7 14:06 vol_data.json Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 15s (x10 over 4m25s) kubelet MountVolume.SetUp failed for volume "pvc-273720ab-605f-4b1f-92e8-18d40fa4222e" : rpc error: code = Internal desc = stat /var/lib/kubelet/plugins/kubernetes.io/csi/openshift-storage.cephfs.csi.ceph.com/34c1afd3fe28812bc7428029c7bb7229495aaddccb9dadf420ec680152ee4639/globalmount: permission denied Warning FailedMount 8s (x2 over 2m23s) kubelet Unable to attach or mount volumes: unmounted volumes=[mypvc], unattached volumes=[kube-api-access-ppvbl mypvc]: timed out waiting for the condition >1. Create a cephfs pvc
2. Create a pod on one of the worker nodes mounting this pvc
3. Evict the cephfs client with the following commands
ceph tell mds.0 client ls
ceph tell mds.0 client evict id=<id>
4. Create a new pod on the same worker nodes mounting the pvc
Am not sure its a valid test case why you are manually trying to evict the client connected and creating a new pod on the same node?
Was it working earlier and not working recently?
a single cephfs subvolume is mounted only once per node. If the client is evicted the node need to be rebooted to recover from it or all the applications using that PVC on a specific node need to be deleted and recreated.
>Steps to Reproduce: >1. Create a cephfs pvc >2. Create a pod on one of the worker nodes mounting this pvc >3. Evict the cephfs client with the following commands ceph tell mds.0 client ls ceph tell mds.0 client evict id=<id> >4. Create a new pod on the same worker nodes mounting the pvc Was the first pod in step 2 deleted successfully ? Evicting client at ceph level will leave behind mount on which permission denied error occurs on the node where the pod are mounted. - All the pods using that PVC need to be deleted/unmounted successfully first on each node before creating a new pod using that PVC. - or a node reboot must be performed. What is the goal of these steps? I see a needinfo flag on me, but not the comment. Could you please make the comment public? @mrajanna , @rar : I executed these steps inorder to verify the ocs-ci testcase, which I think was created based on a bug-1901499 and unfortunately I cannot access that bug https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/pv_services/test_pvc_evict_ceph_clients.py The test case is having two tests 1. When two pods tries to mount the same PVC and scheduled on same node This will be fixed when 1901499 BZ is fixed and now its still in assigned state 2. When two pods tries to mount the same PVC and scheduled on different node If you want you can verify this test. >tests/manage/pv_services/test_pvc_evict_ceph_clients.py::TestPvcEvictCephClients::test_pvc_evict_ceph_clients[same] Based on the above test this is same as 1901499 am marking it as duplicate and closing it now, please feel free to reopen it *** This bug has been marked as a duplicate of bug 1901499 *** @mrajanna : The test which mounts the same pvc on different nodes has already been verified and it works fine. Inorder to follow the updates on the pvc same node mount test, can you please make the bug 1901499 public. Thanks. BZ 1901499 cannot be marked as public as it contains critical information, i will update this BZ once 1901499 is resolved. |
Description of problem (please be detailed as possible and provide log snippests): Ceph-fs PVC (RWX) mount involving evict ceph-fs client fails, when two pods tries to mount the same PVC and are scheduled on same node This testcase is part of ocs-ci tier2 tests/manage/pv_services/test_pvc_evict_ceph_clients.py::TestPvcEvictCephClients::test_pvc_evict_ceph_clients[same] Version of all relevant components (if applicable): ODF : 4.11.0-137 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Create a cephfs pvc 2. Create a pod on one of the worker nodes mounting this pvc 3. Evict the cephfs client with the following commands ceph tell mds.0 client ls ceph tell mds.0 client evict id=<id> 4. Create a new pod on the same worker nodes mounting the pvc Actual results: Cephfs client eviction fails to mount the PV on the new pod [root@worker-0 ~]# ls -lrt /var/lib/kubelet/plugins/kubernetes.io/csi/openshift-storage.cephfs.csi.ceph.com/be3249c31d42b7152b74179a6c5ccd6b5f36973e3de6b4dffc3077b2a42cc020/ ls: cannot access '/var/lib/kubelet/plugins/kubernetes.io/csi/openshift-storage.cephfs.csi.ceph.com/be3249c31d42b7152b74179a6c5ccd6b5f36973e3de6b4dffc3077b2a42cc020/globalmount': Permission denied total 4 d?????????? ? ? ? ? ? globalmount # oc -n namespace-test-5b10376709214c58a93ffa9c1 describe po pod-test-cephfs-ff79f59 afd8540079630aa81 Name: pod-test-cephfs-ff79f59afd8540079630aa81 Namespace: namespace-test-5b10376709214c58a93ffa9c1 Priority: 0 Node: worker-0.ocsm4205001.lnxero1.boe/172.23.233.148 Start Time: Tue, 23 Aug 2022 12:22:29 +0200 Labels: <none> Annotations: openshift.io/scc: anyuid Status: Pending IP: IPs: <none> Containers: web-server: Container ID: Image: quay.io/ocsci/nginx:latest Image ID: Port: <none> Host Port: <none> State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Environment: <none> Mounts: /var/lib/www/html from mypvc (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-swswb (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: mypvc: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: pvc-test-aceda8180eb94271938808fe200a63d ReadOnly: false kube-api-access-swswb: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 15s (x6 over 31s) kubelet MountVolume.SetUp failed for volume "pvc-a35a5b05-221c-4166-8c63-0c283444e650" : rpc error: code = Internal desc = stat /var /lib/kubelet/plugins/kubernetes.io/csi/openshift-storage.cephfs.csi.ceph.com/be3249c31d42b7152b74179a6c5ccd6b5f36973e3de6b4dffc3077b2a42cc020/globalmount: permission denied Expected results: Mounting of the pv on the new pod should work fine Additional info: Must-gather Results: https://drive.google.com/file/d/1YWGPvDnVGbu28fGNfUSIeDGU198szgD3/view?usp=sharing