Bug 2367347
| Summary: | [GSS] rook-ceph-nfs-ocs-storagecluster-cephnfs-a pod in CrashLoopBackOff after ODF upgraded to 4.18.3 | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | kelwhite |
| Component: | NFS-Ganesha | Assignee: | Sachin Punadikar <spunadik> |
| NFS-Ganesha sub component: | Ceph | QA Contact: | Manish Singh <manising> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | unspecified | CC: | bkunal, cephqe-warriors, gjose, jcaratza, khover, kjosy, kkeithle, lema, mduasope, mori, msaini, ngangadh, nravinas, paarora, smitra, spunadik |
| Version: | 6.0 | ||
| Target Milestone: | --- | ||
| Target Release: | 9.0 | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | ceph-20.1.0-13 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2026-01-29 06:49:28 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
kelwhite
2025-05-19 18:31:49 UTC
Also, the data we have is located on supportshell, please let us know if you cannot access this... Also, this is how nfs is configured on ODF: https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.18/html/managing_and_allocating_storage_resources/creating-exports-using-nfs_rhodf#creating-exports-using-nfs_rhodf Hi,
I have a fresh ODF 4.18.3 cluster running and enabled NFS using [1] and replicated the problem:
Did this:
$ oc --namespace openshift-storage patch storageclusters.ocs.openshift.io ocs-storagecluster --type merge --patch '{"spec": {"nfs":{"enable": true}}}'
and the result:
rook-ceph-nfs-ocs-storagecluster-cephnfs-a-69954f66fd-rk6sq 1/2 CrashLoopBackOff 3 (27s ago) 82s
with the log:
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] main :MAIN :EVENT :nfs-ganesha Starting: Ganesha Version 6.5
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_set_param_from_conf :NFS STARTUP :WARN :domainname in NFSv4 config section will soon be deprecated, define it under DIRECTORY_SERVICES section
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_set_param_from_conf :NFS STARTUP :WARN :Use idmapped_user_time_validity under DIRECTORY_SERVICES section to configure time validity of idmapped users
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_set_param_from_conf :NFS STARTUP :WARN :Use idmapped_group_time_validity under DIRECTORY_SERVICES section to configure time validity of idmapped groups
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] rados_load_config_from_parse :NFS STARTUP :CRIT :Error while parsing RadosKV specific configuration
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] main :NFS STARTUP :CRIT :Error setting parameters from configuration file.
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:25): Expected a number, got a option name or number
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:22): 1 errors while processing parameters for RADOS_KV
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:22): Errors processing block (RADOS_KV)
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:36): 1 (invalid param value) errors found block RADOS_KV
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] main :NFS STARTUP :FATAL :Fatal errors. Server exiting...
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :/lib64/libganesha_nfsd.so.6.5(+0x9b9a1) [0x7fcd542ac9a1]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :/lib64/libganesha_nfsd.so.6.5(+0x99ed5) [0x7fcd542aaed5]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :/lib64/libganesha_nfsd.so.6.5(DisplayLogComponentLevel+0x8b) [0x7fcd542ab2bb]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :ganesha.nfsd(main+0x71a) [0x563f7f208eaa]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :/lib64/libc.so.6(+0x295d0) [0x7fcd540135d0]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :/lib64/libc.so.6(__libc_start_main+0x80) [0x7fcd54013680]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :ganesha.nfsd(_start+0x25) [0x563f7f209635]
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] main :MAIN :EVENT :nfs-ganesha Starting: Ganesha Version 6.5
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_set_param_from_conf :NFS STARTUP :WARN :domainname in NFSv4 config section will soon be deprecated, define it under DIRECTORY_SERVICES section
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_set_param_from_conf :NFS STARTUP :WARN :Use idmapped_user_time_validity under DIRECTORY_SERVICES section to configure time validity of idmapped users
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_set_param_from_conf :NFS STARTUP :WARN :Use idmapped_group_time_validity under DIRECTORY_SERVICES section to configure time validity of idmapped groups
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] rados_load_config_from_parse :NFS STARTUP :CRIT :Error while parsing RadosKV specific configuration
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] main :NFS STARTUP :CRIT :Error setting parameters from configuration file.
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:25): Expected a number, got a option name or number
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:22): 1 errors while processing parameters for RADOS_KV
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:22): Errors processing block (RADOS_KV)
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:36): 1 (invalid param value) errors found block RADOS_KV
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] main :NFS STARTUP :FATAL :Fatal errors. Server exiting...
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :/lib64/libganesha_nfsd.so.6.5(+0x9b9a1) [0x7fcd542ac9a1]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :/lib64/libganesha_nfsd.so.6.5(+0x99ed5) [0x7fcd542aaed5]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :/lib64/libganesha_nfsd.so.6.5(DisplayLogComponentLevel+0x8b) [0x7fcd542ab2bb]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :ganesha.nfsd(main+0x71a) [0x563f7f208eaa]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :/lib64/libc.so.6(+0x295d0) [0x7fcd540135d0]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :/lib64/libc.so.6(__libc_start_main+0x80) [0x7fcd54013680]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :ganesha.nfsd(_start+0x25) [0x563f7f209635]
[1] https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.18/html/managing_and_allocating_storage_resources/creating-exports-using-nfs_rhodf#creating-exports-using-nfs_rhodf
I have a workaround for 4.18.3 upgrades, but not for fresh installs... See: web-console here: https://console-openshift-console.apps.kelwhite-04146883.nasa.aws.cee.support user: "kubeadmin" password: "SZjXn-N6HYk-aMh7a-MZufk" $ oc get csv NAME DISPLAY VERSION REPLACES PHASE cephcsi-operator.v4.17.6-rhodf CephCSI operator 4.17.6-rhodf Succeeded mcg-operator.v4.17.6-rhodf NooBaa Operator 4.17.6-rhodf mcg-operator.v4.17.5-rhodf Succeeded ocs-client-operator.v4.17.6-rhodf OpenShift Data Foundation Client 4.17.6-rhodf ocs-client-operator.v4.17.5-rhodf Succeeded ocs-operator.v4.17.6-rhodf OpenShift Container Storage 4.17.6-rhodf ocs-operator.v4.17.5-rhodf Succeeded odf-csi-addons-operator.v4.17.6-rhodf CSI Addons 4.17.6-rhodf odf-csi-addons-operator.v4.17.5-rhodf Succeeded odf-operator.v4.17.6-rhodf OpenShift Data Foundation 4.17.6-rhodf odf-operator.v4.17.5-rhodf Succeeded odf-prometheus-operator.v4.17.6-rhodf Prometheus Operator 4.17.6-rhodf odf-prometheus-operator.v4.17.5-rhodf Succeeded recipe.v4.17.6-rhodf Recipe 4.17.6-rhodf recipe.v4.17.5-rhodf Succeeded rook-ceph-operator.v4.17.6-rhodf Rook-Ceph 4.17.6-rhodf rook-ceph-operator.v4.17.5-rhodf Succeeded $ oc rsh rook-ceph-tools-5dd94ccb95-mzbq6 sh-5.1$ ceph versions { "mon": { "ceph version 18.2.1-298.el9cp (4c924202966766ad9b6798f699fd2a0d2ddf0b28) reef (stable)": 3 }, "mgr": { "ceph version 18.2.1-298.el9cp (4c924202966766ad9b6798f699fd2a0d2ddf0b28) reef (stable)": 2 }, "osd": { "ceph version 18.2.1-298.el9cp (4c924202966766ad9b6798f699fd2a0d2ddf0b28) reef (stable)": 3 }, "mds": { "ceph version 18.2.1-298.el9cp (4c924202966766ad9b6798f699fd2a0d2ddf0b28) reef (stable)": 2 }, "overall": { "ceph version 18.2.1-298.el9cp (4c924202966766ad9b6798f699fd2a0d2ddf0b28) reef (stable)": 10 } } $ oc get pods -o wide|grep nfs csi-nfsplugin-4zrnz 3/3 Running 0 47s 10.0.93.62 ip-10-0-93-62.us-east-2.compute.internal <none> <none> csi-nfsplugin-lmghn 3/3 Running 0 47s 10.0.58.211 ip-10-0-58-211.us-east-2.compute.internal <none> <none> csi-nfsplugin-provisioner-68ffc87b6c-6tfm2 6/6 Running 0 47s 10.131.0.56 ip-10-0-93-62.us-east-2.compute.internal <none> <none> csi-nfsplugin-provisioner-68ffc87b6c-hbvdx 6/6 Running 0 46s 10.128.2.26 ip-10-0-7-186.us-east-2.compute.internal <none> <none> csi-nfsplugin-wj4jt 3/3 Running 0 47s 10.0.7.186 ip-10-0-7-186.us-east-2.compute.internal <none> <none> rook-ceph-nfs-ocs-storagecluster-cephnfs-a-645455d649-9hf7f 2/2 Running 0 29s 10.131.0.59 ip-10-0-93-62.us-east-2.compute.internal <none> <none> $ oc debug node/ip-10-0-93-62.us-east-2.compute.internal Temporary namespace openshift-debug-t24tn is created for debugging node... Starting pod/ip-10-0-93-62us-east-2computeinternal-debug-pbb4n ... To use host binaries, run `chroot /host` chrootPod IP: 10.0.93.62 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host sh-5.1# find / -name ganseha.conf ... /sysroot/ostree/deploy/rhcos/var/lib/kubelet/pods/e8f538cd-3d76-450a-a1f6-33a0667e2187/volumes/kubernetes.io~configmap/ganesha-config/..2025_05_19_22_38_09.3236416130/ganesha.conf /sysroot/ostree/deploy/rhcos/var/lib/kubelet/pods/e8f538cd-3d76-450a-a1f6-33a0667e2187/volumes/kubernetes.io~configmap/ganesha-config/ganesha.conf sh-5.1# cat /sysroot/ostree/deploy/rhcos/var/lib/kubelet/pods/e8f538cd-3d76-450a-a1f6-33a0667e2187/volumes/kubernetes.io~configmap/ganesha-config/ganesha.conf NFS_CORE_PARAM { Enable_NLM = false; Enable_RQUOTA = false; Protocols = 4; } MDCACHE { Dir_Chunk = 0; } EXPORT_DEFAULTS { Attr_Expiration_Time = 0; } NFSv4 { Delegations = false; RecoveryBackend = 'rados_cluster'; Minor_Versions = 1, 2; } RADOS_KV { ceph_conf = "/etc/ceph/ceph.conf"; userid = nfs-ganesha.ocs-storagecluster-cephnfs.a; nodeid = ocs-storagecluster-cephnfs.a; pool = ".nfs"; namespace = "ocs-storagecluster-cephnfs"; } RADOS_URLS { ceph_conf = "/etc/ceph/ceph.conf"; userid = nfs-ganesha.ocs-storagecluster-cephnfs.a; watch_url = 'rados://.nfs/ocs-storagecluster-cephnfs/conf-nfs.ocs-storagecluster-cephnfs'; } RGW { name = "client.nfs-ganesha.ocs-storagecluster-cephnfs.a"; } %url rados://.nfs/ocs-storagecluster-cephnfs/conf-nfs.ocs-storagecluster-cephnfs $ oc describe pod rook-ceph-nfs-ocs-storagecluster-cephnfs-a-645455d649-9hf7f Name: rook-ceph-nfs-ocs-storagecluster-cephnfs-a-645455d649-9hf7f Namespace: openshift-storage Priority: 1000000000 Priority Class Name: openshift-user-critical Service Account: rook-ceph-default Node: ip-10-0-93-62.us-east-2.compute.internal/10.0.93.62 Start Time: Mon, 19 May 2025 16:38:08 -0600 Labels: app=rook-ceph-nfs ... Containers: nfs-ganesha: Container ID: cri-o://bfb2748199ef8b17239b490d9ca01dbd1b088d26e2db5e778e0ff20ef9125a0e Image: registry.redhat.io/rhceph/rhceph-7-rhel9@sha256:74f12deed91db0e478d5801c08959e451e0dbef427497badef7a2d8829631882 Image ID: registry.redhat.io/rhceph/rhceph-7-rhel9@sha256:74f12deed91db0e478d5801c08959e451e0dbef427497badef7a2d8829631882 Port: <none> Host Port: <none> Command: ganesha.nfsd Args: -F -L STDERR -p /var/run/ganesha/ganesha.pid -N NIV_INFO State: Running Started: Mon, 19 May 2025 16:38:09 -0600 Ready: True Restart Count: 0 ################################################### Upgraded to ODF 4.18.3: $ oc get csv NAME DISPLAY VERSION REPLACES PHASE cephcsi-operator.v4.18.3-rhodf CephCSI operator 4.18.3-rhodf cephcsi-operator.v4.17.6-rhodf Succeeded mcg-operator.v4.18.3-rhodf NooBaa Operator 4.18.3-rhodf mcg-operator.v4.17.6-rhodf Succeeded ocs-client-operator.v4.18.3-rhodf OpenShift Data Foundation Client 4.18.3-rhodf ocs-client-operator.v4.17.6-rhodf Succeeded ocs-operator.v4.18.3-rhodf OpenShift Container Storage 4.18.3-rhodf ocs-operator.v4.17.6-rhodf Succeeded odf-csi-addons-operator.v4.18.3-rhodf CSI Addons 4.18.3-rhodf odf-csi-addons-operator.v4.17.6-rhodf Succeeded odf-dependencies.v4.18.3-rhodf Data Foundation Dependencies 4.18.3-rhodf odf-dependencies.v4.18.2-rhodf Succeeded odf-operator.v4.18.3-rhodf OpenShift Data Foundation 4.18.3-rhodf odf-operator.v4.17.6-rhodf Succeeded odf-prometheus-operator.v4.18.3-rhodf Prometheus Operator 4.18.3-rhodf odf-prometheus-operator.v4.17.6-rhodf Succeeded recipe.v4.18.3-rhodf Recipe 4.18.3-rhodf recipe.v4.17.6-rhodf Succeeded rook-ceph-operator.v4.18.3-rhodf Rook-Ceph 4.18.3-rhodf rook-ceph-operator.v4.17.6-rhodf Succeeded $ oc get pods -owide|grep nfs csi-nfsplugin-b2ckw 3/3 Running 0 4m3s 10.0.58.211 ip-10-0-58-211.us-east-2.compute.internal <none> <none> csi-nfsplugin-c8hh4 3/3 Running 0 4m35s 10.0.7.186 ip-10-0-7-186.us-east-2.compute.internal <none> <none> csi-nfsplugin-provisioner-7fdb64b9fb-lsrwx 6/6 Running 0 4m29s 10.128.2.35 ip-10-0-7-186.us-east-2.compute.internal <none> <none> csi-nfsplugin-provisioner-7fdb64b9fb-q9rlr 6/6 Running 0 4m29s 10.129.2.55 ip-10-0-58-211.us-east-2.compute.internal <none> <none> csi-nfsplugin-rbblx 3/3 Running 0 3m32s 10.0.93.62 ip-10-0-93-62.us-east-2.compute.internal <none> <none> rook-ceph-nfs-ocs-storagecluster-cephnfs-a-69954f66fd-sddkz 1/2 CrashLoopBackOff 4 (48s ago) 2m33s 10.129.2.66 ip-10-0-58-211.us-east-2.compute.internal <none> <none $ oc rsh rook-ceph-tools-588f6dcf49-pvmgb ceph versions { "mon": { "ceph version 19.2.0-124.el9cp (11e266251d845c59d6f46fc852124e45e0a772bd) squid (stable)": 3 }, "mgr": { "ceph version 19.2.0-124.el9cp (11e266251d845c59d6f46fc852124e45e0a772bd) squid (stable)": 2 }, "osd": { "ceph version 19.2.0-124.el9cp (11e266251d845c59d6f46fc852124e45e0a772bd) squid (stable)": 3 }, "mds": { "ceph version 19.2.0-124.el9cp (11e266251d845c59d6f46fc852124e45e0a772bd) squid (stable)": 2 }, "overall": { "ceph version 19.2.0-124.el9cp (11e266251d845c59d6f46fc852124e45e0a772bd) squid (stable)": 10 } } $ oc debug node/ip-10-0-93-62.us-east-2.compute.internal Temporary namespace openshift-debug-8z69t is created for debugging node... Starting pod/ip-10-0-93-62us-east-2computeinternal-debug-gdz5t ... To use host binaries, run `chroot /host` Pod IP: 10.0.93.62 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host sh-5.1# find / -name ganesha.conf /sysroot/ostree/deploy/rhcos/var/lib/kubelet/pods/92ce5aa4-7bcf-4976-b93f-aa2e678ebae7/volumes/kubernetes.io~configmap/ganesha-config/..2025_05_19_23_03_38.3796501208/ganesha.conf /sysroot/ostree/deploy/rhcos/var/lib/kubelet/pods/92ce5aa4-7bcf-4976-b93f-aa2e678ebae7/volumes/kubernetes.io~configmap/ganesha-config/ganesha.conf ... sh-5.1# cat /sysroot/ostree/deploy/rhcos/var/lib/kubelet/pods/92ce5aa4-7bcf-4976-b93f-aa2e678ebae7/volumes/kubernetes.io~configmap/ganesha-config/ganesha.conf NFS_CORE_PARAM { Enable_NLM = false; Enable_RQUOTA = false; Protocols = 4; } MDCACHE { Dir_Chunk = 0; } EXPORT_DEFAULTS { Attr_Expiration_Time = 0; } NFSv4 { Delegations = false; RecoveryBackend = "rados_cluster"; Minor_Versions = 1, 2; } RADOS_KV { ceph_conf = "/etc/ceph/ceph.conf"; userid = nfs-ganesha.ocs-storagecluster-cephnfs.a; nodeid = ocs-storagecluster-cephnfs.a; pool = ".nfs"; namespace = "ocs-storagecluster-cephnfs"; } RADOS_URLS { ceph_conf = "/etc/ceph/ceph.conf"; userid = nfs-ganesha.ocs-storagecluster-cephnfs.a; watch_url = "rados://.nfs/ocs-storagecluster-cephnfs/conf-nfs.ocs-storagecluster-cephnfs"; } RGW { name = "client.nfs-ganesha.ocs-storagecluster-cephnfs.a"; } %url rados://.nfs/ocs-storagecluster-cephnfs/conf-nfs.ocs-storagecluster-cephnfs Containers: nfs-ganesha: Container ID: cri-o://fb96de81215a595228f74eb128a8ad71eb2fa02aabeb78c6ae3e67ec4c483faf Image: registry.redhat.io/rhceph/rhceph-8-rhel9@sha256:73e2715e3f5b8d98459a3be1f8ffb23f8eddda32c646f9812599a5ab277f2da7 Image ID: registry.redhat.io/rhceph/rhceph-8-rhel9@sha256:1669308564815f995bf62c40c780cd5357f5fb0d426e712ae6477a5ba2a983b0 Port: <none> Host Port: <none> Command: ganesha.nfsd Args: -F -L STDERR -p /var/run/ganesha/ganesha.pid -N NIV_INFO State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 2 ################################## Workaround: $ oc scale deployment rook-ceph-operator rook-ceph-nfs-ocs-storagecluster-cephnfs-a --replicas 0 deployment.apps/rook-ceph-operator scaled deployment.apps/rook-ceph-nfs-ocs-storagecluster-cephnfs-a scaled $ oc edit deployment rook-ceph-nfs-ocs-storagecluster-cephnfs-a deployment.apps/rook-ceph-nfs-ocs-storagecluster-cephnfs-a edited ^ change all the 'image' to the old working image: registry.redhat.io/rhceph/rhceph-7-rhel9@sha256:74f12deed91db0e478d5801c08959e451e0dbef427497badef7a2d8829631882 save and quit $ oc get pods|grep nfs csi-nfsplugin-b2ckw 3/3 Running 0 9m21s csi-nfsplugin-c8hh4 3/3 Running 0 9m53s csi-nfsplugin-provisioner-7fdb64b9fb-lsrwx 6/6 Running 0 9m47s csi-nfsplugin-provisioner-7fdb64b9fb-q9rlr 6/6 Running 0 9m47s csi-nfsplugin-rbblx 3/3 Running 0 8m50s rook-ceph-nfs-ocs-storagecluster-cephnfs-a-b6dfcd957-frglr 2/2 Running 0 17s 19/05/2025 23:06:57 : epoch 682bb98e : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gssd_refresh_krb5_machine_credential :NFS CB :CRIT :ERROR: gssd_refresh_krb5_machine_credential: no usable keytab entry found in keytab /etc/krb5.keytab for connection with host localhost 19/05/2025 23:06:57 : epoch 682bb98e : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN :gssd_refresh_krb5_machine_credential failed (-1765328160:2) 19/05/2025 23:06:57 : epoch 682bb98e : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_Start_threads :THREAD :EVENT :Starting delayed executor. 19/05/2025 23:06:57 : epoch 682bb98e : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_Start_threads :THREAD :EVENT :gsh_dbusthread was started successfully 19/05/2025 23:06:57 : epoch 682bb98e : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_Start_threads :THREAD :EVENT :admin thread was started successfully 19/05/2025 23:06:57 : epoch 682bb98e : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_Start_threads :THREAD :EVENT :reaper thread was started successfully 19/05/2025 23:06:57 : epoch 682bb98e : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_Start_threads :THREAD :EVENT :General fridge was started successfully 19/05/2025 23:06:57 : epoch 682bb98e : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_start :NFS STARTUP :EVENT :------------------------------------------------- 19/05/2025 23:06:57 : epoch 682bb98e : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_start :NFS STARTUP :EVENT : NFS SERVER INITIALIZED 19/05/2025 23:06:57 : epoch 682bb98e : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_start :NFS STARTUP :EVENT :------------------------------------------------- Hello,
On the same cluster where workaround is in place, i.e, with rook-ceph-nfs-ocs-storagecluster-cephnfs-a deployment modified with the old working image: registry.redhat.io/rhceph/rhceph-7-rhel9@sha256:74f12deed91db0e478d5801c08959e451e0dbef427497badef7a2d8829631882
rook-ceph-nfs-ocs-storagecluster-cephnfs-a-bxxxxx7-fxxxr came back to running state from CLBO state and PVC creation with storageclass 'ocs-storagecluster-ceph-nfs' was successful
oc describe pvc testnfs
Name: testnfs
Namespace: openshift-storage
StorageClass: ocs-storagecluster-ceph-nfs
Status: Bound
Volume: pvc-0e8e5688-2279-4a79-acf2-1d3bd2833427
Labels: <none>
Annotations: pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: openshift-storage.nfs.csi.ceph.com
volume.kubernetes.io/storage-provisioner: openshift-storage.nfs.csi.ceph.com
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 1Gi
Access Modes: RWO
VolumeMode: Filesystem
Used By: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ExternalProvisioning 31s persistentvolume-controller Waiting for a volume to be created either by the external provisioner 'openshift-storage.nfs.csi.ceph.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
Normal Provisioning 31s openshift-storage.nfs.csi.ceph.com_csi-nfsplugin-provisioner-7fdb64b9fb-lsrwx_22694f4e-f540-488a-bcd3-2934c9e0adf2 External provisioner is provisioning volume for claim "openshift-storage/testnfs"
Normal ProvisioningSucceeded 31s openshift-storage.nfs.csi.ceph.com_csi-nfsplugin-provisioner-7fdb64b9fb-lsrwx_22694f4e-f540-488a-bcd3-2934c9e0adf2 Successfully provisioned volume pvc-0e8e5688-2279-4a79-acf2-1d3bd2833427
Workaround steps in brief:
a] Scale down rook-ceph-operator and rook-ceph-nfs-ocs-storagecluster-cephnfs-a deployments
$ oc scale deployment rook-ceph-operator rook-ceph-nfs-ocs-storagecluster-cephnfs-a --replicas 0
b] Change all the 'image' to the old working image: registry.redhat.io/rhceph/rhceph-7-rhel9@sha256:74f12deed91db0e478d5801c08959e451e0dbef427497badef7a2d8829631882
$ oc edit deployment rook-ceph-nfs-ocs-storagecluster-cephnfs-a
c] Scale up rook-ceph-operator and rook-ceph-nfs-ocs-storagecluster-cephnfs-a deployments
d] Check if rook-ceph-nfs-ocs-storagecluster-cephnfs-a-6xxxxxxd-sxxxz pod is back to Running state from CLBO state
$ oc get pods -o wide|grep nfs
e] Verify pvc creation with storageclass 'ocs-storagecluster-ceph-nfs' is successful
The steps are yet to be executed in customer's environment
Regards,
Soumi
@Sachin Punadikar For the workaround outlined in https://bugzilla.redhat.com/show_bug.cgi?id=2367347#c8 I have a few questions.. 1) Where is this ran from? The pod rook-ceph-nfs-ocs-storagecluster-cephnfs-a will be in clbo, but this binary 'ganesha-rados-grace' is in the rook-ceph-tools pod. 2) What options is ran with this ganesha-rados-grace command? Here are some tests: $ oc rsh rook-ceph-tools-588f6dcf49-pvmgb sh-5.1$ ceph -s cluster: id: a4faeada-c8b1-441d-91d8-b06b8ba9e157 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 18h) mgr: a(active, since 17h), standbys: b mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 18h), 3 in (since 18h) data: volumes: 1/1 healthy pools: 5 pools, 145 pgs objects: 140 objects, 265 MiB usage: 984 MiB used, 12 TiB / 12 TiB avail pgs: 145 active+clean io: client: 938 B/s rd, 1.7 KiB/s wr, 1 op/s rd, 0 op/s wr sh-5.1$ ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 12 TiB 12 TiB 990 MiB 990 MiB 0 TOTAL 12 TiB 12 TiB 990 MiB 990 MiB 0 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL ocs-storagecluster-cephblockpool 1 64 179 MiB 109 536 MiB 0 3.4 TiB .mgr 2 1 577 KiB 2 1.7 MiB 0 3.4 TiB ocs-storagecluster-cephfilesystem-metadata 3 16 112 KiB 24 421 KiB 0 3.4 TiB ocs-storagecluster-cephfilesystem-data0 4 32 161 B 1 12 KiB 0 3.4 TiB .nfs 5 32 2.7 KiB 4 42 KiB 0 3.4 TiB sh-5.1$ ganesha-rados-grace rados_ioctx_create: -2 Can't connect to cluster: -2 sh-5.1$ ganesha-rados-grace --help ganesha-rados-grace: unrecognized option '--help' Usage: ganesha-rados-grace [ --userid ceph_user ] [ --cephconf /path/to/ceph.conf ] [ --ns namespace ] [ --oid obj_id ] [ --pool pool_id ] dump|add|start|join|lift|remove|enforce|noenforce|member [ nodeid ... ] sh-5.1$ ganesha-rados-grace --cephconf /etc/ceph/ceph.conf rados_ioctx_create: -2 Can't connect to cluster: -2 sh-5.1$ ganesha-rados-grace --cephconf /etc/ceph/ceph.conf --userid client.admin rados_connect: -13 Can't connect to cluster: -13 sh-5.1$ ganesha-rados-grace --cephconf /etc/ceph/ceph.conf --userid client.admin --pool .nfs rados_connect: -13 Can't connect to cluster: -13 ref: https://github.com/nfs-ganesha/nfs-ganesha/blob/next/src/doc/man/ganesha-rados-grace.rst Nvm...: sh-5.1$ ganesha-rados-grace -p ".nfs" --ns ocs-storagecluster-cephnfs cur=4 rec=0 ====================================================== nodeocs-storagecluster-cephnfs.a E ocs-storagecluster-cephnfs.a @Sachin Punadikar So, I got the correct output when I ran this on my 'fixed' 4.18.3 cluster that's using the older image: sh-5.1$ ganesha-rados-grace -p ".nfs" --ns ocs-storagecluster-cephnfs dump cur=4 rec=0 ====================================================== nodeocs-storagecluster-cephnfs.a E ocs-storagecluster-cephnfs.a sh-5.1$ However, trying to run any of the commands in c#8 will not work on a newly deployed ODF 4.18.3 cluster: sh-5.1$ ganesha-rados-grace -p ".nfs" --ns ocs-storagecluster-cephnfs rados_conf_read_file: -2 Can't connect to cluster: -2 sh-5.1$ ganesha-rados-grace -p ".nfs" --ns ocs-storagecluster-cephnfs dump rados_conf_read_file: -2 Can't connect to cluster: -2 sh-5.1$ ganesha-rados-grace -p ".nfs" --ns nfsganesha add node0 rados_conf_read_file: -2 Can't connect to cluster: -2 // ceph df --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL .nfs 1 32 16 B 2 12 KiB 0 435 GiB .mgr 2 32 577 KiB 2 1.7 MiB 0 435 GiB ocs-storagecluster-cephblockpool 3 32 81 MiB 74 242 MiB 0.02 435 GiB ocs-storagecluster-cephfilesystem-metadata 4 16 25 KiB 22 156 KiB 0 435 GiB ocs-storagecluster-cephfilesystem-data0 5 32 0 B 0 0 B 0 435 GiB ... do note, on fresh installs of 4.18.3, the nfs container doesn't fully finish its reconcile as gasneha never fully runs, so I don't think the db would be completed like it was in an upgrade. Hence, why the commands don't work on new 4.18.3 deployments. We have 2 clusters if you want to play around on them: // Upgraded 4.17 -> 4.18.3 cluster with the image swap workaround: web-console here: https://console-openshift-console.apps.kelwhite-04146883.nasa.aws.cee.support user: "kubeadmin" password: "SZjXn-N6HYk-aMh7a-MZufk" // Fresh odf 4.18.3 deployment: https://console-openshift-console.apps.cluster-dfkpn.dfkpn.sandbox967.opentlc.com User: kubeadmin Password: qNRWh-bSZhR-2CHW2-wMETb @ Miguel
> NOTE. If we scale up rook-ceph-operator , it reconciles the deployment rook-ceph-nfs-ocs-storagecluster-cephnfs-a with the image of 4.18.3 and pod goes back into CLBO again. So we need to keep rook-ceph-operator scale down for this workaround to work.
You can try to block reconcile at the deployment layer.
$ oc patch deployment rook-ceph-nfs-ocs-storagecluster-cephnfs-a -n openshift-storage --type merge --patch '{"metadata": {"labels": {"ceph.rook.io/do-not-reconcile": "true"}}}'
(In reply to khover from comment #19) > @ Miguel > > > NOTE. If we scale up rook-ceph-operator , it reconciles the deployment > rook-ceph-nfs-ocs-storagecluster-cephnfs-a with the image of 4.18.3 and pod > goes back into CLBO again. So we need to keep rook-ceph-operator scale down > for this workaround to work. > > You can try to block reconcile at the deployment layer. > > $ oc patch deployment rook-ceph-nfs-ocs-storagecluster-cephnfs-a -n > openshift-storage --type merge --patch '{"metadata": {"labels": > {"ceph.rook.io/do-not-reconcile": "true"}}}' Disregard ^^ seems it was tested and I was unaware Two more clarifications need to be answered: -- 1. Is it correct to understand that the workaround requires continuing operation with the rook-ceph-operator stopped? If so, what is the operational impact of not having the operator running, and what precautions should we take during this state? 2. Is there a method to exclude only the rook-ceph-nfs component from being managed by the rook-ceph-operator? (in order to prevent reconcile from reverting manual image changes to rook-ceph-nfs) Cheers Ray Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Ceph Storage 9.0 Security and Enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2026:1536 |