Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2367347

Summary:	[GSS] rook-ceph-nfs-ocs-storagecluster-cephnfs-a pod in CrashLoopBackOff after ODF upgraded to 4.18.3
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	kelwhite
Component:	NFS-Ganesha	Assignee:	Sachin Punadikar <spunadik>
NFS-Ganesha sub component:	Ceph	QA Contact:	Manish Singh <manising>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	bkunal, cephqe-warriors, gjose, jcaratza, khover, kjosy, kkeithle, lema, mduasope, mori, msaini, ngangadh, nravinas, paarora, smitra, spunadik
Version:	6.0
Target Milestone:	---
Target Release:	9.0
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	ceph-20.1.0-13	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2026-01-29 06:49:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description kelwhite 2025-05-19 18:31:49 UTC

This is taken from the Jira https://issues.redhat.com/browse/DFBUGS-2564 and was requested to have support open this bug. Please work with your engi counter parts in ODF if you need ODF engi assistance. If I missed anything like links, dependencies, etc, please add and correct.

Comment 3 kelwhite 2025-05-19 18:58:42 UTC

Also, the data we have is located on supportshell, please let us know if you cannot access this...

Comment 4 kelwhite 2025-05-19 19:12:41 UTC

Also, this is how nfs is configured on ODF: 

https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.18/html/managing_and_allocating_storage_resources/creating-exports-using-nfs_rhodf#creating-exports-using-nfs_rhodf

Comment 5 kelwhite 2025-05-19 21:59:12 UTC

Hi,

I have a fresh ODF 4.18.3 cluster running and enabled NFS using [1] and replicated the problem:

Did this:
$ oc --namespace openshift-storage patch storageclusters.ocs.openshift.io ocs-storagecluster --type merge --patch '{"spec": {"nfs":{"enable": true}}}'

and the result:

rook-ceph-nfs-ocs-storagecluster-cephnfs-a-69954f66fd-rk6sq       1/2     CrashLoopBackOff   3 (27s ago)   82s

with the log:
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] main :MAIN :EVENT :nfs-ganesha Starting: Ganesha Version 6.5
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_set_param_from_conf :NFS STARTUP :WARN :domainname in NFSv4 config section will soon be deprecated, define it under DIRECTORY_SERVICES section
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_set_param_from_conf :NFS STARTUP :WARN :Use idmapped_user_time_validity under DIRECTORY_SERVICES section to configure time validity of idmapped users
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_set_param_from_conf :NFS STARTUP :WARN :Use idmapped_group_time_validity under DIRECTORY_SERVICES section to configure time validity of idmapped groups
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] rados_load_config_from_parse :NFS STARTUP :CRIT :Error while parsing RadosKV specific configuration
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] main :NFS STARTUP :CRIT :Error setting parameters from configuration file.
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:25): Expected a number, got a option name or number
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:22): 1 errors while processing parameters for RADOS_KV
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:22): Errors processing block (RADOS_KV)
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:36): 1 (invalid param value) errors found block RADOS_KV
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] main :NFS STARTUP :FATAL :Fatal errors.  Server exiting...
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :/lib64/libganesha_nfsd.so.6.5(+0x9b9a1) [0x7fcd542ac9a1]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :/lib64/libganesha_nfsd.so.6.5(+0x99ed5) [0x7fcd542aaed5]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :/lib64/libganesha_nfsd.so.6.5(DisplayLogComponentLevel+0x8b) [0x7fcd542ab2bb]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :ganesha.nfsd(main+0x71a) [0x563f7f208eaa]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :/lib64/libc.so.6(+0x295d0) [0x7fcd540135d0]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :/lib64/libc.so.6(__libc_start_main+0x80) [0x7fcd54013680]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :ganesha.nfsd(_start+0x25) [0x563f7f209635]

19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] main :MAIN :EVENT :nfs-ganesha Starting: Ganesha Version 6.5
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_set_param_from_conf :NFS STARTUP :WARN :domainname in NFSv4 config section will soon be deprecated, define it under DIRECTORY_SERVICES section
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_set_param_from_conf :NFS STARTUP :WARN :Use idmapped_user_time_validity under DIRECTORY_SERVICES section to configure time validity of idmapped users
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_set_param_from_conf :NFS STARTUP :WARN :Use idmapped_group_time_validity under DIRECTORY_SERVICES section to configure time validity of idmapped groups
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] rados_load_config_from_parse :NFS STARTUP :CRIT :Error while parsing RadosKV specific configuration
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] main :NFS STARTUP :CRIT :Error setting parameters from configuration file.
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:25): Expected a number, got a option name or number
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:22): 1 errors while processing parameters for RADOS_KV
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:22): Errors processing block (RADOS_KV)
19/05/2025 21:56:23 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:36): 1 (invalid param value) errors found block RADOS_KV
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] main :NFS STARTUP :FATAL :Fatal errors.  Server exiting...
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :/lib64/libganesha_nfsd.so.6.5(+0x9b9a1) [0x7fcd542ac9a1]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :/lib64/libganesha_nfsd.so.6.5(+0x99ed5) [0x7fcd542aaed5]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :/lib64/libganesha_nfsd.so.6.5(DisplayLogComponentLevel+0x8b) [0x7fcd542ab2bb]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :ganesha.nfsd(main+0x71a) [0x563f7f208eaa]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :/lib64/libc.so.6(+0x295d0) [0x7fcd540135d0]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :/lib64/libc.so.6(__libc_start_main+0x80) [0x7fcd54013680]
19/05/2025 21:56:24 : epoch 682ba907 : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gsh_backtrace :NFS STARTUP :MAJ :ganesha.nfsd(_start+0x25) [0x563f7f209635]


[1] https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.18/html/managing_and_allocating_storage_resources/creating-exports-using-nfs_rhodf#creating-exports-using-nfs_rhodf

Comment 6 kelwhite 2025-05-19 23:12:51 UTC

I have a workaround for 4.18.3 upgrades, but not for fresh installs... See:

web-console here: https://console-openshift-console.apps.kelwhite-04146883.nasa.aws.cee.support
user: "kubeadmin"
password: "SZjXn-N6HYk-aMh7a-MZufk"


$ oc get csv
NAME                                    DISPLAY                            VERSION        REPLACES                                PHASE
cephcsi-operator.v4.17.6-rhodf          CephCSI operator                   4.17.6-rhodf                                           Succeeded
mcg-operator.v4.17.6-rhodf              NooBaa Operator                    4.17.6-rhodf   mcg-operator.v4.17.5-rhodf              Succeeded
ocs-client-operator.v4.17.6-rhodf       OpenShift Data Foundation Client   4.17.6-rhodf   ocs-client-operator.v4.17.5-rhodf       Succeeded
ocs-operator.v4.17.6-rhodf              OpenShift Container Storage        4.17.6-rhodf   ocs-operator.v4.17.5-rhodf              Succeeded
odf-csi-addons-operator.v4.17.6-rhodf   CSI Addons                         4.17.6-rhodf   odf-csi-addons-operator.v4.17.5-rhodf   Succeeded
odf-operator.v4.17.6-rhodf              OpenShift Data Foundation          4.17.6-rhodf   odf-operator.v4.17.5-rhodf              Succeeded
odf-prometheus-operator.v4.17.6-rhodf   Prometheus Operator                4.17.6-rhodf   odf-prometheus-operator.v4.17.5-rhodf   Succeeded
recipe.v4.17.6-rhodf                    Recipe                             4.17.6-rhodf   recipe.v4.17.5-rhodf                    Succeeded
rook-ceph-operator.v4.17.6-rhodf        Rook-Ceph                          4.17.6-rhodf   rook-ceph-operator.v4.17.5-rhodf        Succeeded

$ oc rsh rook-ceph-tools-5dd94ccb95-mzbq6
sh-5.1$ ceph versions
{
    "mon": {
        "ceph version 18.2.1-298.el9cp (4c924202966766ad9b6798f699fd2a0d2ddf0b28) reef (stable)": 3
    },
    "mgr": {
        "ceph version 18.2.1-298.el9cp (4c924202966766ad9b6798f699fd2a0d2ddf0b28) reef (stable)": 2
    },
    "osd": {
        "ceph version 18.2.1-298.el9cp (4c924202966766ad9b6798f699fd2a0d2ddf0b28) reef (stable)": 3
    },
    "mds": {
        "ceph version 18.2.1-298.el9cp (4c924202966766ad9b6798f699fd2a0d2ddf0b28) reef (stable)": 2
    },
    "overall": {
        "ceph version 18.2.1-298.el9cp (4c924202966766ad9b6798f699fd2a0d2ddf0b28) reef (stable)": 10
    }
}

$ oc get pods -o wide|grep nfs
csi-nfsplugin-4zrnz                                               3/3     Running   0          47s     10.0.93.62    ip-10-0-93-62.us-east-2.compute.internal    <none>           <none>
csi-nfsplugin-lmghn                                               3/3     Running   0          47s     10.0.58.211   ip-10-0-58-211.us-east-2.compute.internal   <none>           <none>
csi-nfsplugin-provisioner-68ffc87b6c-6tfm2                        6/6     Running   0          47s     10.131.0.56   ip-10-0-93-62.us-east-2.compute.internal    <none>           <none>
csi-nfsplugin-provisioner-68ffc87b6c-hbvdx                        6/6     Running   0          46s     10.128.2.26   ip-10-0-7-186.us-east-2.compute.internal    <none>           <none>
csi-nfsplugin-wj4jt                                               3/3     Running   0          47s     10.0.7.186    ip-10-0-7-186.us-east-2.compute.internal    <none>           <none>
rook-ceph-nfs-ocs-storagecluster-cephnfs-a-645455d649-9hf7f       2/2     Running   0          29s     10.131.0.59   ip-10-0-93-62.us-east-2.compute.internal    <none>           <none>

$ oc debug node/ip-10-0-93-62.us-east-2.compute.internal
Temporary namespace openshift-debug-t24tn is created for debugging node...
Starting pod/ip-10-0-93-62us-east-2computeinternal-debug-pbb4n ...
To use host binaries, run `chroot /host`
chrootPod IP: 10.0.93.62
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# find / -name ganseha.conf
...
/sysroot/ostree/deploy/rhcos/var/lib/kubelet/pods/e8f538cd-3d76-450a-a1f6-33a0667e2187/volumes/kubernetes.io~configmap/ganesha-config/..2025_05_19_22_38_09.3236416130/ganesha.conf
/sysroot/ostree/deploy/rhcos/var/lib/kubelet/pods/e8f538cd-3d76-450a-a1f6-33a0667e2187/volumes/kubernetes.io~configmap/ganesha-config/ganesha.conf

sh-5.1# cat /sysroot/ostree/deploy/rhcos/var/lib/kubelet/pods/e8f538cd-3d76-450a-a1f6-33a0667e2187/volumes/kubernetes.io~configmap/ganesha-config/ganesha.conf

NFS_CORE_PARAM {
        Enable_NLM = false;
        Enable_RQUOTA = false;
        Protocols = 4;
}

MDCACHE {
        Dir_Chunk = 0;
}

EXPORT_DEFAULTS {
        Attr_Expiration_Time = 0;
}

NFSv4 {
        Delegations = false;
        RecoveryBackend = 'rados_cluster';
        Minor_Versions = 1, 2;
}

RADOS_KV {
        ceph_conf = "/etc/ceph/ceph.conf";
        userid = nfs-ganesha.ocs-storagecluster-cephnfs.a;
        nodeid = ocs-storagecluster-cephnfs.a;
        pool = ".nfs";
        namespace = "ocs-storagecluster-cephnfs";
}

RADOS_URLS {
        ceph_conf = "/etc/ceph/ceph.conf";
        userid = nfs-ganesha.ocs-storagecluster-cephnfs.a;
        watch_url = 'rados://.nfs/ocs-storagecluster-cephnfs/conf-nfs.ocs-storagecluster-cephnfs';
}

RGW {
        name = "client.nfs-ganesha.ocs-storagecluster-cephnfs.a";
}

%url    rados://.nfs/ocs-storagecluster-cephnfs/conf-nfs.ocs-storagecluster-cephnfs


$ oc describe pod rook-ceph-nfs-ocs-storagecluster-cephnfs-a-645455d649-9hf7f
Name:                 rook-ceph-nfs-ocs-storagecluster-cephnfs-a-645455d649-9hf7f
Namespace:            openshift-storage
Priority:             1000000000
Priority Class Name:  openshift-user-critical
Service Account:      rook-ceph-default
Node:                 ip-10-0-93-62.us-east-2.compute.internal/10.0.93.62
Start Time:           Mon, 19 May 2025 16:38:08 -0600
Labels:               app=rook-ceph-nfs
...
Containers:
  nfs-ganesha:
    Container ID:  cri-o://bfb2748199ef8b17239b490d9ca01dbd1b088d26e2db5e778e0ff20ef9125a0e
    Image:         registry.redhat.io/rhceph/rhceph-7-rhel9@sha256:74f12deed91db0e478d5801c08959e451e0dbef427497badef7a2d8829631882
    Image ID:      registry.redhat.io/rhceph/rhceph-7-rhel9@sha256:74f12deed91db0e478d5801c08959e451e0dbef427497badef7a2d8829631882
    Port:          <none>
    Host Port:     <none>
    Command:
      ganesha.nfsd
    Args:
      -F
      -L
      STDERR
      -p
      /var/run/ganesha/ganesha.pid
      -N
      NIV_INFO
    State:          Running
      Started:      Mon, 19 May 2025 16:38:09 -0600
    Ready:          True
    Restart Count:  0
	
###################################################
Upgraded to ODF 4.18.3:

$ oc get csv
NAME                                    DISPLAY                            VERSION        REPLACES                                PHASE
cephcsi-operator.v4.18.3-rhodf          CephCSI operator                   4.18.3-rhodf   cephcsi-operator.v4.17.6-rhodf          Succeeded
mcg-operator.v4.18.3-rhodf              NooBaa Operator                    4.18.3-rhodf   mcg-operator.v4.17.6-rhodf              Succeeded
ocs-client-operator.v4.18.3-rhodf       OpenShift Data Foundation Client   4.18.3-rhodf   ocs-client-operator.v4.17.6-rhodf       Succeeded
ocs-operator.v4.18.3-rhodf              OpenShift Container Storage        4.18.3-rhodf   ocs-operator.v4.17.6-rhodf              Succeeded
odf-csi-addons-operator.v4.18.3-rhodf   CSI Addons                         4.18.3-rhodf   odf-csi-addons-operator.v4.17.6-rhodf   Succeeded
odf-dependencies.v4.18.3-rhodf          Data Foundation Dependencies       4.18.3-rhodf   odf-dependencies.v4.18.2-rhodf          Succeeded
odf-operator.v4.18.3-rhodf              OpenShift Data Foundation          4.18.3-rhodf   odf-operator.v4.17.6-rhodf              Succeeded
odf-prometheus-operator.v4.18.3-rhodf   Prometheus Operator                4.18.3-rhodf   odf-prometheus-operator.v4.17.6-rhodf   Succeeded
recipe.v4.18.3-rhodf                    Recipe                             4.18.3-rhodf   recipe.v4.17.6-rhodf                    Succeeded
rook-ceph-operator.v4.18.3-rhodf        Rook-Ceph                          4.18.3-rhodf   rook-ceph-operator.v4.17.6-rhodf        Succeeded

$ oc get pods -owide|grep nfs
csi-nfsplugin-b2ckw                                               3/3     Running            0             4m3s    10.0.58.211   ip-10-0-58-211.us-east-2.compute.internal   <none>           <none>
csi-nfsplugin-c8hh4                                               3/3     Running            0             4m35s   10.0.7.186    ip-10-0-7-186.us-east-2.compute.internal    <none>           <none>
csi-nfsplugin-provisioner-7fdb64b9fb-lsrwx                        6/6     Running            0             4m29s   10.128.2.35   ip-10-0-7-186.us-east-2.compute.internal    <none>           <none>
csi-nfsplugin-provisioner-7fdb64b9fb-q9rlr                        6/6     Running            0             4m29s   10.129.2.55   ip-10-0-58-211.us-east-2.compute.internal   <none>           <none>
csi-nfsplugin-rbblx                                               3/3     Running            0             3m32s   10.0.93.62    ip-10-0-93-62.us-east-2.compute.internal    <none>           <none>
rook-ceph-nfs-ocs-storagecluster-cephnfs-a-69954f66fd-sddkz       1/2     CrashLoopBackOff   4 (48s ago)   2m33s   10.129.2.66   ip-10-0-58-211.us-east-2.compute.internal   <none>           <none

$ oc rsh rook-ceph-tools-588f6dcf49-pvmgb ceph versions
{
    "mon": {
        "ceph version 19.2.0-124.el9cp (11e266251d845c59d6f46fc852124e45e0a772bd) squid (stable)": 3
    },
    "mgr": {
        "ceph version 19.2.0-124.el9cp (11e266251d845c59d6f46fc852124e45e0a772bd) squid (stable)": 2
    },
    "osd": {
        "ceph version 19.2.0-124.el9cp (11e266251d845c59d6f46fc852124e45e0a772bd) squid (stable)": 3
    },
    "mds": {
        "ceph version 19.2.0-124.el9cp (11e266251d845c59d6f46fc852124e45e0a772bd) squid (stable)": 2
    },
    "overall": {
        "ceph version 19.2.0-124.el9cp (11e266251d845c59d6f46fc852124e45e0a772bd) squid (stable)": 10
    }
}

$ oc debug node/ip-10-0-93-62.us-east-2.compute.internal
Temporary namespace openshift-debug-8z69t is created for debugging node...
Starting pod/ip-10-0-93-62us-east-2computeinternal-debug-gdz5t ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.93.62
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# find / -name ganesha.conf
/sysroot/ostree/deploy/rhcos/var/lib/kubelet/pods/92ce5aa4-7bcf-4976-b93f-aa2e678ebae7/volumes/kubernetes.io~configmap/ganesha-config/..2025_05_19_23_03_38.3796501208/ganesha.conf
/sysroot/ostree/deploy/rhcos/var/lib/kubelet/pods/92ce5aa4-7bcf-4976-b93f-aa2e678ebae7/volumes/kubernetes.io~configmap/ganesha-config/ganesha.conf
...

sh-5.1# cat /sysroot/ostree/deploy/rhcos/var/lib/kubelet/pods/92ce5aa4-7bcf-4976-b93f-aa2e678ebae7/volumes/kubernetes.io~configmap/ganesha-config/ganesha.conf

NFS_CORE_PARAM {
        Enable_NLM = false;
        Enable_RQUOTA = false;
        Protocols = 4;
}

MDCACHE {
        Dir_Chunk = 0;
}

EXPORT_DEFAULTS {
        Attr_Expiration_Time = 0;
}

NFSv4 {
        Delegations = false;
        RecoveryBackend = "rados_cluster";
        Minor_Versions = 1, 2;
}

RADOS_KV {
        ceph_conf = "/etc/ceph/ceph.conf";
        userid = nfs-ganesha.ocs-storagecluster-cephnfs.a;
        nodeid = ocs-storagecluster-cephnfs.a;
        pool = ".nfs";
        namespace = "ocs-storagecluster-cephnfs";
}

RADOS_URLS {
        ceph_conf = "/etc/ceph/ceph.conf";
        userid = nfs-ganesha.ocs-storagecluster-cephnfs.a;
        watch_url = "rados://.nfs/ocs-storagecluster-cephnfs/conf-nfs.ocs-storagecluster-cephnfs";
}

RGW {
        name = "client.nfs-ganesha.ocs-storagecluster-cephnfs.a";
}

%url    rados://.nfs/ocs-storagecluster-cephnfs/conf-nfs.ocs-storagecluster-cephnfs

Containers:
  nfs-ganesha:
    Container ID:  cri-o://fb96de81215a595228f74eb128a8ad71eb2fa02aabeb78c6ae3e67ec4c483faf
    Image:         registry.redhat.io/rhceph/rhceph-8-rhel9@sha256:73e2715e3f5b8d98459a3be1f8ffb23f8eddda32c646f9812599a5ab277f2da7
    Image ID:      registry.redhat.io/rhceph/rhceph-8-rhel9@sha256:1669308564815f995bf62c40c780cd5357f5fb0d426e712ae6477a5ba2a983b0
    Port:          <none>
    Host Port:     <none>
    Command:
      ganesha.nfsd
    Args:
      -F
      -L
      STDERR
      -p
      /var/run/ganesha/ganesha.pid
      -N
      NIV_INFO
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2

##################################
Workaround:

$ oc scale deployment rook-ceph-operator  rook-ceph-nfs-ocs-storagecluster-cephnfs-a --replicas 0
deployment.apps/rook-ceph-operator scaled
deployment.apps/rook-ceph-nfs-ocs-storagecluster-cephnfs-a scaled

$ oc edit deployment rook-ceph-nfs-ocs-storagecluster-cephnfs-a
deployment.apps/rook-ceph-nfs-ocs-storagecluster-cephnfs-a edited

^ change all the 'image' to the old working image: registry.redhat.io/rhceph/rhceph-7-rhel9@sha256:74f12deed91db0e478d5801c08959e451e0dbef427497badef7a2d8829631882
save and quit

$ oc get pods|grep nfs
csi-nfsplugin-b2ckw                                               3/3     Running   0          9m21s
csi-nfsplugin-c8hh4                                               3/3     Running   0          9m53s
csi-nfsplugin-provisioner-7fdb64b9fb-lsrwx                        6/6     Running   0          9m47s
csi-nfsplugin-provisioner-7fdb64b9fb-q9rlr                        6/6     Running   0          9m47s
csi-nfsplugin-rbblx                                               3/3     Running   0          8m50s
rook-ceph-nfs-ocs-storagecluster-cephnfs-a-b6dfcd957-frglr        2/2     Running   0          17s

19/05/2025 23:06:57 : epoch 682bb98e : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] gssd_refresh_krb5_machine_credential :NFS CB :CRIT :ERROR: gssd_refresh_krb5_machine_credential: no usable keytab entry found in keytab /etc/krb5.keytab for connection with host localhost
19/05/2025 23:06:57 : epoch 682bb98e : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN :gssd_refresh_krb5_machine_credential failed (-1765328160:2)
19/05/2025 23:06:57 : epoch 682bb98e : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_Start_threads :THREAD :EVENT :Starting delayed executor.
19/05/2025 23:06:57 : epoch 682bb98e : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_Start_threads :THREAD :EVENT :gsh_dbusthread was started successfully
19/05/2025 23:06:57 : epoch 682bb98e : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_Start_threads :THREAD :EVENT :admin thread was started successfully
19/05/2025 23:06:57 : epoch 682bb98e : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_Start_threads :THREAD :EVENT :reaper thread was started successfully
19/05/2025 23:06:57 : epoch 682bb98e : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_Start_threads :THREAD :EVENT :General fridge was started successfully
19/05/2025 23:06:57 : epoch 682bb98e : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
19/05/2025 23:06:57 : epoch 682bb98e : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_start :NFS STARTUP :EVENT :             NFS SERVER INITIALIZED
19/05/2025 23:06:57 : epoch 682bb98e : openshift-storage-ocs-storagecluster-cephnfs : nfs-ganesha-1[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------

Comment 7 Soumi Mitra 2025-05-20 04:47:50 UTC

Hello,

On the same cluster where workaround is in place, i.e, with rook-ceph-nfs-ocs-storagecluster-cephnfs-a deployment modified with the old working image: registry.redhat.io/rhceph/rhceph-7-rhel9@sha256:74f12deed91db0e478d5801c08959e451e0dbef427497badef7a2d8829631882


rook-ceph-nfs-ocs-storagecluster-cephnfs-a-bxxxxx7-fxxxr  came back to running state from CLBO state and PVC creation with storageclass 'ocs-storagecluster-ceph-nfs' was successful

oc describe pvc testnfs
Name:          testnfs
Namespace:     openshift-storage
StorageClass:  ocs-storagecluster-ceph-nfs
Status:        Bound
Volume:        pvc-0e8e5688-2279-4a79-acf2-1d3bd2833427
Labels:        <none>
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: openshift-storage.nfs.csi.ceph.com
               volume.kubernetes.io/storage-provisioner: openshift-storage.nfs.csi.ceph.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      1Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Used By:       <none>
Events:
  Type    Reason                 Age   From                                                                                                                Message
  ----    ------                 ----  ----                                                                                                                -------
  Normal  ExternalProvisioning   31s   persistentvolume-controller                                                                                         Waiting for a volume to be created either by the external provisioner 'openshift-storage.nfs.csi.ceph.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
  Normal  Provisioning           31s   openshift-storage.nfs.csi.ceph.com_csi-nfsplugin-provisioner-7fdb64b9fb-lsrwx_22694f4e-f540-488a-bcd3-2934c9e0adf2  External provisioner is provisioning volume for claim "openshift-storage/testnfs"
  Normal  ProvisioningSucceeded  31s   openshift-storage.nfs.csi.ceph.com_csi-nfsplugin-provisioner-7fdb64b9fb-lsrwx_22694f4e-f540-488a-bcd3-2934c9e0adf2  Successfully provisioned volume pvc-0e8e5688-2279-4a79-acf2-1d3bd2833427

Workaround steps in brief:


a] Scale down rook-ceph-operator and rook-ceph-nfs-ocs-storagecluster-cephnfs-a deployments

$ oc scale deployment rook-ceph-operator  rook-ceph-nfs-ocs-storagecluster-cephnfs-a --replicas 0

b] Change all the 'image' to the old working image: registry.redhat.io/rhceph/rhceph-7-rhel9@sha256:74f12deed91db0e478d5801c08959e451e0dbef427497badef7a2d8829631882


$ oc edit deployment rook-ceph-nfs-ocs-storagecluster-cephnfs-a

c] Scale up rook-ceph-operator and rook-ceph-nfs-ocs-storagecluster-cephnfs-a deployments

d] Check if rook-ceph-nfs-ocs-storagecluster-cephnfs-a-6xxxxxxd-sxxxz pod is back to Running state from CLBO state

$ oc get pods -o wide|grep nfs

e] Verify pvc creation with storageclass 'ocs-storagecluster-ceph-nfs' is successful

The steps are yet to be executed in customer's environment


Regards,
Soumi

Comment 13 kelwhite 2025-05-20 17:07:30 UTC

@Sachin Punadikar 

For the workaround outlined in https://bugzilla.redhat.com/show_bug.cgi?id=2367347#c8 I have a few questions..

1) Where is this ran from? The pod rook-ceph-nfs-ocs-storagecluster-cephnfs-a will be in clbo, but this binary 'ganesha-rados-grace' is in the rook-ceph-tools pod.
2) What options is ran with this ganesha-rados-grace command?

Here are some tests:
$ oc rsh rook-ceph-tools-588f6dcf49-pvmgb
sh-5.1$ ceph -s
  cluster:
    id:     a4faeada-c8b1-441d-91d8-b06b8ba9e157
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 18h)
    mgr: a(active, since 17h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 18h), 3 in (since 18h)

  data:
    volumes: 1/1 healthy
    pools:   5 pools, 145 pgs
    objects: 140 objects, 265 MiB
    usage:   984 MiB used, 12 TiB / 12 TiB avail
    pgs:     145 active+clean

  io:
    client:   938 B/s rd, 1.7 KiB/s wr, 1 op/s rd, 0 op/s wr

sh-5.1$ ceph df
--- RAW STORAGE ---
CLASS    SIZE   AVAIL     USED  RAW USED  %RAW USED
ssd    12 TiB  12 TiB  990 MiB   990 MiB          0
TOTAL  12 TiB  12 TiB  990 MiB   990 MiB          0

--- POOLS ---
POOL                                        ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
ocs-storagecluster-cephblockpool             1   64  179 MiB      109  536 MiB      0    3.4 TiB
.mgr                                         2    1  577 KiB        2  1.7 MiB      0    3.4 TiB
ocs-storagecluster-cephfilesystem-metadata   3   16  112 KiB       24  421 KiB      0    3.4 TiB
ocs-storagecluster-cephfilesystem-data0      4   32    161 B        1   12 KiB      0    3.4 TiB
.nfs                                         5   32  2.7 KiB        4   42 KiB      0    3.4 TiB

sh-5.1$ ganesha-rados-grace
rados_ioctx_create: -2
Can't connect to cluster: -2
sh-5.1$ ganesha-rados-grace --help
ganesha-rados-grace: unrecognized option '--help'
Usage:
ganesha-rados-grace [ --userid ceph_user ] [ --cephconf /path/to/ceph.conf ] [ --ns namespace ] [ --oid obj_id ] [ --pool pool_id ] dump|add|start|join|lift|remove|enforce|noenforce|member [ nodeid ... ]
sh-5.1$ ganesha-rados-grace --cephconf /etc/ceph/ceph.conf
rados_ioctx_create: -2
Can't connect to cluster: -2
sh-5.1$ ganesha-rados-grace --cephconf /etc/ceph/ceph.conf --userid client.admin
rados_connect: -13
Can't connect to cluster: -13
sh-5.1$ ganesha-rados-grace --cephconf /etc/ceph/ceph.conf --userid client.admin --pool .nfs
rados_connect: -13
Can't connect to cluster: -13

ref:
https://github.com/nfs-ganesha/nfs-ganesha/blob/next/src/doc/man/ganesha-rados-grace.rst

Comment 14 kelwhite 2025-05-20 17:11:14 UTC

Nvm...:

sh-5.1$ ganesha-rados-grace -p ".nfs" --ns ocs-storagecluster-cephnfs
cur=4 rec=0
======================================================
nodeocs-storagecluster-cephnfs.a         E
ocs-storagecluster-cephnfs.a

Comment 15 kelwhite 2025-05-20 17:28:00 UTC

@Sachin Punadikar 

So, I got the correct output when I ran this on my 'fixed' 4.18.3 cluster that's using the older image:

sh-5.1$ ganesha-rados-grace -p ".nfs" --ns ocs-storagecluster-cephnfs dump
cur=4 rec=0
======================================================
nodeocs-storagecluster-cephnfs.a         E
ocs-storagecluster-cephnfs.a
sh-5.1$

However, trying to run any of the commands in c#8 will not work on a newly deployed ODF 4.18.3 cluster:

sh-5.1$ ganesha-rados-grace -p ".nfs" --ns ocs-storagecluster-cephnfs
rados_conf_read_file: -2
Can't connect to cluster: -2
sh-5.1$ ganesha-rados-grace -p ".nfs" --ns ocs-storagecluster-cephnfs dump
rados_conf_read_file: -2
Can't connect to cluster: -2
sh-5.1$ ganesha-rados-grace -p ".nfs" --ns nfsganesha add node0
rados_conf_read_file: -2
Can't connect to cluster: -2

// ceph df
--- POOLS ---
POOL                                        ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.nfs                                         1   32     16 B        2   12 KiB      0    435 GiB
.mgr                                         2   32  577 KiB        2  1.7 MiB      0    435 GiB
ocs-storagecluster-cephblockpool             3   32   81 MiB       74  242 MiB   0.02    435 GiB
ocs-storagecluster-cephfilesystem-metadata   4   16   25 KiB       22  156 KiB      0    435 GiB
ocs-storagecluster-cephfilesystem-data0      5   32      0 B        0      0 B      0    435 GiB

Comment 16 kelwhite 2025-05-20 17:32:52 UTC

... do note, on fresh installs of 4.18.3, the nfs container doesn't fully finish its reconcile as gasneha never fully runs, so I don't think the db would be completed like it was in an upgrade. Hence, why the commands don't work on new 4.18.3 deployments. We have 2 clusters if you want to play around on them:

// Upgraded 4.17 -> 4.18.3 cluster with the image swap workaround:
web-console here: https://console-openshift-console.apps.kelwhite-04146883.nasa.aws.cee.support
user: "kubeadmin"
password: "SZjXn-N6HYk-aMh7a-MZufk"

// Fresh odf 4.18.3 deployment:
https://console-openshift-console.apps.cluster-dfkpn.dfkpn.sandbox967.opentlc.com
User: kubeadmin
Password: qNRWh-bSZhR-2CHW2-wMETb

Comment 19 khover 2025-05-21 17:53:46 UTC

@ Miguel 

 > NOTE. If we scale up rook-ceph-operator , it reconciles the deployment rook-ceph-nfs-ocs-storagecluster-cephnfs-a  with the image of 4.18.3 and pod goes back into CLBO again. So we need to keep rook-ceph-operator scale down for this workaround to work.

You can try to block reconcile at the deployment layer.

$ oc patch deployment rook-ceph-nfs-ocs-storagecluster-cephnfs-a -n openshift-storage --type merge --patch '{"metadata": {"labels": {"ceph.rook.io/do-not-reconcile": "true"}}}'

Comment 20 khover 2025-05-21 18:19:13 UTC

(In reply to khover from comment #19)
> @ Miguel 
> 
>  > NOTE. If we scale up rook-ceph-operator , it reconciles the deployment
> rook-ceph-nfs-ocs-storagecluster-cephnfs-a  with the image of 4.18.3 and pod
> goes back into CLBO again. So we need to keep rook-ceph-operator scale down
> for this workaround to work.
> 
> You can try to block reconcile at the deployment layer.
> 
> $ oc patch deployment rook-ceph-nfs-ocs-storagecluster-cephnfs-a -n
> openshift-storage --type merge --patch '{"metadata": {"labels":
> {"ceph.rook.io/do-not-reconcile": "true"}}}'

Disregard ^^ seems it was tested and I was unaware

Comment 22 lema 2025-05-23 02:06:31 UTC

Two more clarifications need to be answered:
--
1. Is it correct to understand that the workaround requires continuing operation with the rook-ceph-operator stopped? If so, what is the operational impact of not having the operator running, and what precautions should we take during this state?

2. Is there a method to exclude only the rook-ceph-nfs component from being managed by the rook-ceph-operator? (in order to prevent reconcile from reverting manual image changes to rook-ceph-nfs)

Cheers
Ray

Comment 30 errata-xmlrpc 2026-01-29 06:49:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 9.0 Security and Enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2026:1536