2092220 – [Tracker for Ceph BZ #2096882] CephNFS is not reaching to Ready state on ODF on IBM Power (ppc64le)

Bug 2092220 - [Tracker for Ceph BZ #2096882] CephNFS is not reaching to Ready state on ODF on IBM Power (ppc64le)

Summary: [Tracker for Ceph BZ #2096882] CephNFS is not reaching to Ready state on ODF ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.11
Hardware:	ppc64le
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	ODF 4.11.0
Assignee:	tserlin
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:	2096882
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-01 06:38 UTC by Aaruni Aggarwal
Modified:	2023-08-09 16:37 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2096882 (view as bug list)
Environment:
Last Closed:	2022-08-24 13:54:12 UTC
Embargoed:

Attachments	(Terms of Use)
minimal reproducer for "char c" vs "int c" (1.32 KB, text/plain) 2022-06-14 13:10 UTC, Niels de Vos	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2022:6156	0	None	None	None	2022-08-24 13:54:26 UTC

Description Aaruni Aggarwal 2022-06-01 06:38:33 UTC

Description of problem (please be detailed as possible and provide log
snippests):
CephNFS is not reaching to ready state on IBM Power and NFS Ganesha server is also not Running.

Version of all relevant components (if applicable):
OCP Version : 4.11.0-0.nightly-ppc64le-2022-05-23-232055
ODF version: 4.11.0-80
ceph version 16.2.8-5.el8cp (0974c9ff5a69f17f3843a7c1a568daa2b4559e2d) pacific (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy ODF 4.11 on OCP4.11 on IBM Power
2. Patch the storagecluster to enable nfs feature, using following command:
oc patch -n openshift-storage storageclusters.ocs.openshift.io ocs-storagecluster --patch '{"spec": {"nfs":{"enable": true}}}' --type merge

3.Wait for `rook-ceph-nfs-ocs-storagecluster-cephnfs-a-*` pod to reach Running state. 


Actual results:
`rook-ceph-nfs-ocs-storagecluster-cephnfs-a-*` pod is in CrashLoopBackOff state and CephNFS status shows Failed. 

Expected results:
`rook-ceph-nfs-ocs-storagecluster-cephnfs-a-*` pod should be in Running state and ocs-storagecluster-ceph-nfs Storageclass should be created. 

Additional info:

Comment 2 Aaruni Aggarwal 2022-06-01 06:44:11 UTC

[root@rdr-aar411-sao01-bastion-0 ~]# oc get pods |grep nfs
rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f        1/2     CrashLoopBackOff   30 (3m15s ago)   132m


Described pod:

[root@rdr-aar411-sao01-bastion-0 ~]# oc describe pod rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f
Name:                 rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f
Namespace:            openshift-storage
Priority:             1000000000
Priority Class Name:  openshift-user-critical
Node:                 sao01-worker-0.rdr-aar411.ibm.com/192.168.0.157
Start Time:           Tue, 31 May 2022 05:07:39 -0400
Labels:               app=rook-ceph-nfs
                      app.kubernetes.io/component=cephnfses.ceph.rook.io
                      app.kubernetes.io/created-by=rook-ceph-operator
                      app.kubernetes.io/instance=ocs-storagecluster-cephnfs-a
                      app.kubernetes.io/managed-by=rook-ceph-operator
                      app.kubernetes.io/name=ceph-nfs
                      app.kubernetes.io/part-of=ocs-storagecluster-cephnfs
                      ceph_daemon_id=ocs-storagecluster-cephnfs-a
                      ceph_daemon_type=nfs
                      ceph_nfs=ocs-storagecluster-cephnfs
                      instance=a
                      nfs=ocs-storagecluster-cephnfs-a
                      pod-template-hash=d97c557c8
                      rook.io/operator-namespace=openshift-storage
                      rook_cluster=openshift-storage
Annotations:          config-hash: 5e0395e0b68b7bbbf8c216b865198a03
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "openshift-sdn",
                            "interface": "eth0",
                            "ips": [
                                "10.129.2.54"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "openshift-sdn",
                            "interface": "eth0",
                            "ips": [
                                "10.129.2.54"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      openshift.io/scc: rook-ceph
Status:               Running
IP:                   10.129.2.54
IPs:
  IP:           10.129.2.54
Controlled By:  ReplicaSet/rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8
Init Containers:
  generate-minimal-ceph-conf:
    Container ID:  cri-o://85040c6564def25f5d93407572e3a99b1141e330ce90130dfbe50d01fe625be6
    Image:         quay.io/rhceph-dev/rhceph@sha256:abc274cd8cbaaf4abc213c1b7949a2063b63d501329e0f692d93b2e546ae8b91
    Image ID:      quay.io/rhceph-dev/rhceph@sha256:917ac58f7f5dd3c78c30ff19cab7dbc2d31545a2e7f6f109f2c1bbb6d00a4dd6
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      
      set -xEeuo pipefail
      
      cat << EOF > /etc/ceph/ceph.conf
      [global]
      mon_host = $(ROOK_CEPH_MON_HOST)
      
      [client.nfs-ganesha.ocs-storagecluster-cephnfs.a]
      keyring = /etc/ceph/keyring-store/keyring
      EOF
      
      chmod 444 /etc/ceph/ceph.conf
      
      cat /etc/ceph/ceph.conf
      
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 31 May 2022 05:07:41 -0400
      Finished:     Tue, 31 May 2022 05:07:41 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     3
      memory:  8Gi
    Requests:
      cpu:     3
      memory:  8Gi
    Environment:
      ROOK_CEPH_MON_HOST:             <set to the key 'mon_host' in secret 'rook-ceph-config'>             Optional: false
      ROOK_CEPH_MON_INITIAL_MEMBERS:  <set to the key 'mon_initial_members' in secret 'rook-ceph-config'>  Optional: false
    Mounts:
      /etc/ceph from etc-ceph (rw)
      /etc/ceph/keyring-store/ from rook-ceph-nfs-ocs-storagecluster-cephnfs-a-keyring (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6qfwc (ro)
Containers:
  nfs-ganesha:
    Container ID:  cri-o://2562e74073374b4774f2c3d82506549e34b03829a9e698ff575d67c9fcb560dd
    Image:         quay.io/rhceph-dev/rhceph@sha256:abc274cd8cbaaf4abc213c1b7949a2063b63d501329e0f692d93b2e546ae8b91
    Image ID:      quay.io/rhceph-dev/rhceph@sha256:917ac58f7f5dd3c78c30ff19cab7dbc2d31545a2e7f6f109f2c1bbb6d00a4dd6
    Port:          <none>
    Host Port:     <none>
    Command:
      ganesha.nfsd
    Args:
      -F
      -L
      STDERR
      -p
      /var/run/ganesha/ganesha.pid
      -N
      NIV_INFO
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Tue, 31 May 2022 07:16:36 -0400
      Finished:     Tue, 31 May 2022 07:16:37 -0400
    Ready:          False
    Restart Count:  30
    Limits:
      cpu:     3
      memory:  8Gi
    Requests:
      cpu:     3
      memory:  8Gi
    Environment:
      CONTAINER_IMAGE:                quay.io/rhceph-dev/rhceph@sha256:abc274cd8cbaaf4abc213c1b7949a2063b63d501329e0f692d93b2e546ae8b91
      POD_NAME:                       rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f (v1:metadata.name)
      POD_NAMESPACE:                  openshift-storage (v1:metadata.namespace)
      NODE_NAME:                       (v1:spec.nodeName)
      POD_MEMORY_LIMIT:               8589934592 (limits.memory)
      POD_MEMORY_REQUEST:             8589934592 (requests.memory)
      POD_CPU_LIMIT:                  3 (limits.cpu)
      POD_CPU_REQUEST:                3 (requests.cpu)
      ROOK_CEPH_MON_HOST:             <set to the key 'mon_host' in secret 'rook-ceph-config'>             Optional: false
      ROOK_CEPH_MON_INITIAL_MEMBERS:  <set to the key 'mon_initial_members' in secret 'rook-ceph-config'>  Optional: false
    Mounts:
      /etc/ceph from etc-ceph (rw)
      /etc/ceph/keyring-store/ from rook-ceph-nfs-ocs-storagecluster-cephnfs-a-keyring (ro)
      /etc/ganesha from ganesha-config (rw)
      /run/dbus from run-dbus (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6qfwc (ro)
  dbus-daemon:
    Container ID:  cri-o://324a2881fcbda852ef7960ec05f4b5eb49e471d52c76529d1d090c7356a7ed4f
    Image:         quay.io/rhceph-dev/rhceph@sha256:abc274cd8cbaaf4abc213c1b7949a2063b63d501329e0f692d93b2e546ae8b91
    Image ID:      quay.io/rhceph-dev/rhceph@sha256:917ac58f7f5dd3c78c30ff19cab7dbc2d31545a2e7f6f109f2c1bbb6d00a4dd6
    Port:          <none>
    Host Port:     <none>
    Command:
      dbus-daemon
    Args:
      --nofork
      --system
      --nopidfile
    State:          Running
      Started:      Tue, 31 May 2022 05:07:43 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     3
      memory:  8Gi
    Requests:
      cpu:     3
      memory:  8Gi
    Environment:
      CONTAINER_IMAGE:     quay.io/rhceph-dev/rhceph@sha256:abc274cd8cbaaf4abc213c1b7949a2063b63d501329e0f692d93b2e546ae8b91
      POD_NAME:            rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f (v1:metadata.name)
      POD_NAMESPACE:       openshift-storage (v1:metadata.namespace)
      NODE_NAME:            (v1:spec.nodeName)
      POD_MEMORY_LIMIT:    8589934592 (limits.memory)
      POD_MEMORY_REQUEST:  8589934592 (requests.memory)
      POD_CPU_LIMIT:       3 (limits.cpu)
      POD_CPU_REQUEST:     3 (requests.cpu)
    Mounts:
      /run/dbus from run-dbus (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6qfwc (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  etc-ceph:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  rook-ceph-nfs-ocs-storagecluster-cephnfs-a-keyring:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  rook-ceph-nfs-ocs-storagecluster-cephnfs-a-keyring
    Optional:    false
  ganesha-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      rook-ceph-nfs-ocs-storagecluster-cephnfs-a
    Optional:  false
  run-dbus:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-6qfwc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 5s
                             node.ocs.openshift.io/storage=true:NoSchedule
Events:
  Type     Reason          Age                     From               Message
  ----     ------          ----                    ----               -------
  Normal   Scheduled       132m                    default-scheduler  Successfully assigned openshift-storage/rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f to sao01-worker-0.rdr-aar411.ibm.com by sao01-master-1.rdr-aar411.ibm.com
  Normal   AddedInterface  132m                    multus             Add eth0 [10.129.2.54/23] from openshift-sdn
  Normal   Pulled          132m                    kubelet            Container image "quay.io/rhceph-dev/rhceph@sha256:abc274cd8cbaaf4abc213c1b7949a2063b63d501329e0f692d93b2e546ae8b91" already present on machine
  Normal   Created         132m                    kubelet            Created container generate-minimal-ceph-conf
  Normal   Started         132m                    kubelet            Started container generate-minimal-ceph-conf
  Normal   Pulled          132m                    kubelet            Container image "quay.io/rhceph-dev/rhceph@sha256:abc274cd8cbaaf4abc213c1b7949a2063b63d501329e0f692d93b2e546ae8b91" already present on machine
  Normal   Created         132m                    kubelet            Created container dbus-daemon
  Normal   Started         132m                    kubelet            Started container dbus-daemon
  Normal   Pulled          132m (x4 over 132m)     kubelet            Container image "quay.io/rhceph-dev/rhceph@sha256:abc274cd8cbaaf4abc213c1b7949a2063b63d501329e0f692d93b2e546ae8b91" already present on machine
  Normal   Created         132m (x4 over 132m)     kubelet            Created container nfs-ganesha
  Normal   Started         132m (x4 over 132m)     kubelet            Started container nfs-ganesha
  Warning  BackOff         2m49s (x602 over 132m)  kubelet            Back-off restarting failed container


logs of the pod: 

[root@rdr-aar411-sao01-bastion-0 ~]# oc logs pod/rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f -c nfs-ganesha
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] main :MAIN :EVENT :nfs-ganesha Starting: Ganesha Version 3.5
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] nfs_set_param_from_conf :NFS STARTUP :EVENT :Configuration file successfully parsed
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] init_fds_limit :INODE LRU :INFO :Setting the system-imposed limit on FDs to 1048576.
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] init_server_pkgs :NFS STARTUP :INFO :State lock layer successfully initialized
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] init_server_pkgs :NFS STARTUP :INFO :IP/name cache successfully initialized
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] init_server_pkgs :NFS STARTUP :EVENT :Initializing ID Mapper.
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper successfully initialized.
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] nfs4_recovery_init :CLIENT ID :INFO :Recovery Backend Init for rados_cluster
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] rados_cluster_init :CLIENT ID :EVENT :Cluster membership check failed: -2
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] main :NFS STARTUP :CRIT :Recovery backend initialization failed!
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] main :NFS STARTUP :FATAL :Fatal errors.  Server exiting...


Described Cephnfs : 

[root@rdr-aar411-sao01-bastion-0 ~]# oc describe cephnfs ocs-storagecluster-cephnfs
Name:         ocs-storagecluster-cephnfs
Namespace:    openshift-storage
Labels:       <none>
Annotations:  <none>
API Version:  ceph.rook.io/v1
Kind:         CephNFS
Metadata:
  Creation Timestamp:  2022-05-31T09:07:30Z
  Finalizers:
    cephnfs.ceph.rook.io
  Generation:  1
  Managed Fields:
    API Version:  ceph.rook.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:ownerReferences:
          .:
          k:{"uid":"d96624a6-f3d8-41ca-bb02-7301d78abb03"}:
      f:spec:
        .:
        f:rados:
        f:server:
          .:
          f:active:
          f:placement:
            .:
            f:nodeAffinity:
              .:
              f:requiredDuringSchedulingIgnoredDuringExecution:
                .:
                f:nodeSelectorTerms:
            f:podAntiAffinity:
              .:
              f:requiredDuringSchedulingIgnoredDuringExecution:
            f:tolerations:
          f:priorityClassName:
          f:resources:
            .:
            f:limits:
              .:
              f:cpu:
              f:memory:
            f:requests:
              .:
              f:cpu:
              f:memory:
    Manager:      ocs-operator
    Operation:    Update
    Time:         2022-05-31T09:07:30Z
    API Version:  ceph.rook.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"cephnfs.ceph.rook.io":
    Manager:      rook
    Operation:    Update
    Time:         2022-05-31T09:07:30Z
    API Version:  ceph.rook.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:phase:
    Manager:      rook
    Operation:    Update
    Subresource:  status
    Time:         2022-05-31T09:07:39Z
  Owner References:
    API Version:           ocs.openshift.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  StorageCluster
    Name:                  ocs-storagecluster
    UID:                   d96624a6-f3d8-41ca-bb02-7301d78abb03
  Resource Version:        12370216
  UID:                     1d21b144-ab15-4d67-81ee-fddb08fb7ace
Spec:
  Rados:
  Server:
    Active:  1
    Placement:
      Node Affinity:
        Required During Scheduling Ignored During Execution:
          Node Selector Terms:
            Match Expressions:
              Key:       cluster.ocs.openshift.io/openshift-storage
              Operator:  Exists
      Pod Anti Affinity:
        Required During Scheduling Ignored During Execution:
          Label Selector:
            Match Expressions:
              Key:       app
              Operator:  In
              Values:
                rook-ceph-nfs
          Topology Key:  kubernetes.io/hostname
      Tolerations:
        Effect:           NoSchedule
        Key:              node.ocs.openshift.io/storage
        Operator:         Equal
        Value:            true
    Priority Class Name:  openshift-user-critical
    Resources:
      Limits:
        Cpu:     3
        Memory:  8Gi
      Requests:
        Cpu:     3
        Memory:  8Gi
Status:
  Phase:  Failed
Events:
  Type     Reason           Age                  From                      Message
  ----     ------           ----                 ----                      -------
  Warning  ReconcileFailed  4s (x14 over 2m29s)  rook-ceph-nfs-controller  failed to reconcile CephNFS "openshift-storage/ocs-storagecluster-cephnfs". failed to create ceph nfs deployments: failed to update ceph nfs "ocs-storagecluster-cephnfs": failed to add server "a" to database: failed to add "a" to grace db: exit status 1



Rook-ceph-operator logs:

2022-05-31 11:29:24.670365 I | ceph-spec: parsing mon endpoints: a=172.30.245.151:6789,b=172.30.25.76:6789,c=172.30.64.206:6789
2022-05-31 11:29:24.670439 I | ceph-spec: detecting the ceph image version for image quay.io/rhceph-dev/rhceph@sha256:abc274cd8cbaaf4abc213c1b7949a2063b63d501329e0f692d93b2e546ae8b91...
2022-05-31 11:29:29.380569 I | ceph-spec: detected ceph image version: "16.2.8-5 pacific"
2022-05-31 11:29:30.402032 I | ceph-nfs-controller: configuring pool ".nfs" for nfs
2022-05-31 11:29:32.993018 I | ceph-nfs-controller: set pool ".nfs" for the application nfs
2022-05-31 11:29:33.005783 I | ceph-nfs-controller: updating ceph nfs "ocs-storagecluster-cephnfs"
2022-05-31 11:29:33.094323 I | cephclient: getting or creating ceph auth key "client.nfs-ganesha.ocs-storagecluster-cephnfs.a"
2022-05-31 11:29:34.036599 I | ceph-nfs-controller: ceph nfs deployment "rook-ceph-nfs-ocs-storagecluster-cephnfs-a" already exists. updating if needed
2022-05-31 11:29:34.049749 I | op-k8sutil: deployment "rook-ceph-nfs-ocs-storagecluster-cephnfs-a" did not change, nothing to update
2022-05-31 11:29:34.067508 I | ceph-nfs-controller: ceph nfs service already created
2022-05-31 11:29:34.067523 I | ceph-nfs-controller: adding ganesha "a" to grace db
2022-05-31 11:29:34.092006 E | ceph-nfs-controller: failed to reconcile CephNFS "openshift-storage/ocs-storagecluster-cephnfs". failed to create ceph nfs deployments: failed to update ceph nfs "ocs-storagecluster-cephnfs": failed to add server "a" to database: failed to add "a" to grace db: exit status 1

Comment 3 Rakshith 2022-06-01 06:52:05 UTC

`ganesha-rados-grace` add command is failing

https://github.com/rook/rook/blob/f7930455c4af01fdb4962a06af6e4ee49178d6e5/pkg/operator/ceph/nfs/nfs.go#L172-L179

Comment 5 Aaruni Aggarwal 2022-06-06 17:27:12 UTC

Attaching must-gather logs:
https://drive.google.com/file/d/1khBOLMFEcsXRYp3LfuyIgY6hzTCrZpLA/view?usp=sharing

Comment 6 Blaine Gardner 2022-06-06 17:38:51 UTC

I got SSH access to the cluster and enabled debug mode for the rook operator. I see some additional information about the command Rook is executing that is failing to add the NFS server to the grace database (copied at bottom). Ceph is responding as though the command has invalid usage. 

The command is not erroring due to an unknown flag. Otherwise, the error message would include something like "ganesha-rados-grace: unrecognized option '--poool'" (from a deliberate attempt to cause the error). This means that the flags themselves are not bad.

When I execute the command manually from the operator pod, the return code is 1. The return code is the same if I remove both the --pool and --ns flags. The error code doesn't seem to be useful for identifying where in the ganesha-rados-grace utility the error is occurring.

Both "ganesha-rados-grace --pool '.nfs' --ns ocs-storagecluster-cephnfs dump" as well as "ganesha-rados-grace --pool '.nfs' --ns ocs-storagecluster-cephnfs dump ocs-storagecluster-cephnfs.a" result in the same error. This suggests to me that it is not merely the "add" subcommand that is buggy.

Adding the --cephconf option to any of the commands I have tried doesn't change the behavior.

Based on my debugging so far, I am inclined to think this is a bug with the ganesha-rados-grace utility. This is possibly a build issue limited to ppc architectures, or could generically be any non-x86 architecture. I don't know enough about the ganesha-rados-grace utility or how its built to provide better feedback.

-----

2022-06-06 17:14:38.207620 I | ceph-nfs-controller: adding ganesha "a" to grace db
2022-06-06 17:14:38.207641 D | exec: Running command: ganesha-rados-grace --pool .nfs --ns ocs-storagecluster-cephnfs add ocs-storagecluster-cephnfs.a
2022-06-06 17:14:38.230578 D | exec: Usage:
2022-06-06 17:14:38.230611 D | exec: ganesha-rados-grace [ --userid ceph_user ] [ --cephconf /path/to/ceph.conf ] [ --ns namespace ] [ --oid obj_id ] [ --pool pool_id ] dump|add|start|join|lift|remove|enforce|noenforce|member [ nodeid ... ]
2022-06-06 17:14:38.240908 D | ceph-nfs-controller: nfs "openshift-storage/ocs-storagecluster-cephnfs" status updated to "Failed"
2022-06-06 17:14:38.240958 E | ceph-nfs-controller: failed to reconcile CephNFS "openshift-storage/ocs-storagecluster-cephnfs". failed to create ceph nfs deployments: failed to update ceph nfs "ocs-storagecluster-cephnfs": failed to add server "a" to database: failed to add "a" to grace db: exit status 1

Comment 7 Niels de Vos 2022-06-14 11:30:51 UTC

It seems we need https://github.com/nfs-ganesha/nfs-ganesha/commit/3db6bc0cb75fa85ffcebeda1276d195915b84579 to have this working on ppc64le too.

Comment 8 Niels de Vos 2022-06-14 13:10:15 UTC

Created attachment 1889712 [details]
minimal reproducer for "char c" vs "int c"

This is a small C program based on tools/ganesha-rados-grace.c in the NFS-Ganesha sources.

On ppc64le getopt_long() returns c=255 instead of c=-1:

[root@ibm-p9b-25 ~]# gdb --args ganesha-rados-grace -p .nfs -n ocs-storagecluster-cephnfs add ocs-storagecluster-cephnfs.a                                                                                                                  
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-18.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "ppc64le-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ganesha-rados-grace...Reading symbols from /usr/lib/debug/usr/bin/ganesha-rados-grace-3.5-1.el8cp.ppc64le.debug...done.                                                                                                
done.
(gdb) b usage
Breakpoint 1 at 0x1718: file /usr/include/bits/stdio2.h, line 100.
(gdb) r
Starting program: /usr/bin/ganesha-rados-grace -p .nfs -n ocs-storagecluster-cephnfs add ocs-storagecluster-cephnfs.a
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/glibc-hwcaps/power9/libthread_db-1.0.so".

Breakpoint 1, usage (argv=0x7fffffffee48) at /usr/include/bits/stdio2.h:100
warning: Source file is more recent than executable.
100       return __fprintf_chk (__stream, __USE_FORTIFY_LEVEL - 1, __fmt,
Missing separate debuginfos, use: yum debuginfo-install libntirpc-3.4-1.el8cp.ppc64le librados2-16.2.8-43.el8cp.ppc64le                                                                                                                     
(gdb) bt
#0  usage (argv=0x7fffffffee48) at /usr/include/bits/stdio2.h:100
#1  main (argc=<optimized out>, argv=0x7fffffffee48) at /usr/src/debug/nfs-ganesha-3.5-1.el8cp.ppc64le/src/tools/ganesha-rados-grace.c:134                                                                                                  
(gdb) f 1
#1  main (argc=<optimized out>, argv=0x7fffffffee48) at /usr/src/debug/nfs-ganesha-3.5-1.el8cp.ppc64le/src/tools/ganesha-rados-grace.c:134                                                                                                  
134     /usr/src/debug/nfs-ganesha-3.5-1.el8cp.ppc64le/src/tools/ganesha-rados-grace.c: No such file or directory.
(gdb) p c
$1 = 255 '\377'

Comment 9 Niels de Vos 2022-06-14 13:28:08 UTC

Bug 2096882 has been reported against RHCS. Once that is addressed, the rook-ceph container image needs to be rebuilt so that it uses the fixed `ganesha-rados-grace` command.

Comment 10 Mudit Agarwal 2022-06-20 14:04:07 UTC

Ceph BZ is ON_QA

Comment 14 Aaruni Aggarwal 2022-06-22 13:47:15 UTC

With the latest ODF version : 4.11.0-101 , nfs pod reached Running state as well as cephnfs reached Ready state. 

[root@rdr-odf-nfs-sao01-bastion-0 ~]# oc get csv -n openshift-storage -o json ocs-operator.v4.11.0 | jq '.metadata.labels["full_version"]'
"4.11.0-101"

[root@rdr-odf-nfs-sao01-bastion-0 ~]# oc get pods |grep rook-ceph-nfs
rook-ceph-nfs-ocs-storagecluster-cephnfs-a-768c45bb78-w6zvh       2/2     Running     0               7m17s

[root@rdr-odf-nfs-sao01-bastion-0 ~]# oc get sc |grep nfs
ocs-storagecluster-ceph-nfs   openshift-storage.nfs.csi.ceph.com      Delete          Immediate              false                  7m56s

[root@rdr-odf-nfs-sao01-bastion-0 ~]# oc get cephnfs
NAME                         AGE
ocs-storagecluster-cephnfs   2m39s
[root@rdr-odf-nfs-sao01-bastion-0 ~]# 
[root@rdr-odf-nfs-sao01-bastion-0 ~]# oc get cephnfs -o yaml
apiVersion: v1
items:
- apiVersion: ceph.rook.io/v1
  kind: CephNFS
  metadata:
    creationTimestamp: "2022-06-22T13:37:23Z"
    finalizers:
    - cephnfs.ceph.rook.io
    generation: 1
    name: ocs-storagecluster-cephnfs
    namespace: openshift-storage
    ownerReferences:
    - apiVersion: ocs.openshift.io/v1
      blockOwnerDeletion: true
      controller: true
      kind: StorageCluster
      name: ocs-storagecluster
      uid: 4ebdb866-56d5-4145-b4c5-e73d003ffa50
    resourceVersion: "695162"
    uid: ad5d9eb8-1973-4ff3-a544-61f707ea2063
  spec:
    rados: {}
    server:
      active: 1
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage
                operator: Exists
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - rook-ceph-nfs
            topologyKey: kubernetes.io/hostname
        tolerations:
        - effect: NoSchedule
          key: node.ocs.openshift.io/storage
          operator: Equal
          value: "true"
      priorityClassName: openshift-user-critical
      resources:
        limits:
          cpu: "3"
          memory: 8Gi
        requests:
          cpu: "3"
          memory: 8Gi
  status:
    observedGeneration: 1
    phase: Ready
kind: List
metadata:
  resourceVersion: ""

Comment 16 Elad 2022-06-26 12:40:24 UTC

Moving to VERIFIED based on Aaruni's comment #14

Comment 18 errata-xmlrpc 2022-08-24 13:54:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6156

Note You need to log in before you can comment on or make changes to this bug.