Bug 2092220
| Summary: | [Tracker for Ceph BZ #2096882] CephNFS is not reaching to Ready state on ODF on IBM Power (ppc64le) | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Aaruni Aggarwal <aaaggarw> | ||||
| Component: | ceph | Assignee: | tserlin | ||||
| ceph sub component: | CephFS | QA Contact: | Neha Berry <nberry> | ||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||
| Severity: | unspecified | ||||||
| Priority: | unspecified | CC: | bniver, brgardne, ebenahar, hyelloji, kkeithle, madam, muagarwa, ndevos, ocs-bugs, odf-bz-bot, rar, tnielsen | ||||
| Version: | 4.11 | Keywords: | Tracking | ||||
| Target Milestone: | --- | ||||||
| Target Release: | ODF 4.11.0 | ||||||
| Hardware: | ppc64le | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | No Doc Update | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 2096882 (view as bug list) | Environment: | |||||
| Last Closed: | 2022-08-24 13:54:12 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 2096882 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
[root@rdr-aar411-sao01-bastion-0 ~]# oc get pods |grep nfs
rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f 1/2 CrashLoopBackOff 30 (3m15s ago) 132m
Described pod:
[root@rdr-aar411-sao01-bastion-0 ~]# oc describe pod rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f
Name: rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f
Namespace: openshift-storage
Priority: 1000000000
Priority Class Name: openshift-user-critical
Node: sao01-worker-0.rdr-aar411.ibm.com/192.168.0.157
Start Time: Tue, 31 May 2022 05:07:39 -0400
Labels: app=rook-ceph-nfs
app.kubernetes.io/component=cephnfses.ceph.rook.io
app.kubernetes.io/created-by=rook-ceph-operator
app.kubernetes.io/instance=ocs-storagecluster-cephnfs-a
app.kubernetes.io/managed-by=rook-ceph-operator
app.kubernetes.io/name=ceph-nfs
app.kubernetes.io/part-of=ocs-storagecluster-cephnfs
ceph_daemon_id=ocs-storagecluster-cephnfs-a
ceph_daemon_type=nfs
ceph_nfs=ocs-storagecluster-cephnfs
instance=a
nfs=ocs-storagecluster-cephnfs-a
pod-template-hash=d97c557c8
rook.io/operator-namespace=openshift-storage
rook_cluster=openshift-storage
Annotations: config-hash: 5e0395e0b68b7bbbf8c216b865198a03
k8s.v1.cni.cncf.io/network-status:
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.129.2.54"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.129.2.54"
],
"default": true,
"dns": {}
}]
openshift.io/scc: rook-ceph
Status: Running
IP: 10.129.2.54
IPs:
IP: 10.129.2.54
Controlled By: ReplicaSet/rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8
Init Containers:
generate-minimal-ceph-conf:
Container ID: cri-o://85040c6564def25f5d93407572e3a99b1141e330ce90130dfbe50d01fe625be6
Image: quay.io/rhceph-dev/rhceph@sha256:abc274cd8cbaaf4abc213c1b7949a2063b63d501329e0f692d93b2e546ae8b91
Image ID: quay.io/rhceph-dev/rhceph@sha256:917ac58f7f5dd3c78c30ff19cab7dbc2d31545a2e7f6f109f2c1bbb6d00a4dd6
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
set -xEeuo pipefail
cat << EOF > /etc/ceph/ceph.conf
[global]
mon_host = $(ROOK_CEPH_MON_HOST)
[client.nfs-ganesha.ocs-storagecluster-cephnfs.a]
keyring = /etc/ceph/keyring-store/keyring
EOF
chmod 444 /etc/ceph/ceph.conf
cat /etc/ceph/ceph.conf
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 31 May 2022 05:07:41 -0400
Finished: Tue, 31 May 2022 05:07:41 -0400
Ready: True
Restart Count: 0
Limits:
cpu: 3
memory: 8Gi
Requests:
cpu: 3
memory: 8Gi
Environment:
ROOK_CEPH_MON_HOST: <set to the key 'mon_host' in secret 'rook-ceph-config'> Optional: false
ROOK_CEPH_MON_INITIAL_MEMBERS: <set to the key 'mon_initial_members' in secret 'rook-ceph-config'> Optional: false
Mounts:
/etc/ceph from etc-ceph (rw)
/etc/ceph/keyring-store/ from rook-ceph-nfs-ocs-storagecluster-cephnfs-a-keyring (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6qfwc (ro)
Containers:
nfs-ganesha:
Container ID: cri-o://2562e74073374b4774f2c3d82506549e34b03829a9e698ff575d67c9fcb560dd
Image: quay.io/rhceph-dev/rhceph@sha256:abc274cd8cbaaf4abc213c1b7949a2063b63d501329e0f692d93b2e546ae8b91
Image ID: quay.io/rhceph-dev/rhceph@sha256:917ac58f7f5dd3c78c30ff19cab7dbc2d31545a2e7f6f109f2c1bbb6d00a4dd6
Port: <none>
Host Port: <none>
Command:
ganesha.nfsd
Args:
-F
-L
STDERR
-p
/var/run/ganesha/ganesha.pid
-N
NIV_INFO
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Tue, 31 May 2022 07:16:36 -0400
Finished: Tue, 31 May 2022 07:16:37 -0400
Ready: False
Restart Count: 30
Limits:
cpu: 3
memory: 8Gi
Requests:
cpu: 3
memory: 8Gi
Environment:
CONTAINER_IMAGE: quay.io/rhceph-dev/rhceph@sha256:abc274cd8cbaaf4abc213c1b7949a2063b63d501329e0f692d93b2e546ae8b91
POD_NAME: rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f (v1:metadata.name)
POD_NAMESPACE: openshift-storage (v1:metadata.namespace)
NODE_NAME: (v1:spec.nodeName)
POD_MEMORY_LIMIT: 8589934592 (limits.memory)
POD_MEMORY_REQUEST: 8589934592 (requests.memory)
POD_CPU_LIMIT: 3 (limits.cpu)
POD_CPU_REQUEST: 3 (requests.cpu)
ROOK_CEPH_MON_HOST: <set to the key 'mon_host' in secret 'rook-ceph-config'> Optional: false
ROOK_CEPH_MON_INITIAL_MEMBERS: <set to the key 'mon_initial_members' in secret 'rook-ceph-config'> Optional: false
Mounts:
/etc/ceph from etc-ceph (rw)
/etc/ceph/keyring-store/ from rook-ceph-nfs-ocs-storagecluster-cephnfs-a-keyring (ro)
/etc/ganesha from ganesha-config (rw)
/run/dbus from run-dbus (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6qfwc (ro)
dbus-daemon:
Container ID: cri-o://324a2881fcbda852ef7960ec05f4b5eb49e471d52c76529d1d090c7356a7ed4f
Image: quay.io/rhceph-dev/rhceph@sha256:abc274cd8cbaaf4abc213c1b7949a2063b63d501329e0f692d93b2e546ae8b91
Image ID: quay.io/rhceph-dev/rhceph@sha256:917ac58f7f5dd3c78c30ff19cab7dbc2d31545a2e7f6f109f2c1bbb6d00a4dd6
Port: <none>
Host Port: <none>
Command:
dbus-daemon
Args:
--nofork
--system
--nopidfile
State: Running
Started: Tue, 31 May 2022 05:07:43 -0400
Ready: True
Restart Count: 0
Limits:
cpu: 3
memory: 8Gi
Requests:
cpu: 3
memory: 8Gi
Environment:
CONTAINER_IMAGE: quay.io/rhceph-dev/rhceph@sha256:abc274cd8cbaaf4abc213c1b7949a2063b63d501329e0f692d93b2e546ae8b91
POD_NAME: rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f (v1:metadata.name)
POD_NAMESPACE: openshift-storage (v1:metadata.namespace)
NODE_NAME: (v1:spec.nodeName)
POD_MEMORY_LIMIT: 8589934592 (limits.memory)
POD_MEMORY_REQUEST: 8589934592 (requests.memory)
POD_CPU_LIMIT: 3 (limits.cpu)
POD_CPU_REQUEST: 3 (requests.cpu)
Mounts:
/run/dbus from run-dbus (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6qfwc (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
etc-ceph:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
rook-ceph-nfs-ocs-storagecluster-cephnfs-a-keyring:
Type: Secret (a volume populated by a Secret)
SecretName: rook-ceph-nfs-ocs-storagecluster-cephnfs-a-keyring
Optional: false
ganesha-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: rook-ceph-nfs-ocs-storagecluster-cephnfs-a
Optional: false
run-dbus:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-6qfwc:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 5s
node.ocs.openshift.io/storage=true:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 132m default-scheduler Successfully assigned openshift-storage/rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f to sao01-worker-0.rdr-aar411.ibm.com by sao01-master-1.rdr-aar411.ibm.com
Normal AddedInterface 132m multus Add eth0 [10.129.2.54/23] from openshift-sdn
Normal Pulled 132m kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:abc274cd8cbaaf4abc213c1b7949a2063b63d501329e0f692d93b2e546ae8b91" already present on machine
Normal Created 132m kubelet Created container generate-minimal-ceph-conf
Normal Started 132m kubelet Started container generate-minimal-ceph-conf
Normal Pulled 132m kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:abc274cd8cbaaf4abc213c1b7949a2063b63d501329e0f692d93b2e546ae8b91" already present on machine
Normal Created 132m kubelet Created container dbus-daemon
Normal Started 132m kubelet Started container dbus-daemon
Normal Pulled 132m (x4 over 132m) kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:abc274cd8cbaaf4abc213c1b7949a2063b63d501329e0f692d93b2e546ae8b91" already present on machine
Normal Created 132m (x4 over 132m) kubelet Created container nfs-ganesha
Normal Started 132m (x4 over 132m) kubelet Started container nfs-ganesha
Warning BackOff 2m49s (x602 over 132m) kubelet Back-off restarting failed container
logs of the pod:
[root@rdr-aar411-sao01-bastion-0 ~]# oc logs pod/rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f -c nfs-ganesha
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] main :MAIN :EVENT :nfs-ganesha Starting: Ganesha Version 3.5
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] nfs_set_param_from_conf :NFS STARTUP :EVENT :Configuration file successfully parsed
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] init_fds_limit :INODE LRU :INFO :Setting the system-imposed limit on FDs to 1048576.
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] init_server_pkgs :NFS STARTUP :INFO :State lock layer successfully initialized
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] init_server_pkgs :NFS STARTUP :INFO :IP/name cache successfully initialized
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] init_server_pkgs :NFS STARTUP :EVENT :Initializing ID Mapper.
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper successfully initialized.
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] nfs4_recovery_init :CLIENT ID :INFO :Recovery Backend Init for rados_cluster
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] rados_cluster_init :CLIENT ID :EVENT :Cluster membership check failed: -2
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] main :NFS STARTUP :CRIT :Recovery backend initialization failed!
31/05/2022 09:13:36 : epoch 6295dc40 : rook-ceph-nfs-ocs-storagecluster-cephnfs-a-d97c557c8-sdw4f : nfs-ganesha-1[main] main :NFS STARTUP :FATAL :Fatal errors. Server exiting...
Described Cephnfs :
[root@rdr-aar411-sao01-bastion-0 ~]# oc describe cephnfs ocs-storagecluster-cephnfs
Name: ocs-storagecluster-cephnfs
Namespace: openshift-storage
Labels: <none>
Annotations: <none>
API Version: ceph.rook.io/v1
Kind: CephNFS
Metadata:
Creation Timestamp: 2022-05-31T09:07:30Z
Finalizers:
cephnfs.ceph.rook.io
Generation: 1
Managed Fields:
API Version: ceph.rook.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:ownerReferences:
.:
k:{"uid":"d96624a6-f3d8-41ca-bb02-7301d78abb03"}:
f:spec:
.:
f:rados:
f:server:
.:
f:active:
f:placement:
.:
f:nodeAffinity:
.:
f:requiredDuringSchedulingIgnoredDuringExecution:
.:
f:nodeSelectorTerms:
f:podAntiAffinity:
.:
f:requiredDuringSchedulingIgnoredDuringExecution:
f:tolerations:
f:priorityClassName:
f:resources:
.:
f:limits:
.:
f:cpu:
f:memory:
f:requests:
.:
f:cpu:
f:memory:
Manager: ocs-operator
Operation: Update
Time: 2022-05-31T09:07:30Z
API Version: ceph.rook.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.:
v:"cephnfs.ceph.rook.io":
Manager: rook
Operation: Update
Time: 2022-05-31T09:07:30Z
API Version: ceph.rook.io/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:phase:
Manager: rook
Operation: Update
Subresource: status
Time: 2022-05-31T09:07:39Z
Owner References:
API Version: ocs.openshift.io/v1
Block Owner Deletion: true
Controller: true
Kind: StorageCluster
Name: ocs-storagecluster
UID: d96624a6-f3d8-41ca-bb02-7301d78abb03
Resource Version: 12370216
UID: 1d21b144-ab15-4d67-81ee-fddb08fb7ace
Spec:
Rados:
Server:
Active: 1
Placement:
Node Affinity:
Required During Scheduling Ignored During Execution:
Node Selector Terms:
Match Expressions:
Key: cluster.ocs.openshift.io/openshift-storage
Operator: Exists
Pod Anti Affinity:
Required During Scheduling Ignored During Execution:
Label Selector:
Match Expressions:
Key: app
Operator: In
Values:
rook-ceph-nfs
Topology Key: kubernetes.io/hostname
Tolerations:
Effect: NoSchedule
Key: node.ocs.openshift.io/storage
Operator: Equal
Value: true
Priority Class Name: openshift-user-critical
Resources:
Limits:
Cpu: 3
Memory: 8Gi
Requests:
Cpu: 3
Memory: 8Gi
Status:
Phase: Failed
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ReconcileFailed 4s (x14 over 2m29s) rook-ceph-nfs-controller failed to reconcile CephNFS "openshift-storage/ocs-storagecluster-cephnfs". failed to create ceph nfs deployments: failed to update ceph nfs "ocs-storagecluster-cephnfs": failed to add server "a" to database: failed to add "a" to grace db: exit status 1
Rook-ceph-operator logs:
2022-05-31 11:29:24.670365 I | ceph-spec: parsing mon endpoints: a=172.30.245.151:6789,b=172.30.25.76:6789,c=172.30.64.206:6789
2022-05-31 11:29:24.670439 I | ceph-spec: detecting the ceph image version for image quay.io/rhceph-dev/rhceph@sha256:abc274cd8cbaaf4abc213c1b7949a2063b63d501329e0f692d93b2e546ae8b91...
2022-05-31 11:29:29.380569 I | ceph-spec: detected ceph image version: "16.2.8-5 pacific"
2022-05-31 11:29:30.402032 I | ceph-nfs-controller: configuring pool ".nfs" for nfs
2022-05-31 11:29:32.993018 I | ceph-nfs-controller: set pool ".nfs" for the application nfs
2022-05-31 11:29:33.005783 I | ceph-nfs-controller: updating ceph nfs "ocs-storagecluster-cephnfs"
2022-05-31 11:29:33.094323 I | cephclient: getting or creating ceph auth key "client.nfs-ganesha.ocs-storagecluster-cephnfs.a"
2022-05-31 11:29:34.036599 I | ceph-nfs-controller: ceph nfs deployment "rook-ceph-nfs-ocs-storagecluster-cephnfs-a" already exists. updating if needed
2022-05-31 11:29:34.049749 I | op-k8sutil: deployment "rook-ceph-nfs-ocs-storagecluster-cephnfs-a" did not change, nothing to update
2022-05-31 11:29:34.067508 I | ceph-nfs-controller: ceph nfs service already created
2022-05-31 11:29:34.067523 I | ceph-nfs-controller: adding ganesha "a" to grace db
2022-05-31 11:29:34.092006 E | ceph-nfs-controller: failed to reconcile CephNFS "openshift-storage/ocs-storagecluster-cephnfs". failed to create ceph nfs deployments: failed to update ceph nfs "ocs-storagecluster-cephnfs": failed to add server "a" to database: failed to add "a" to grace db: exit status 1
`ganesha-rados-grace` add command is failing https://github.com/rook/rook/blob/f7930455c4af01fdb4962a06af6e4ee49178d6e5/pkg/operator/ceph/nfs/nfs.go#L172-L179 Attaching must-gather logs: https://drive.google.com/file/d/1khBOLMFEcsXRYp3LfuyIgY6hzTCrZpLA/view?usp=sharing I got SSH access to the cluster and enabled debug mode for the rook operator. I see some additional information about the command Rook is executing that is failing to add the NFS server to the grace database (copied at bottom). Ceph is responding as though the command has invalid usage. The command is not erroring due to an unknown flag. Otherwise, the error message would include something like "ganesha-rados-grace: unrecognized option '--poool'" (from a deliberate attempt to cause the error). This means that the flags themselves are not bad. When I execute the command manually from the operator pod, the return code is 1. The return code is the same if I remove both the --pool and --ns flags. The error code doesn't seem to be useful for identifying where in the ganesha-rados-grace utility the error is occurring. Both "ganesha-rados-grace --pool '.nfs' --ns ocs-storagecluster-cephnfs dump" as well as "ganesha-rados-grace --pool '.nfs' --ns ocs-storagecluster-cephnfs dump ocs-storagecluster-cephnfs.a" result in the same error. This suggests to me that it is not merely the "add" subcommand that is buggy. Adding the --cephconf option to any of the commands I have tried doesn't change the behavior. Based on my debugging so far, I am inclined to think this is a bug with the ganesha-rados-grace utility. This is possibly a build issue limited to ppc architectures, or could generically be any non-x86 architecture. I don't know enough about the ganesha-rados-grace utility or how its built to provide better feedback. ----- 2022-06-06 17:14:38.207620 I | ceph-nfs-controller: adding ganesha "a" to grace db 2022-06-06 17:14:38.207641 D | exec: Running command: ganesha-rados-grace --pool .nfs --ns ocs-storagecluster-cephnfs add ocs-storagecluster-cephnfs.a 2022-06-06 17:14:38.230578 D | exec: Usage: 2022-06-06 17:14:38.230611 D | exec: ganesha-rados-grace [ --userid ceph_user ] [ --cephconf /path/to/ceph.conf ] [ --ns namespace ] [ --oid obj_id ] [ --pool pool_id ] dump|add|start|join|lift|remove|enforce|noenforce|member [ nodeid ... ] 2022-06-06 17:14:38.240908 D | ceph-nfs-controller: nfs "openshift-storage/ocs-storagecluster-cephnfs" status updated to "Failed" 2022-06-06 17:14:38.240958 E | ceph-nfs-controller: failed to reconcile CephNFS "openshift-storage/ocs-storagecluster-cephnfs". failed to create ceph nfs deployments: failed to update ceph nfs "ocs-storagecluster-cephnfs": failed to add server "a" to database: failed to add "a" to grace db: exit status 1 It seems we need https://github.com/nfs-ganesha/nfs-ganesha/commit/3db6bc0cb75fa85ffcebeda1276d195915b84579 to have this working on ppc64le too. Created attachment 1889712 [details] minimal reproducer for "char c" vs "int c" This is a small C program based on tools/ganesha-rados-grace.c in the NFS-Ganesha sources. On ppc64le getopt_long() returns c=255 instead of c=-1: [root@ibm-p9b-25 ~]# gdb --args ganesha-rados-grace -p .nfs -n ocs-storagecluster-cephnfs add ocs-storagecluster-cephnfs.a GNU gdb (GDB) Red Hat Enterprise Linux 8.2-18.el8 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "ppc64le-redhat-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from ganesha-rados-grace...Reading symbols from /usr/lib/debug/usr/bin/ganesha-rados-grace-3.5-1.el8cp.ppc64le.debug...done. done. (gdb) b usage Breakpoint 1 at 0x1718: file /usr/include/bits/stdio2.h, line 100. (gdb) r Starting program: /usr/bin/ganesha-rados-grace -p .nfs -n ocs-storagecluster-cephnfs add ocs-storagecluster-cephnfs.a [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/glibc-hwcaps/power9/libthread_db-1.0.so". Breakpoint 1, usage (argv=0x7fffffffee48) at /usr/include/bits/stdio2.h:100 warning: Source file is more recent than executable. 100 return __fprintf_chk (__stream, __USE_FORTIFY_LEVEL - 1, __fmt, Missing separate debuginfos, use: yum debuginfo-install libntirpc-3.4-1.el8cp.ppc64le librados2-16.2.8-43.el8cp.ppc64le (gdb) bt #0 usage (argv=0x7fffffffee48) at /usr/include/bits/stdio2.h:100 #1 main (argc=<optimized out>, argv=0x7fffffffee48) at /usr/src/debug/nfs-ganesha-3.5-1.el8cp.ppc64le/src/tools/ganesha-rados-grace.c:134 (gdb) f 1 #1 main (argc=<optimized out>, argv=0x7fffffffee48) at /usr/src/debug/nfs-ganesha-3.5-1.el8cp.ppc64le/src/tools/ganesha-rados-grace.c:134 134 /usr/src/debug/nfs-ganesha-3.5-1.el8cp.ppc64le/src/tools/ganesha-rados-grace.c: No such file or directory. (gdb) p c $1 = 255 '\377' Bug 2096882 has been reported against RHCS. Once that is addressed, the rook-ceph container image needs to be rebuilt so that it uses the fixed `ganesha-rados-grace` command. Ceph BZ is ON_QA With the latest ODF version : 4.11.0-101 , nfs pod reached Running state as well as cephnfs reached Ready state.
[root@rdr-odf-nfs-sao01-bastion-0 ~]# oc get csv -n openshift-storage -o json ocs-operator.v4.11.0 | jq '.metadata.labels["full_version"]'
"4.11.0-101"
[root@rdr-odf-nfs-sao01-bastion-0 ~]# oc get pods |grep rook-ceph-nfs
rook-ceph-nfs-ocs-storagecluster-cephnfs-a-768c45bb78-w6zvh 2/2 Running 0 7m17s
[root@rdr-odf-nfs-sao01-bastion-0 ~]# oc get sc |grep nfs
ocs-storagecluster-ceph-nfs openshift-storage.nfs.csi.ceph.com Delete Immediate false 7m56s
[root@rdr-odf-nfs-sao01-bastion-0 ~]# oc get cephnfs
NAME AGE
ocs-storagecluster-cephnfs 2m39s
[root@rdr-odf-nfs-sao01-bastion-0 ~]#
[root@rdr-odf-nfs-sao01-bastion-0 ~]# oc get cephnfs -o yaml
apiVersion: v1
items:
- apiVersion: ceph.rook.io/v1
kind: CephNFS
metadata:
creationTimestamp: "2022-06-22T13:37:23Z"
finalizers:
- cephnfs.ceph.rook.io
generation: 1
name: ocs-storagecluster-cephnfs
namespace: openshift-storage
ownerReferences:
- apiVersion: ocs.openshift.io/v1
blockOwnerDeletion: true
controller: true
kind: StorageCluster
name: ocs-storagecluster
uid: 4ebdb866-56d5-4145-b4c5-e73d003ffa50
resourceVersion: "695162"
uid: ad5d9eb8-1973-4ff3-a544-61f707ea2063
spec:
rados: {}
server:
active: 1
placement:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cluster.ocs.openshift.io/openshift-storage
operator: Exists
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- rook-ceph-nfs
topologyKey: kubernetes.io/hostname
tolerations:
- effect: NoSchedule
key: node.ocs.openshift.io/storage
operator: Equal
value: "true"
priorityClassName: openshift-user-critical
resources:
limits:
cpu: "3"
memory: 8Gi
requests:
cpu: "3"
memory: 8Gi
status:
observedGeneration: 1
phase: Ready
kind: List
metadata:
resourceVersion: ""
Moving to VERIFIED based on Aaruni's comment #14 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6156 |
Description of problem (please be detailed as possible and provide log snippests): CephNFS is not reaching to ready state on IBM Power and NFS Ganesha server is also not Running. Version of all relevant components (if applicable): OCP Version : 4.11.0-0.nightly-ppc64le-2022-05-23-232055 ODF version: 4.11.0-80 ceph version 16.2.8-5.el8cp (0974c9ff5a69f17f3843a7c1a568daa2b4559e2d) pacific (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy ODF 4.11 on OCP4.11 on IBM Power 2. Patch the storagecluster to enable nfs feature, using following command: oc patch -n openshift-storage storageclusters.ocs.openshift.io ocs-storagecluster --patch '{"spec": {"nfs":{"enable": true}}}' --type merge 3.Wait for `rook-ceph-nfs-ocs-storagecluster-cephnfs-a-*` pod to reach Running state. Actual results: `rook-ceph-nfs-ocs-storagecluster-cephnfs-a-*` pod is in CrashLoopBackOff state and CephNFS status shows Failed. Expected results: `rook-ceph-nfs-ocs-storagecluster-cephnfs-a-*` pod should be in Running state and ocs-storagecluster-ceph-nfs Storageclass should be created. Additional info: