Description of problem (please be detailed as possible and provide log snippests): Cephobjectstore stuck on Progressing state because " failed to create or retrieve rgw admin ops user" Version of all relevant components (if applicable): OCP Version: 4.13.0-0.nightly-2023-05-22-181752 ODF Version: 4.13.0-203.stable Platform: Vsphere Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Install OCP4.13 + ODF4.13 with multus [public-net+cluster-net and bridge mode] NAD configuration: --- apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: public-net namespace: default labels: {} annotations: {} spec: config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.20.0/24" } }' --- apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: cluster-net namespace: default labels: {} annotations: {} spec: config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "ipam": { "type": "whereabouts", "range": "192.168.30.0/24" } }' 2.Check CephObjectStore status: $ oc get CephObjectStore NAME PHASE ocs-storagecluster-cephobjectstore Progressing Warning ReconcileFailed 11m (x90 over 21h) rook-ceph-object-controller failed to reconcile CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore". failed to get admin ops API context: failed to create or retrieve rgw admin ops user: failed to create object user "rgw-admin-ops-user". error code 1 for object store "ocs-storagecluster-cephobjectstore": failed to create s3 user. . : signal: interrupt 3.Check RGW pod: $ oc get pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9d4d788l9h8k NAME READY STATUS RESTARTS AGE rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9d4d788l9h8k 1/2 Running 288 (35s ago) 21h Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 4h47m (x225 over 21h) kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:fa6d01cdef17bc32d2b95b8121b02f4d41adccc5ba8a9b95f38c97797ff6621f" already present on machine Warning BackOff 7m14s (x2226 over 21h) kubelet Back-off restarting failed container rgw in pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9d4d788l9h8k_openshift-storage(64db8bd3-2327-4aa1-9c01-3f70476a475e) Warning Unhealthy 2m13s (x4741 over 21h) kubelet Startup probe failed: RGW health check failed with error code: 7. the RGW likely cannot be reached by clients for more info on deployment process: https://docs.google.com/document/d/1q2OzxGGUgM9R8TUWEbb2ulRUw_GtHcJs_SmxfSE5iXM/edit Actual results: CephObjectStore in Progressing state Expected results: CephObjectStore in Ready state Additional info: OCS MG: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2209643.tar.gz
Moving to post since Jiffin has a fix.
Bug Fixed. Setup: ODF Version: 4.13.0-0.nightly-2023-06-05-164816 OCP Version: odf-operator.v4.13.0-rhodf [4.13.0-214] $ oc describe csv odf-operator.v4.13.0-rhodf -n openshift-storage | grep full_version Labels: full_version=4.13.0-214 Platform: Vsphere Test Process: 1. Install OCP Version $ cat nad.yaml --- apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: public-net namespace: default labels: {} annotations: {} spec: config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.20.0/24" } }' --- apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: cluster-net namespace: default labels: {} annotations: {} spec: config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "ipam": { "type": "whereabouts", "range": "192.168.30.0/24" } }' oviner:ClusterPath$ oc create -f nad.yaml networkattachmentdefinition.k8s.cni.cncf.io/public-net created networkattachmentdefinition.k8s.cni.cncf.io/cluster-net created 2. Create a StorageCluster: 3.Check storagecluster status: $ oc get storagecluster NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 6m30s Ready 2023-06-06T11:34:42Z 4.13.0 $ oc get CephObjectStore NAME PHASE ocs-storagecluster-cephobjectstore Ready 4. Check ceph status: sh-5.1$ ceph -s cluster: id: d937bfcd-92a1-4935-8233-41acdebb362f health: HEALTH_WARN Slow OSD heartbeats on back (longest 8664.143ms) Slow OSD heartbeats on front (longest 8605.254ms) services: mon: 3 daemons, quorum a,b,c (age 13m) mgr: a(active, since 13m) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 12m), 3 in (since 12m) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 217 pgs objects: 320 objects, 126 MiB usage: 417 MiB used, 1.5 TiB / 1.5 TiB avail pgs: 217 active+clean io: client: 938 B/s rd, 18 KiB/s wr, 1 op/s rd, 1 op/s wr $ oc logs rook-ceph-osd-1-78cd65659c-57jds | grep 192.168 debug 2023-06-06T11:39:49.140+0000 7f2ef6eee640 -1 osd.1 38 heartbeat_check: no reply from 192.168.20.22:6802 osd.0 ever on either front or back, first ping sent 2023-06-06T11:38:51.339026+0000 (oldest deadline 2023-06-06T11:39:11.339026+0000)
for more info: https://docs.google.com/document/d/1IOucHpxCucBFWcSCfKiVUeDFOtkMF-0V_W0J3RU5qm8/edit
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742