Bug 2209643
| Summary: | Multus, Cephobjectstore stuck on Progressing state because " failed to create or retrieve rgw admin ops user" | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Oded <oviner> |
| Component: | rook | Assignee: | Jiffin <jthottan> |
| Status: | CLOSED ERRATA | QA Contact: | Oded <oviner> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.13 | CC: | brgardne, ebenahar, hnallurv, jthottan, muagarwa, ocs-bugs, odf-bz-bot |
| Target Milestone: | --- | ||
| Target Release: | ODF 4.13.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.13.0-214 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-06-21 15:25:39 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Moving to post since Jiffin has a fix. Bug Fixed.
Setup:
ODF Version: 4.13.0-0.nightly-2023-06-05-164816
OCP Version: odf-operator.v4.13.0-rhodf [4.13.0-214]
$ oc describe csv odf-operator.v4.13.0-rhodf -n openshift-storage | grep full_version
Labels: full_version=4.13.0-214
Platform: Vsphere
Test Process:
1. Install OCP Version
$ cat nad.yaml
---
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: public-net
namespace: default
labels: {}
annotations: {}
spec:
config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.20.0/24" } }'
---
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: cluster-net
namespace: default
labels: {}
annotations: {}
spec:
config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "ipam": { "type": "whereabouts", "range": "192.168.30.0/24" } }'
oviner:ClusterPath$ oc create -f nad.yaml
networkattachmentdefinition.k8s.cni.cncf.io/public-net created
networkattachmentdefinition.k8s.cni.cncf.io/cluster-net created
2. Create a StorageCluster:
3.Check storagecluster status:
$ oc get storagecluster
NAME AGE PHASE EXTERNAL CREATED AT VERSION
ocs-storagecluster 6m30s Ready 2023-06-06T11:34:42Z 4.13.0
$ oc get CephObjectStore
NAME PHASE
ocs-storagecluster-cephobjectstore Ready
4. Check ceph status:
sh-5.1$ ceph -s
cluster:
id: d937bfcd-92a1-4935-8233-41acdebb362f
health: HEALTH_WARN
Slow OSD heartbeats on back (longest 8664.143ms)
Slow OSD heartbeats on front (longest 8605.254ms)
services:
mon: 3 daemons, quorum a,b,c (age 13m)
mgr: a(active, since 13m)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 12m), 3 in (since 12m)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 217 pgs
objects: 320 objects, 126 MiB
usage: 417 MiB used, 1.5 TiB / 1.5 TiB avail
pgs: 217 active+clean
io:
client: 938 B/s rd, 18 KiB/s wr, 1 op/s rd, 1 op/s wr
$ oc logs rook-ceph-osd-1-78cd65659c-57jds | grep 192.168
debug 2023-06-06T11:39:49.140+0000 7f2ef6eee640 -1 osd.1 38 heartbeat_check: no reply from 192.168.20.22:6802 osd.0 ever on either front or back, first ping sent 2023-06-06T11:38:51.339026+0000 (oldest deadline 2023-06-06T11:39:11.339026+0000)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742 |
Description of problem (please be detailed as possible and provide log snippests): Cephobjectstore stuck on Progressing state because " failed to create or retrieve rgw admin ops user" Version of all relevant components (if applicable): OCP Version: 4.13.0-0.nightly-2023-05-22-181752 ODF Version: 4.13.0-203.stable Platform: Vsphere Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Install OCP4.13 + ODF4.13 with multus [public-net+cluster-net and bridge mode] NAD configuration: --- apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: public-net namespace: default labels: {} annotations: {} spec: config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.20.0/24" } }' --- apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: cluster-net namespace: default labels: {} annotations: {} spec: config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "ipam": { "type": "whereabouts", "range": "192.168.30.0/24" } }' 2.Check CephObjectStore status: $ oc get CephObjectStore NAME PHASE ocs-storagecluster-cephobjectstore Progressing Warning ReconcileFailed 11m (x90 over 21h) rook-ceph-object-controller failed to reconcile CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore". failed to get admin ops API context: failed to create or retrieve rgw admin ops user: failed to create object user "rgw-admin-ops-user". error code 1 for object store "ocs-storagecluster-cephobjectstore": failed to create s3 user. . : signal: interrupt 3.Check RGW pod: $ oc get pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9d4d788l9h8k NAME READY STATUS RESTARTS AGE rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9d4d788l9h8k 1/2 Running 288 (35s ago) 21h Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 4h47m (x225 over 21h) kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:fa6d01cdef17bc32d2b95b8121b02f4d41adccc5ba8a9b95f38c97797ff6621f" already present on machine Warning BackOff 7m14s (x2226 over 21h) kubelet Back-off restarting failed container rgw in pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9d4d788l9h8k_openshift-storage(64db8bd3-2327-4aa1-9c01-3f70476a475e) Warning Unhealthy 2m13s (x4741 over 21h) kubelet Startup probe failed: RGW health check failed with error code: 7. the RGW likely cannot be reached by clients for more info on deployment process: https://docs.google.com/document/d/1q2OzxGGUgM9R8TUWEbb2ulRUw_GtHcJs_SmxfSE5iXM/edit Actual results: CephObjectStore in Progressing state Expected results: CephObjectStore in Ready state Additional info: OCS MG: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2209643.tar.gz