2209643 – Multus, Cephobjectstore stuck on Progressing state because " failed to create or retrieve rgw admin ops user"

Bug 2209643 - Multus, Cephobjectstore stuck on Progressing state because " failed to create or retrieve rgw admin ops user"

Summary: Multus, Cephobjectstore stuck on Progressing state because " failed to create...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	ODF 4.13.0
Assignee:	Jiffin
QA Contact:	Oded
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-05-24 10:55 UTC by Oded
Modified:	2023-08-09 17:03 UTC (History)
CC List:	7 users (show)
Fixed In Version:	4.13.0-214
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-06-21 15:25:39 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage rook pull 494	None	Merged	Bug 2209643: object: use networkspec from clusterinfo spec while running radosgw-admin	2023-05-31 15:24:49 UTC
Github	rook rook pull 12273	None	open	object: add missing cephcluster spec addition in object controller	2023-05-25 08:35:02 UTC
Red Hat Product Errata	RHBA-2023:3742	None	None	None	2023-06-21 15:25:52 UTC

Description Oded 2023-05-24 10:55:12 UTC

Description of problem (please be detailed as possible and provide log
snippests):
Cephobjectstore stuck on Progressing state because " failed to create or retrieve rgw admin ops user"

Version of all relevant components (if applicable):
OCP Version: 4.13.0-0.nightly-2023-05-22-181752
ODF Version: 4.13.0-203.stable
Platform: Vsphere

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Install OCP4.13 + ODF4.13 with multus [public-net+cluster-net and bridge mode]

NAD configuration:
---
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
 name: public-net
 namespace: default
 labels: {}
 annotations: {}
spec:
 config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.20.0/24" } }'
---
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
 name: cluster-net
 namespace: default
 labels: {}
 annotations: {}
spec:
 config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "ipam": { "type": "whereabouts", "range": "192.168.30.0/24" } }'


2.Check  CephObjectStore status:
$ oc get CephObjectStore 
NAME                                 PHASE
ocs-storagecluster-cephobjectstore   Progressing
  
Warning  ReconcileFailed  11m (x90 over 21h)  rook-ceph-object-controller  failed to reconcile CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore". failed to get admin ops API context: failed to create or retrieve rgw admin ops user: failed to create object user "rgw-admin-ops-user". error code 1 for object store "ocs-storagecluster-cephobjectstore": failed to create s3 user. . : signal: interrupt

3.Check  RGW pod:
$ oc get pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9d4d788l9h8k
NAME                                                              READY   STATUS    RESTARTS        AGE
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9d4d788l9h8k   1/2     Running   288 (35s ago)   21h

Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Normal   Pulled     4h47m (x225 over 21h)   kubelet  Container image "quay.io/rhceph-dev/rhceph@sha256:fa6d01cdef17bc32d2b95b8121b02f4d41adccc5ba8a9b95f38c97797ff6621f" already present on machine
  Warning  BackOff    7m14s (x2226 over 21h)  kubelet  Back-off restarting failed container rgw in pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9d4d788l9h8k_openshift-storage(64db8bd3-2327-4aa1-9c01-3f70476a475e)
  Warning  Unhealthy  2m13s (x4741 over 21h)  kubelet  Startup probe failed: RGW health check failed with error code: 7. the RGW likely cannot be reached by clients

for more info on deployment process:
https://docs.google.com/document/d/1q2OzxGGUgM9R8TUWEbb2ulRUw_GtHcJs_SmxfSE5iXM/edit

Actual results:
CephObjectStore in Progressing state

Expected results:
CephObjectStore in Ready state

Additional info:
OCS MG:
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2209643.tar.gz

Comment 3 Blaine Gardner 2023-05-24 18:43:12 UTC

Moving to post since Jiffin has a fix.

Comment 7 Oded 2023-06-06 11:58:33 UTC

Bug Fixed.


Setup:
ODF Version: 4.13.0-0.nightly-2023-06-05-164816
OCP Version: odf-operator.v4.13.0-rhodf [4.13.0-214]
$ oc describe csv odf-operator.v4.13.0-rhodf  -n openshift-storage | grep full_version
Labels:       full_version=4.13.0-214
Platform: Vsphere 

Test Process:

1. Install OCP Version 
$ cat nad.yaml 
---
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
 name: public-net
 namespace: default
 labels: {}
 annotations: {}
spec:
 config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.20.0/24" } }'
---
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
 name: cluster-net
 namespace: default
 labels: {}
 annotations: {}
spec:
 config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "ipam": { "type": "whereabouts", "range": "192.168.30.0/24" } }'

oviner:ClusterPath$ oc create -f nad.yaml 
networkattachmentdefinition.k8s.cni.cncf.io/public-net created
networkattachmentdefinition.k8s.cni.cncf.io/cluster-net created

2. Create a StorageCluster:

3.Check storagecluster status:
$ oc get storagecluster
NAME                 AGE     PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   6m30s   Ready              2023-06-06T11:34:42Z   4.13.0

$ oc get CephObjectStore
NAME                                 PHASE
ocs-storagecluster-cephobjectstore   Ready


4. Check ceph status:

sh-5.1$ ceph -s
  cluster:
    id:     d937bfcd-92a1-4935-8233-41acdebb362f
    health: HEALTH_WARN
            Slow OSD heartbeats on back (longest 8664.143ms)
            Slow OSD heartbeats on front (longest 8605.254ms)
 
  services:
    mon: 3 daemons, quorum a,b,c (age 13m)
    mgr: a(active, since 13m)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 12m), 3 in (since 12m)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 217 pgs
    objects: 320 objects, 126 MiB
    usage:   417 MiB used, 1.5 TiB / 1.5 TiB avail
    pgs:     217 active+clean
 
  io:
    client:   938 B/s rd, 18 KiB/s wr, 1 op/s rd, 1 op/s wr


$ oc logs rook-ceph-osd-1-78cd65659c-57jds | grep 192.168
debug 2023-06-06T11:39:49.140+0000 7f2ef6eee640 -1 osd.1 38 heartbeat_check: no reply from 192.168.20.22:6802 osd.0 ever on either front or back, first ping sent 2023-06-06T11:38:51.339026+0000 (oldest deadline 2023-06-06T11:39:11.339026+0000)

Comment 8 Oded 2023-06-06 12:02:02 UTC

for more info:
https://docs.google.com/document/d/1IOucHpxCucBFWcSCfKiVUeDFOtkMF-0V_W0J3RU5qm8/edit

Comment 10 errata-xmlrpc 2023-06-21 15:25:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742

Note You need to log in before you can comment on or make changes to this bug.