Bug 2209643

Summary:	Multus, Cephobjectstore stuck on Progressing state because " failed to create or retrieve rgw admin ops user"
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Oded <oviner>
Component:	rook	Assignee:	Jiffin <jthottan>
Status:	CLOSED ERRATA	QA Contact:	Oded <oviner>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.13	CC:	brgardne, ebenahar, hnallurv, jthottan, muagarwa, ocs-bugs, odf-bz-bot
Target Milestone:	---
Target Release:	ODF 4.13.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.13.0-214	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-06-21 15:25:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Oded 2023-05-24 10:55:12 UTC

Description of problem (please be detailed as possible and provide log
snippests):
Cephobjectstore stuck on Progressing state because " failed to create or retrieve rgw admin ops user"

Version of all relevant components (if applicable):
OCP Version: 4.13.0-0.nightly-2023-05-22-181752
ODF Version: 4.13.0-203.stable
Platform: Vsphere

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Install OCP4.13 + ODF4.13 with multus [public-net+cluster-net and bridge mode]

NAD configuration:
---
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
 name: public-net
 namespace: default
 labels: {}
 annotations: {}
spec:
 config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.20.0/24" } }'
---
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
 name: cluster-net
 namespace: default
 labels: {}
 annotations: {}
spec:
 config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "ipam": { "type": "whereabouts", "range": "192.168.30.0/24" } }'


2.Check  CephObjectStore status:
$ oc get CephObjectStore 
NAME                                 PHASE
ocs-storagecluster-cephobjectstore   Progressing
  
Warning  ReconcileFailed  11m (x90 over 21h)  rook-ceph-object-controller  failed to reconcile CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore". failed to get admin ops API context: failed to create or retrieve rgw admin ops user: failed to create object user "rgw-admin-ops-user". error code 1 for object store "ocs-storagecluster-cephobjectstore": failed to create s3 user. . : signal: interrupt

3.Check  RGW pod:
$ oc get pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9d4d788l9h8k
NAME                                                              READY   STATUS    RESTARTS        AGE
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9d4d788l9h8k   1/2     Running   288 (35s ago)   21h

Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Normal   Pulled     4h47m (x225 over 21h)   kubelet  Container image "quay.io/rhceph-dev/rhceph@sha256:fa6d01cdef17bc32d2b95b8121b02f4d41adccc5ba8a9b95f38c97797ff6621f" already present on machine
  Warning  BackOff    7m14s (x2226 over 21h)  kubelet  Back-off restarting failed container rgw in pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9d4d788l9h8k_openshift-storage(64db8bd3-2327-4aa1-9c01-3f70476a475e)
  Warning  Unhealthy  2m13s (x4741 over 21h)  kubelet  Startup probe failed: RGW health check failed with error code: 7. the RGW likely cannot be reached by clients

for more info on deployment process:
https://docs.google.com/document/d/1q2OzxGGUgM9R8TUWEbb2ulRUw_GtHcJs_SmxfSE5iXM/edit

Actual results:
CephObjectStore in Progressing state

Expected results:
CephObjectStore in Ready state

Additional info:
OCS MG:
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2209643.tar.gz

Comment 3 Blaine Gardner 2023-05-24 18:43:12 UTC

Moving to post since Jiffin has a fix.

Comment 7 Oded 2023-06-06 11:58:33 UTC

Bug Fixed.


Setup:
ODF Version: 4.13.0-0.nightly-2023-06-05-164816
OCP Version: odf-operator.v4.13.0-rhodf [4.13.0-214]
$ oc describe csv odf-operator.v4.13.0-rhodf  -n openshift-storage | grep full_version
Labels:       full_version=4.13.0-214
Platform: Vsphere 

Test Process:

1. Install OCP Version 
$ cat nad.yaml 
---
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
 name: public-net
 namespace: default
 labels: {}
 annotations: {}
spec:
 config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.20.0/24" } }'
---
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
 name: cluster-net
 namespace: default
 labels: {}
 annotations: {}
spec:
 config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "ipam": { "type": "whereabouts", "range": "192.168.30.0/24" } }'

oviner:ClusterPath$ oc create -f nad.yaml 
networkattachmentdefinition.k8s.cni.cncf.io/public-net created
networkattachmentdefinition.k8s.cni.cncf.io/cluster-net created

2. Create a StorageCluster:

3.Check storagecluster status:
$ oc get storagecluster
NAME                 AGE     PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   6m30s   Ready              2023-06-06T11:34:42Z   4.13.0

$ oc get CephObjectStore
NAME                                 PHASE
ocs-storagecluster-cephobjectstore   Ready


4. Check ceph status:

sh-5.1$ ceph -s
  cluster:
    id:     d937bfcd-92a1-4935-8233-41acdebb362f
    health: HEALTH_WARN
            Slow OSD heartbeats on back (longest 8664.143ms)
            Slow OSD heartbeats on front (longest 8605.254ms)
 
  services:
    mon: 3 daemons, quorum a,b,c (age 13m)
    mgr: a(active, since 13m)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 12m), 3 in (since 12m)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 217 pgs
    objects: 320 objects, 126 MiB
    usage:   417 MiB used, 1.5 TiB / 1.5 TiB avail
    pgs:     217 active+clean
 
  io:
    client:   938 B/s rd, 18 KiB/s wr, 1 op/s rd, 1 op/s wr


$ oc logs rook-ceph-osd-1-78cd65659c-57jds | grep 192.168
debug 2023-06-06T11:39:49.140+0000 7f2ef6eee640 -1 osd.1 38 heartbeat_check: no reply from 192.168.20.22:6802 osd.0 ever on either front or back, first ping sent 2023-06-06T11:38:51.339026+0000 (oldest deadline 2023-06-06T11:39:11.339026+0000)

Comment 8 Oded 2023-06-06 12:02:02 UTC

for more info:
https://docs.google.com/document/d/1IOucHpxCucBFWcSCfKiVUeDFOtkMF-0V_W0J3RU5qm8/edit

Comment 10 errata-xmlrpc 2023-06-21 15:25:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742