Description of problem (please be detailed as possible and provide log snippests): When deploying OCS Internal mode with Multus, deployment is unsuccessful. CSV gets stuck in Installing phase. Noobaa gets stuck in Configuring phase with these Message in status "Ceph objectstore user "noobaa-ceph-objectstore-user" is not ready" And rgw pod is not created. Also observed this error message in rook-ceph-operator pod: ``` E | ceph-object-controller: failed to reconcile failed to create object store deployments: failed to configure multisite for object store: failed create ceph multisite for object-store "ocs-storagecluster-cephobjectstore": radosgw-admin realm get failed with code -1, for reason "": signal: interrupt ``` Version of all relevant components (if applicable): OCP: 4.8.0-0.nightly-2021-07-01-043852 OCS: ocs-operator.v4.8.0-436.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes, CSV stuck in Installing phase Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes, 2/2 Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Create a NAD for public network 2. Install OCS operator 3. Create Storage Cluster with Multus Actual results: The CSV is stuck in Installing and noobaa stuck in Configuring phase Expected results: Deployment should succeed with all components Ready Additional info: $ oc get csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.8.0-436.ci OpenShift Container Storage 4.8.0-436.ci Installing $ oc get storagecluster NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 117m Progressing 2021-07-02T13:14:02Z 4.8.0 $ oc get noobaa NAME MGMT-ENDPOINTS S3-ENDPOINTS IMAGE PHASE AGE noobaa ["https://10.70.45.73:32551"] ["https://10.70.45.76:30455"] quay.io/rhceph-dev/mcg-core@sha256:1f7082c55c9d7caee72e3387cc82667bb559abbc756bd3dbddc56d956d2ba87d Configuring 113m >> Observe that RGW pod not present $ oc get pod NAME READY STATUS RESTARTS AGE csi-cephfsplugin-b768r 3/3 Running 0 117m csi-cephfsplugin-f62rr 3/3 Running 0 117m csi-cephfsplugin-provisioner-65c868d8cc-825mx 6/6 Running 0 117m csi-cephfsplugin-provisioner-65c868d8cc-fsp2c 6/6 Running 0 117m csi-cephfsplugin-xkntl 3/3 Running 0 117m csi-rbdplugin-7bplb 3/3 Running 0 117m csi-rbdplugin-89gh2 3/3 Running 0 117m csi-rbdplugin-8k9rh 3/3 Running 0 117m csi-rbdplugin-provisioner-5ddfbdbc46-bmwzm 6/6 Running 0 117m csi-rbdplugin-provisioner-5ddfbdbc46-z6hfj 6/6 Running 0 117m noobaa-core-0 1/1 Running 0 113m noobaa-db-pg-0 1/1 Running 0 113m noobaa-endpoint-7b99999d89-456cr 1/1 Running 0 111m noobaa-operator-6fb786d997-l6lpv 1/1 Running 0 119m ocs-metrics-exporter-db6d58f84-fx97f 1/1 Running 0 119m ocs-operator-77449fc484-9jc6z 0/1 Running 0 119m rook-ceph-crashcollector-compute-0-7975b59bfb-8xn9h 1/1 Running 0 115m rook-ceph-crashcollector-compute-1-5444489bdc-dfhtq 1/1 Running 0 115m rook-ceph-crashcollector-compute-2-6474c5f74-2d7h4 1/1 Running 0 115m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-57444f97kt8l2 2/2 Running 0 113m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dfb569ctrgdw 2/2 Running 0 113m rook-ceph-mgr-a-54b99d54-wv4jw 2/2 Running 0 115m rook-ceph-mon-a-5944bfb767-jtqv7 2/2 Running 0 116m rook-ceph-mon-b-6f69686b99-s4w4m 2/2 Running 0 116m rook-ceph-mon-c-74cb649cf-lzc9b 2/2 Running 0 116m rook-ceph-operator-66ccdccd88-wr88v 1/1 Running 0 119m rook-ceph-osd-0-7d8877586-zrn4j 2/2 Running 0 115m rook-ceph-osd-1-5df9fdcdb9-pmvvp 2/2 Running 0 115m rook-ceph-osd-2-7858b6b9db-wrw5f 2/2 Running 0 113m rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-0mz2pm-qhx9r 0/1 Completed 0 115m rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-0ftdz4-fdg9p 0/1 Completed 0 115m rook-ceph-osd-prepare-ocs-deviceset-thin-2-data-0rfvtd-jx7fk 0/1 Completed 0 115m
RGW is failing to be created because the realm cannot be queried or configured [1]: 2021-07-02T13:19:24.000347222Z 2021-07-02 13:19:24.000293 I | exec: timeout waiting for process radosgw-admin to return. Sending interrupt signal to the process 2021-07-02T13:19:24.035556071Z 2021-07-02 13:19:24.035504 E | ceph-object-controller: failed to reconcile failed to create object store deployments: failed to configure multisite for object store: failed create ceph multisite for object-store "ocs-storagecluster-cephobjectstore": radosgw-admin realm get failed with code -1, for reason "": signal: interrupt Typically this is caused by the PGs not being clean or the OSDs not be balanced in the CRUSH tree. However, according to the ceph status [2], the PGs are clean, ceph is HEALTH_OK, there are three OSDs that are spread as expected across different racks and hosts (PVC name is the host), and everything else looks expected from what I can find. This sounds related to recent changes for using the RGW API, or possibly that rook needs to change this instance of radosgw-admin to call the RGW API. @Seb can you take a look? [1] http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1978722/ocs-must-gather/must-gather.local.8652376216988780581/quay-io-rhceph-dev-ocs-must-gather-sha256-318f9fb2c5a4aa470e5f221c4a7a76045b6778b4c8895db71bd9225024169371/namespaces/openshift-storage/pods/rook-ceph-operator-66ccdccd88-wr88v/rook-ceph-operator/rook-ceph-operator/logs/current.log [2] http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1978722/ocs-must-gather/must-gather.local.8652376216988780581/quay-io-rhceph-dev-ocs-must-gather-sha256-318f9fb2c5a4aa470e5f221c4a7a76045b6778b4c8895db71bd9225024169371/ceph/must_gather_commands_json_output/
I don't know why the commands are timing out, if it's a networking issue then all the cluster would have been down. Also, this is not related to the latest changes from the API since we don't use it for realms/zonegroup/zones. sagrawal is it possible to get access to the cluster for further investigation? Thanks
Ok so the rook-ceph-operator cannot access the multus network and thus cannot query the current realm with "radosgw-admin realm get --rgw-realm=ocs-storagecluster-cephobjectstore --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring". Currently, the rook-ceph-operator does not have the network annotations, adding the annotation fixes the issue. Eng is currently evaluating different possibilities and will come back with a solution ASAP.
Proposed fix is under review.
From Sebastien: When the CephCluster is configured with Multus and multiple networks are used to deploy Ceph some commands are failing to be executed from the Operator. These commands, in particular, radosgw-admin ones need access to the "ceph public network" to talk to OSDs. Unfortunately, the Rook-Ceph Operator does not have the network annotations and thus doesn't have the networks available and cannot reach OSDs. So the commands end up hanging and eventually time out. Applying the annotations to the Operator pod is possible but will result in restarting the operator too and this should be avoided at all costs. Also, applying the annotations beforehand is not possible since the Multus declaration is in the CephCluster specification. So we would have no idea what to do. So the current approach runs a new sidecar container in the mgr pod to act as a proxy for "some" ceph commands, only the radosgw-admin ones for multi-site setup. This is a small container with admin access running idle waiting for commands to be executed. In a sense, it is similar to the toolbox but we didn't want to clearly expose it, so running as a sidecar is quite nice. Proxying command is obviously not always recommended since we add an extra hop in the network path. Now each request has to go from the operator pod to the API server to the remote pod to Ceph. Previously, the command only goes from the operator to Ceph. It's worth noting that external mode is not impacted since no rgw pod is configured. This scenario is flexible and allows us to scale pretty well since any CephCluster with Multus will see its mgr sidecar deployed and can then talk to Ceph. We are not limited.
Version Used: OCP: 4.8.0-0.nightly-2021-07-09-181248 OCS: ocs-operator.v4.8.0-450.ci Although RGW pod is coming up in the cluster, the installation is still getting stuck. rook-ceph-operator pod is filled with these messages: ``` 2021-07-12 11:33:55.084447 I | exec: timeout waiting for process radosgw-admin to return. Sending interrupt signal to the process 2021-07-12 11:33:55.101651 I | op-mon: parsing mon endpoints: a=172.30.5.253:6789,b=172.30.195.39:6789,c=172.30.165.102:6789 2021-07-12 11:33:55.101738 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found 2021-07-12 11:33:55.101958 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found ``` must-gather logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1978722/C12/ >> CSV, StorageCluster, Pods and Noobaa after deployment: ----- NAME DISPLAY VERSION REPLACES PHASE clusterserviceversion.operators.coreos.com/ocs-operator.v4.8.0-450.ci OpenShift Container Storage 4.8.0-450.ci Installing NAME AGE PHASE EXTERNAL CREATED AT VERSION storagecluster.ocs.openshift.io/ocs-storagecluster 27m Progressing 2021-07-12T11:09:04Z 4.8.0 NAME READY STATUS RESTARTS AGE pod/csi-cephfsplugin-pc9zc 3/3 Running 0 27m pod/csi-cephfsplugin-provisioner-6bd865fddf-6j5g5 6/6 Running 0 27m pod/csi-cephfsplugin-provisioner-6bd865fddf-x6465 6/6 Running 0 27m pod/csi-cephfsplugin-s5jdj 3/3 Running 0 27m pod/csi-cephfsplugin-wqvz4 3/3 Running 0 27m pod/csi-rbdplugin-bq2qk 3/3 Running 0 27m pod/csi-rbdplugin-nl4mq 3/3 Running 0 27m pod/csi-rbdplugin-provisioner-645747dd64-blsz8 6/6 Running 0 27m pod/csi-rbdplugin-provisioner-645747dd64-l4qc4 6/6 Running 0 27m pod/csi-rbdplugin-vpmd2 3/3 Running 0 27m pod/noobaa-core-0 1/1 Running 0 24m pod/noobaa-db-pg-0 1/1 Running 0 24m pod/noobaa-endpoint-56f5b67ddf-mzwvq 1/1 Running 0 22m pod/noobaa-operator-6b99f9dc8b-qvptq 1/1 Running 0 31m pod/ocs-metrics-exporter-7fb6c95ff7-nrk4r 1/1 Running 0 31m pod/ocs-operator-5b55cd4c64-gn8d6 0/1 Running 0 31m pod/rook-ceph-crashcollector-compute-0-78bff5fb4b-js9b8 1/1 Running 0 24m pod/rook-ceph-crashcollector-compute-1-5df648854f-ch98t 1/1 Running 0 24m pod/rook-ceph-crashcollector-compute-2-76c9c664bd-dvcv6 1/1 Running 0 24m pod/rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-84dc6f8892wf8 2/2 Running 0 23m pod/rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-64cd74bdffrqs 2/2 Running 0 23m pod/rook-ceph-mgr-a-5c4b5589b-6c4fs 3/3 Running 0 24m pod/rook-ceph-mon-a-669666856c-5wm9q 2/2 Running 0 26m pod/rook-ceph-mon-b-849d79f8fc-bfnn4 2/2 Running 0 25m pod/rook-ceph-mon-c-577b4575-hvq9c 2/2 Running 0 25m pod/rook-ceph-operator-665db59857-f6hdb 1/1 Running 0 31m pod/rook-ceph-osd-0-777d9c9566-xrqgf 2/2 Running 0 24m pod/rook-ceph-osd-1-7d7cdbf799-qhw7v 2/2 Running 0 24m pod/rook-ceph-osd-2-f7846bc78-bq24h 2/2 Running 0 24m pod/rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-0mrjld-7d82d 0/1 Completed 0 24m pod/rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-0vc4bt-cswzn 0/1 Completed 0 24m pod/rook-ceph-osd-prepare-ocs-deviceset-thin-2-data-0kwh4w-mrh2r 0/1 Completed 0 24m pod/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-c56fd56jvwnc 2/2 Running 0 23m NAME MGMT-ENDPOINTS S3-ENDPOINTS IMAGE PHASE AGE noobaa.noobaa.io/noobaa ["https://10.1.160.184:31368"] ["https://10.1.160.181:32699"] quay.io/rhceph-dev/mcg-core@sha256:db64242abde8be6a18b06157953133d117a7cd7b86d3048f317c5427417391e5 Configuring 24m >>From Ceph status all OSDs are up and RGW daemon is active $ ceph -s cluster: id: 7bc05757-63b7-4034-9c05-8f46760ca321 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 23m) mgr: a(active, since 23m) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 22m), 3 in (since 22m) rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a) data: pools: 10 pools, 176 pgs objects: 334 objects, 132 MiB usage: 3.3 GiB used, 1.5 TiB / 1.5 TiB avail pgs: 176 active+clean io: client: 938 B/s rd, 2.8 KiB/s wr, 1 op/s rd, 0 op/s w
Small code nit. Fix posted.