Description of problem (please be detailed as possible and provide log snippests): csv ocs-registry:4.9.0-102.ci is in Installing phase on vSphere platform Version of all relevant components (if applicable): ocs-registry:4.9.0-102.ci openshift installer (4.9.0-0.nightly-2021-08-19-184748) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Not able to install OCS Is there any workaround available to the best of your knowledge? NA Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? 1/1 Can this issue reproduce from the UI? Not tried If this is a regression, please provide more details to justify this: Yes Steps to Reproduce: 1. Install OCS using ocs-ci 2. verify csv is in succeeded phase 3. Actual results: ocs-registry:4.9.0-102.ci is in Installing phase Expected results: csv should be in Succeeded phase Additional info: $ oc get csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.9.0-102.ci OpenShift Container Storage 4.9.0-102.ci Installing $ oc describe csv ocs-operator.v4.9.0-102.ci Name: ocs-operator.v4.9.0-102.ci Namespace: openshift-storage Labels: olm.api.1cf66995ee5bab83=provided vents: Type Reason Age From Message ---- ------ ---- ---- ------- Normal InstallSucceeded 122m (x78 over 5h42m) operator-lifecycle-manager waiting for install components to report healthy Normal NeedsReinstall 117m (x83 over 5h41m) operator-lifecycle-manager installing: waiting for deployment ocs-operator to become ready: deployment "ocs-operator" not available: Deployment does not have minimum availability. Warning InstallCheckFailed 67s (x118 over 5h36m) operator-lifecycle-manager install timeout > pods $ oc get pods NAME READY STATUS RESTARTS AGE csi-cephfsplugin-79dnw 3/3 Running 0 5h42m csi-cephfsplugin-d4wbd 3/3 Running 0 5h42m csi-cephfsplugin-mb5ks 3/3 Running 0 5h42m csi-cephfsplugin-provisioner-54fbb98c8f-b5v4l 6/6 Running 0 5h42m csi-cephfsplugin-provisioner-54fbb98c8f-pcvgq 6/6 Running 0 5h42m csi-rbdplugin-27sm6 3/3 Running 0 5h42m csi-rbdplugin-94xn7 3/3 Running 0 5h42m csi-rbdplugin-lm4qv 3/3 Running 0 5h42m csi-rbdplugin-provisioner-84ccc64b48-5cfvw 6/6 Running 0 5h42m csi-rbdplugin-provisioner-84ccc64b48-nd8dl 6/6 Running 0 5h42m noobaa-core-0 1/1 Running 0 5h38m noobaa-db-pg-0 1/1 Running 0 5h38m noobaa-endpoint-54c66b6b88-cg5f6 1/1 Running 0 4h57m noobaa-operator-68998c44dc-78pb6 1/1 Running 0 5h43m ocs-metrics-exporter-7455f88587-fm6df 1/1 Running 0 5h43m ocs-operator-7d8bb7577d-4sffr 0/1 Running 0 5h43m rook-ceph-crashcollector-compute-0-7bf548c9fc-5vpjj 1/1 Running 0 5h38m rook-ceph-crashcollector-compute-1-5b55b94666-hqczc 1/1 Running 0 5h38m rook-ceph-crashcollector-compute-2-58b844dbff-n86sw 1/1 Running 0 5h38m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-57b54b46vstmc 2/2 Running 0 5h38m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7c8f8d55k67tb 2/2 Running 0 5h38m rook-ceph-mgr-a-666787bf5-rc2xz 2/2 Running 0 5h39m rook-ceph-mon-a-78f768bdb4-66sm9 2/2 Running 0 5h42m rook-ceph-mon-b-8886f46f4-45htn 2/2 Running 0 5h40m rook-ceph-mon-c-cb4695b4d-q6kzs 2/2 Running 0 5h40m rook-ceph-operator-5c6c56b95-djt88 1/1 Running 0 5h43m rook-ceph-osd-0-6d4d98d9c4-nhqqn 2/2 Running 0 5h38m rook-ceph-osd-1-547dd69cfb-87zg2 2/2 Running 0 5h38m rook-ceph-osd-2-84df9467c-xkgd6 2/2 Running 0 5h38m rook-ceph-osd-prepare-ocs-deviceset-0-data-0jl5s5--1-jjdzp 0/1 Completed 0 5h39m rook-ceph-osd-prepare-ocs-deviceset-1-data-08bzbq--1-s88cj 0/1 Completed 0 5h39m rook-ceph-osd-prepare-ocs-deviceset-2-data-0nwpsn--1-4pkrl 0/1 Completed 0 5h39m rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-75b567567r5r 2/2 Running 0 5h38m rook-ceph-tools-cdd8d5c65-7vkg2 1/1 Running 0 5h36m > rook-ceph-operator-5c6c56b95-djt88 has below error in creating noobaa-ceph-objectstore-user but eventually it created succesfully 2021-08-20 05:38:04.196041 E | ceph-object-store-user-controller: failed to reconcile failed to create/update object store user "noobaa-ceph-objectstore-user": failed to get details from ceph object user "noobaa-ceph-objectstore-user": Get "https://rook-ceph-rgw-ocs-storagecluster-cephobjectstore.openshift-storage.svc:443/admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user": dial tcp 172.30.110.155:443: connect: no route to host 2021-08-20 05:38:04.202684 I | op-mon: parsing mon endpoints: a=172.30.151.77:6789,b=172.30.127.181:6789,c=172.30.37.191:6789 2021-08-20 05:38:04.202763 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found 2021-08-20 05:38:04.202886 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found 2021-08-20 05:38:04.481775 I | ceph-object-store-user-controller: creating ceph object user "ocs-storagecluster-cephobjectstoreuser" in namespace "openshift-storage" 2021-08-20 05:38:04.576778 I | ceph-object-store-user-controller: created ceph object user "ocs-storagecluster-cephobjectstoreuser" 2021-08-20 05:38:04.583982 I | ceph-spec: created ceph *v1.Secret object "rook-ceph-object-user-ocs-storagecluster-cephobjectstore-ocs-storagecluster-cephobjectstoreuser" 2021-08-20 05:38:04.597441 I | op-mon: parsing mon endpoints: a=172.30.151.77:6789,b=172.30.127.181:6789,c=172.30.37.191:6789 2021-08-20 05:38:04.597548 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found 2021-08-20 05:38:04.597691 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found 2021-08-20 05:38:04.893648 I | ceph-object-store-user-controller: creating ceph object user "noobaa-ceph-objectstore-user" in namespace "openshift-storage" 2021-08-20 05:38:04.989766 I | ceph-object-store-user-controller: created ceph object user "noobaa-ceph-objectstore-user" > $ oc describe pod ocs-operator-7d8bb7577d-4sffr Name: ocs-operator-7d8bb7577d-4sffr Namespace: openshift-storage Priority: 0 Node: compute-1/10.1.161.31 Start Time: Fri, 20 Aug 2021 11:02:48 +0530 Labels: name=ocs-operator pod-template-hash=7d8bb7577d Annotations: alm-examples: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ProbeError 104s (x2341 over 5h45m) kubelet Readiness probe error: HTTP probe failed with statuscode: 500 body: [-]readyz failed: reason withheld healthz check failed > ocs-operator-7d8bb7577d-4sffr log {"level":"error","ts":1629458239.4290745,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Reconciler error","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","name":" ocs-storagecluster","namespace":"openshift-storage","error":"Operation cannot be fulfilled on storageclusters.ocs.openshift.io \"ocs-storagecluster\": the object has been modified; please apply your changes to t he latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/app/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller. (*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:302\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processN extWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\t/remote-so urce/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/ wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/remote-source /app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/w ait.JitterUntilWithContext\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util /wait/wait.go:99"} > Job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/5379/console > must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/vavuthu6-ocs49/vavuthu6-ocs49_20210820T044517/logs/failed_testcase_ocs_logs_1629435690/test_deployment_ocs_logs/
Here's my observation: This is the StorageCluster status ``` status: conditions: - lastHeartbeatTime: "2021-08-20T05:33:41Z" lastTransitionTime: "2021-08-20T05:33:40Z" message: CephCluster resource is not reporting status reason: CephClusterStatus status: "False" type: Available - lastHeartbeatTime: "2021-08-20T06:38:34Z" lastTransitionTime: "2021-08-20T05:33:40Z" message: Waiting on Nooba instance to finish initialization reason: NoobaaInitializing status: "True" type: Progressing - lastHeartbeatTime: "2021-08-20T05:37:38Z" lastTransitionTime: "2021-08-20T05:33:41Z" message: 'CephCluster is creating: Processing OSD 2 on PVC "ocs-deviceset-0-data-0jl5s5"' reason: ClusterStateCreating status: "False" type: Upgradeable ``` It says CephCluster is NotReady because it is still processing OSD 2. And, NooBaa is still not initialized. Issue with CephCluster status is most probably due to failure in status update. Because, Ceph health seems to be fine with all OSDs accounted for. And if it was actually failing, my experience is that NooBaa failure will call it out explicitly (which it didn't in this case). This is CephCluster state ``` status: ceph: capacity: bytesAvailable: 319240495104 bytesTotal: 322122547200 bytesUsed: 2882052096 lastUpdated: "2021-08-20T06:41:31Z" health: HEALTH_OK lastChanged: "2021-08-20T05:38:40Z" lastChecked: "2021-08-20T06:41:31Z" previousHealth: HEALTH_WARN message: Cluster created successfully phase: Ready state: Created ``` Now, this is NooBaa status ``` status: conditions: - lastHeartbeatTime: "2021-08-20T05:37:38Z" lastTransitionTime: "2021-08-20T05:37:38Z" message: |- RequestError: send request failed caused by: Put "https://rook-ceph-rgw-ocs-storagecluster-cephobjectstore.openshift-storage.svc:443/nb.1629441716656.apps.vavuthu6-ocs49.qe.rh-ocs.com": x509: certificate signed by unknown authority reason: TemporaryError status: "False" type: Available ``` It says it's hitting certificate issue which is probably where NooBaa operator is stuck. So this marks StorageCluster as Progressing. Which means OCS Operator pod is NotReady (i.e. it's still hitting errors). So CSV is also stuck in installing. We need to figure 2 things here: 1. Why are we hitting certs issue? Did anything change here? 2. Why is the status mixed up? It still might not be reflecting the actual state or maybe that was the state at log collection time.
On IBM Power platform (ppc64le) , we are also hitting the same issue. Tried both with UI and ocs-ci, issue is same in both the cases. CSV: [root@rdr-aar49-sao01-bastion-0 ~]# oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.9.0-102.ci OpenShift Container Storage 4.9.0-102.ci Installing Pods : [root@rdr-aar49-sao01-bastion-0 ~]# oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-6cdhn 3/3 Running 0 3d3h csi-cephfsplugin-provisioner-54fbb98c8f-9jf76 6/6 Running 0 3d3h csi-cephfsplugin-provisioner-54fbb98c8f-wz9gc 6/6 Running 0 3d3h csi-cephfsplugin-scdzd 3/3 Running 0 3d3h csi-cephfsplugin-wdn92 3/3 Running 0 3d3h csi-rbdplugin-5xrpd 3/3 Running 0 3d3h csi-rbdplugin-8gqjn 3/3 Running 0 3d3h csi-rbdplugin-provisioner-84ccc64b48-7xqzw 6/6 Running 0 3d3h csi-rbdplugin-provisioner-84ccc64b48-pmh9h 6/6 Running 0 3d3h csi-rbdplugin-pwvxx 3/3 Running 0 3d3h noobaa-core-0 1/1 Running 0 3d3h noobaa-db-pg-0 1/1 Running 0 3d3h noobaa-endpoint-6f854c9848-9c9s7 1/1 Running 0 3d3h noobaa-operator-68998c44dc-mtqcf 1/1 Running 0 3d3h ocs-metrics-exporter-7455f88587-tq56g 1/1 Running 0 3d3h ocs-operator-7d8bb7577d-tnf8d 0/1 Running 0 3d3h rook-ceph-crashcollector-rdr-aar49-sao01-worker-0-59ffb784crr7k 1/1 Running 0 3d3h rook-ceph-crashcollector-rdr-aar49-sao01-worker-1-5bd7bdbbnktmk 1/1 Running 0 3d3h rook-ceph-crashcollector-rdr-aar49-sao01-worker-2-89fd967c6db9h 1/1 Running 0 3d3h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-577fd545kl7k7 2/2 Running 0 3d3h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-695f47b4gr2pl 2/2 Running 0 3d3h rook-ceph-mgr-a-668c4fc9b6-75l7m 2/2 Running 0 3d3h rook-ceph-mon-a-55f545b46d-z42np 2/2 Running 0 3d3h rook-ceph-mon-b-689568457b-s59xc 2/2 Running 0 3d3h rook-ceph-mon-c-796f6d449c-cpvfj 2/2 Running 0 3d3h rook-ceph-operator-5c6c56b95-kbtlc 1/1 Running 0 3d3h rook-ceph-osd-0-7ddc744477-bxqcb 2/2 Running 0 3d3h rook-ceph-osd-1-b6b98d8b-k8dkm 2/2 Running 0 3d3h rook-ceph-osd-2-5487f7c89f-ps97v 2/2 Running 0 3d3h rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data---1-d2ql7 0/1 Completed 0 3d3h rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data---1-qsgld 0/1 Completed 0 3d3h rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data---1-txqww 0/1 Completed 0 3d3h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-595cd6blt94s 2/2 Running 0 3d3h rook-ceph-tools-cdd8d5c65-lfrq5 1/1 Running 0 3d3h events of ocs-operator pod : Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ProbeError 2m48s (x30681 over 3d3h) kubelet Readiness probe error: HTTP probe failed with statuscode: 500 body: [-]readyz failed: reason withheld healthz check failed events of ocs-operator CSV : Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning InstallCheckFailed 42s (x1761 over 3d3h) operator-lifecycle-manager install timeout
must-gather logs for ODF4.9 installation on ppc64le : https://drive.google.com/file/d/1d8ZAn-vtY8j4YI6wJIFJDbnc2AExqQYa/view?usp=sharing
Is there any live set up available, in 4.9 for RGW we opened up SSL port with help of the serving cert feature(till now only insecure port 80 was enabled)
in noobaa-operator we are trying to create a bucket in RGW here: https://github.com/noobaa/noobaa-operator/blob/e9400e06fbb438b52ba0946e17e76a13f16b0b1f/pkg/system/phase4_configuring.go#L980-L998 we use the endpoint provided in object-store-user secret.
cluster is in same state. Kubeconfig: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/vavuthu6-ocs49/vavuthu6-ocs49_20210820T044517/openshift-cluster-dir/auth/kubeconfig
(In reply to Vijay Avuthu from comment #8) > cluster is in same state. > > Kubeconfig: > http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/vavuthu6- > ocs49/vavuthu6-ocs49_20210820T044517/openshift-cluster-dir/auth/kubeconfig Thanks for sharing the cluster, I checked the state of cephobjectstore and cephobjectstoreuser it looks good to me. As Nirmod mentioned in https://chat.google.com/room/AAAAREGEba8/PJsC-wuS_98, the secure port was opened for RGW from 4.9 onwards, so we need changes in Noobaa to accommodate the same. For internal mode, Openshift serving cert feature is used, so the cert should be available in each pod AFAIR. For the external mode of the ceph cluster, since is ceph cluster is already configured and most probably running outside of openshift it is expected them to provide the TLS certs as secrets for RGW.
*** Bug 1998065 has been marked as a duplicate of this bug. ***
We are (IBM Power platform) is still seeing the problem ocs-operator.v4.9.0-117.ci is not in expected phase: Succeeded the csv continues to be in installing state.
Mudit asked me to collet must-gather. But found another issue : https://bugzilla.redhat.com/show_bug.cgi?id=1999267 ocs-must-gather is available in quay.io/rhcephdev only for x86 platform.
The root cause for ODF 4.9 deploy failure for system P is missing ppc64le image for ocs-operator. (also ocs-must-gather as noted above). Both ocs-operator and ocs-must-gather are getting built only for x86.
Please don't overload with different issues on the same BZ
For the issue Sridhar mentioned we have https://bugzilla.redhat.com/show_bug.cgi?id=1999267 For the original issue there is another workaround in place and we will get it with the next build.
I can confirm that this issue has been solved in ODF operator build 4.9.0-120.ci on the s390x platform. There is however a new issue that prevents the ODF operator from installing properly on s390x. I will create a new ticket for that one.
@muagarwa is this already fixed? If so, this bug should be moved to ON_QE please. Thanks
AFAIK, its not fixed completely. A workaround is there in place by the Noobaa team. Nimrod, can confirm.
Elad, the workaround (its a code fix) is working and we are no longer blocked. I can't remove the blocker flag because it has a keyword 'regression', is it fine to move this BZ to ON_QA and track the complete fix with a new bz?
Verified with build 4.9.0-194.ci > All operators are in succeeded state NAME DISPLAY VERSION REPLACES PHASE noobaa-operator.v4.9.0 NooBaa Operator 4.9.0 Succeeded ocs-operator.v4.9.0 OpenShift Container Storage 4.9.0 Succeeded odf-operator.v4.9.0 OpenShift Data Foundation 4.9.0 Succeeded Job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/6911/consoleFull
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:5085