Created attachment 1934597 [details] ocs-osd-controller-manager-logs.txt Created attachment 1934597 [details] ocs-osd-controller-manager-logs.txt Description of problem: ocs-provider-qe addon stuck on Failed State Version-Release number of selected component (if applicable): ODF Version: 4.10.9-7 OCP Version: 4.10.45 How reproducible: Steps to Reproduce: 1.Install Manged Service cluster 2.Install ocs-provider-qe addon [Failed] $ rosa list addon -c oviner-pr ID NAME STATE cluster-logging-operator Cluster Logging Operator not installed dbaas-operator Red Hat OpenShift Database Access not installed ocm-addon-test-operator OCM Add-On Test Operator not installed ocs-consumer Red Hat OpenShift Data Foundation Managed Service Consumer not installed ocs-consumer-dev Red Hat OpenShift Data Foundation Managed Service Consumer (dev) not installed ocs-consumer-qe Red Hat OpenShift Data Foundation Managed Service Consumer (QE) not installed ocs-converged Red Hat OpenShift Data Foundation Managed Service (converged) not installed ocs-converged-dev Red Hat OpenShift Data Foundation Managed Service (converged, dev) not installed ocs-converged-qe Red Hat OpenShift Data Foundation Managed Service (converged, QE) not installed ocs-provider Red Hat OpenShift Data Foundation Managed Service Provider not installed ocs-provider-dev Red Hat OpenShift Data Foundation Managed Service Provider (dev) not installed ocs-provider-qe Red Hat OpenShift Data Foundation Managed Service Provider (QE) failed $ oc get pods NAME READY STATUS RESTARTS AGE addon-ocs-provider-qe-catalog-xxs76 1/1 Running 0 4h24m alertmanager-managed-ocs-alertmanager-0 2/2 Running 0 4h24m csi-addons-controller-manager-759b488df-k4g6g 2/2 Running 0 4h20m ocs-metrics-exporter-5dd96c885b-8mlkj 1/1 Running 0 4h20m ocs-operator-6888799d6b-8qzmf 1/1 Running 0 4h20m ocs-osd-aws-data-gather-74c5bbdf75-295bj 1/1 Running 0 4h24m ocs-osd-controller-manager-7cc9f965bb-22d5c 2/3 Running 0 4h24m ocs-provider-server-86dc8f67c9-jpt8w 1/1 Running 0 4h20m odf-console-57b8476cd4-k9jkq 1/1 Running 0 4h24m odf-operator-controller-manager-6f44676f4f-ngwr2 2/2 Running 0 4h20m prometheus-managed-ocs-prometheus-0 3/3 Running 0 4h20m prometheus-operator-8547cc9f89-btzt7 1/1 Running 0 4h20m rook-ceph-detect-version-w8rzr 0/1 ImagePullBackOff 0 16m rook-ceph-operator-548b87d44b-p8hbk 1/1 Running 0 4h24m rook-ceph-tools-7c8c77bd96-rvpnf 0/1 ContainerCreating 0 4h24m $ oc describe pod rook-ceph-detect-version-w8rzr Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 17m default-scheduler Successfully assigned openshift-storage/rook-ceph-detect-version-w8rzr to ip-10-0-139-85.ec2.internal Normal AddedInterface 17m multus Add eth0 [10.128.2.44/23] from openshift-sdn Normal Pulled 17m kubelet Container image "registry.redhat.io/odf4/rook-ceph-rhel8-operator@sha256:c93bb8cc668009b606606d1ff345d5b4e4ef38cf30e45641ea7e1b5735d68f82" already present on machine Normal Created 17m kubelet Created container init-copy-binaries Normal Started 17m kubelet Started container init-copy-binaries Warning Failed 16m (x5 over 17m) kubelet Error: ImagePullBackOff Normal Pulling 15m (x4 over 17m) kubelet Pulling image "registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6" Warning Failed 15m (x4 over 17m) kubelet Failed to pull image "registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6": rpc error: code = Unknown desc = reading manifest sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6 in registry.redhat.io/rhceph/rhceph-5-rhel8: manifest unknown: manifest unknown Warning Failed 15m (x4 over 17m) kubelet Error: ErrImagePull Normal BackOff 2m10s (x64 over 17m) kubelet Back-off pulling image "registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6" $ oc get storagecluster NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 4h39m Progressing 2022-12-27T10:51:50Z $ oc describe storagecluster Status: Conditions: Last Heartbeat Time: 2022-12-27T15:27:36Z Last Transition Time: 2022-12-27T10:52:32Z Message: Reconcile completed successfully Reason: ReconcileCompleted Status: True Type: ReconcileComplete Last Heartbeat Time: 2022-12-27T15:27:36Z Last Transition Time: 2022-12-27T10:51:50Z Message: CephCluster error: failed the ceph version check: failed to complete ceph version job: failed to run CmdReporter rook-ceph-detect-version successfully. failed waiting for results ConfigMap rook-ceph-detect-version. timed out waiting for results ConfigMap Reason: ClusterStateError Status: False Type: Available $ oc get managedocs -o yaml apiVersion: v1 items: - apiVersion: ocs.openshift.io/v1alpha1 kind: ManagedOCS metadata: creationTimestamp: "2022-12-27T10:51:49Z" finalizers: - managedocs.ocs.openshift.io generation: 1 name: managedocs namespace: openshift-storage resourceVersion: "101985" uid: 9d835281-1449-46c4-ab52-c0fa4f051477 spec: {} status: components: alertmanager: state: Ready prometheus: state: Ready storageCluster: state: Pending reconcileStrategy: strict kind: List metadata: resourceVersion: "" $ rosa list cluster | grep ov 20s7dec937r9d1vt9iktariuq990ouec oviner-pr ready I tried to pull image locally and it is working as expected: $ docker pull registry.redhat.io/rhceph/rhceph-5-rhel8 Using default tag: latest latest: Pulling from rhceph/rhceph-5-rhel8 db0f4cd41250: Pull complete 7e3624512448: Pull complete e603b871a132: Pull complete Digest: sha256:31fbe18b6f81c53d21053a4a0897bc3875e8ee8ec424393e4d5c3c3afd388274 Status: Downloaded newer image for registry.redhat.io/rhceph/rhceph-5-rhel8:latest registry.redhat.io/rhceph/rhceph-5-rhel8:latest $ docker images REPOSITORY TAG IMAGE ID CREATED SIZE registry.redhat.io/rhceph/rhceph-5-rhel8 latest b2c997ff1898 4 months ago 1.02GB Actual results: ocs-provider-qe addon on failed state Expected results: ocs-provider-qe addon on installed state Additional info: OCP MG: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2156559.tar.gz
- provided must gather is not all useful as I can't see any resource in openshift-storage ns - correct me if I'm missing anything -> ls must-gather.local.4491499808335695455/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-e0a09035b08ec4e978cc9a381fa396c41e9f8f7b4b3cace70c623e02e5c76797/namespaces/openshift-storage monitoring.coreos.com operators.coreos.com -> omg -nopenshift-storage get pods No resources found (vs) in some random ns -> omg -nopenshift-machine-api get pods NAME AGE cluster-autoscaler-operator-78dc46cd7d-d7ftk 4h40m cluster-baremetal-operator-7bc746fdb7-dtp6s 4h40m machine-api-controllers-7999795dc-nxbp2 4h45m machine-api-operator-767b658f98-8rsc7 4h45m - can only ask for a live cluster when issue gets reproduced as the error doesn't seem to relate to ODF MS Thanks, Leela.