Bug 2156559

Summary: ODF Managed Service, ocs-provider-qe addon stuck on Failed State
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Oded <oviner>
Component: odf-managed-serviceAssignee: Leela Venkaiah Gangavarapu <lgangava>
Status: CLOSED WORKSFORME QA Contact: Neha Berry <nberry>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.10CC: aeyal, assingh, dbindra, lgangava, ocs-bugs, odf-bz-bot
Target Milestone: ---Flags: lgangava: needinfo? (oviner)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-20 09:50:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ocs-osd-controller-manager-logs.txt none

Description Oded 2022-12-27 15:40:01 UTC
Created attachment 1934597 [details]
ocs-osd-controller-manager-logs.txt

Created attachment 1934597 [details]
ocs-osd-controller-manager-logs.txt

Description of problem:
ocs-provider-qe  addon stuck on Failed State

Version-Release number of selected component (if applicable):
ODF Version: 4.10.9-7
OCP Version: 4.10.45

How reproducible:


Steps to Reproduce:
1.Install Manged Service cluster
2.Install ocs-provider-qe  addon [Failed]
$  rosa list addon -c oviner-pr
ID                          NAME                                                                  STATE
cluster-logging-operator    Cluster Logging Operator                                              not installed
dbaas-operator              Red Hat OpenShift Database Access                                     not installed
ocm-addon-test-operator     OCM Add-On Test Operator                                              not installed
ocs-consumer                Red Hat OpenShift Data Foundation Managed Service Consumer            not installed
ocs-consumer-dev            Red Hat OpenShift Data Foundation Managed Service Consumer (dev)      not installed
ocs-consumer-qe             Red Hat OpenShift Data Foundation Managed Service Consumer (QE)       not installed
ocs-converged               Red Hat OpenShift Data Foundation Managed Service (converged)         not installed
ocs-converged-dev           Red Hat OpenShift Data Foundation Managed Service (converged, dev)    not installed
ocs-converged-qe            Red Hat OpenShift Data Foundation Managed Service (converged, QE)     not installed
ocs-provider                Red Hat OpenShift Data Foundation Managed Service Provider            not installed
ocs-provider-dev            Red Hat OpenShift Data Foundation Managed Service Provider (dev)      not installed
ocs-provider-qe             Red Hat OpenShift Data Foundation Managed Service Provider (QE)       failed

$ oc get pods
NAME                                               READY   STATUS              RESTARTS   AGE
addon-ocs-provider-qe-catalog-xxs76                1/1     Running             0          4h24m
alertmanager-managed-ocs-alertmanager-0            2/2     Running             0          4h24m
csi-addons-controller-manager-759b488df-k4g6g      2/2     Running             0          4h20m
ocs-metrics-exporter-5dd96c885b-8mlkj              1/1     Running             0          4h20m
ocs-operator-6888799d6b-8qzmf                      1/1     Running             0          4h20m
ocs-osd-aws-data-gather-74c5bbdf75-295bj           1/1     Running             0          4h24m
ocs-osd-controller-manager-7cc9f965bb-22d5c        2/3     Running             0          4h24m
ocs-provider-server-86dc8f67c9-jpt8w               1/1     Running             0          4h20m
odf-console-57b8476cd4-k9jkq                       1/1     Running             0          4h24m
odf-operator-controller-manager-6f44676f4f-ngwr2   2/2     Running             0          4h20m
prometheus-managed-ocs-prometheus-0                3/3     Running             0          4h20m
prometheus-operator-8547cc9f89-btzt7               1/1     Running             0          4h20m
rook-ceph-detect-version-w8rzr                     0/1     ImagePullBackOff    0          16m
rook-ceph-operator-548b87d44b-p8hbk                1/1     Running             0          4h24m
rook-ceph-tools-7c8c77bd96-rvpnf                   0/1     ContainerCreating   0          4h24m

$ oc describe pod rook-ceph-detect-version-w8rzr
Events:
  Type     Reason          Age                   From               Message
  ----     ------          ----                  ----               -------
  Normal   Scheduled       17m                   default-scheduler  Successfully assigned openshift-storage/rook-ceph-detect-version-w8rzr to ip-10-0-139-85.ec2.internal
  Normal   AddedInterface  17m                   multus             Add eth0 [10.128.2.44/23] from openshift-sdn
  Normal   Pulled          17m                   kubelet            Container image "registry.redhat.io/odf4/rook-ceph-rhel8-operator@sha256:c93bb8cc668009b606606d1ff345d5b4e4ef38cf30e45641ea7e1b5735d68f82" already present on machine
  Normal   Created         17m                   kubelet            Created container init-copy-binaries
  Normal   Started         17m                   kubelet            Started container init-copy-binaries
  Warning  Failed          16m (x5 over 17m)     kubelet            Error: ImagePullBackOff
  Normal   Pulling         15m (x4 over 17m)     kubelet            Pulling image "registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6"
  Warning  Failed          15m (x4 over 17m)     kubelet            Failed to pull image "registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6": rpc error: code = Unknown desc = reading manifest sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6 in registry.redhat.io/rhceph/rhceph-5-rhel8: manifest unknown: manifest unknown
  Warning  Failed          15m (x4 over 17m)     kubelet            Error: ErrImagePull
  Normal   BackOff         2m10s (x64 over 17m)  kubelet            Back-off pulling image "registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6"

$ oc get storagecluster
NAME                 AGE     PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   4h39m   Progressing              2022-12-27T10:51:50Z   

$ oc describe storagecluster
Status:
  Conditions:
    Last Heartbeat Time:   2022-12-27T15:27:36Z
    Last Transition Time:  2022-12-27T10:52:32Z
    Message:               Reconcile completed successfully
    Reason:                ReconcileCompleted
    Status:                True
    Type:                  ReconcileComplete
    Last Heartbeat Time:   2022-12-27T15:27:36Z
    Last Transition Time:  2022-12-27T10:51:50Z
    Message:               CephCluster error: failed the ceph version check: failed to complete ceph version job: failed to run CmdReporter rook-ceph-detect-version successfully. failed waiting for results ConfigMap rook-ceph-detect-version. timed out waiting for results ConfigMap
    Reason:                ClusterStateError
    Status:                False
    Type:                  Available


$ oc get managedocs -o yaml
apiVersion: v1
items:
- apiVersion: ocs.openshift.io/v1alpha1
  kind: ManagedOCS
  metadata:
    creationTimestamp: "2022-12-27T10:51:49Z"
    finalizers:
    - managedocs.ocs.openshift.io
    generation: 1
    name: managedocs
    namespace: openshift-storage
    resourceVersion: "101985"
    uid: 9d835281-1449-46c4-ab52-c0fa4f051477
  spec: {}
  status:
    components:
      alertmanager:
        state: Ready
      prometheus:
        state: Ready
      storageCluster:
        state: Pending
    reconcileStrategy: strict
kind: List
metadata:
  resourceVersion: ""

$ rosa list cluster | grep ov
20s7dec937r9d1vt9iktariuq990ouec  oviner-pr      ready

I tried to pull image locally and it is working as expected:
$ docker pull registry.redhat.io/rhceph/rhceph-5-rhel8
Using default tag: latest
latest: Pulling from rhceph/rhceph-5-rhel8
db0f4cd41250: Pull complete 
7e3624512448: Pull complete 
e603b871a132: Pull complete 
Digest: sha256:31fbe18b6f81c53d21053a4a0897bc3875e8ee8ec424393e4d5c3c3afd388274
Status: Downloaded newer image for registry.redhat.io/rhceph/rhceph-5-rhel8:latest
registry.redhat.io/rhceph/rhceph-5-rhel8:latest

$ docker images 
REPOSITORY                                 TAG       IMAGE ID       CREATED        SIZE
registry.redhat.io/rhceph/rhceph-5-rhel8   latest    b2c997ff1898   4 months ago   1.02GB

Actual results:
ocs-provider-qe  addon on failed state

Expected results:
ocs-provider-qe  addon on installed state

Additional info:
OCP MG: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2156559.tar.gz

Comment 1 Leela Venkaiah Gangavarapu 2022-12-28 02:43:07 UTC
- provided must gather is not all useful as I can't see any resource in openshift-storage ns
- correct me if I'm missing anything

-> ls must-gather.local.4491499808335695455/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-e0a09035b08ec4e978cc9a381fa396c41e9f8f7b4b3cace70c623e02e5c76797/namespaces/openshift-storage
monitoring.coreos.com  operators.coreos.com

-> omg -nopenshift-storage get pods
No resources found

(vs) in some random ns

-> omg -nopenshift-machine-api get pods
NAME                                          AGE
cluster-autoscaler-operator-78dc46cd7d-d7ftk  4h40m
cluster-baremetal-operator-7bc746fdb7-dtp6s   4h40m
machine-api-controllers-7999795dc-nxbp2       4h45m
machine-api-operator-767b658f98-8rsc7         4h45m

- can only ask for a live cluster when issue gets reproduced as the error doesn't seem to relate to ODF MS

Thanks,
Leela.