2156559 – ODF Managed Service, ocs-provider-qe addon stuck on Failed State

Bug 2156559 - ODF Managed Service, ocs-provider-qe addon stuck on Failed State

Summary: ODF Managed Service, ocs-provider-qe addon stuck on Failed State

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-managed-service
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Leela Venkaiah Gangavarapu
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-12-27 15:40 UTC by Oded
Modified:	2023-12-08 04:31 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-20 09:50:14 UTC
Embargoed:

Attachments	(Terms of Use)
ocs-osd-controller-manager-logs.txt (87.05 KB, text/plain) 2022-12-27 15:40 UTC, Oded	no flags	Details
View All

Description Oded 2022-12-27 15:40:01 UTC

Created attachment 1934597 [details]
ocs-osd-controller-manager-logs.txt

Created attachment 1934597 [details]
ocs-osd-controller-manager-logs.txt

Description of problem:
ocs-provider-qe  addon stuck on Failed State

Version-Release number of selected component (if applicable):
ODF Version: 4.10.9-7
OCP Version: 4.10.45

How reproducible:


Steps to Reproduce:
1.Install Manged Service cluster
2.Install ocs-provider-qe  addon [Failed]
$  rosa list addon -c oviner-pr
ID                          NAME                                                                  STATE
cluster-logging-operator    Cluster Logging Operator                                              not installed
dbaas-operator              Red Hat OpenShift Database Access                                     not installed
ocm-addon-test-operator     OCM Add-On Test Operator                                              not installed
ocs-consumer                Red Hat OpenShift Data Foundation Managed Service Consumer            not installed
ocs-consumer-dev            Red Hat OpenShift Data Foundation Managed Service Consumer (dev)      not installed
ocs-consumer-qe             Red Hat OpenShift Data Foundation Managed Service Consumer (QE)       not installed
ocs-converged               Red Hat OpenShift Data Foundation Managed Service (converged)         not installed
ocs-converged-dev           Red Hat OpenShift Data Foundation Managed Service (converged, dev)    not installed
ocs-converged-qe            Red Hat OpenShift Data Foundation Managed Service (converged, QE)     not installed
ocs-provider                Red Hat OpenShift Data Foundation Managed Service Provider            not installed
ocs-provider-dev            Red Hat OpenShift Data Foundation Managed Service Provider (dev)      not installed
ocs-provider-qe             Red Hat OpenShift Data Foundation Managed Service Provider (QE)       failed

$ oc get pods
NAME                                               READY   STATUS              RESTARTS   AGE
addon-ocs-provider-qe-catalog-xxs76                1/1     Running             0          4h24m
alertmanager-managed-ocs-alertmanager-0            2/2     Running             0          4h24m
csi-addons-controller-manager-759b488df-k4g6g      2/2     Running             0          4h20m
ocs-metrics-exporter-5dd96c885b-8mlkj              1/1     Running             0          4h20m
ocs-operator-6888799d6b-8qzmf                      1/1     Running             0          4h20m
ocs-osd-aws-data-gather-74c5bbdf75-295bj           1/1     Running             0          4h24m
ocs-osd-controller-manager-7cc9f965bb-22d5c        2/3     Running             0          4h24m
ocs-provider-server-86dc8f67c9-jpt8w               1/1     Running             0          4h20m
odf-console-57b8476cd4-k9jkq                       1/1     Running             0          4h24m
odf-operator-controller-manager-6f44676f4f-ngwr2   2/2     Running             0          4h20m
prometheus-managed-ocs-prometheus-0                3/3     Running             0          4h20m
prometheus-operator-8547cc9f89-btzt7               1/1     Running             0          4h20m
rook-ceph-detect-version-w8rzr                     0/1     ImagePullBackOff    0          16m
rook-ceph-operator-548b87d44b-p8hbk                1/1     Running             0          4h24m
rook-ceph-tools-7c8c77bd96-rvpnf                   0/1     ContainerCreating   0          4h24m

$ oc describe pod rook-ceph-detect-version-w8rzr
Events:
  Type     Reason          Age                   From               Message
  ----     ------          ----                  ----               -------
  Normal   Scheduled       17m                   default-scheduler  Successfully assigned openshift-storage/rook-ceph-detect-version-w8rzr to ip-10-0-139-85.ec2.internal
  Normal   AddedInterface  17m                   multus             Add eth0 [10.128.2.44/23] from openshift-sdn
  Normal   Pulled          17m                   kubelet            Container image "registry.redhat.io/odf4/rook-ceph-rhel8-operator@sha256:c93bb8cc668009b606606d1ff345d5b4e4ef38cf30e45641ea7e1b5735d68f82" already present on machine
  Normal   Created         17m                   kubelet            Created container init-copy-binaries
  Normal   Started         17m                   kubelet            Started container init-copy-binaries
  Warning  Failed          16m (x5 over 17m)     kubelet            Error: ImagePullBackOff
  Normal   Pulling         15m (x4 over 17m)     kubelet            Pulling image "registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6"
  Warning  Failed          15m (x4 over 17m)     kubelet            Failed to pull image "registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6": rpc error: code = Unknown desc = reading manifest sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6 in registry.redhat.io/rhceph/rhceph-5-rhel8: manifest unknown: manifest unknown
  Warning  Failed          15m (x4 over 17m)     kubelet            Error: ErrImagePull
  Normal   BackOff         2m10s (x64 over 17m)  kubelet            Back-off pulling image "registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6"

$ oc get storagecluster
NAME                 AGE     PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   4h39m   Progressing              2022-12-27T10:51:50Z   

$ oc describe storagecluster
Status:
  Conditions:
    Last Heartbeat Time:   2022-12-27T15:27:36Z
    Last Transition Time:  2022-12-27T10:52:32Z
    Message:               Reconcile completed successfully
    Reason:                ReconcileCompleted
    Status:                True
    Type:                  ReconcileComplete
    Last Heartbeat Time:   2022-12-27T15:27:36Z
    Last Transition Time:  2022-12-27T10:51:50Z
    Message:               CephCluster error: failed the ceph version check: failed to complete ceph version job: failed to run CmdReporter rook-ceph-detect-version successfully. failed waiting for results ConfigMap rook-ceph-detect-version. timed out waiting for results ConfigMap
    Reason:                ClusterStateError
    Status:                False
    Type:                  Available


$ oc get managedocs -o yaml
apiVersion: v1
items:
- apiVersion: ocs.openshift.io/v1alpha1
  kind: ManagedOCS
  metadata:
    creationTimestamp: "2022-12-27T10:51:49Z"
    finalizers:
    - managedocs.ocs.openshift.io
    generation: 1
    name: managedocs
    namespace: openshift-storage
    resourceVersion: "101985"
    uid: 9d835281-1449-46c4-ab52-c0fa4f051477
  spec: {}
  status:
    components:
      alertmanager:
        state: Ready
      prometheus:
        state: Ready
      storageCluster:
        state: Pending
    reconcileStrategy: strict
kind: List
metadata:
  resourceVersion: ""

$ rosa list cluster | grep ov
20s7dec937r9d1vt9iktariuq990ouec  oviner-pr      ready

I tried to pull image locally and it is working as expected:
$ docker pull registry.redhat.io/rhceph/rhceph-5-rhel8
Using default tag: latest
latest: Pulling from rhceph/rhceph-5-rhel8
db0f4cd41250: Pull complete 
7e3624512448: Pull complete 
e603b871a132: Pull complete 
Digest: sha256:31fbe18b6f81c53d21053a4a0897bc3875e8ee8ec424393e4d5c3c3afd388274
Status: Downloaded newer image for registry.redhat.io/rhceph/rhceph-5-rhel8:latest
registry.redhat.io/rhceph/rhceph-5-rhel8:latest

$ docker images 
REPOSITORY                                 TAG       IMAGE ID       CREATED        SIZE
registry.redhat.io/rhceph/rhceph-5-rhel8   latest    b2c997ff1898   4 months ago   1.02GB

Actual results:
ocs-provider-qe  addon on failed state

Expected results:
ocs-provider-qe  addon on installed state

Additional info:
OCP MG: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2156559.tar.gz

Comment 1 Leela Venkaiah Gangavarapu 2022-12-28 02:43:07 UTC

- provided must gather is not all useful as I can't see any resource in openshift-storage ns
- correct me if I'm missing anything

-> ls must-gather.local.4491499808335695455/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-e0a09035b08ec4e978cc9a381fa396c41e9f8f7b4b3cace70c623e02e5c76797/namespaces/openshift-storage
monitoring.coreos.com  operators.coreos.com

-> omg -nopenshift-storage get pods
No resources found

(vs) in some random ns

-> omg -nopenshift-machine-api get pods
NAME                                          AGE
cluster-autoscaler-operator-78dc46cd7d-d7ftk  4h40m
cluster-baremetal-operator-7bc746fdb7-dtp6s   4h40m
machine-api-controllers-7999795dc-nxbp2       4h45m
machine-api-operator-767b658f98-8rsc7         4h45m

- can only ask for a live cluster when issue gets reproduced as the error doesn't seem to relate to ODF MS

Thanks,
Leela.

Comment 6 Red Hat Bugzilla 2023-12-08 04:31:47 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.