Bug 2170859

Summary: [ODFMS] osd-deployer.v2.0.10 pods stuck in installing state intermittently during rosa upgrade from 4.10.47 to rosa 4.11.25 on consumer cluster
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: suchita <sgatfane>
Component: odf-managed-serviceAssignee: Ohad <omitrani>
Status: CLOSED NOTABUG QA Contact: Neha Berry <nberry>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.10CC: ocs-bugs, odf-bz-bot, resoni
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-05-04 05:54:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description suchita 2023-02-17 12:58:57 UTC
Description of problem:
During rosa upgrade on the COnsumer cluster from 4.10.47 to rosa 4.11.25, 
osd-deployer.v2.0.10 pods stuck in installing state. This is an intermittent issue and not observed on all upgraded clusters.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Deploy consumer cluster with ROSA 4.10.47 and ocs-consumer addon
2. Upgrade rosa from 4.10.47 to 4.11.25
3.

Actual results:
 ocs-deplyer csv stuck in installing state , intermittently

Expected results:
All csv should reach in successful state

Additional info:

$ oc get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.9                      NooBaa Operator               4.10.9            mcg-operator.v4.10.8                      Succeeded
observability-operator.v0.0.20            Observability Operator        0.0.20            observability-operator.v0.0.19            Succeeded
ocs-operator.v4.10.5                      OpenShift Container Storage   4.10.5            ocs-operator.v4.10.4                      Succeeded
ocs-osd-deployer.v2.0.10                  OCS OSD Deployer              2.0.10            ocs-osd-deployer.v2.0.9                   Installing
odf-csi-addons-operator.v4.10.5           CSI Addons                    4.10.5            odf-csi-addons-operator.v4.10.4           Succeeded
odf-operator.v4.10.5                      OpenShift Data Foundation     4.10.5            odf-operator.v4.10.4                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.461-dbddf1f   Route Monitor Operator        0.1.461-dbddf1f   route-monitor-operator.v0.1.456-02ea942   Succeeded

The managedocs yaml shows all the 3 are Ready
status:
    components:
      alertmanager:
        state: Ready
      prometheus:
        state: Ready
      storageCluster:
        state: Ready

status in deployed csv is

`installing: waiting for deployment ocs-osd-controller-manager to become
      ready: deployment "ocs-osd-controller-manager" not available: Deployment does
      not have minimum availability.`

csv describe error:

Events:
  Type     Reason              Age                   From                        Message
  ----     ------              ----                  ----                        -------
  Warning  InstallCheckFailed  84s (x80 over 3h48m)  operator-lifecycle-manager  install timeout


Workaround:
respin the ocs-osd-controller-manager

Comment 2 suchita 2023-04-10 18:54:04 UTC
Verified in V2.0.12 Qualification :

$ oc get pods
NAME                                               READY   STATUS    RESTARTS      AGE
addon-ocs-consumer-qe-catalog-rhxb2                1/1     Running   0             69m
alertmanager-managed-ocs-alertmanager-0            2/2     Running   0             45m
csi-addons-controller-manager-759b488df-5n6fq      2/2     Running   0             48m
csi-cephfsplugin-7m59s                             2/2     Running   4             3h18m
csi-cephfsplugin-fg2jg                             2/2     Running   2             3h18m
csi-cephfsplugin-nvqfx                             2/2     Running   2             3h18m
csi-cephfsplugin-provisioner-5d6b768994-8962l      5/5     Running   0             59m
csi-cephfsplugin-provisioner-5d6b768994-jbn8l      5/5     Running   0             45m
csi-rbdplugin-96h84                                3/3     Running   3             3h18m
csi-rbdplugin-provisioner-65477c4f5-54vdg          6/6     Running   0             55m
csi-rbdplugin-provisioner-65477c4f5-zg9r8          6/6     Running   0             45m
csi-rbdplugin-qdjl7                                3/3     Running   6             3h18m
csi-rbdplugin-tfnfb                                3/3     Running   3             3h18m
ocs-metrics-exporter-5dd96c885b-l8z8t              1/1     Running   0             48m
ocs-operator-6888799d6b-2jj65                      1/1     Running   0             45m
ocs-osd-aws-data-gather-5bd59fb6c8-ph82z           1/1     Running   0             48m
ocs-osd-controller-manager-5d9694754c-swzx6        3/3     Running   1 (54s ago)   45m
odf-console-57b8476cd4-6pcqz                       1/1     Running   0             59m
odf-operator-controller-manager-6f44676f4f-bqpzw   2/2     Running   0             69m
prometheus-managed-ocs-prometheus-0                3/3     Running   0             59m
prometheus-operator-8547cc9f89-c6dm9               1/1     Running   0             48m
redhat-operators-kkbcz                             1/1     Running   0             45m
rook-ceph-operator-548b87d44b-98ph5                1/1     Running   0             55m
rook-ceph-tools-7c8c77bd96-gtwp9                   1/1     Running   0             55m
[jenkins@temp-jagent-sgatfane-10cma auth]$ oc get managedocs
NAME         AGE
managedocs   3h20m

$ oc get managedocs -o yaml
apiVersion: v1
items:
- apiVersion: ocs.openshift.io/v1alpha1
  kind: ManagedOCS
  metadata:
    creationTimestamp: "2023-04-10T15:28:39Z"
    finalizers:
    - managedocs.ocs.openshift.io
    generation: 1
    name: managedocs
    namespace: openshift-storage
    resourceVersion: "422419"
    uid: 324ec872-b91f-4770-99c9-aa16987e2e30
  spec: {}
  status:
    components:
      alertmanager:
        state: Ready
      prometheus:
        state: Ready
      storageCluster:
        state: Ready
    reconcileStrategy: strict
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

$ oc get storagecluster
NAME                 AGE     PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   3h20m   Ready   true       2023-04-10T15:28:41Z   

$ oc get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.11                     NooBaa Operator               4.10.11           mcg-operator.v4.10.10                     Succeeded
observability-operator.v0.0.20            Observability Operator        0.0.20            observability-operator.v0.0.19            Succeeded
ocs-operator.v4.10.9                      OpenShift Container Storage   4.10.9            ocs-operator.v4.10.8                      Succeeded
ocs-osd-deployer.v2.0.12                  OCS OSD Deployer              2.0.12            ocs-osd-deployer.v2.0.11                  Succeeded
odf-csi-addons-operator.v4.10.9           CSI Addons                    4.10.9            odf-csi-addons-operator.v4.10.8           Succeeded
odf-operator.v4.10.9                      OpenShift Data Foundation     4.10.9            odf-operator.v4.10.8                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.493-a866e7c   Route Monitor Operator        0.1.493-a866e7c   route-monitor-operator.v0.1.489-7d9fe90   Succeeded

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.25   True        False         58m     Cluster version is 4.11.25

Comment 3 Rewant 2023-05-04 05:54:14 UTC
Fixed in version 2.0.12