Bug 2209846

Summary: 'ODF Operator stuck in 'Unknown Failure'
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Anjali <amenon>
Component: odf-operatorAssignee: Nitin Goyal <nigoyal>
Status: CLOSED NOTABUG QA Contact: Elad <ebenahar>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.10CC: hnallurv, mparida, muagarwa, nigoyal, ocs-bugs, odf-bz-bot
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-18 06:25:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Anjali 2023-05-25 02:59:16 UTC
Description of problem (please be detailed as possible and provide log
snippests):

- Cu is trying to upgrade ODF operators from  v4.10.7 to v4.10.8. Initially the upgrade was getting stuck with error Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline

- All pods in openshift storage namespace are up and running. 

- ceph cluster is healthy 

[amenon@supportshell-1 must_gather_commands]$ cat ceph_status
  cluster:
    id:     c374b71c-19a4-45e9-bc6d-fb3f90d1b0dd
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 6M)
    mgr: a(active, since 7M)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 6M), 3 in (since 7M)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   11 pools, 177 pgs
    objects: 9.04k objects, 14 GiB
    usage:   45 GiB used, 3.0 TiB / 3 TiB avail
    pgs:     177 active+clean
 
  io:
    client:   852 B/s rd, 32 KiB/s wr, 1 op/s rd, 3 op/s wr

- We applied solutions  https://access.redhat.com/solutions/6459071 and https://access.redhat.com/solutions/6972585, but didn't help. 

- Then, with help of SBR-Shift, we followed Steps 1 and 2 from the "For issues related to operator upgrade" section in https://access.redhat.com/solutions/6459071 and then deleted the relevant InstallPlan

oc get ip/install-hj4ns -n openshift-storage
NAME            CSV                                                                                                   APPROVAL    APPROVED
install-hj4ns   [mcg-operator.v4.10.8, ocs-operator.v4.10.8, odf-csi-addons-operator.v4.10.8, odf-operator.v4.10.8]   Automatic   true

- After this all OCP operators can be upgraded, but ODF Operator is stuck in 'Unknown Failure' (attaching screenshot) and version is still 4.10.7. 

- There's no errors or messages in the ODF m-g regarding upgrading to ODF 4.10.8. The upgrade fails to start, with UI showing "Upgrade status" is "Unknown failure" 

[amenon@supportshell-1 oc_output]$ cat csv
NAME                                              DISPLAY                                    VERSION                 REPLACES                                          PHASE
container-security-operator.v3.8.7                Red Hat Quay Container Security Operator   3.8.7                   container-security-operator.v3.8.6                Succeeded
mcg-operator.v4.10.7                              NooBaa Operator                            4.10.7                  mcg-operator.v4.10.6                              Succeeded
ocs-operator.v4.10.7                              OpenShift Container Storage                4.10.7                  ocs-operator.v4.10.6                              Succeeded
odf-csi-addons-operator.v4.10.7                   CSI Addons                                 4.10.7                  odf-csi-addons-operator.v4.10.6                   Succeeded
odf-operator.v4.10.7                              OpenShift Data Foundation                  4.10.7                  odf-operator.v4.10.6                              Succeeded
red-hat-camel-k-operator.v1.10.0-0.1682325781.p   Red Hat Integration - Camel K              1.10.0+0.1682325781.p   red-hat-camel-k-operator.v1.10.0-0.1679561624.p   Succeeded

[amenon@supportshell-1 oc_output]$ cat installplan
NAME            CSV                    APPROVAL    APPROVED
install-2xmbm   odf-operator.v4.10.7   Automatic   true
install-q8jr9   odf-operator.v4.10.6   Automatic   true
install-w5j67   mcg-operator.v4.10.5   Automatic   true

$ oc get subs -n openshift-storage
NAME                                                                         PACKAGE                   SOURCE             CHANNEL
mcg-operator-stable-4.10-redhat-operators-openshift-marketplace              mcg-operator              redhat-operators   stable-4.10
ocs-operator-stable-4.10-redhat-operators-openshift-marketplace              ocs-operator              redhat-operators   stable-4.10
odf-csi-addons-operator-stable-4.10-redhat-operators-openshift-marketplace   odf-csi-addons-operator   redhat-operators   stable-4.10
odf-operator                   

- OpenShift Data Foundation - was unlocked but didn't update to the last version. Its actual version is 4.10.7 and the installplan resource cu deleted tried to upgrade to 4.10.8. 

Version of all relevant components (if applicable):
[amenon@supportshell-1 oc_output]$ cat csv
NAME                                              DISPLAY                                    VERSION                 REPLACES                                          PHASE
container-security-operator.v3.8.7                Red Hat Quay Container Security Operator   3.8.7                   container-security-operator.v3.8.6                Succeeded
mcg-operator.v4.10.7                              NooBaa Operator                            4.10.7                  mcg-operator.v4.10.6                              Succeeded
ocs-operator.v4.10.7                              OpenShift Container Storage                4.10.7                  ocs-operator.v4.10.6                              Succeeded
odf-csi-addons-operator.v4.10.7                   CSI Addons                                 4.10.7                  odf-csi-addons-operator.v4.10.6                   Succeeded
odf-operator.v4.10.7                              OpenShift Data Foundation                  4.10.7                  odf-operator.v4.10.6                              Succeeded
red-hat-camel-k-operator.v1.10.0-0.1682325781.p   Red Hat Integration - Camel K              1.10.0+0.1682325781.p   red-hat-camel-k-operator.v1.10.0-0.1679561624.p   Succeeded

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.28   True        False         206d    Cluster version is 4.10.28


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

No, cluster is running fine but Operators cannot be upgraded


Is there any workaround available to the best of your knowledge?
No


Actual results:

Operators are not getting upgraded to 4.10.8

Expected results:

Operators are successfully upgraded to 4.10.8

Additional info:

- All related logs/m-g available in supportshell under ~/03469870

Comment 4 Malay Kumar parida 2023-05-25 10:20:39 UTC
I am still looking into it. As far as I understood, the customer was facing issues in upgrading other operators including ODF. After some manual fixing, it was possible to upgrade the other operators, but ODF is still not upgrading. So prima facie looks like an OCP/OLM thing. Another thing that caught my attention although it might not be related to the problem, is the high number of pod restarts of noobaa operator. 
noobaa-operator-56457bf44b-jj66q                                  1/1     Running   220 (32h ago)   208d   172.31.12.117   worker4.ocp.rosat.ro   <none>           <none>.

Comment 5 Nitin Goyal 2023-05-29 04:55:04 UTC
Hello Anjali, Can I pls get the latest odf must gather?