Bug 1818147

Summary: CSV: ocs-operator.v4.3.0-379.ci is stuck in installing phase after upgrade from 4.2.2 live content
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Petr Balogh <pbalogh>
Component: Multi-Cloud Object GatewayAssignee: Danny <dzaken>
Status: CLOSED ERRATA QA Contact: Petr Balogh <pbalogh>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.3CC: dzaken, etamir, jarrpa, madam, nbecker, ocs-bugs, owasserm, shmohan, sostapov, tnielsen
Target Milestone: ---Keywords: Automation, Regression, Upgrades
Target Release: OCS 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.3.0-rc5 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-14 09:48:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logs from operators and few other files about cluster none

Description Petr Balogh 2020-03-27 20:19:56 UTC
Description of problem (please be detailed as possible and provide log
snippests):
I see that in our job:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6056/console

the upgrade failed with CSV stuck in installing phase.

https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6056/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/

When I connected to the cluster during the execution I see:
ocs-operator-6dd9fd9d8d-rrcp6                                     0/1     Running     0          77m

So it means that 77mins and still the same.

All other pods looks OK to me:
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-4wjs4                                            3/3     Running     0          76m
csi-cephfsplugin-m2mbm                                            3/3     Running     0          76m
csi-cephfsplugin-provisioner-65b59d9dc9-rczl7                     5/5     Running     0          76m
csi-cephfsplugin-provisioner-65b59d9dc9-tgnnh                     5/5     Running     0          77m
csi-cephfsplugin-snnrp                                            3/3     Running     0          77m
csi-rbdplugin-6c8gw                                               3/3     Running     0          76m
csi-rbdplugin-p4l9r                                               3/3     Running     0          76m
csi-rbdplugin-provisioner-86c8bc888d-48hc6                        5/5     Running     0          76m
csi-rbdplugin-provisioner-86c8bc888d-lzsq9                        5/5     Running     0          77m
csi-rbdplugin-w5rk2                                               3/3     Running     0          77m
lib-bucket-provisioner-55f74d96f6-xmlrr                           1/1     Running     0          3h24m
noobaa-core-0                                                     1/1     Running     0          76m
noobaa-db-0                                                       1/1     Running     0          76m
noobaa-endpoint-59c4769f6b-4qrhx                                  1/1     Running     0          75m
noobaa-operator-b77ccff86-5lrz9                                   1/1     Running     0          77m
ocs-operator-6dd9fd9d8d-rrcp6                                     0/1     Running     0          77m
pod-test-cephfs-990c7858e3eb4341b386707e70f7bc61                  1/1     Running     0          112m
rook-ceph-crashcollector-ip-10-0-136-86-649f5b65b6-x2x2h          1/1     Running     0          72m
rook-ceph-crashcollector-ip-10-0-159-203-958c4dc5b-vvqjj          1/1     Running     0          72m
rook-ceph-crashcollector-ip-10-0-170-123-77c5689f4-7hj4t          1/1     Running     0          72m
rook-ceph-drain-canary-299f24d16a705b54bd0251ece01ad1f5-f4r7hgm   1/1     Running     0          71m
rook-ceph-drain-canary-e285eca2f732a465cc14d563dfbea99c-595kb52   1/1     Running     0          68m
rook-ceph-drain-canary-ee8c22ace5779c4ae7b153f4babe321b-6bvjml4   1/1     Running     0          72m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6988b56fjk2h6   1/1     Running     0          67m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5bcd8f6fch9j7   1/1     Running     0          67m
rook-ceph-mgr-a-8467688fd8-k5f9x                                  1/1     Running     0          72m
rook-ceph-mon-a-5b66ffcf95-g88bf                                  1/1     Running     0          76m
rook-ceph-mon-b-747c7fcddb-7mp5r                                  1/1     Running     0          74m
rook-ceph-mon-c-5d948d5449-d8nxq                                  1/1     Running     0          73m
rook-ceph-operator-599dbd974f-jw68l                               1/1     Running     0          77m
rook-ceph-osd-0-77f79bdf5c-4kwrq                                  1/1     Running     0          72m
rook-ceph-osd-1-5674b9d647-j9k6h                                  1/1     Running     0          71m
rook-ceph-osd-2-5757b89f6b-t86wv                                  1/1     Running     0          68m
rook-ceph-osd-prepare-ocs-deviceset-0-0-zrwdz-7fng7               0/1     Completed   0          3h21m
rook-ceph-osd-prepare-ocs-deviceset-1-0-tpssk-cp6fs               0/1     Completed   0          3h21m
rook-ceph-osd-prepare-ocs-deviceset-2-0-psbx6-rks7w               0/1     Completed   0          3h21m
rook-ceph-tools-fc566f885-t92bf                                   1/1     Running     0          3h20m


Version of all relevant components (if applicable):
OCS 4.2.2 live content upgraded to 4.3.0-379.ci build.
OCP:  4.3.0-0.nightly-2020-03-04-222846


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
It block upgrade


Is there any workaround available to the best of your knowledge?
NO

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Not sure, executed 4 more jobs and we will update if we will reproduce again.
The same execution triggered here:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6069/console

So we will see there.

Can this issue reproduce from the UI?
Haven't tried

If this is a regression, please provide more details to justify this:
Yes, this worked before.

Steps to Reproduce:
1. Install 4.2.2 from live content
2. Run some FIO
3. Add catalog source for internal build
4. Change the source in subscription to added catalogSource + change the channel from stable-4.2 to stable-4.3.
5. We see the issue mentioned in BZ


Actual results:
Upgrade failed

Expected results:
Upgrade will pass

Additional info:
Must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200327T132049/logs/failed_testcase_ocs_logs_1585323953/test_upgrade_ocs_logs/

Comment 2 Petr Balogh 2020-03-27 20:38:41 UTC
Created attachment 1674175 [details]
Logs from operators and few other files about cluster

Some yaml files I collected from my queries about csv, subscription, logs and so on.

Comment 3 Travis Nielsen 2020-03-27 20:47:23 UTC
I don't see what would be stuck in installing status. The upgrade seems to be complete from the OCS and Rook perspective.
@Jose do you see what status is causing this issue?

In the StorageCluster status the reconcile indicates it is completed:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200327T132049/logs/failed_testcase_ocs_logs_1585323953/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-4af88c1c34068413c3e838c4e1156937a44c0b53c6fca26a85b08c3f4280d368/storagecluster.yaml

    - lastHeartbeatTime: "2020-03-27T17:13:09Z"
      lastTransitionTime: "2020-03-27T14:16:20Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "True"
      type: ReconcileComplete


The rook operator was upgraded and completed the upgrade of all the ceph components

The status on the CephCluster shows Completed
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200327T132049/logs/failed_testcase_ocs_logs_1585323953/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-4af88c1c34068413c3e838c4e1156937a44c0b53c6fca26a85b08c3f4280d368/ceph/namespaces/openshift-storage/ceph.rook.io/cephclusters/ocs-storagecluster-cephcluster.yaml

Comment 4 Jose A. Rivera 2020-03-27 20:54:30 UTC
I would look at the olm-operator logs. We should be able to get those from a standard OCP must-gather.

Comment 5 shylesh 2020-03-27 21:27:58 UTC
Initial analysis from Jrivera, captured from gchat
=============================


Looking at the storagecluster.yaml:
    - lastHeartbeatTime: "2020-03-27T17:13:09Z"
      lastTransitionTime: "2020-03-27T16:23:20Z"
      message: Waiting on Nooba instance to finish initialization
      reason: NoobaaInitializing
      status: "True"
      type: Progressing


From noobaa.yaml:
  conditions:
  - lastHeartbeatTime: "2020-03-27T14:19:42Z"
    lastTransitionTime: "2020-03-27T16:24:24Z"
    message: Cannot read property 'email' of undefined
    reason: TemporaryError
    status: "False"
    type: Available
  - lastHeartbeatTime: "2020-03-27T14:19:42Z"
    lastTransitionTime: "2020-03-27T16:24:24Z"
    message: Cannot read property 'email' of undefined
    reason: TemporaryError
    status: "True"
    type: Progressing
  - lastHeartbeatTime: "2020-03-27T14:19:42Z"
    lastTransitionTime: "2020-03-27T14:19:42Z"
    message: Cannot read property 'email' of undefined
    reason: TemporaryError
    status: "False"
    type: Degraded
  - lastHeartbeatTime: "2020-03-27T14:19:42Z"
    lastTransitionTime: "2020-03-27T16:24:24Z"
    message: Cannot read property 'email' of undefined
    reason: TemporaryError
    status: "False"
    type: Upgradeable
  observedGeneration: 3
  phase: Configuring

From the noobaa-operator logs:

time="2020-03-27T17:13:45Z" level=info msg="âœˆï¸  RPC: system.read_system() Request: <nil>"
time="2020-03-27T17:13:45Z" level=error msg="âš ï¸  RPC: system.read_system() Response Error: Code=INTERNAL Message=Cannot read property 'email' of undefined"
time="2020-03-27T17:13:45Z" level=error msg="failed to read system info: Cannot read property 'email' of undefined" sys=openshift-storage/noobaa
time="2020-03-27T17:13:45Z" level=info msg="SetPhase: temporary error during phase \"Configuring\"" sys=openshift-storage/noobaa
time="2020-03-27T17:13:45Z" level=warning msg="â³ Temporary Error: Cannot read property 'email' of undefined" sys=openshift-storage/noobaa

Comment 6 shylesh 2020-03-27 23:51:50 UTC
In this run somehow it went fine ===>> http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200327T200206/logs/failed_testcase_ocs_logs_1585342169/test_upgrade_ocs_logs/ .

Linking must gather so that we can make sure what is the diff b/w earlier run.

Comment 7 shylesh 2020-03-27 23:56:03 UTC
Output from the above run https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6069/, Courtesy @pbalogh.



$ oc get pod -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-2gkwd                                            3/3     Running     0          40m
csi-cephfsplugin-8z54x                                            3/3     Running     0          40m
csi-cephfsplugin-provisioner-65b59d9dc9-7l2r7                     5/5     Running     0          40m
csi-cephfsplugin-provisioner-65b59d9dc9-87qcm                     5/5     Running     0          40m
csi-cephfsplugin-x5wpx                                            3/3     Running     0          40m
csi-rbdplugin-62bbp                                               3/3     Running     0          40m
csi-rbdplugin-9slh5                                               3/3     Running     0          40m
csi-rbdplugin-provisioner-86c8bc888d-8kqmk                        5/5     Running     0          40m
csi-rbdplugin-provisioner-86c8bc888d-p58hj                        5/5     Running     0          41m
csi-rbdplugin-tqkbz                                               3/3     Running     0          40m
lib-bucket-provisioner-55f74d96f6-8ll4m                           1/1     Running     0          86m
noobaa-core-0                                                     1/1     Running     0          40m
noobaa-db-0                                                       1/1     Running     0          40m
noobaa-endpoint-64666986b5-r25bb                                  1/1     Running     0          39m
noobaa-operator-b77ccff86-6gcck                                   1/1     Running     0          41m
ocs-operator-6dd9fd9d8d-ttzp7                                     0/1     Running     0          41m
pod-test-cephfs-418a20805bb742b9af40192feb660a30                  1/1     Running     0          78m
rook-ceph-crashcollector-ip-10-0-140-120-7bd8c65c8d-n8bgx         1/1     Running     0          35m
rook-ceph-crashcollector-ip-10-0-149-101-74c96b48f4-mtvfl         1/1     Running     0          35m
rook-ceph-crashcollector-ip-10-0-165-13-6cc484f7c6-sd8x2          1/1     Running     0          35m
rook-ceph-drain-canary-060f6bb0e5cb84732c4e59ceb119fdf4-76w7kw5   1/1     Running     0          34m
rook-ceph-drain-canary-8ecf54a1e4f30f0a8d3f1911a74f1b81-57lb2vl   1/1     Running     0          36m
rook-ceph-drain-canary-c0722e044791df50d01ec179fd6b83d7-68gqnn8   1/1     Running     0          36m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7d5d9c4bn74cg   1/1     Running     0          34m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6c4cc794v7br6   1/1     Running     0          33m
rook-ceph-mgr-a-66645fdd4b-zrbb5                                  1/1     Running     0          37m
rook-ceph-mon-a-64c7885cf5-8lhrh                                  1/1     Running     0          40m
rook-ceph-mon-b-59f987689d-bvzlc                                  1/1     Running     0          38m
rook-ceph-mon-c-69747d9497-bglzl                                  1/1     Running     0          37m
rook-ceph-operator-599dbd974f-k7s75                               1/1     Running     0          41m
rook-ceph-osd-0-65b4bf455b-m9cll                                  1/1     Running     0          36m
rook-ceph-osd-1-6bd897d47c-p5jh7                                  1/1     Running     0          34m
rook-ceph-osd-2-58dccc9d46-7lcxf                                  1/1     Running     0          36m
rook-ceph-osd-prepare-ocs-deviceset-0-0-gsd7j-85qq7               0/1     Completed   0          82m
rook-ceph-osd-prepare-ocs-deviceset-1-0-9w86m-xrrtz               0/1     Completed   0          82m
rook-ceph-osd-prepare-ocs-deviceset-2-0-gg9sv-qlgv8               0/1     Completed   0          82m
rook-ceph-tools-fc566f885-t6d2p                                   1/1     Running     0          81m
pbalogh@MacBook-Pro upgrade-bug $ oc get pod -n openshift-storag
pbalogh@MacBook-Pro upgrade-bug $ oc get csv -n openshift-storage
NAME                            DISPLAY                       VERSION        REPLACES              PHASE
lib-bucket-provisioner.v1.0.0   lib-bucket-provisioner        1.0.0                                Succeeded
ocs-operator.v4.3.0-379.ci      OpenShift Container Storage   4.3.0-379.ci   ocs-operator.v4.2.2   Installing
pbalogh@MacBook-Pro upgrade-bug $ oc get csv -n openshift-storage
NAME                            DISPLAY                       VERSION        REPLACES              PHASE
lib-bucket-provisioner.v1.0.0   lib-bucket-provisioner        1.0.0                                Succeeded
ocs-operator.v4.3.0-379.ci      OpenShift Container Storage   4.3.0-379.ci   ocs-operator.v4.2.2   Succeeded

$ oc get pod -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-2gkwd                                            3/3     Running     0          128m
csi-cephfsplugin-8z54x                                            3/3     Running     0          129m
csi-cephfsplugin-provisioner-65b59d9dc9-7l2r7                     5/5     Running     0          129m
csi-cephfsplugin-provisioner-65b59d9dc9-87qcm                     5/5     Running     0          129m
csi-cephfsplugin-x5wpx                                            3/3     Running     0          129m
csi-rbdplugin-62bbp                                               3/3     Running     0          129m
csi-rbdplugin-9slh5                                               3/3     Running     0          129m
csi-rbdplugin-provisioner-86c8bc888d-8kqmk                        5/5     Running     0          129m
csi-rbdplugin-provisioner-86c8bc888d-p58hj                        5/5     Running     0          129m
csi-rbdplugin-tqkbz                                               3/3     Running     0          129m
lib-bucket-provisioner-55f74d96f6-8ll4m                           1/1     Running     0          175m
noobaa-core-0                                                     1/1     Running     0          129m
noobaa-db-0                                                       1/1     Running     0          129m
noobaa-endpoint-64666986b5-r25bb                                  1/1     Running     0          128m
noobaa-operator-b77ccff86-6gcck                                   1/1     Running     0          130m
ocs-operator-6dd9fd9d8d-ttzp7                                     1/1     Running     0          130m
rook-ceph-crashcollector-ip-10-0-140-120-7bd8c65c8d-n8bgx         1/1     Running     0          124m
rook-ceph-crashcollector-ip-10-0-149-101-74c96b48f4-mtvfl         1/1     Running     0          124m
rook-ceph-crashcollector-ip-10-0-165-13-6cc484f7c6-sd8x2          1/1     Running     0          124m
rook-ceph-drain-canary-060f6bb0e5cb84732c4e59ceb119fdf4-76w7kw5   1/1     Running     0          123m
rook-ceph-drain-canary-8ecf54a1e4f30f0a8d3f1911a74f1b81-57lb2vl   1/1     Running     0          125m
rook-ceph-drain-canary-c0722e044791df50d01ec179fd6b83d7-68gqnn8   1/1     Running     0          125m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7d5d9c4bn74cg   1/1     Running     0          123m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6c4cc794v7br6   1/1     Running     0          122m
rook-ceph-mgr-a-66645fdd4b-zrbb5                                  1/1     Running     0          126m
rook-ceph-mon-a-64c7885cf5-8lhrh                                  1/1     Running     0          129m
rook-ceph-mon-b-59f987689d-bvzlc                                  1/1     Running     0          127m
rook-ceph-mon-c-69747d9497-bglzl                                  1/1     Running     0          126m
rook-ceph-operator-599dbd974f-k7s75                               1/1     Running     0          130m
rook-ceph-osd-0-65b4bf455b-m9cll                                  1/1     Running     0          125m
rook-ceph-osd-1-6bd897d47c-p5jh7                                  1/1     Running     0          123m
rook-ceph-osd-2-58dccc9d46-7lcxf                                  1/1     Running     0          125m
rook-ceph-osd-prepare-ocs-deviceset-0-0-gsd7j-85qq7               0/1     Completed   0          171m
rook-ceph-osd-prepare-ocs-deviceset-1-0-9w86m-xrrtz               0/1     Completed   0          171m
rook-ceph-osd-prepare-ocs-deviceset-2-0-gg9sv-qlgv8               0/1     Completed   0          171m
rook-ceph-tools-fc566f885-t6d2p                                   1/1     Running     0          170m

Comment 8 Danny 2020-03-29 12:45:22 UTC
I found the issue in one of noobaa's upgrade scripts. working on a fix.

Comment 9 Michael Adam 2020-03-30 12:17:03 UTC
(In reply to Danny from comment #8)
> I found the issue in one of noobaa's upgrade scripts. working on a fix.

Existence if patch etc seems to imply that we should ACK it for 4.3?

Comment 10 Nimrod Becker 2020-03-30 12:19:50 UTC
Adding an ack

Comment 13 Petr Balogh 2020-04-02 12:31:43 UTC
Running verification job here:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6328/console

Comment 14 Petr Balogh 2020-04-02 16:29:31 UTC
Verified - all upgrade tests passed!
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6328/testReport/

Comment 17 errata-xmlrpc 2020-04-14 09:48:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1437