1818147 – CSV: ocs-operator.v4.3.0-379.ci is stuck in installing phase after upgrade from 4.2.2 live content

Bug 1818147 - CSV: ocs-operator.v4.3.0-379.ci is stuck in installing phase after upgrade from 4.2.2 live content

Summary: CSV: ocs-operator.v4.3.0-379.ci is stuck in installing phase after upgrade fr...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	Multi-Cloud Object Gateway
Sub Component:
Version:	4.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	OCS 4.3.0
Assignee:	Danny
QA Contact:	Petr Balogh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-27 20:19 UTC by Petr Balogh
Modified:	2020-04-14 09:48 UTC (History)
CC List:	10 users (show)
Fixed In Version:	4.3.0-rc5
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-14 09:48:30 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Logs from operators and few other files about cluster (18.73 KB, application/gzip) 2020-03-27 20:38 UTC, Petr Balogh	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	noobaa noobaa-core pull 5964	0	None	closed	fixed upgrade script update_bucket_owner_account.js	2021-01-23 21:20:33 UTC
Red Hat Product Errata	RHBA-2020:1437	0	None	None	None	2020-04-14 09:48:36 UTC

Description Petr Balogh 2020-03-27 20:19:56 UTC

Description of problem (please be detailed as possible and provide log
snippests):
I see that in our job:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6056/console

the upgrade failed with CSV stuck in installing phase.

https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6056/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/

When I connected to the cluster during the execution I see:
ocs-operator-6dd9fd9d8d-rrcp6                                     0/1     Running     0          77m

So it means that 77mins and still the same.

All other pods looks OK to me:
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-4wjs4                                            3/3     Running     0          76m
csi-cephfsplugin-m2mbm                                            3/3     Running     0          76m
csi-cephfsplugin-provisioner-65b59d9dc9-rczl7                     5/5     Running     0          76m
csi-cephfsplugin-provisioner-65b59d9dc9-tgnnh                     5/5     Running     0          77m
csi-cephfsplugin-snnrp                                            3/3     Running     0          77m
csi-rbdplugin-6c8gw                                               3/3     Running     0          76m
csi-rbdplugin-p4l9r                                               3/3     Running     0          76m
csi-rbdplugin-provisioner-86c8bc888d-48hc6                        5/5     Running     0          76m
csi-rbdplugin-provisioner-86c8bc888d-lzsq9                        5/5     Running     0          77m
csi-rbdplugin-w5rk2                                               3/3     Running     0          77m
lib-bucket-provisioner-55f74d96f6-xmlrr                           1/1     Running     0          3h24m
noobaa-core-0                                                     1/1     Running     0          76m
noobaa-db-0                                                       1/1     Running     0          76m
noobaa-endpoint-59c4769f6b-4qrhx                                  1/1     Running     0          75m
noobaa-operator-b77ccff86-5lrz9                                   1/1     Running     0          77m
ocs-operator-6dd9fd9d8d-rrcp6                                     0/1     Running     0          77m
pod-test-cephfs-990c7858e3eb4341b386707e70f7bc61                  1/1     Running     0          112m
rook-ceph-crashcollector-ip-10-0-136-86-649f5b65b6-x2x2h          1/1     Running     0          72m
rook-ceph-crashcollector-ip-10-0-159-203-958c4dc5b-vvqjj          1/1     Running     0          72m
rook-ceph-crashcollector-ip-10-0-170-123-77c5689f4-7hj4t          1/1     Running     0          72m
rook-ceph-drain-canary-299f24d16a705b54bd0251ece01ad1f5-f4r7hgm   1/1     Running     0          71m
rook-ceph-drain-canary-e285eca2f732a465cc14d563dfbea99c-595kb52   1/1     Running     0          68m
rook-ceph-drain-canary-ee8c22ace5779c4ae7b153f4babe321b-6bvjml4   1/1     Running     0          72m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6988b56fjk2h6   1/1     Running     0          67m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5bcd8f6fch9j7   1/1     Running     0          67m
rook-ceph-mgr-a-8467688fd8-k5f9x                                  1/1     Running     0          72m
rook-ceph-mon-a-5b66ffcf95-g88bf                                  1/1     Running     0          76m
rook-ceph-mon-b-747c7fcddb-7mp5r                                  1/1     Running     0          74m
rook-ceph-mon-c-5d948d5449-d8nxq                                  1/1     Running     0          73m
rook-ceph-operator-599dbd974f-jw68l                               1/1     Running     0          77m
rook-ceph-osd-0-77f79bdf5c-4kwrq                                  1/1     Running     0          72m
rook-ceph-osd-1-5674b9d647-j9k6h                                  1/1     Running     0          71m
rook-ceph-osd-2-5757b89f6b-t86wv                                  1/1     Running     0          68m
rook-ceph-osd-prepare-ocs-deviceset-0-0-zrwdz-7fng7               0/1     Completed   0          3h21m
rook-ceph-osd-prepare-ocs-deviceset-1-0-tpssk-cp6fs               0/1     Completed   0          3h21m
rook-ceph-osd-prepare-ocs-deviceset-2-0-psbx6-rks7w               0/1     Completed   0          3h21m
rook-ceph-tools-fc566f885-t92bf                                   1/1     Running     0          3h20m


Version of all relevant components (if applicable):
OCS 4.2.2 live content upgraded to 4.3.0-379.ci build.
OCP:  4.3.0-0.nightly-2020-03-04-222846


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
It block upgrade


Is there any workaround available to the best of your knowledge?
NO

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Not sure, executed 4 more jobs and we will update if we will reproduce again.
The same execution triggered here:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6069/console

So we will see there.

Can this issue reproduce from the UI?
Haven't tried

If this is a regression, please provide more details to justify this:
Yes, this worked before.

Steps to Reproduce:
1. Install 4.2.2 from live content
2. Run some FIO
3. Add catalog source for internal build
4. Change the source in subscription to added catalogSource + change the channel from stable-4.2 to stable-4.3.
5. We see the issue mentioned in BZ


Actual results:
Upgrade failed

Expected results:
Upgrade will pass

Additional info:
Must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200327T132049/logs/failed_testcase_ocs_logs_1585323953/test_upgrade_ocs_logs/

Comment 2 Petr Balogh 2020-03-27 20:38:41 UTC

Created attachment 1674175 [details]
Logs from operators and few other files about cluster

Some yaml files I collected from my queries about csv, subscription, logs and so on.

Comment 3 Travis Nielsen 2020-03-27 20:47:23 UTC

I don't see what would be stuck in installing status. The upgrade seems to be complete from the OCS and Rook perspective.
@Jose do you see what status is causing this issue?

In the StorageCluster status the reconcile indicates it is completed:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200327T132049/logs/failed_testcase_ocs_logs_1585323953/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-4af88c1c34068413c3e838c4e1156937a44c0b53c6fca26a85b08c3f4280d368/storagecluster.yaml

    - lastHeartbeatTime: "2020-03-27T17:13:09Z"
      lastTransitionTime: "2020-03-27T14:16:20Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "True"
      type: ReconcileComplete


The rook operator was upgraded and completed the upgrade of all the ceph components

The status on the CephCluster shows Completed
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200327T132049/logs/failed_testcase_ocs_logs_1585323953/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-4af88c1c34068413c3e838c4e1156937a44c0b53c6fca26a85b08c3f4280d368/ceph/namespaces/openshift-storage/ceph.rook.io/cephclusters/ocs-storagecluster-cephcluster.yaml

Comment 4 Jose A. Rivera 2020-03-27 20:54:30 UTC

I would look at the olm-operator logs. We should be able to get those from a standard OCP must-gather.

Comment 5 shylesh 2020-03-27 21:27:58 UTC

Initial analysis from Jrivera, captured from gchat
=============================


Looking at the storagecluster.yaml:
    - lastHeartbeatTime: "2020-03-27T17:13:09Z"
      lastTransitionTime: "2020-03-27T16:23:20Z"
      message: Waiting on Nooba instance to finish initialization
      reason: NoobaaInitializing
      status: "True"
      type: Progressing


From noobaa.yaml:
  conditions:
  - lastHeartbeatTime: "2020-03-27T14:19:42Z"
    lastTransitionTime: "2020-03-27T16:24:24Z"
    message: Cannot read property 'email' of undefined
    reason: TemporaryError
    status: "False"
    type: Available
  - lastHeartbeatTime: "2020-03-27T14:19:42Z"
    lastTransitionTime: "2020-03-27T16:24:24Z"
    message: Cannot read property 'email' of undefined
    reason: TemporaryError
    status: "True"
    type: Progressing
  - lastHeartbeatTime: "2020-03-27T14:19:42Z"
    lastTransitionTime: "2020-03-27T14:19:42Z"
    message: Cannot read property 'email' of undefined
    reason: TemporaryError
    status: "False"
    type: Degraded
  - lastHeartbeatTime: "2020-03-27T14:19:42Z"
    lastTransitionTime: "2020-03-27T16:24:24Z"
    message: Cannot read property 'email' of undefined
    reason: TemporaryError
    status: "False"
    type: Upgradeable
  observedGeneration: 3
  phase: Configuring

From the noobaa-operator logs:

time="2020-03-27T17:13:45Z" level=info msg="âœˆï¸  RPC: system.read_system() Request: <nil>"
time="2020-03-27T17:13:45Z" level=error msg="âš ï¸  RPC: system.read_system() Response Error: Code=INTERNAL Message=Cannot read property 'email' of undefined"
time="2020-03-27T17:13:45Z" level=error msg="failed to read system info: Cannot read property 'email' of undefined" sys=openshift-storage/noobaa
time="2020-03-27T17:13:45Z" level=info msg="SetPhase: temporary error during phase \"Configuring\"" sys=openshift-storage/noobaa
time="2020-03-27T17:13:45Z" level=warning msg="â³ Temporary Error: Cannot read property 'email' of undefined" sys=openshift-storage/noobaa

Comment 6 shylesh 2020-03-27 23:51:50 UTC

In this run somehow it went fine ===>> http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200327T200206/logs/failed_testcase_ocs_logs_1585342169/test_upgrade_ocs_logs/ .

Linking must gather so that we can make sure what is the diff b/w earlier run.

Comment 7 shylesh 2020-03-27 23:56:03 UTC

Output from the above run https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6069/, Courtesy @pbalogh.



$ oc get pod -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-2gkwd                                            3/3     Running     0          40m
csi-cephfsplugin-8z54x                                            3/3     Running     0          40m
csi-cephfsplugin-provisioner-65b59d9dc9-7l2r7                     5/5     Running     0          40m
csi-cephfsplugin-provisioner-65b59d9dc9-87qcm                     5/5     Running     0          40m
csi-cephfsplugin-x5wpx                                            3/3     Running     0          40m
csi-rbdplugin-62bbp                                               3/3     Running     0          40m
csi-rbdplugin-9slh5                                               3/3     Running     0          40m
csi-rbdplugin-provisioner-86c8bc888d-8kqmk                        5/5     Running     0          40m
csi-rbdplugin-provisioner-86c8bc888d-p58hj                        5/5     Running     0          41m
csi-rbdplugin-tqkbz                                               3/3     Running     0          40m
lib-bucket-provisioner-55f74d96f6-8ll4m                           1/1     Running     0          86m
noobaa-core-0                                                     1/1     Running     0          40m
noobaa-db-0                                                       1/1     Running     0          40m
noobaa-endpoint-64666986b5-r25bb                                  1/1     Running     0          39m
noobaa-operator-b77ccff86-6gcck                                   1/1     Running     0          41m
ocs-operator-6dd9fd9d8d-ttzp7                                     0/1     Running     0          41m
pod-test-cephfs-418a20805bb742b9af40192feb660a30                  1/1     Running     0          78m
rook-ceph-crashcollector-ip-10-0-140-120-7bd8c65c8d-n8bgx         1/1     Running     0          35m
rook-ceph-crashcollector-ip-10-0-149-101-74c96b48f4-mtvfl         1/1     Running     0          35m
rook-ceph-crashcollector-ip-10-0-165-13-6cc484f7c6-sd8x2          1/1     Running     0          35m
rook-ceph-drain-canary-060f6bb0e5cb84732c4e59ceb119fdf4-76w7kw5   1/1     Running     0          34m
rook-ceph-drain-canary-8ecf54a1e4f30f0a8d3f1911a74f1b81-57lb2vl   1/1     Running     0          36m
rook-ceph-drain-canary-c0722e044791df50d01ec179fd6b83d7-68gqnn8   1/1     Running     0          36m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7d5d9c4bn74cg   1/1     Running     0          34m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6c4cc794v7br6   1/1     Running     0          33m
rook-ceph-mgr-a-66645fdd4b-zrbb5                                  1/1     Running     0          37m
rook-ceph-mon-a-64c7885cf5-8lhrh                                  1/1     Running     0          40m
rook-ceph-mon-b-59f987689d-bvzlc                                  1/1     Running     0          38m
rook-ceph-mon-c-69747d9497-bglzl                                  1/1     Running     0          37m
rook-ceph-operator-599dbd974f-k7s75                               1/1     Running     0          41m
rook-ceph-osd-0-65b4bf455b-m9cll                                  1/1     Running     0          36m
rook-ceph-osd-1-6bd897d47c-p5jh7                                  1/1     Running     0          34m
rook-ceph-osd-2-58dccc9d46-7lcxf                                  1/1     Running     0          36m
rook-ceph-osd-prepare-ocs-deviceset-0-0-gsd7j-85qq7               0/1     Completed   0          82m
rook-ceph-osd-prepare-ocs-deviceset-1-0-9w86m-xrrtz               0/1     Completed   0          82m
rook-ceph-osd-prepare-ocs-deviceset-2-0-gg9sv-qlgv8               0/1     Completed   0          82m
rook-ceph-tools-fc566f885-t6d2p                                   1/1     Running     0          81m
pbalogh@MacBook-Pro upgrade-bug $ oc get pod -n openshift-storag
pbalogh@MacBook-Pro upgrade-bug $ oc get csv -n openshift-storage
NAME                            DISPLAY                       VERSION        REPLACES              PHASE
lib-bucket-provisioner.v1.0.0   lib-bucket-provisioner        1.0.0                                Succeeded
ocs-operator.v4.3.0-379.ci      OpenShift Container Storage   4.3.0-379.ci   ocs-operator.v4.2.2   Installing
pbalogh@MacBook-Pro upgrade-bug $ oc get csv -n openshift-storage
NAME                            DISPLAY                       VERSION        REPLACES              PHASE
lib-bucket-provisioner.v1.0.0   lib-bucket-provisioner        1.0.0                                Succeeded
ocs-operator.v4.3.0-379.ci      OpenShift Container Storage   4.3.0-379.ci   ocs-operator.v4.2.2   Succeeded

$ oc get pod -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-2gkwd                                            3/3     Running     0          128m
csi-cephfsplugin-8z54x                                            3/3     Running     0          129m
csi-cephfsplugin-provisioner-65b59d9dc9-7l2r7                     5/5     Running     0          129m
csi-cephfsplugin-provisioner-65b59d9dc9-87qcm                     5/5     Running     0          129m
csi-cephfsplugin-x5wpx                                            3/3     Running     0          129m
csi-rbdplugin-62bbp                                               3/3     Running     0          129m
csi-rbdplugin-9slh5                                               3/3     Running     0          129m
csi-rbdplugin-provisioner-86c8bc888d-8kqmk                        5/5     Running     0          129m
csi-rbdplugin-provisioner-86c8bc888d-p58hj                        5/5     Running     0          129m
csi-rbdplugin-tqkbz                                               3/3     Running     0          129m
lib-bucket-provisioner-55f74d96f6-8ll4m                           1/1     Running     0          175m
noobaa-core-0                                                     1/1     Running     0          129m
noobaa-db-0                                                       1/1     Running     0          129m
noobaa-endpoint-64666986b5-r25bb                                  1/1     Running     0          128m
noobaa-operator-b77ccff86-6gcck                                   1/1     Running     0          130m
ocs-operator-6dd9fd9d8d-ttzp7                                     1/1     Running     0          130m
rook-ceph-crashcollector-ip-10-0-140-120-7bd8c65c8d-n8bgx         1/1     Running     0          124m
rook-ceph-crashcollector-ip-10-0-149-101-74c96b48f4-mtvfl         1/1     Running     0          124m
rook-ceph-crashcollector-ip-10-0-165-13-6cc484f7c6-sd8x2          1/1     Running     0          124m
rook-ceph-drain-canary-060f6bb0e5cb84732c4e59ceb119fdf4-76w7kw5   1/1     Running     0          123m
rook-ceph-drain-canary-8ecf54a1e4f30f0a8d3f1911a74f1b81-57lb2vl   1/1     Running     0          125m
rook-ceph-drain-canary-c0722e044791df50d01ec179fd6b83d7-68gqnn8   1/1     Running     0          125m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7d5d9c4bn74cg   1/1     Running     0          123m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6c4cc794v7br6   1/1     Running     0          122m
rook-ceph-mgr-a-66645fdd4b-zrbb5                                  1/1     Running     0          126m
rook-ceph-mon-a-64c7885cf5-8lhrh                                  1/1     Running     0          129m
rook-ceph-mon-b-59f987689d-bvzlc                                  1/1     Running     0          127m
rook-ceph-mon-c-69747d9497-bglzl                                  1/1     Running     0          126m
rook-ceph-operator-599dbd974f-k7s75                               1/1     Running     0          130m
rook-ceph-osd-0-65b4bf455b-m9cll                                  1/1     Running     0          125m
rook-ceph-osd-1-6bd897d47c-p5jh7                                  1/1     Running     0          123m
rook-ceph-osd-2-58dccc9d46-7lcxf                                  1/1     Running     0          125m
rook-ceph-osd-prepare-ocs-deviceset-0-0-gsd7j-85qq7               0/1     Completed   0          171m
rook-ceph-osd-prepare-ocs-deviceset-1-0-9w86m-xrrtz               0/1     Completed   0          171m
rook-ceph-osd-prepare-ocs-deviceset-2-0-gg9sv-qlgv8               0/1     Completed   0          171m
rook-ceph-tools-fc566f885-t6d2p                                   1/1     Running     0          170m

Comment 8 Danny 2020-03-29 12:45:22 UTC

I found the issue in one of noobaa's upgrade scripts. working on a fix.

Comment 9 Michael Adam 2020-03-30 12:17:03 UTC

(In reply to Danny from comment #8)
> I found the issue in one of noobaa's upgrade scripts. working on a fix.

Existence if patch etc seems to imply that we should ACK it for 4.3?

Comment 10 Nimrod Becker 2020-03-30 12:19:50 UTC

Adding an ack

Comment 13 Petr Balogh 2020-04-02 12:31:43 UTC

Running verification job here:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6328/console

Comment 14 Petr Balogh 2020-04-02 16:29:31 UTC

Verified - all upgrade tests passed!
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6328/testReport/

Comment 17 errata-xmlrpc 2020-04-14 09:48:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1437

Note You need to log in before you can comment on or make changes to this bug.