Description of problem (please be detailed as possible and provide log snippests): I see that in our job: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6056/console the upgrade failed with CSV stuck in installing phase. https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6056/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/ When I connected to the cluster during the execution I see: ocs-operator-6dd9fd9d8d-rrcp6 0/1 Running 0 77m So it means that 77mins and still the same. All other pods looks OK to me: NAME READY STATUS RESTARTS AGE csi-cephfsplugin-4wjs4 3/3 Running 0 76m csi-cephfsplugin-m2mbm 3/3 Running 0 76m csi-cephfsplugin-provisioner-65b59d9dc9-rczl7 5/5 Running 0 76m csi-cephfsplugin-provisioner-65b59d9dc9-tgnnh 5/5 Running 0 77m csi-cephfsplugin-snnrp 3/3 Running 0 77m csi-rbdplugin-6c8gw 3/3 Running 0 76m csi-rbdplugin-p4l9r 3/3 Running 0 76m csi-rbdplugin-provisioner-86c8bc888d-48hc6 5/5 Running 0 76m csi-rbdplugin-provisioner-86c8bc888d-lzsq9 5/5 Running 0 77m csi-rbdplugin-w5rk2 3/3 Running 0 77m lib-bucket-provisioner-55f74d96f6-xmlrr 1/1 Running 0 3h24m noobaa-core-0 1/1 Running 0 76m noobaa-db-0 1/1 Running 0 76m noobaa-endpoint-59c4769f6b-4qrhx 1/1 Running 0 75m noobaa-operator-b77ccff86-5lrz9 1/1 Running 0 77m ocs-operator-6dd9fd9d8d-rrcp6 0/1 Running 0 77m pod-test-cephfs-990c7858e3eb4341b386707e70f7bc61 1/1 Running 0 112m rook-ceph-crashcollector-ip-10-0-136-86-649f5b65b6-x2x2h 1/1 Running 0 72m rook-ceph-crashcollector-ip-10-0-159-203-958c4dc5b-vvqjj 1/1 Running 0 72m rook-ceph-crashcollector-ip-10-0-170-123-77c5689f4-7hj4t 1/1 Running 0 72m rook-ceph-drain-canary-299f24d16a705b54bd0251ece01ad1f5-f4r7hgm 1/1 Running 0 71m rook-ceph-drain-canary-e285eca2f732a465cc14d563dfbea99c-595kb52 1/1 Running 0 68m rook-ceph-drain-canary-ee8c22ace5779c4ae7b153f4babe321b-6bvjml4 1/1 Running 0 72m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6988b56fjk2h6 1/1 Running 0 67m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5bcd8f6fch9j7 1/1 Running 0 67m rook-ceph-mgr-a-8467688fd8-k5f9x 1/1 Running 0 72m rook-ceph-mon-a-5b66ffcf95-g88bf 1/1 Running 0 76m rook-ceph-mon-b-747c7fcddb-7mp5r 1/1 Running 0 74m rook-ceph-mon-c-5d948d5449-d8nxq 1/1 Running 0 73m rook-ceph-operator-599dbd974f-jw68l 1/1 Running 0 77m rook-ceph-osd-0-77f79bdf5c-4kwrq 1/1 Running 0 72m rook-ceph-osd-1-5674b9d647-j9k6h 1/1 Running 0 71m rook-ceph-osd-2-5757b89f6b-t86wv 1/1 Running 0 68m rook-ceph-osd-prepare-ocs-deviceset-0-0-zrwdz-7fng7 0/1 Completed 0 3h21m rook-ceph-osd-prepare-ocs-deviceset-1-0-tpssk-cp6fs 0/1 Completed 0 3h21m rook-ceph-osd-prepare-ocs-deviceset-2-0-psbx6-rks7w 0/1 Completed 0 3h21m rook-ceph-tools-fc566f885-t92bf 1/1 Running 0 3h20m Version of all relevant components (if applicable): OCS 4.2.2 live content upgraded to 4.3.0-379.ci build. OCP: 4.3.0-0.nightly-2020-03-04-222846 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? It block upgrade Is there any workaround available to the best of your knowledge? NO Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Not sure, executed 4 more jobs and we will update if we will reproduce again. The same execution triggered here: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6069/console So we will see there. Can this issue reproduce from the UI? Haven't tried If this is a regression, please provide more details to justify this: Yes, this worked before. Steps to Reproduce: 1. Install 4.2.2 from live content 2. Run some FIO 3. Add catalog source for internal build 4. Change the source in subscription to added catalogSource + change the channel from stable-4.2 to stable-4.3. 5. We see the issue mentioned in BZ Actual results: Upgrade failed Expected results: Upgrade will pass Additional info: Must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200327T132049/logs/failed_testcase_ocs_logs_1585323953/test_upgrade_ocs_logs/
Created attachment 1674175 [details] Logs from operators and few other files about cluster Some yaml files I collected from my queries about csv, subscription, logs and so on.
I don't see what would be stuck in installing status. The upgrade seems to be complete from the OCS and Rook perspective. @Jose do you see what status is causing this issue? In the StorageCluster status the reconcile indicates it is completed: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200327T132049/logs/failed_testcase_ocs_logs_1585323953/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-4af88c1c34068413c3e838c4e1156937a44c0b53c6fca26a85b08c3f4280d368/storagecluster.yaml - lastHeartbeatTime: "2020-03-27T17:13:09Z" lastTransitionTime: "2020-03-27T14:16:20Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "True" type: ReconcileComplete The rook operator was upgraded and completed the upgrade of all the ceph components The status on the CephCluster shows Completed http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200327T132049/logs/failed_testcase_ocs_logs_1585323953/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-4af88c1c34068413c3e838c4e1156937a44c0b53c6fca26a85b08c3f4280d368/ceph/namespaces/openshift-storage/ceph.rook.io/cephclusters/ocs-storagecluster-cephcluster.yaml
I would look at the olm-operator logs. We should be able to get those from a standard OCP must-gather.
Initial analysis from Jrivera, captured from gchat ============================= Looking at the storagecluster.yaml: - lastHeartbeatTime: "2020-03-27T17:13:09Z" lastTransitionTime: "2020-03-27T16:23:20Z" message: Waiting on Nooba instance to finish initialization reason: NoobaaInitializing status: "True" type: Progressing From noobaa.yaml: conditions: - lastHeartbeatTime: "2020-03-27T14:19:42Z" lastTransitionTime: "2020-03-27T16:24:24Z" message: Cannot read property 'email' of undefined reason: TemporaryError status: "False" type: Available - lastHeartbeatTime: "2020-03-27T14:19:42Z" lastTransitionTime: "2020-03-27T16:24:24Z" message: Cannot read property 'email' of undefined reason: TemporaryError status: "True" type: Progressing - lastHeartbeatTime: "2020-03-27T14:19:42Z" lastTransitionTime: "2020-03-27T14:19:42Z" message: Cannot read property 'email' of undefined reason: TemporaryError status: "False" type: Degraded - lastHeartbeatTime: "2020-03-27T14:19:42Z" lastTransitionTime: "2020-03-27T16:24:24Z" message: Cannot read property 'email' of undefined reason: TemporaryError status: "False" type: Upgradeable observedGeneration: 3 phase: Configuring From the noobaa-operator logs: time="2020-03-27T17:13:45Z" level=info msg="âœˆï¸ RPC: system.read_system() Request: <nil>" time="2020-03-27T17:13:45Z" level=error msg="âš ï¸ RPC: system.read_system() Response Error: Code=INTERNAL Message=Cannot read property 'email' of undefined" time="2020-03-27T17:13:45Z" level=error msg="failed to read system info: Cannot read property 'email' of undefined" sys=openshift-storage/noobaa time="2020-03-27T17:13:45Z" level=info msg="SetPhase: temporary error during phase \"Configuring\"" sys=openshift-storage/noobaa time="2020-03-27T17:13:45Z" level=warning msg="â³ Temporary Error: Cannot read property 'email' of undefined" sys=openshift-storage/noobaa
In this run somehow it went fine ===>> http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200327T200206/logs/failed_testcase_ocs_logs_1585342169/test_upgrade_ocs_logs/ . Linking must gather so that we can make sure what is the diff b/w earlier run.
Output from the above run https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6069/, Courtesy @pbalogh. $ oc get pod -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-2gkwd 3/3 Running 0 40m csi-cephfsplugin-8z54x 3/3 Running 0 40m csi-cephfsplugin-provisioner-65b59d9dc9-7l2r7 5/5 Running 0 40m csi-cephfsplugin-provisioner-65b59d9dc9-87qcm 5/5 Running 0 40m csi-cephfsplugin-x5wpx 3/3 Running 0 40m csi-rbdplugin-62bbp 3/3 Running 0 40m csi-rbdplugin-9slh5 3/3 Running 0 40m csi-rbdplugin-provisioner-86c8bc888d-8kqmk 5/5 Running 0 40m csi-rbdplugin-provisioner-86c8bc888d-p58hj 5/5 Running 0 41m csi-rbdplugin-tqkbz 3/3 Running 0 40m lib-bucket-provisioner-55f74d96f6-8ll4m 1/1 Running 0 86m noobaa-core-0 1/1 Running 0 40m noobaa-db-0 1/1 Running 0 40m noobaa-endpoint-64666986b5-r25bb 1/1 Running 0 39m noobaa-operator-b77ccff86-6gcck 1/1 Running 0 41m ocs-operator-6dd9fd9d8d-ttzp7 0/1 Running 0 41m pod-test-cephfs-418a20805bb742b9af40192feb660a30 1/1 Running 0 78m rook-ceph-crashcollector-ip-10-0-140-120-7bd8c65c8d-n8bgx 1/1 Running 0 35m rook-ceph-crashcollector-ip-10-0-149-101-74c96b48f4-mtvfl 1/1 Running 0 35m rook-ceph-crashcollector-ip-10-0-165-13-6cc484f7c6-sd8x2 1/1 Running 0 35m rook-ceph-drain-canary-060f6bb0e5cb84732c4e59ceb119fdf4-76w7kw5 1/1 Running 0 34m rook-ceph-drain-canary-8ecf54a1e4f30f0a8d3f1911a74f1b81-57lb2vl 1/1 Running 0 36m rook-ceph-drain-canary-c0722e044791df50d01ec179fd6b83d7-68gqnn8 1/1 Running 0 36m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7d5d9c4bn74cg 1/1 Running 0 34m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6c4cc794v7br6 1/1 Running 0 33m rook-ceph-mgr-a-66645fdd4b-zrbb5 1/1 Running 0 37m rook-ceph-mon-a-64c7885cf5-8lhrh 1/1 Running 0 40m rook-ceph-mon-b-59f987689d-bvzlc 1/1 Running 0 38m rook-ceph-mon-c-69747d9497-bglzl 1/1 Running 0 37m rook-ceph-operator-599dbd974f-k7s75 1/1 Running 0 41m rook-ceph-osd-0-65b4bf455b-m9cll 1/1 Running 0 36m rook-ceph-osd-1-6bd897d47c-p5jh7 1/1 Running 0 34m rook-ceph-osd-2-58dccc9d46-7lcxf 1/1 Running 0 36m rook-ceph-osd-prepare-ocs-deviceset-0-0-gsd7j-85qq7 0/1 Completed 0 82m rook-ceph-osd-prepare-ocs-deviceset-1-0-9w86m-xrrtz 0/1 Completed 0 82m rook-ceph-osd-prepare-ocs-deviceset-2-0-gg9sv-qlgv8 0/1 Completed 0 82m rook-ceph-tools-fc566f885-t6d2p 1/1 Running 0 81m pbalogh@MacBook-Pro upgrade-bug $ oc get pod -n openshift-storag pbalogh@MacBook-Pro upgrade-bug $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE lib-bucket-provisioner.v1.0.0 lib-bucket-provisioner 1.0.0 Succeeded ocs-operator.v4.3.0-379.ci OpenShift Container Storage 4.3.0-379.ci ocs-operator.v4.2.2 Installing pbalogh@MacBook-Pro upgrade-bug $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE lib-bucket-provisioner.v1.0.0 lib-bucket-provisioner 1.0.0 Succeeded ocs-operator.v4.3.0-379.ci OpenShift Container Storage 4.3.0-379.ci ocs-operator.v4.2.2 Succeeded $ oc get pod -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-2gkwd 3/3 Running 0 128m csi-cephfsplugin-8z54x 3/3 Running 0 129m csi-cephfsplugin-provisioner-65b59d9dc9-7l2r7 5/5 Running 0 129m csi-cephfsplugin-provisioner-65b59d9dc9-87qcm 5/5 Running 0 129m csi-cephfsplugin-x5wpx 3/3 Running 0 129m csi-rbdplugin-62bbp 3/3 Running 0 129m csi-rbdplugin-9slh5 3/3 Running 0 129m csi-rbdplugin-provisioner-86c8bc888d-8kqmk 5/5 Running 0 129m csi-rbdplugin-provisioner-86c8bc888d-p58hj 5/5 Running 0 129m csi-rbdplugin-tqkbz 3/3 Running 0 129m lib-bucket-provisioner-55f74d96f6-8ll4m 1/1 Running 0 175m noobaa-core-0 1/1 Running 0 129m noobaa-db-0 1/1 Running 0 129m noobaa-endpoint-64666986b5-r25bb 1/1 Running 0 128m noobaa-operator-b77ccff86-6gcck 1/1 Running 0 130m ocs-operator-6dd9fd9d8d-ttzp7 1/1 Running 0 130m rook-ceph-crashcollector-ip-10-0-140-120-7bd8c65c8d-n8bgx 1/1 Running 0 124m rook-ceph-crashcollector-ip-10-0-149-101-74c96b48f4-mtvfl 1/1 Running 0 124m rook-ceph-crashcollector-ip-10-0-165-13-6cc484f7c6-sd8x2 1/1 Running 0 124m rook-ceph-drain-canary-060f6bb0e5cb84732c4e59ceb119fdf4-76w7kw5 1/1 Running 0 123m rook-ceph-drain-canary-8ecf54a1e4f30f0a8d3f1911a74f1b81-57lb2vl 1/1 Running 0 125m rook-ceph-drain-canary-c0722e044791df50d01ec179fd6b83d7-68gqnn8 1/1 Running 0 125m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7d5d9c4bn74cg 1/1 Running 0 123m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6c4cc794v7br6 1/1 Running 0 122m rook-ceph-mgr-a-66645fdd4b-zrbb5 1/1 Running 0 126m rook-ceph-mon-a-64c7885cf5-8lhrh 1/1 Running 0 129m rook-ceph-mon-b-59f987689d-bvzlc 1/1 Running 0 127m rook-ceph-mon-c-69747d9497-bglzl 1/1 Running 0 126m rook-ceph-operator-599dbd974f-k7s75 1/1 Running 0 130m rook-ceph-osd-0-65b4bf455b-m9cll 1/1 Running 0 125m rook-ceph-osd-1-6bd897d47c-p5jh7 1/1 Running 0 123m rook-ceph-osd-2-58dccc9d46-7lcxf 1/1 Running 0 125m rook-ceph-osd-prepare-ocs-deviceset-0-0-gsd7j-85qq7 0/1 Completed 0 171m rook-ceph-osd-prepare-ocs-deviceset-1-0-9w86m-xrrtz 0/1 Completed 0 171m rook-ceph-osd-prepare-ocs-deviceset-2-0-gg9sv-qlgv8 0/1 Completed 0 171m rook-ceph-tools-fc566f885-t6d2p 1/1 Running 0 170m
I found the issue in one of noobaa's upgrade scripts. working on a fix.
(In reply to Danny from comment #8) > I found the issue in one of noobaa's upgrade scripts. working on a fix. Existence if patch etc seems to imply that we should ACK it for 4.3?
Adding an ack
Running verification job here: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6328/console
Verified - all upgrade tests passed! https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6328/testReport/
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1437