I've tried to reproduce the issue where I omitted the pre upgrade tests, so we didn't run any workload and noobaa related pre upgrade tests and the upgrade passed this time here: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/7503/console $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE lib-bucket-provisioner.v2.0.0 lib-bucket-provisioner 2.0.0 lib-bucket-provisioner.v1.0.0 Succeeded ocs-operator.v4.4.0-420.ci OpenShift Container Storage 4.4.0-420.ci ocs-operator.v4.3.0 Succeeded $ oc get pod -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-2whmn 3/3 Running 0 13m csi-cephfsplugin-provisioner-5d94649b9f-5ghjb 5/5 Running 0 13m csi-cephfsplugin-provisioner-5d94649b9f-r4zg2 5/5 Running 0 14m csi-cephfsplugin-qvnb5 3/3 Running 0 13m csi-cephfsplugin-tns9j 3/3 Running 0 14m csi-rbdplugin-d7dh7 3/3 Running 0 13m csi-rbdplugin-provisioner-55c5479c46-vtntm 5/5 Running 0 14m csi-rbdplugin-provisioner-55c5479c46-z9ms7 5/5 Running 0 13m csi-rbdplugin-rstlf 3/3 Running 0 13m csi-rbdplugin-xnqqf 3/3 Running 0 14m lib-bucket-provisioner-ccc897fc8-68cz2 1/1 Running 0 28m noobaa-core-0 1/1 Running 0 13m noobaa-db-0 1/1 Running 0 13m noobaa-endpoint-f6dc4f7d4-57bjz 1/1 Running 0 13m noobaa-operator-6c6c99b8b5-6wfwr 1/1 Running 0 14m ocs-operator-544d9ddd9d-7wqc8 1/1 Running 0 14m rook-ceph-crashcollector-ip-10-0-135-243-675b79b679-jcv5l 1/1 Running 0 8m20s rook-ceph-crashcollector-ip-10-0-152-148-856655866d-hp5mr 1/1 Running 0 8m20s rook-ceph-crashcollector-ip-10-0-164-112-56cd99c8bb-dnqv4 1/1 Running 0 8m20s rook-ceph-drain-canary-2fc9e245b7de4b0f9ad0328d3e005dc9-68h47bz 1/1 Running 0 8m10s rook-ceph-drain-canary-45d0b616fbe8a834000bb7eff71e1697-76zsjjs 1/1 Running 0 6m54s rook-ceph-drain-canary-d0090bac3d67eb71ca8c4c1e541cea51-68t88p2 1/1 Running 0 6m40s rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-65f9d659pr69b 1/1 Running 0 6m10s rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7c55d9b946vqx 1/1 Running 0 5m31s rook-ceph-mgr-a-787f988597-wtxjj 1/1 Running 0 8m31s rook-ceph-mon-a-6fdf4b5-l7r2r 1/1 Running 0 12m rook-ceph-mon-b-757dfb8fc8-fh4h8 1/1 Running 0 10m rook-ceph-mon-c-fdc5cfcc7-frkxm 1/1 Running 0 9m1s rook-ceph-operator-5f858bfb9f-7bqk6 1/1 Running 0 14m rook-ceph-osd-0-5789969fd7-8m7ln 1/1 Running 0 6m55s rook-ceph-osd-1-5744554c85-w6ld9 1/1 Running 0 8m10s rook-ceph-osd-2-5bdc44b9f7-2fr9c 1/1 Running 0 6m40s rook-ceph-osd-prepare-ocs-deviceset-0-0-xxlrl-fx7sx 0/1 Completed 0 59m rook-ceph-osd-prepare-ocs-deviceset-1-0-9vx2b-97ml2 0/1 Completed 0 59m rook-ceph-osd-prepare-ocs-deviceset-2-0-sjml9-gw56q 0/1 Completed 0 59m rook-ceph-tools-6f59b98f4f-stg8r 1/1 Running 0 58m So will do another run wit the same set of the pre/post upgrade tests and will get to you back with other results.
The OCS and OCP must gather and other logs I collected locally from the attempt above before upgrade and after upgrade are available here in this tar file: http://rhsqe-repo.lab.eng.blr.redhat.com/cns/ocs-qe-bugs/bz-1835125.tar.gz Running another execution the same with all pre upgrade tests like I hit the issue yesterday. Job is here: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/7517/
Just adding some output of how pods looked like: $ oc get pod -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-5x22k 3/3 Running 0 45m csi-cephfsplugin-n7gck 3/3 Running 0 45m csi-cephfsplugin-provisioner-5d94649b9f-d6g7s 5/5 Running 0 46m csi-cephfsplugin-provisioner-5d94649b9f-v5rjf 5/5 Running 0 45m csi-cephfsplugin-xlg6b 3/3 Running 0 46m csi-rbdplugin-jzxvg 3/3 Running 0 45m csi-rbdplugin-lsfjt 3/3 Running 0 45m csi-rbdplugin-plbd8 3/3 Running 0 45m csi-rbdplugin-provisioner-55c5479c46-9b7w2 5/5 Running 0 46m csi-rbdplugin-provisioner-55c5479c46-bw48k 5/5 Running 0 45m noobaa-core-0 0/1 CrashLoopBackOff 6 45m noobaa-db-0 0/1 Init:CrashLoopBackOff 13 45m noobaa-endpoint-798db5b9f7-48c46 0/1 CrashLoopBackOff 6 45m noobaa-operator-6c6c99b8b5-sx75r 1/1 Running 0 46m ocs-operator-544d9ddd9d-nb9l4 0/1 Running 0 46m pod-test-cephfs-79757eb9a89643fb9f64a991b3057763 1/1 Running 0 144m rook-ceph-crashcollector-ip-10-0-129-158-6d7bd7f987-57d9t 1/1 Running 0 40m rook-ceph-crashcollector-ip-10-0-157-45-65d5cd8487-5gx9g 1/1 Running 0 40m rook-ceph-crashcollector-ip-10-0-171-32-5fb95bb5c-db596 1/1 Running 0 40m rook-ceph-drain-canary-85b2bb4d77422e05980cd0e2e324b252-765rg4w 1/1 Running 0 30m rook-ceph-drain-canary-8af4f9c77d22bcb9e8649fbbce6437fc-76b4njk 1/1 Running 0 40m rook-ceph-drain-canary-d0e859058b1765fcc8919ac66b55c432-66w74l8 1/1 Running 0 19m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-86d8fb54t62b4 1/1 Running 0 19m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-cfc54487lrwp8 1/1 Running 0 19m rook-ceph-mgr-a-664c65f4d6-7rngp 1/1 Running 0 41m rook-ceph-mon-a-d6747d879-brtwv 1/1 Running 0 43m rook-ceph-mon-b-85fff45f48-msp6s 1/1 Running 0 42m rook-ceph-mon-c-5f69ff6d75-mpwfg 1/1 Running 0 41m rook-ceph-operator-5f858bfb9f-62c5x 1/1 Running 0 46m rook-ceph-osd-0-67c558ffb7-kz6jk 1/1 Running 0 19m rook-ceph-osd-1-6f685cd448-v9kf8 1/1 Running 0 40m rook-ceph-osd-2-94488d6d-nrjxd 1/1 Running 0 30m rook-ceph-osd-prepare-ocs-deviceset-0-0-lc48t-b4kfn 0/1 Completed 0 162m rook-ceph-osd-prepare-ocs-deviceset-1-0-q24bx-5xscd 0/1 Completed 0 162m rook-ceph-osd-prepare-ocs-deviceset-2-0-4bmx4-xzt69 0/1 Completed 0 162m rook-ceph-tools-6f59b98f4f-8ksgv 1/1 Running 0 161m
The problem is due to setting openshift.io/scc in the noobaa-db pod to anyuid instead of restricted. This is already a known issue: https://bugzilla.redhat.com/show_bug.cgi?id=1804168 We are now working on discover which QE test is causing this change. Immediately after install it is still set to restricted and after some tests we can see it changes to anyuid we want to find the auto-test that causing it to change - working on it with Petr
Jacky, I've tested with provided build 4.4.0-426.ci and still the same issue: But I also see this image is used in at least one of the pod: containerImage: quay.io/ocs-dev/ocs-operator:4.4.0 Boris, can you confirm that: quay.io/rhceph-dev/ocs-operator@sha256:1ac6eb090759f94fee54d2af4b73faf2c0bd0af9ace7052902d8198f2b51d1d7 Is one you created for Jacky? Logs I collected manually: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-upgr/pbalogh-upgr_20200515T142113/logs/upgrade-with-new-build.tar.gz Those from our automation: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-upgr/pbalogh-upgr_20200515T142113/logs/failed_testcase_ocs_logs_1589558321/
Corresponding PR has merged in ocs-operator master branch and is being backported: https://github.com/openshift/ocs-operator/pull/516
Danny and Jacky have verified this is not upgrade related. Upgrade simply restarted the pods, restarting them before the. upgrade results in the same problem.
Closing as not a bug based on last insights from Eng team