Bug 1835125
Summary: | Restart pods - failing with noobaa-db-0 0/1 Init:CrashLoopBackOff and noobaa-endpoint 0/1 CrashLoopBackOff due to SCC change | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Petr Balogh <pbalogh> |
Component: | Multi-Cloud Object Gateway | Assignee: | Nimrod Becker <nbecker> |
Status: | CLOSED NOTABUG | QA Contact: | Petr Balogh <pbalogh> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.4 | CC: | assingh, ebenahar, etamir, jarrpa, ocs-bugs, ratamir |
Target Milestone: | --- | Keywords: | Automation, Regression, Upgrades |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-05-20 15:23:09 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Comment 3
Petr Balogh
2020-05-13 10:34:31 UTC
The OCS and OCP must gather and other logs I collected locally from the attempt above before upgrade and after upgrade are available here in this tar file: http://rhsqe-repo.lab.eng.blr.redhat.com/cns/ocs-qe-bugs/bz-1835125.tar.gz Running another execution the same with all pre upgrade tests like I hit the issue yesterday. Job is here: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/7517/ Just adding some output of how pods looked like: $ oc get pod -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-5x22k 3/3 Running 0 45m csi-cephfsplugin-n7gck 3/3 Running 0 45m csi-cephfsplugin-provisioner-5d94649b9f-d6g7s 5/5 Running 0 46m csi-cephfsplugin-provisioner-5d94649b9f-v5rjf 5/5 Running 0 45m csi-cephfsplugin-xlg6b 3/3 Running 0 46m csi-rbdplugin-jzxvg 3/3 Running 0 45m csi-rbdplugin-lsfjt 3/3 Running 0 45m csi-rbdplugin-plbd8 3/3 Running 0 45m csi-rbdplugin-provisioner-55c5479c46-9b7w2 5/5 Running 0 46m csi-rbdplugin-provisioner-55c5479c46-bw48k 5/5 Running 0 45m noobaa-core-0 0/1 CrashLoopBackOff 6 45m noobaa-db-0 0/1 Init:CrashLoopBackOff 13 45m noobaa-endpoint-798db5b9f7-48c46 0/1 CrashLoopBackOff 6 45m noobaa-operator-6c6c99b8b5-sx75r 1/1 Running 0 46m ocs-operator-544d9ddd9d-nb9l4 0/1 Running 0 46m pod-test-cephfs-79757eb9a89643fb9f64a991b3057763 1/1 Running 0 144m rook-ceph-crashcollector-ip-10-0-129-158-6d7bd7f987-57d9t 1/1 Running 0 40m rook-ceph-crashcollector-ip-10-0-157-45-65d5cd8487-5gx9g 1/1 Running 0 40m rook-ceph-crashcollector-ip-10-0-171-32-5fb95bb5c-db596 1/1 Running 0 40m rook-ceph-drain-canary-85b2bb4d77422e05980cd0e2e324b252-765rg4w 1/1 Running 0 30m rook-ceph-drain-canary-8af4f9c77d22bcb9e8649fbbce6437fc-76b4njk 1/1 Running 0 40m rook-ceph-drain-canary-d0e859058b1765fcc8919ac66b55c432-66w74l8 1/1 Running 0 19m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-86d8fb54t62b4 1/1 Running 0 19m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-cfc54487lrwp8 1/1 Running 0 19m rook-ceph-mgr-a-664c65f4d6-7rngp 1/1 Running 0 41m rook-ceph-mon-a-d6747d879-brtwv 1/1 Running 0 43m rook-ceph-mon-b-85fff45f48-msp6s 1/1 Running 0 42m rook-ceph-mon-c-5f69ff6d75-mpwfg 1/1 Running 0 41m rook-ceph-operator-5f858bfb9f-62c5x 1/1 Running 0 46m rook-ceph-osd-0-67c558ffb7-kz6jk 1/1 Running 0 19m rook-ceph-osd-1-6f685cd448-v9kf8 1/1 Running 0 40m rook-ceph-osd-2-94488d6d-nrjxd 1/1 Running 0 30m rook-ceph-osd-prepare-ocs-deviceset-0-0-lc48t-b4kfn 0/1 Completed 0 162m rook-ceph-osd-prepare-ocs-deviceset-1-0-q24bx-5xscd 0/1 Completed 0 162m rook-ceph-osd-prepare-ocs-deviceset-2-0-4bmx4-xzt69 0/1 Completed 0 162m rook-ceph-tools-6f59b98f4f-8ksgv 1/1 Running 0 161m The problem is due to setting openshift.io/scc in the noobaa-db pod to anyuid instead of restricted. This is already a known issue: https://bugzilla.redhat.com/show_bug.cgi?id=1804168 We are now working on discover which QE test is causing this change. Immediately after install it is still set to restricted and after some tests we can see it changes to anyuid we want to find the auto-test that causing it to change - working on it with Petr Jacky, I've tested with provided build 4.4.0-426.ci and still the same issue: But I also see this image is used in at least one of the pod: containerImage: quay.io/ocs-dev/ocs-operator:4.4.0 Boris, can you confirm that: quay.io/rhceph-dev/ocs-operator@sha256:1ac6eb090759f94fee54d2af4b73faf2c0bd0af9ace7052902d8198f2b51d1d7 Is one you created for Jacky? Logs I collected manually: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-upgr/pbalogh-upgr_20200515T142113/logs/upgrade-with-new-build.tar.gz Those from our automation: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-upgr/pbalogh-upgr_20200515T142113/logs/failed_testcase_ocs_logs_1589558321/ Corresponding PR has merged in ocs-operator master branch and is being backported: https://github.com/openshift/ocs-operator/pull/516 Danny and Jacky have verified this is not upgrade related. Upgrade simply restarted the pods, restarting them before the. upgrade results in the same problem. Closing as not a bug based on last insights from Eng team |