Bug 1835125

Summary:	Restart pods - failing with noobaa-db-0 0/1 Init:CrashLoopBackOff and noobaa-endpoint 0/1 CrashLoopBackOff due to SCC change
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Petr Balogh <pbalogh>
Component:	Multi-Cloud Object Gateway	Assignee:	Nimrod Becker <nbecker>
Status:	CLOSED NOTABUG	QA Contact:	Petr Balogh <pbalogh>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.4	CC:	assingh, ebenahar, etamir, jarrpa, ocs-bugs, ratamir
Target Milestone:	---	Keywords:	Automation, Regression, Upgrades
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-20 15:23:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 3 Petr Balogh 2020-05-13 10:34:31 UTC

I've tried to reproduce the issue where I omitted the pre upgrade tests, so we didn't run any workload and noobaa related pre upgrade tests and the upgrade passed this time here: 
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/7503/console

$ oc get csv -n openshift-storage
NAME                            DISPLAY                       VERSION        REPLACES                        PHASE
lib-bucket-provisioner.v2.0.0   lib-bucket-provisioner        2.0.0          lib-bucket-provisioner.v1.0.0   Succeeded
ocs-operator.v4.4.0-420.ci      OpenShift Container Storage   4.4.0-420.ci   ocs-operator.v4.3.0             Succeeded


$ oc get pod -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-2whmn                                            3/3     Running     0          13m
csi-cephfsplugin-provisioner-5d94649b9f-5ghjb                     5/5     Running     0          13m
csi-cephfsplugin-provisioner-5d94649b9f-r4zg2                     5/5     Running     0          14m
csi-cephfsplugin-qvnb5                                            3/3     Running     0          13m
csi-cephfsplugin-tns9j                                            3/3     Running     0          14m
csi-rbdplugin-d7dh7                                               3/3     Running     0          13m
csi-rbdplugin-provisioner-55c5479c46-vtntm                        5/5     Running     0          14m
csi-rbdplugin-provisioner-55c5479c46-z9ms7                        5/5     Running     0          13m
csi-rbdplugin-rstlf                                               3/3     Running     0          13m
csi-rbdplugin-xnqqf                                               3/3     Running     0          14m
lib-bucket-provisioner-ccc897fc8-68cz2                            1/1     Running     0          28m
noobaa-core-0                                                     1/1     Running     0          13m
noobaa-db-0                                                       1/1     Running     0          13m
noobaa-endpoint-f6dc4f7d4-57bjz                                   1/1     Running     0          13m
noobaa-operator-6c6c99b8b5-6wfwr                                  1/1     Running     0          14m
ocs-operator-544d9ddd9d-7wqc8                                     1/1     Running     0          14m
rook-ceph-crashcollector-ip-10-0-135-243-675b79b679-jcv5l         1/1     Running     0          8m20s
rook-ceph-crashcollector-ip-10-0-152-148-856655866d-hp5mr         1/1     Running     0          8m20s
rook-ceph-crashcollector-ip-10-0-164-112-56cd99c8bb-dnqv4         1/1     Running     0          8m20s
rook-ceph-drain-canary-2fc9e245b7de4b0f9ad0328d3e005dc9-68h47bz   1/1     Running     0          8m10s
rook-ceph-drain-canary-45d0b616fbe8a834000bb7eff71e1697-76zsjjs   1/1     Running     0          6m54s
rook-ceph-drain-canary-d0090bac3d67eb71ca8c4c1e541cea51-68t88p2   1/1     Running     0          6m40s
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-65f9d659pr69b   1/1     Running     0          6m10s
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7c55d9b946vqx   1/1     Running     0          5m31s
rook-ceph-mgr-a-787f988597-wtxjj                                  1/1     Running     0          8m31s
rook-ceph-mon-a-6fdf4b5-l7r2r                                     1/1     Running     0          12m
rook-ceph-mon-b-757dfb8fc8-fh4h8                                  1/1     Running     0          10m
rook-ceph-mon-c-fdc5cfcc7-frkxm                                   1/1     Running     0          9m1s
rook-ceph-operator-5f858bfb9f-7bqk6                               1/1     Running     0          14m
rook-ceph-osd-0-5789969fd7-8m7ln                                  1/1     Running     0          6m55s
rook-ceph-osd-1-5744554c85-w6ld9                                  1/1     Running     0          8m10s
rook-ceph-osd-2-5bdc44b9f7-2fr9c                                  1/1     Running     0          6m40s
rook-ceph-osd-prepare-ocs-deviceset-0-0-xxlrl-fx7sx               0/1     Completed   0          59m
rook-ceph-osd-prepare-ocs-deviceset-1-0-9vx2b-97ml2               0/1     Completed   0          59m
rook-ceph-osd-prepare-ocs-deviceset-2-0-sjml9-gw56q               0/1     Completed   0          59m
rook-ceph-tools-6f59b98f4f-stg8r                                  1/1     Running     0          58m

So will do another run wit the same set of the pre/post upgrade tests and will get to you back with other results.

Comment 4 Petr Balogh 2020-05-13 11:09:00 UTC

The OCS and OCP must gather and other logs I collected locally from the attempt above before upgrade and after upgrade are available here in this tar file:
http://rhsqe-repo.lab.eng.blr.redhat.com/cns/ocs-qe-bugs/bz-1835125.tar.gz

Running another execution the same with all pre upgrade tests like I hit the issue yesterday.

Job is here:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/7517/

Comment 6 Petr Balogh 2020-05-13 15:47:13 UTC

Just adding some output of how pods looked like:
$ oc get pod -n openshift-storage
NAME                                                              READY   STATUS                  RESTARTS   AGE
csi-cephfsplugin-5x22k                                            3/3     Running                 0          45m
csi-cephfsplugin-n7gck                                            3/3     Running                 0          45m
csi-cephfsplugin-provisioner-5d94649b9f-d6g7s                     5/5     Running                 0          46m
csi-cephfsplugin-provisioner-5d94649b9f-v5rjf                     5/5     Running                 0          45m
csi-cephfsplugin-xlg6b                                            3/3     Running                 0          46m
csi-rbdplugin-jzxvg                                               3/3     Running                 0          45m
csi-rbdplugin-lsfjt                                               3/3     Running                 0          45m
csi-rbdplugin-plbd8                                               3/3     Running                 0          45m
csi-rbdplugin-provisioner-55c5479c46-9b7w2                        5/5     Running                 0          46m
csi-rbdplugin-provisioner-55c5479c46-bw48k                        5/5     Running                 0          45m
noobaa-core-0                                                     0/1     CrashLoopBackOff        6          45m
noobaa-db-0                                                       0/1     Init:CrashLoopBackOff   13         45m
noobaa-endpoint-798db5b9f7-48c46                                  0/1     CrashLoopBackOff        6          45m
noobaa-operator-6c6c99b8b5-sx75r                                  1/1     Running                 0          46m
ocs-operator-544d9ddd9d-nb9l4                                     0/1     Running                 0          46m
pod-test-cephfs-79757eb9a89643fb9f64a991b3057763                  1/1     Running                 0          144m
rook-ceph-crashcollector-ip-10-0-129-158-6d7bd7f987-57d9t         1/1     Running                 0          40m
rook-ceph-crashcollector-ip-10-0-157-45-65d5cd8487-5gx9g          1/1     Running                 0          40m
rook-ceph-crashcollector-ip-10-0-171-32-5fb95bb5c-db596           1/1     Running                 0          40m
rook-ceph-drain-canary-85b2bb4d77422e05980cd0e2e324b252-765rg4w   1/1     Running                 0          30m
rook-ceph-drain-canary-8af4f9c77d22bcb9e8649fbbce6437fc-76b4njk   1/1     Running                 0          40m
rook-ceph-drain-canary-d0e859058b1765fcc8919ac66b55c432-66w74l8   1/1     Running                 0          19m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-86d8fb54t62b4   1/1     Running                 0          19m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-cfc54487lrwp8   1/1     Running                 0          19m
rook-ceph-mgr-a-664c65f4d6-7rngp                                  1/1     Running                 0          41m
rook-ceph-mon-a-d6747d879-brtwv                                   1/1     Running                 0          43m
rook-ceph-mon-b-85fff45f48-msp6s                                  1/1     Running                 0          42m
rook-ceph-mon-c-5f69ff6d75-mpwfg                                  1/1     Running                 0          41m
rook-ceph-operator-5f858bfb9f-62c5x                               1/1     Running                 0          46m
rook-ceph-osd-0-67c558ffb7-kz6jk                                  1/1     Running                 0          19m
rook-ceph-osd-1-6f685cd448-v9kf8                                  1/1     Running                 0          40m
rook-ceph-osd-2-94488d6d-nrjxd                                    1/1     Running                 0          30m
rook-ceph-osd-prepare-ocs-deviceset-0-0-lc48t-b4kfn               0/1     Completed               0          162m
rook-ceph-osd-prepare-ocs-deviceset-1-0-q24bx-5xscd               0/1     Completed               0          162m
rook-ceph-osd-prepare-ocs-deviceset-2-0-4bmx4-xzt69               0/1     Completed               0          162m
rook-ceph-tools-6f59b98f4f-8ksgv                                  1/1     Running                 0          161m

Comment 7 Jacky Albo 2020-05-14 13:31:10 UTC

The problem is due to setting openshift.io/scc in the noobaa-db pod to anyuid instead of restricted.
This is already a known issue: https://bugzilla.redhat.com/show_bug.cgi?id=1804168
We are now working on discover which QE test is causing this change.
Immediately after install it is still set to restricted
and after some tests we can see it changes to anyuid
we want to find the auto-test that causing it to change - working on it with Petr

Comment 8 Petr Balogh 2020-05-15 17:59:27 UTC

Jacky, I've tested with provided build 4.4.0-426.ci and still the same issue:


But I also see this image is used in at least one of the pod:
containerImage: quay.io/ocs-dev/ocs-operator:4.4.0

Boris, can you confirm that:
quay.io/rhceph-dev/ocs-operator@sha256:1ac6eb090759f94fee54d2af4b73faf2c0bd0af9ace7052902d8198f2b51d1d7

Is one you created for Jacky?

Logs I collected manually:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-upgr/pbalogh-upgr_20200515T142113/logs/upgrade-with-new-build.tar.gz

Those from our automation:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-upgr/pbalogh-upgr_20200515T142113/logs/failed_testcase_ocs_logs_1589558321/

Comment 9 Jose A. Rivera 2020-05-15 21:10:21 UTC

Corresponding PR has merged in ocs-operator master branch and is being backported: https://github.com/openshift/ocs-operator/pull/516

Comment 12 Nimrod Becker 2020-05-20 10:34:29 UTC

Danny and Jacky have verified this is not upgrade related. Upgrade simply restarted the pods, restarting them before the. upgrade results in the same problem.

Comment 14 Raz Tamir 2020-05-20 15:23:09 UTC

Closing as not a bug based on last insights from Eng team