Bug 1854651
Summary: | Converged Mode:ocs-operator in CrashLoopBackoff with empty labelSelector | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | John Strunk <jstrunk> |
Component: | ocs-operator | Assignee: | Jose A. Rivera <jarrpa> |
Status: | CLOSED ERRATA | QA Contact: | akarsha <akrai> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.5 | CC: | afrahman, ebenahar, madam, nberry, ocs-bugs, sostapov |
Target Milestone: | --- | Keywords: | AutomationBackLog |
Target Release: | OCS 4.5.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | 4.5.0-482.ci | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-09-15 10:18:18 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
John Strunk
2020-07-07 20:52:57 UTC
This is legit. And we need to fix it. Not sure how QE could verify https://bugzilla.redhat.com/show_bug.cgi?id=1846389 :-) PR is upstream: https://github.com/openshift/ocs-operator/pull/618 (In reply to Michael Adam from comment #2) > This is legit. And we need to fix it. > Not sure how QE could verify > https://bugzilla.redhat.com/show_bug.cgi?id=1846389 :-) To expain: That BZ was for independent mode only, and it seems we're not hitting the crash in the code path for independent mode. https://github.com/openshift/ocs-operator/pull/623 Backport PR. merged https://ceph-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/OCS%20Build%20Pipeline%204.5/61/ contains the fix 4.5.0-482.ci Tested on AWS IPI environment(converged mode) - 3 Masters - 3 Workers Version: OCP: 4.5.0-0.nightly-2020-07-14-213353 OCS: ocs-operator.v4.5.0-487.ci Steps performed: 1. Created OCP cluster using ocs-ci 2. Ran deploy-olm.yaml $ oc create -f deploy-olm.yaml 3. Did subscription through UI 4. Ran the storagecluster.yaml $ oc create -f storagecluster.yaml Observations: Don't see any problem on ocs-operator and all pods are up and running. @John Strunk is the verification steps are correct, are we missing something? Do we need to validate on independent mode too? Additional information: $ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-28bvw 3/3 Running 0 13m csi-cephfsplugin-lfmxk 3/3 Running 0 13m csi-cephfsplugin-provisioner-65c858dcb7-q7hbm 5/5 Running 0 13m csi-cephfsplugin-provisioner-65c858dcb7-xj7hp 5/5 Running 0 13m csi-cephfsplugin-rddkh 3/3 Running 0 13m csi-rbdplugin-h8bdk 3/3 Running 0 13m csi-rbdplugin-pd9cq 3/3 Running 0 13m csi-rbdplugin-provisioner-b6b697b66-6n6bt 5/5 Running 0 13m csi-rbdplugin-provisioner-b6b697b66-9wmz5 5/5 Running 0 13m csi-rbdplugin-tdsxm 3/3 Running 0 13m noobaa-core-0 1/1 Running 0 10m noobaa-db-0 1/1 Running 0 10m noobaa-endpoint-758cbdd6d4-hj6wl 1/1 Running 0 9m20s noobaa-operator-5f9d557669-2xg6g 1/1 Running 0 16m ocs-operator-75b4fbfbff-q9t9p 1/1 Running 0 16m rook-ceph-crashcollector-ip-10-0-135-206-7566bc5678-27s5d 1/1 Running 0 12m rook-ceph-crashcollector-ip-10-0-185-61-78d5ffb9b4-5dwnz 1/1 Running 0 11m rook-ceph-crashcollector-ip-10-0-199-13-76cf7d686-skk2q 1/1 Running 0 12m rook-ceph-drain-canary-0e68ef29218a4256e368ebd8f2e7bd14-7cff4dx 1/1 Running 0 11m rook-ceph-drain-canary-1dbf9852097ecaf2d538dccc5663ece1-65xmb9t 1/1 Running 0 10m rook-ceph-drain-canary-f3fa4531e5fce199d25d2b6649d283da-69jnb2v 1/1 Running 0 10m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78dbfbcdjmq7f 1/1 Running 0 10m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-575f8c6bhr7jd 1/1 Running 0 10m rook-ceph-mgr-a-6468b57f74-4zvhn 1/1 Running 0 11m rook-ceph-mon-a-7959984c64-t7thn 1/1 Running 0 12m rook-ceph-mon-b-5b4f9fb78b-tj7cp 1/1 Running 0 12m rook-ceph-mon-c-6b6d5ccfd6-vbb2p 1/1 Running 0 11m rook-ceph-operator-7cd55d84f6-hzsbf 1/1 Running 0 16m rook-ceph-osd-0-5cb8765454-l4nt2 1/1 Running 0 11m rook-ceph-osd-1-795968c964-nh96d 1/1 Running 0 10m rook-ceph-osd-2-76f96c57c5-gkhh7 1/1 Running 0 10m rook-ceph-osd-prepare-mydeviceset-0-data-0-fx97q-c5x4s 0/1 Completed 0 11m rook-ceph-osd-prepare-mydeviceset-1-data-0-jwxf8-wrgt6 0/1 Completed 0 11m rook-ceph-osd-prepare-mydeviceset-2-data-0-8x7rb-hp88r 0/1 Completed 0 11m $ oc get storagecluster ocs-storagecluster -n openshift-storage -o yaml . . spec: externalStorage: {} labelSelector: {} . . deploy-olm.yaml: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1854651/bz1854651/deploy-olm.yaml storagecluster.yaml: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1854651/bz1854651/storagecluster.yaml Logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1854651/bz1854651/ My understanding is that the empty labelSelector is enough to trigger the bug in affected versions. I have been running successfully with 4.5.0-485.ci using the same StorageCluster that caused the initial panic, so I believe this is fixed. Moving the BZ to verified state based on Comment#11 and Comment#12, conclusion is that with empty labelSelector we don't see any problem on ocs-operator and all pods were up and running. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754 |