Description of problem (please be detailed as possible and provide log snippests): With fresh installation of ocs from UI with KMS enabled, OSDs went to Init:CrashLoopBackOff state , RGW went to CrashLoopBackOff , No operations were done, just kept idle. [jenkins@temp-jslave-sshreeka-vm ~]$ oc get po NAME READY STATUS RESTARTS AGE csi-cephfsplugin-28gsb 3/3 Running 0 20h csi-cephfsplugin-provisioner-fdc478cc-4dnv4 6/6 Running 14 10h csi-cephfsplugin-provisioner-fdc478cc-qwcsk 6/6 Running 22 20h csi-cephfsplugin-qctch 3/3 Running 0 20h csi-cephfsplugin-wc9cc 3/3 Running 0 20h csi-rbdplugin-98rcr 3/3 Running 0 20h csi-rbdplugin-cznf9 3/3 Running 0 20h csi-rbdplugin-provisioner-64db99d598-9r5jp 6/6 Running 12 10h csi-rbdplugin-provisioner-64db99d598-gc4js 6/6 Running 13 10h csi-rbdplugin-zkxd8 3/3 Running 0 20h must-gather-x2hkp-helper 1/1 Running 0 108m noobaa-core-0 1/1 Running 0 10h noobaa-db-pg-0 1/1 Terminating 0 20h noobaa-endpoint-55b4bd44f4-qp8t6 1/1 Running 0 10h noobaa-operator-76fd5fbfbf-6nv8x 1/1 Running 7 10h ocs-metrics-exporter-cc6484bf5-tkwht 1/1 Running 0 10h ocs-operator-7fdf7b64bb-q2g7v 1/1 Running 3 10h rook-ceph-crashcollector-compute-0-5f7667c54b-bmhtx 1/1 Running 0 20h rook-ceph-crashcollector-compute-1-7bf885565c-lgp4x 1/1 Running 0 10h rook-ceph-crashcollector-compute-2-78d84947f6-scqtm 1/1 Running 0 10h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6fdf95585jwz5 2/2 Running 0 10h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-759ff5c85njg6 2/2 Running 0 20h rook-ceph-mgr-a-5bb88555dd-lz2p2 2/2 Running 0 20h rook-ceph-mon-a-5bc8f4c4f4-vk8lt 2/2 Running 0 10h rook-ceph-mon-b-56f458b5f6-w7z8p 2/2 Running 0 20h rook-ceph-mon-c-5d77b4f488-8vbbj 2/2 Running 0 10h rook-ceph-operator-6d69bc4586-7tn55 1/1 Running 0 10h rook-ceph-osd-0-85ccc5b9c7-jnd6b 0/2 Init:CrashLoopBackOff 125 10h rook-ceph-osd-1-6cb557bb87-2zj4g 0/2 Init:CrashLoopBackOff 128 10h rook-ceph-osd-2-d8847cf45-ltsdv 2/2 Running 0 20h rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-08ck5d-m92l9 0/1 Completed 0 20h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-666db85jlzc4 1/2 CrashLoopBackOff 203 10h rook-ceph-tools-84bc476959-l88fp 1/1 Running 0 148m Version of all relevant components (if applicable): ocs-operator.v4.7.0-254.ci ocp : 4.7.0-0.nightly-2021-02-08-052658 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? NA Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? NA Can this issue reproduce from the UI? NA If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Setup ocs via UI with KMS( internal) enabled Actual results: OSDs and RGW pods are in Init:CrashLoopBackOff, CrashLoopBackOff with noobaa-db-pg-0 in Terminating state Expected results: All pods should be UP and Running. Additional info: Cluster access: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/290/
logs?
(In reply to Sébastien Han from comment #2) > logs? OCP logs http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz1926617.zip Must gather is taking lot of time while collecting ocs logs, will update once I collect Thanks, Shreekar
Vijay, the error is different: Liveness probe failed: admin_socket: exception getting command descriptions: [Errno 2] No such file or directory Please open a different BZ.
(In reply to Sébastien Han from comment #5) > Vijay, the error is different: Liveness probe failed: admin_socket: > exception getting command descriptions: [Errno 2] No such file or directory > Please open a different BZ. Thanks for the confirmation. Raised new bz for issue: https://bugzilla.redhat.com/show_bug.cgi?id=1927262
Tested on ocs-operator.v4.7.0-263.ci, did flow based ops like add capacity, node restart with running IOs, found that RGW pods are Up and Running, but existing OSD went to rook-ceph-osd-2-696d8df8d4-5hcpf 0/2 Init:CrashLoopBackOff 87 7h7m Moving to Assigned.
Tested on ocs-operator.v4.7.0-731.ci with OpenShift version 4.7.0-0.nightly-2021-02-18-110409 All OSDs are up and running, add capacity worked, post add capacity, no issues seen on Existing OSDs. Moving bug to Verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041
Removing AutomationBacklog keyword. This will be covered in installation of KMS enabled cluster. A specific test case is not needed.