Bug 1926617

Summary:	osds are in Init:CrashLoopBackOff with rgw in CrashLoopBackOff on KMS enabled cluster
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Persona non grata <nobody+410372>
Component:	rook	Assignee:	Sébastien Han <shan>
Status:	CLOSED ERRATA	QA Contact:	Persona non grata <nobody+410372>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.7	CC:	branto, ebenahar, hnallurv, jijoy, jthottan, madam, muagarwa, ocs-bugs, shan, sostapov, vavuthu
Target Milestone:	---	Keywords:	AutomationTriaged
Target Release:	OCS 4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.7.0-731.ci	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-19 09:19:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Persona non grata 2021-02-09 08:29:53 UTC

Description of problem (please be detailed as possible and provide log
snippests):
With fresh installation of ocs from UI with KMS enabled, OSDs went to Init:CrashLoopBackOff state , RGW went to CrashLoopBackOff , No operations were done, just kept idle.

[jenkins@temp-jslave-sshreeka-vm ~]$ oc get po 
NAME                                                              READY   STATUS                  RESTARTS   AGE
csi-cephfsplugin-28gsb                                            3/3     Running                 0          20h
csi-cephfsplugin-provisioner-fdc478cc-4dnv4                       6/6     Running                 14         10h
csi-cephfsplugin-provisioner-fdc478cc-qwcsk                       6/6     Running                 22         20h
csi-cephfsplugin-qctch                                            3/3     Running                 0          20h
csi-cephfsplugin-wc9cc                                            3/3     Running                 0          20h
csi-rbdplugin-98rcr                                               3/3     Running                 0          20h
csi-rbdplugin-cznf9                                               3/3     Running                 0          20h
csi-rbdplugin-provisioner-64db99d598-9r5jp                        6/6     Running                 12         10h
csi-rbdplugin-provisioner-64db99d598-gc4js                        6/6     Running                 13         10h
csi-rbdplugin-zkxd8                                               3/3     Running                 0          20h
must-gather-x2hkp-helper                                          1/1     Running                 0          108m
noobaa-core-0                                                     1/1     Running                 0          10h
noobaa-db-pg-0                                                    1/1     Terminating             0          20h
noobaa-endpoint-55b4bd44f4-qp8t6                                  1/1     Running                 0          10h
noobaa-operator-76fd5fbfbf-6nv8x                                  1/1     Running                 7          10h
ocs-metrics-exporter-cc6484bf5-tkwht                              1/1     Running                 0          10h
ocs-operator-7fdf7b64bb-q2g7v                                     1/1     Running                 3          10h
rook-ceph-crashcollector-compute-0-5f7667c54b-bmhtx               1/1     Running                 0          20h
rook-ceph-crashcollector-compute-1-7bf885565c-lgp4x               1/1     Running                 0          10h
rook-ceph-crashcollector-compute-2-78d84947f6-scqtm               1/1     Running                 0          10h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6fdf95585jwz5   2/2     Running                 0          10h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-759ff5c85njg6   2/2     Running                 0          20h
rook-ceph-mgr-a-5bb88555dd-lz2p2                                  2/2     Running                 0          20h
rook-ceph-mon-a-5bc8f4c4f4-vk8lt                                  2/2     Running                 0          10h
rook-ceph-mon-b-56f458b5f6-w7z8p                                  2/2     Running                 0          20h
rook-ceph-mon-c-5d77b4f488-8vbbj                                  2/2     Running                 0          10h
rook-ceph-operator-6d69bc4586-7tn55                               1/1     Running                 0          10h
rook-ceph-osd-0-85ccc5b9c7-jnd6b                                  0/2     Init:CrashLoopBackOff   125        10h
rook-ceph-osd-1-6cb557bb87-2zj4g                                  0/2     Init:CrashLoopBackOff   128        10h
rook-ceph-osd-2-d8847cf45-ltsdv                                   2/2     Running                 0          20h
rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-08ck5d-m92l9      0/1     Completed               0          20h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-666db85jlzc4   1/2     CrashLoopBackOff        203        10h
rook-ceph-tools-84bc476959-l88fp                                  1/1     Running                 0          148m


Version of all relevant components (if applicable):

ocs-operator.v4.7.0-254.ci
ocp : 4.7.0-0.nightly-2021-02-08-052658

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
NA

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Setup ocs via UI with KMS( internal) enabled



Actual results:
OSDs and RGW pods are in Init:CrashLoopBackOff, CrashLoopBackOff with noobaa-db-pg-0 in Terminating state

Expected results:
All pods should be UP and Running.

Additional info:

Cluster access: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/290/

Comment 2 Sébastien Han 2021-02-09 08:41:43 UTC

logs?

Comment 3 Persona non grata 2021-02-09 09:23:55 UTC

(In reply to Sébastien Han from comment #2)
> logs?

OCP logs http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz1926617.zip
Must gather is taking lot of time while collecting ocs logs, will update once I collect

Thanks,
Shreekar

Comment 5 Sébastien Han 2021-02-10 11:49:38 UTC

Vijay, the error is different: Liveness probe failed: admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
Please open a different BZ.

Comment 6 Vijay Avuthu 2021-02-10 12:21:30 UTC

(In reply to Sébastien Han from comment #5)
> Vijay, the error is different: Liveness probe failed: admin_socket:
> exception getting command descriptions: [Errno 2] No such file or directory
> Please open a different BZ.

Thanks for the confirmation. Raised new bz for issue: https://bugzilla.redhat.com/show_bug.cgi?id=1927262

Comment 8 Persona non grata 2021-02-16 14:26:22 UTC

Tested on ocs-operator.v4.7.0-263.ci, did flow based ops like add capacity, node restart with running IOs, found that RGW pods are Up and Running, but existing OSD went to 

rook-ceph-osd-2-696d8df8d4-5hcpf                                  0/2     Init:CrashLoopBackOff   87         7h7m

Moving to Assigned.

Comment 11 Persona non grata 2021-02-22 09:39:13 UTC

Tested on ocs-operator.v4.7.0-731.ci with OpenShift version
4.7.0-0.nightly-2021-02-18-110409

All OSDs are up and running, add capacity worked, post add capacity, no issues seen on Existing OSDs. Moving bug to Verified

Comment 14 errata-xmlrpc 2021-05-19 09:19:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Comment 15 Jilju Joy 2021-09-21 06:30:35 UTC

Removing AutomationBacklog keyword. This will be covered in installation of KMS enabled cluster. A specific test case is not needed.