1926617 – osds are in Init:CrashLoopBackOff with rgw in CrashLoopBackOff on KMS enabled cluster

Bug 1926617 - osds are in Init:CrashLoopBackOff with rgw in CrashLoopBackOff on KMS enabled cluster

Summary: osds are in Init:CrashLoopBackOff with rgw in CrashLoopBackOff on KMS enabled...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.7.0
Assignee:	Sébastien Han
QA Contact:	Persona non grata
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-09 08:29 UTC by Persona non grata
Modified:	2021-09-21 06:30 UTC (History)
CC List:	11 users (show)
Fixed In Version:	4.7.0-731.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-19 09:19:00 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift rook pull 169	None	closed	Bug 1926617: better expose vault API error messages	2021-02-19 08:33:19 UTC
Github	openshift rook pull 172	None	closed	Bug 1926617: ceph: do not override existing keys	2021-02-19 08:33:19 UTC
Github	rook rook pull 7193	None	closed	ceph: expose vault curl errors	2021-02-19 13:31:02 UTC
Github	rook rook pull 7230	None	closed	ceph: embed error in return	2021-02-16 10:24:47 UTC
Github	rook rook pull 7240	None	open	ceph: do not override existing keys	2021-02-16 17:43:06 UTC
Red Hat Product Errata	RHSA-2021:2041	None	None	None	2021-05-19 09:19:45 UTC

Description Persona non grata 2021-02-09 08:29:53 UTC

Description of problem (please be detailed as possible and provide log
snippests):
With fresh installation of ocs from UI with KMS enabled, OSDs went to Init:CrashLoopBackOff state , RGW went to CrashLoopBackOff , No operations were done, just kept idle.

[jenkins@temp-jslave-sshreeka-vm ~]$ oc get po 
NAME                                                              READY   STATUS                  RESTARTS   AGE
csi-cephfsplugin-28gsb                                            3/3     Running                 0          20h
csi-cephfsplugin-provisioner-fdc478cc-4dnv4                       6/6     Running                 14         10h
csi-cephfsplugin-provisioner-fdc478cc-qwcsk                       6/6     Running                 22         20h
csi-cephfsplugin-qctch                                            3/3     Running                 0          20h
csi-cephfsplugin-wc9cc                                            3/3     Running                 0          20h
csi-rbdplugin-98rcr                                               3/3     Running                 0          20h
csi-rbdplugin-cznf9                                               3/3     Running                 0          20h
csi-rbdplugin-provisioner-64db99d598-9r5jp                        6/6     Running                 12         10h
csi-rbdplugin-provisioner-64db99d598-gc4js                        6/6     Running                 13         10h
csi-rbdplugin-zkxd8                                               3/3     Running                 0          20h
must-gather-x2hkp-helper                                          1/1     Running                 0          108m
noobaa-core-0                                                     1/1     Running                 0          10h
noobaa-db-pg-0                                                    1/1     Terminating             0          20h
noobaa-endpoint-55b4bd44f4-qp8t6                                  1/1     Running                 0          10h
noobaa-operator-76fd5fbfbf-6nv8x                                  1/1     Running                 7          10h
ocs-metrics-exporter-cc6484bf5-tkwht                              1/1     Running                 0          10h
ocs-operator-7fdf7b64bb-q2g7v                                     1/1     Running                 3          10h
rook-ceph-crashcollector-compute-0-5f7667c54b-bmhtx               1/1     Running                 0          20h
rook-ceph-crashcollector-compute-1-7bf885565c-lgp4x               1/1     Running                 0          10h
rook-ceph-crashcollector-compute-2-78d84947f6-scqtm               1/1     Running                 0          10h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6fdf95585jwz5   2/2     Running                 0          10h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-759ff5c85njg6   2/2     Running                 0          20h
rook-ceph-mgr-a-5bb88555dd-lz2p2                                  2/2     Running                 0          20h
rook-ceph-mon-a-5bc8f4c4f4-vk8lt                                  2/2     Running                 0          10h
rook-ceph-mon-b-56f458b5f6-w7z8p                                  2/2     Running                 0          20h
rook-ceph-mon-c-5d77b4f488-8vbbj                                  2/2     Running                 0          10h
rook-ceph-operator-6d69bc4586-7tn55                               1/1     Running                 0          10h
rook-ceph-osd-0-85ccc5b9c7-jnd6b                                  0/2     Init:CrashLoopBackOff   125        10h
rook-ceph-osd-1-6cb557bb87-2zj4g                                  0/2     Init:CrashLoopBackOff   128        10h
rook-ceph-osd-2-d8847cf45-ltsdv                                   2/2     Running                 0          20h
rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-08ck5d-m92l9      0/1     Completed               0          20h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-666db85jlzc4   1/2     CrashLoopBackOff        203        10h
rook-ceph-tools-84bc476959-l88fp                                  1/1     Running                 0          148m


Version of all relevant components (if applicable):

ocs-operator.v4.7.0-254.ci
ocp : 4.7.0-0.nightly-2021-02-08-052658

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
NA

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Setup ocs via UI with KMS( internal) enabled



Actual results:
OSDs and RGW pods are in Init:CrashLoopBackOff, CrashLoopBackOff with noobaa-db-pg-0 in Terminating state

Expected results:
All pods should be UP and Running.

Additional info:

Cluster access: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/290/

Comment 2 Sébastien Han 2021-02-09 08:41:43 UTC

logs?

Comment 3 Persona non grata 2021-02-09 09:23:55 UTC

(In reply to Sébastien Han from comment #2)
> logs?

OCP logs http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz1926617.zip
Must gather is taking lot of time while collecting ocs logs, will update once I collect

Thanks,
Shreekar

Comment 5 Sébastien Han 2021-02-10 11:49:38 UTC

Vijay, the error is different: Liveness probe failed: admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
Please open a different BZ.

Comment 6 Vijay Avuthu 2021-02-10 12:21:30 UTC

(In reply to Sébastien Han from comment #5)
> Vijay, the error is different: Liveness probe failed: admin_socket:
> exception getting command descriptions: [Errno 2] No such file or directory
> Please open a different BZ.

Thanks for the confirmation. Raised new bz for issue: https://bugzilla.redhat.com/show_bug.cgi?id=1927262

Comment 8 Persona non grata 2021-02-16 14:26:22 UTC

Tested on ocs-operator.v4.7.0-263.ci, did flow based ops like add capacity, node restart with running IOs, found that RGW pods are Up and Running, but existing OSD went to 

rook-ceph-osd-2-696d8df8d4-5hcpf                                  0/2     Init:CrashLoopBackOff   87         7h7m

Moving to Assigned.

Comment 11 Persona non grata 2021-02-22 09:39:13 UTC

Tested on ocs-operator.v4.7.0-731.ci with OpenShift version
4.7.0-0.nightly-2021-02-18-110409

All OSDs are up and running, add capacity worked, post add capacity, no issues seen on Existing OSDs. Moving bug to Verified

Comment 14 errata-xmlrpc 2021-05-19 09:19:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Comment 15 Jilju Joy 2021-09-21 06:30:35 UTC

Removing AutomationBacklog keyword. This will be covered in installation of KMS enabled cluster. A specific test case is not needed.

Note You need to log in before you can comment on or make changes to this bug.