Bug 1926617 - osds are in Init:CrashLoopBackOff with rgw in CrashLoopBackOff on KMS enabled cluster
Summary: osds are in Init:CrashLoopBackOff with rgw in CrashLoopBackOff on KMS enabled...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.7.0
Assignee: Sébastien Han
QA Contact: Persona non grata
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-09 08:29 UTC by Persona non grata
Modified: 2021-09-21 06:30 UTC (History)
11 users (show)

Fixed In Version: 4.7.0-731.ci
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-19 09:19:00 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift rook pull 169 0 None closed Bug 1926617: better expose vault API error messages 2021-02-19 08:33:19 UTC
Github openshift rook pull 172 0 None closed Bug 1926617: ceph: do not override existing keys 2021-02-19 08:33:19 UTC
Github rook rook pull 7193 0 None closed ceph: expose vault curl errors 2021-02-19 13:31:02 UTC
Github rook rook pull 7230 0 None closed ceph: embed error in return 2021-02-16 10:24:47 UTC
Github rook rook pull 7240 0 None open ceph: do not override existing keys 2021-02-16 17:43:06 UTC
Red Hat Product Errata RHSA-2021:2041 0 None None None 2021-05-19 09:19:45 UTC

Description Persona non grata 2021-02-09 08:29:53 UTC
Description of problem (please be detailed as possible and provide log
snippests):
With fresh installation of ocs from UI with KMS enabled, OSDs went to Init:CrashLoopBackOff state , RGW went to CrashLoopBackOff , No operations were done, just kept idle.

[jenkins@temp-jslave-sshreeka-vm ~]$ oc get po 
NAME                                                              READY   STATUS                  RESTARTS   AGE
csi-cephfsplugin-28gsb                                            3/3     Running                 0          20h
csi-cephfsplugin-provisioner-fdc478cc-4dnv4                       6/6     Running                 14         10h
csi-cephfsplugin-provisioner-fdc478cc-qwcsk                       6/6     Running                 22         20h
csi-cephfsplugin-qctch                                            3/3     Running                 0          20h
csi-cephfsplugin-wc9cc                                            3/3     Running                 0          20h
csi-rbdplugin-98rcr                                               3/3     Running                 0          20h
csi-rbdplugin-cznf9                                               3/3     Running                 0          20h
csi-rbdplugin-provisioner-64db99d598-9r5jp                        6/6     Running                 12         10h
csi-rbdplugin-provisioner-64db99d598-gc4js                        6/6     Running                 13         10h
csi-rbdplugin-zkxd8                                               3/3     Running                 0          20h
must-gather-x2hkp-helper                                          1/1     Running                 0          108m
noobaa-core-0                                                     1/1     Running                 0          10h
noobaa-db-pg-0                                                    1/1     Terminating             0          20h
noobaa-endpoint-55b4bd44f4-qp8t6                                  1/1     Running                 0          10h
noobaa-operator-76fd5fbfbf-6nv8x                                  1/1     Running                 7          10h
ocs-metrics-exporter-cc6484bf5-tkwht                              1/1     Running                 0          10h
ocs-operator-7fdf7b64bb-q2g7v                                     1/1     Running                 3          10h
rook-ceph-crashcollector-compute-0-5f7667c54b-bmhtx               1/1     Running                 0          20h
rook-ceph-crashcollector-compute-1-7bf885565c-lgp4x               1/1     Running                 0          10h
rook-ceph-crashcollector-compute-2-78d84947f6-scqtm               1/1     Running                 0          10h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6fdf95585jwz5   2/2     Running                 0          10h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-759ff5c85njg6   2/2     Running                 0          20h
rook-ceph-mgr-a-5bb88555dd-lz2p2                                  2/2     Running                 0          20h
rook-ceph-mon-a-5bc8f4c4f4-vk8lt                                  2/2     Running                 0          10h
rook-ceph-mon-b-56f458b5f6-w7z8p                                  2/2     Running                 0          20h
rook-ceph-mon-c-5d77b4f488-8vbbj                                  2/2     Running                 0          10h
rook-ceph-operator-6d69bc4586-7tn55                               1/1     Running                 0          10h
rook-ceph-osd-0-85ccc5b9c7-jnd6b                                  0/2     Init:CrashLoopBackOff   125        10h
rook-ceph-osd-1-6cb557bb87-2zj4g                                  0/2     Init:CrashLoopBackOff   128        10h
rook-ceph-osd-2-d8847cf45-ltsdv                                   2/2     Running                 0          20h
rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-08ck5d-m92l9      0/1     Completed               0          20h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-666db85jlzc4   1/2     CrashLoopBackOff        203        10h
rook-ceph-tools-84bc476959-l88fp                                  1/1     Running                 0          148m


Version of all relevant components (if applicable):

ocs-operator.v4.7.0-254.ci
ocp : 4.7.0-0.nightly-2021-02-08-052658

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
NA

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Setup ocs via UI with KMS( internal) enabled



Actual results:
OSDs and RGW pods are in Init:CrashLoopBackOff, CrashLoopBackOff with noobaa-db-pg-0 in Terminating state

Expected results:
All pods should be UP and Running.

Additional info:

Cluster access: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/290/

Comment 2 Sébastien Han 2021-02-09 08:41:43 UTC
logs?

Comment 3 Persona non grata 2021-02-09 09:23:55 UTC
(In reply to Sébastien Han from comment #2)
> logs?

OCP logs http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz1926617.zip
Must gather is taking lot of time while collecting ocs logs, will update once I collect

Thanks,
Shreekar

Comment 5 Sébastien Han 2021-02-10 11:49:38 UTC
Vijay, the error is different: Liveness probe failed: admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
Please open a different BZ.

Comment 6 Vijay Avuthu 2021-02-10 12:21:30 UTC
(In reply to Sébastien Han from comment #5)
> Vijay, the error is different: Liveness probe failed: admin_socket:
> exception getting command descriptions: [Errno 2] No such file or directory
> Please open a different BZ.

Thanks for the confirmation. Raised new bz for issue: https://bugzilla.redhat.com/show_bug.cgi?id=1927262

Comment 8 Persona non grata 2021-02-16 14:26:22 UTC
Tested on ocs-operator.v4.7.0-263.ci, did flow based ops like add capacity, node restart with running IOs, found that RGW pods are Up and Running, but existing OSD went to 

rook-ceph-osd-2-696d8df8d4-5hcpf                                  0/2     Init:CrashLoopBackOff   87         7h7m

Moving to Assigned.

Comment 11 Persona non grata 2021-02-22 09:39:13 UTC
Tested on ocs-operator.v4.7.0-731.ci with OpenShift version
4.7.0-0.nightly-2021-02-18-110409

All OSDs are up and running, add capacity worked, post add capacity, no issues seen on Existing OSDs. Moving bug to Verified

Comment 14 errata-xmlrpc 2021-05-19 09:19:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Comment 15 Jilju Joy 2021-09-21 06:30:35 UTC
Removing AutomationBacklog keyword. This will be covered in installation of KMS enabled cluster. A specific test case is not needed.


Note You need to log in before you can comment on or make changes to this bug.