2180211 – StorageCluster stuck in progressing state for Thales KMS deployment

Bug 2180211 - StorageCluster stuck in progressing state for Thales KMS deployment

Summary: StorageCluster stuck in progressing state for Thales KMS deployment

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.13.0
Assignee:	Travis Nielsen
QA Contact:	Parag Kamble
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-03-20 22:38 UTC by Coady LaCroix
Modified:	2023-08-09 17:03 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-06-21 15:24:39 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage rook pull 458	None	open	Final resync from upstream release-1.11 to downstream release-4.13 branch	2023-03-21 20:10:12 UTC
Github	rook rook pull 11951	None	Merged	core: Set key rotation default in code instead of crds	2023-03-21 19:56:44 UTC
Red Hat Product Errata	RHBA-2023:3742	None	None	None	2023-06-21 15:24:52 UTC

Description Coady LaCroix 2023-03-20 22:38:43 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Deployment of ODF 4.13 with Thales KMS is failing due to the StorageCluster becoming stuck in the Progressing state. Last status condition shows an unknown status while initializing and nothing afterwards. The cluster in question is still available for troubleshooting, connection details are linked below.

~ oc --kubeconfig kubeconfig -n openshift-storage get storagecluster      
NAME                 AGE     PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   5h48m   Progressing              2023-03-20T16:41:52Z   4.13.0


    Last Heartbeat Time:   2023-03-20T16:41:52Z
    Last Transition Time:  2023-03-20T16:41:52Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                Unknown
    Type:                  Upgradeable


Version of all relevant components (if applicable):
Client Version: 4.13.0-0.nightly-2023-03-19-052243
Kustomize Version: v4.5.7
Server Version: 4.13.0-0.nightly-2023-03-19-052243
Kubernetes Version: v1.26.2+06e8c46
odf-operator version: 4.13.0-107

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, this impacts Thales KMS testing for 4.13


Is there any workaround available to the best of your knowledge?
No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy ODF with Thales KMS (see jenkins job below)


Actual results:
Storagecluster stuck in progressing state

Expected results:
Successful creation of StorageCluster

Additional info:
Jenkins job: https://url.corp.redhat.com/959a4f7
Must-gather: https://url.corp.redhat.com/02e3ff8
kubeconfig:  https://url.corp.redhat.com/9fdb5b9

Comment 1 Petr Balogh 2023-03-21 08:18:14 UTC

I see such errors in the rook-ceph-operator logs:
2023-03-20T17:03:53.706212579Z 2023-03-20 17:03:53.706169 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: failed to validate kms connection details: failed to fetch kms token secret "ocs-thales-kmip-991dee84fe0b447f88933ce7": client rate limiter Wait returned an error: context canceled

https://url.corp.redhat.com/33595b9

Comment 2 Travis Nielsen 2023-03-21 19:56:44 UTC

There was a default value being set in the crd that was causing these context issues. 
That default value has been removed from the crd, so the kms scenarios will be working again now.

Comment 3 Mudit Agarwal 2023-04-03 11:25:53 UTC

Fixed in version: Any latest stable 4.13 build

Comment 6 Coady LaCroix 2023-04-07 18:34:35 UTC

Seeing that we have successful deployments with 4.13.0-121.

Jenkins: https://url.corp.redhat.com/edbf9db

Comment 9 errata-xmlrpc 2023-06-21 15:24:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742

Note You need to log in before you can comment on or make changes to this bug.