1977609 – [OCS 4.7.2]- OSD in CLBO for ~10-20 mins on using kv-v2 after v2 deployment is supported with fix for Bug 1970583

Bug 1977609 - [OCS 4.7.2]- OSD in CLBO for ~10-20 mins on using kv-v2 after v2 deployment is supported with fix for Bug 1970583

Summary: [OCS 4.7.2]- OSD in CLBO for ~10-20 mins on using kv-v2 after v2 deployment i...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.7.4
Assignee:	Sébastien Han
QA Contact:	Shay Rozen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1997125
TreeView+	depends on / blocked

Reported:	2021-06-30 07:33 UTC by Neha Berry
Modified:	2021-09-15 13:27 UTC (History)
CC List:	10 users (show)
Fixed In Version:	v4.7.4-244.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1997125 (view as bug list)
Environment:
Last Closed:	2021-09-15 13:26:57 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift rook pull 270	None	closed	Bug 1970583: ceph: cancel on-going orchestration on any CR update	2021-06-30 07:50:12 UTC
Github	openshift rook pull 286	None	None	None	2021-08-26 07:13:57 UTC
Github	rook rook pull 8216	None	open	ceph: cancel on-going orchestration on any CR update	2021-06-30 07:50:12 UTC
Github	rook rook pull 8583	None	None	None	2021-08-24 10:01:59 UTC
Red Hat Product Errata	RHBA-2021:3549	None	None	None	2021-09-15 13:27:14 UTC

Comment 8 Shay Rozen 2021-08-09 12:35:50 UTC

The problem is still occurring although I see in the logs the new log from the PR:
2021-08-09 12:12:15.410480 I | ceph-cluster-controller: CR has changed for "ocs-storagecluster-cephcluster". cancelling any ongoing orchestration. diff=  v1.ClusterSpec{
  	... // 21 identical fields
  	CleanupPolicy: {},
  	HealthCheck:   {},
  	Security: v1.SecuritySpec{
  		KeyManagementService: v1.KeyManagementServiceSpec{
  			ConnectionDetails: map[string]string{
  				"KMS_PROVIDER":       "vault",
  				"KMS_SERVICE_NAME":   "vault",
  				"VAULT_ADDR":         "https://vault.qe.rh-ocs.com:8200",
+ 				"VAULT_BACKEND":      "v2",
  				"VAULT_BACKEND_PATH": "v2shay",
  				"VAULT_CACERT":       "ocs-kms-ca-secret-60g2ng",
  				... // 4 identical entries
  			},
  			TokenSecretName: "ocs-kms-token",
  		},
  	},
  	LogCollector: {Enabled: true, Periodicity: "24h"},
  }
The OSD is still in CLBO for 10-15 minutes.

Comment 10 Travis Nielsen 2021-08-09 16:53:06 UTC

Here's what I'm seeing:
1. The OCS operator creates the CephCluster CR **without** the vault v2 setting
2. Rook creates the cluster. The OSDs are not able to connect since the v2 property is not specified.
3. The CephCluster is updated by the OCS operator **with** the v2 setting and a new Rook reconcile begins
4. The PGs are all unknown because of step 1, so the operator waits for them to be healthy.
5. After the timeout to wait for the PGs (10 min), the OSDs are updated
6. Now the OSDs are healthy and everything looks good in the cluster that I can see. 

So we really need the OCS operator to apply the configmap setting with the initial reconcile. If the CephCluster CR is updated after the initial reconcile it isn't working.

In the meantime, the workaround is to wait for the operator to timeout waiting for the PGs to be healthy and update OSDs with the correct setting before the cluster will be healthy.

@Arun Looks like you are familiar with this configmap to take a look.

Comment 17 arun kumar mohan 2021-08-11 15:07:03 UTC

We already have a fix in rook, through BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1975272, to auto-detect the vault version
PR: https://github.com/rook/rook/pull/8265
It will be redundant to place logic in all the components (OCS-Operator + Rook).

Mudit/Travis, is it possible for us to backport the changes in rook to version 1.6.x and update OCS Operator to point to the latest 1.6.x rook branch?

Comment 18 Travis Nielsen 2021-08-11 21:37:26 UTC

There is no workaround that would allow the v2 vault setting to be applied during initial CephCluster CR creation instead of later editing it?

Backporting detecting vault v2 automatically would require:
- Testing in 4.9. This is a big change that needs validation before we consider backporting.
- Confirmation from Seb as he is much more familiar with the risk of those changes.

Comment 19 Mudit Agarwal 2021-08-16 12:18:35 UTC

I agree with Travis, this is indeed a big change for z-stream. Let's try to find some other solution.

Comment 20 Sébastien Han 2021-08-23 13:03:30 UTC

I agree with all the above, it's a large change, also this change relies on another change that brings some Go deps.
Those deps are in line with Rook 4.9 which will only support Go 1.16 and above where 4.7 runs on 1.16, so it's a larger issue.
So the backport will have a much broader impact which is hard to estimate but is extremely risky anyway.

One more thing, we cannot solely rely on auto-detection. Some customers will refuse to give access to "sys/mount" which is the Vault's API path used for auto-detection.

Comment 21 Mudit Agarwal 2021-08-23 13:59:30 UTC

Arun, are we good here? Can we work on a 4.7.z specific fix?

Comment 22 arun kumar mohan 2021-08-24 10:23:37 UTC

Removing the need-info as it is moved to rook...

Comment 29 Shay Rozen 2021-08-30 08:09:39 UTC

Verified on 4.7.4-244.ci . All OSD came to running state without CLBO and are crypted.

Comment 33 errata-xmlrpc 2021-09-15 13:26:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.7.4 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3549

Note You need to log in before you can comment on or make changes to this bug.