Bug 1977609 - [OCS 4.7.2]- OSD in CLBO for ~10-20 mins on using kv-v2 after v2 deployment is supported with fix for Bug 1970583
Summary: [OCS 4.7.2]- OSD in CLBO for ~10-20 mins on using kv-v2 after v2 deployment i...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.7.4
Assignee: Sébastien Han
QA Contact: Shay Rozen
URL:
Whiteboard:
Depends On:
Blocks: 1997125
TreeView+ depends on / blocked
 
Reported: 2021-06-30 07:33 UTC by Neha Berry
Modified: 2021-09-15 13:27 UTC (History)
10 users (show)

Fixed In Version: v4.7.4-244.ci
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1997125 (view as bug list)
Environment:
Last Closed: 2021-09-15 13:26:57 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift rook pull 270 0 None closed Bug 1970583: ceph: cancel on-going orchestration on any CR update 2021-06-30 07:50:12 UTC
Github openshift rook pull 286 0 None None None 2021-08-26 07:13:57 UTC
Github rook rook pull 8216 0 None open ceph: cancel on-going orchestration on any CR update 2021-06-30 07:50:12 UTC
Github rook rook pull 8583 0 None None None 2021-08-24 10:01:59 UTC
Red Hat Product Errata RHBA-2021:3549 0 None None None 2021-09-15 13:27:14 UTC

Comment 8 Shay Rozen 2021-08-09 12:35:50 UTC
The problem is still occurring although I see in the logs the new log from the PR:
2021-08-09 12:12:15.410480 I | ceph-cluster-controller: CR has changed for "ocs-storagecluster-cephcluster". cancelling any ongoing orchestration. diff=  v1.ClusterSpec{
  	... // 21 identical fields
  	CleanupPolicy: {},
  	HealthCheck:   {},
  	Security: v1.SecuritySpec{
  		KeyManagementService: v1.KeyManagementServiceSpec{
  			ConnectionDetails: map[string]string{
  				"KMS_PROVIDER":       "vault",
  				"KMS_SERVICE_NAME":   "vault",
  				"VAULT_ADDR":         "https://vault.qe.rh-ocs.com:8200",
+ 				"VAULT_BACKEND":      "v2",
  				"VAULT_BACKEND_PATH": "v2shay",
  				"VAULT_CACERT":       "ocs-kms-ca-secret-60g2ng",
  				... // 4 identical entries
  			},
  			TokenSecretName: "ocs-kms-token",
  		},
  	},
  	LogCollector: {Enabled: true, Periodicity: "24h"},
  }
The OSD is still in CLBO for 10-15 minutes.

Comment 10 Travis Nielsen 2021-08-09 16:53:06 UTC
Here's what I'm seeing:
1. The OCS operator creates the CephCluster CR **without** the vault v2 setting
2. Rook creates the cluster. The OSDs are not able to connect since the v2 property is not specified.
3. The CephCluster is updated by the OCS operator **with** the v2 setting and a new Rook reconcile begins
4. The PGs are all unknown because of step 1, so the operator waits for them to be healthy.
5. After the timeout to wait for the PGs (10 min), the OSDs are updated
6. Now the OSDs are healthy and everything looks good in the cluster that I can see. 

So we really need the OCS operator to apply the configmap setting with the initial reconcile. If the CephCluster CR is updated after the initial reconcile it isn't working.

In the meantime, the workaround is to wait for the operator to timeout waiting for the PGs to be healthy and update OSDs with the correct setting before the cluster will be healthy.

@Arun Looks like you are familiar with this configmap to take a look.

Comment 17 arun kumar mohan 2021-08-11 15:07:03 UTC
We already have a fix in rook, through BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1975272, to auto-detect the vault version
PR: https://github.com/rook/rook/pull/8265
It will be redundant to place logic in all the components (OCS-Operator + Rook).

Mudit/Travis, is it possible for us to backport the changes in rook to version 1.6.x and update OCS Operator to point to the latest 1.6.x rook branch?

Comment 18 Travis Nielsen 2021-08-11 21:37:26 UTC
There is no workaround that would allow the v2 vault setting to be applied during initial CephCluster CR creation instead of later editing it?

Backporting detecting vault v2 automatically would require:
- Testing in 4.9. This is a big change that needs validation before we consider backporting.
- Confirmation from Seb as he is much more familiar with the risk of those changes.

Comment 19 Mudit Agarwal 2021-08-16 12:18:35 UTC
I agree with Travis, this is indeed a big change for z-stream. Let's try to find some other solution.

Comment 20 Sébastien Han 2021-08-23 13:03:30 UTC
I agree with all the above, it's a large change, also this change relies on another change that brings some Go deps.
Those deps are in line with Rook 4.9 which will only support Go 1.16 and above where 4.7 runs on 1.16, so it's a larger issue.
So the backport will have a much broader impact which is hard to estimate but is extremely risky anyway.

One more thing, we cannot solely rely on auto-detection. Some customers will refuse to give access to "sys/mount" which is the Vault's API path used for auto-detection.

Comment 21 Mudit Agarwal 2021-08-23 13:59:30 UTC
Arun, are we good here? Can we work on a 4.7.z specific fix?

Comment 22 arun kumar mohan 2021-08-24 10:23:37 UTC
Removing the need-info as it is moved to rook...

Comment 29 Shay Rozen 2021-08-30 08:09:39 UTC
Verified on 4.7.4-244.ci . All OSD came to running state without CLBO and are crypted.

Comment 33 errata-xmlrpc 2021-09-15 13:26:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.7.4 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3549


Note You need to log in before you can comment on or make changes to this bug.