Bug 1977609

Summary: [OCS 4.7.2]- OSD in CLBO for ~10-20 mins on using kv-v2 after v2 deployment is supported with fix for Bug 1970583
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Neha Berry <nberry>
Component: rookAssignee: Sébastien Han <shan>
Status: CLOSED ERRATA QA Contact: Shay Rozen <srozen>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.7CC: amohan, ebenahar, madam, muagarwa, ocs-bugs, rcyriac, shan, sostapov, srozen, tnielsen
Target Milestone: ---Keywords: AutomationBackLog, ZStream
Target Release: OCS 4.7.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: v4.7.4-244.ci Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1997125 (view as bug list) Environment:
Last Closed: 2021-09-15 13:26:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1997125    

Comment 8 Shay Rozen 2021-08-09 12:35:50 UTC
The problem is still occurring although I see in the logs the new log from the PR:
2021-08-09 12:12:15.410480 I | ceph-cluster-controller: CR has changed for "ocs-storagecluster-cephcluster". cancelling any ongoing orchestration. diff=  v1.ClusterSpec{
  	... // 21 identical fields
  	CleanupPolicy: {},
  	HealthCheck:   {},
  	Security: v1.SecuritySpec{
  		KeyManagementService: v1.KeyManagementServiceSpec{
  			ConnectionDetails: map[string]string{
  				"KMS_PROVIDER":       "vault",
  				"KMS_SERVICE_NAME":   "vault",
  				"VAULT_ADDR":         "https://vault.qe.rh-ocs.com:8200",
+ 				"VAULT_BACKEND":      "v2",
  				"VAULT_BACKEND_PATH": "v2shay",
  				"VAULT_CACERT":       "ocs-kms-ca-secret-60g2ng",
  				... // 4 identical entries
  			},
  			TokenSecretName: "ocs-kms-token",
  		},
  	},
  	LogCollector: {Enabled: true, Periodicity: "24h"},
  }
The OSD is still in CLBO for 10-15 minutes.

Comment 10 Travis Nielsen 2021-08-09 16:53:06 UTC
Here's what I'm seeing:
1. The OCS operator creates the CephCluster CR **without** the vault v2 setting
2. Rook creates the cluster. The OSDs are not able to connect since the v2 property is not specified.
3. The CephCluster is updated by the OCS operator **with** the v2 setting and a new Rook reconcile begins
4. The PGs are all unknown because of step 1, so the operator waits for them to be healthy.
5. After the timeout to wait for the PGs (10 min), the OSDs are updated
6. Now the OSDs are healthy and everything looks good in the cluster that I can see. 

So we really need the OCS operator to apply the configmap setting with the initial reconcile. If the CephCluster CR is updated after the initial reconcile it isn't working.

In the meantime, the workaround is to wait for the operator to timeout waiting for the PGs to be healthy and update OSDs with the correct setting before the cluster will be healthy.

@Arun Looks like you are familiar with this configmap to take a look.

Comment 17 arun kumar mohan 2021-08-11 15:07:03 UTC
We already have a fix in rook, through BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1975272, to auto-detect the vault version
PR: https://github.com/rook/rook/pull/8265
It will be redundant to place logic in all the components (OCS-Operator + Rook).

Mudit/Travis, is it possible for us to backport the changes in rook to version 1.6.x and update OCS Operator to point to the latest 1.6.x rook branch?

Comment 18 Travis Nielsen 2021-08-11 21:37:26 UTC
There is no workaround that would allow the v2 vault setting to be applied during initial CephCluster CR creation instead of later editing it?

Backporting detecting vault v2 automatically would require:
- Testing in 4.9. This is a big change that needs validation before we consider backporting.
- Confirmation from Seb as he is much more familiar with the risk of those changes.

Comment 19 Mudit Agarwal 2021-08-16 12:18:35 UTC
I agree with Travis, this is indeed a big change for z-stream. Let's try to find some other solution.

Comment 20 Sébastien Han 2021-08-23 13:03:30 UTC
I agree with all the above, it's a large change, also this change relies on another change that brings some Go deps.
Those deps are in line with Rook 4.9 which will only support Go 1.16 and above where 4.7 runs on 1.16, so it's a larger issue.
So the backport will have a much broader impact which is hard to estimate but is extremely risky anyway.

One more thing, we cannot solely rely on auto-detection. Some customers will refuse to give access to "sys/mount" which is the Vault's API path used for auto-detection.

Comment 21 Mudit Agarwal 2021-08-23 13:59:30 UTC
Arun, are we good here? Can we work on a 4.7.z specific fix?

Comment 22 arun kumar mohan 2021-08-24 10:23:37 UTC
Removing the need-info as it is moved to rook...

Comment 29 Shay Rozen 2021-08-30 08:09:39 UTC
Verified on 4.7.4-244.ci . All OSD came to running state without CLBO and are crypted.

Comment 33 errata-xmlrpc 2021-09-15 13:26:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.7.4 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3549