The problem is still occurring although I see in the logs the new log from the PR: 2021-08-09 12:12:15.410480 I | ceph-cluster-controller: CR has changed for "ocs-storagecluster-cephcluster". cancelling any ongoing orchestration. diff= v1.ClusterSpec{ ... // 21 identical fields CleanupPolicy: {}, HealthCheck: {}, Security: v1.SecuritySpec{ KeyManagementService: v1.KeyManagementServiceSpec{ ConnectionDetails: map[string]string{ "KMS_PROVIDER": "vault", "KMS_SERVICE_NAME": "vault", "VAULT_ADDR": "https://vault.qe.rh-ocs.com:8200", + "VAULT_BACKEND": "v2", "VAULT_BACKEND_PATH": "v2shay", "VAULT_CACERT": "ocs-kms-ca-secret-60g2ng", ... // 4 identical entries }, TokenSecretName: "ocs-kms-token", }, }, LogCollector: {Enabled: true, Periodicity: "24h"}, } The OSD is still in CLBO for 10-15 minutes.
Here's what I'm seeing: 1. The OCS operator creates the CephCluster CR **without** the vault v2 setting 2. Rook creates the cluster. The OSDs are not able to connect since the v2 property is not specified. 3. The CephCluster is updated by the OCS operator **with** the v2 setting and a new Rook reconcile begins 4. The PGs are all unknown because of step 1, so the operator waits for them to be healthy. 5. After the timeout to wait for the PGs (10 min), the OSDs are updated 6. Now the OSDs are healthy and everything looks good in the cluster that I can see. So we really need the OCS operator to apply the configmap setting with the initial reconcile. If the CephCluster CR is updated after the initial reconcile it isn't working. In the meantime, the workaround is to wait for the operator to timeout waiting for the PGs to be healthy and update OSDs with the correct setting before the cluster will be healthy. @Arun Looks like you are familiar with this configmap to take a look.
We already have a fix in rook, through BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1975272, to auto-detect the vault version PR: https://github.com/rook/rook/pull/8265 It will be redundant to place logic in all the components (OCS-Operator + Rook). Mudit/Travis, is it possible for us to backport the changes in rook to version 1.6.x and update OCS Operator to point to the latest 1.6.x rook branch?
There is no workaround that would allow the v2 vault setting to be applied during initial CephCluster CR creation instead of later editing it? Backporting detecting vault v2 automatically would require: - Testing in 4.9. This is a big change that needs validation before we consider backporting. - Confirmation from Seb as he is much more familiar with the risk of those changes.
I agree with Travis, this is indeed a big change for z-stream. Let's try to find some other solution.
I agree with all the above, it's a large change, also this change relies on another change that brings some Go deps. Those deps are in line with Rook 4.9 which will only support Go 1.16 and above where 4.7 runs on 1.16, so it's a larger issue. So the backport will have a much broader impact which is hard to estimate but is extremely risky anyway. One more thing, we cannot solely rely on auto-detection. Some customers will refuse to give access to "sys/mount" which is the Vault's API path used for auto-detection.
Arun, are we good here? Can we work on a 4.7.z specific fix?
Removing the need-info as it is moved to rook...
Verified on 4.7.4-244.ci . All OSD came to running state without CLBO and are crypted.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.7.4 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3549