Bug 2187197

Summary: Noobaa pods are not coming up after enabling Ceph storageclass encryption
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: narayanspg <ngowda>
Component: csi-driverAssignee: Rakshith <rar>
Status: CLOSED NOTABUG QA Contact: krishnaram Karthick <kramdoss>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.13CC: aindenba, mparida, muagarwa, nbecker, ocs-bugs, odf-bz-bot, pakamble, rar
Target Milestone: ---Keywords: TestBlocker
Target Release: ---   
Hardware: ppc64le   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-05-03 04:39:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
noobaa operator logs none

Description narayanspg 2023-04-17 07:16:41 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Noobaa pods are not coming up after enabling Ceph storageclass encryption
Due to this Storagecluster is not getting to ready state with below info.

Waiting on Nooba instance to finish initialization

noobaa operator logs:
time="2023-04-17T07:11:58Z" level=error msg="ReconcileRootSecret, NewKMS error failed to get the authentication token: authentication returned nil auth info" sys=openshift-storage/noobaa
time="2023-04-17T07:11:58Z" level=info msg="setKMSConditionStatus Invalid" sys=openshift-storage/noobaa
time="2023-04-17T07:11:58Z" level=info msg="SetPhase: temporary error during phase \"Creating\"" sys=openshift-storage/noobaa
time="2023-04-17T07:11:58Z" level=warning msg="⏳ Temporary Error: failed to get the authentication token: authentication returned nil auth info" sys=openshift-storage/noobaa


Version of all relevant components (if applicable):
[root@nara4-cicd-odf-ba4c-sao01-bastion-0 ~]# oc describe csv odf-operator.v4.13.0 -n openshift-storage | grep full
Labels:       full_version=4.13.0-165
          f:full_version:
[root@nara4-cicd-odf-ba4c-sao01-bastion-0 ~]# oc get clusterversion
NAME      VERSION                                      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-ppc64le-2023-02-17-084453   True        False         4h24m   Cluster version is 4.13.0-0.nightly-ppc64le-2023-02-17-084453
[root@nara4-cicd-odf-ba4c-sao01-bastion-0 ~]#


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Not able to progress on PV encryption feature.

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Create OCP and install ODF
2. enable storageclass encryption during storagesystem creation
3. storagecluster will be stuck in progressing state due to noobaa pods not coming up.


Actual results:
storagecluster stuck in progressing state

Expected results:
storagecluster should get ready state

Additional info:
attaching noobaa operator logs and mustgather.

Comment 2 narayanspg 2023-04-17 07:39:48 UTC
Created attachment 1957792 [details]
noobaa operator logs

Comment 3 narayanspg 2023-04-17 07:44:06 UTC
getting below errors when we create PVC with storageclass with encryption.

PVCnew-sc-pvc-1
NamespaceNSdefault
Apr 17, 2023, 12:47 PM
Generated from openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-64888d545-cg8k8_649b40fa-6144-45cd-824b-9a5336059345
14 times in the last 44 minutes
failed to provision volume with StorageClass "newstorageclass-one": rpc error: code = InvalidArgument desc = invalid encryption kms configuration: failed connecting to Vault: failed to get the authentication token: Error making API request. Namespace: admin URL: PUT https://vault-cluster.vault.2467e33a-73f9-408b-b9ff-b0476a654d30.aws.hashicorp.cloud:8200/v1/auth/kubernetes/login Code: 403. Errors: * permission denied

Comment 4 Alexander Indenbaum 2023-04-17 16:47:09 UTC
Hello 🖖

Based on the operator log provided, it appears that the issue is related to the retrieval of the vault token for the specified authentication method. 
Error source:
- https://github.com/libopenstorage/secrets/blob/1022cc4d5aeb8bceedfc664b32667755b35e6a15/vault/utils/utils.go#L159-L161
- https://github.com/libopenstorage/secrets/blob/1022cc4d5aeb8bceedfc664b32667755b35e6a15/vault/vault.go#L106-L110

To address this issue, I suggest verifying that the Vault service is operational and ensuring that the authentication configuration, Vault credentials used by the operator to connect to Vault are properly configured.
To further investigate and resolve the issue, it would be helpful to review the must-gather logs, including the NooBaa CR YAML. Can you please provide these logs?

Thank you.

Comment 5 Alexander Indenbaum 2023-04-17 17:13:33 UTC
Hello 🖖

Based on the information provided in comment #3 https://bugzilla.redhat.com/show_bug.cgi?id=2187197#c3, it seems that a similar error is also originating from ceph-csi rbd.
Error source:
- https://github.com/ceph/ceph-csi/blob/cd2e25c290a642154c25c4bf42e739f39c1d51bd/internal/rbd/encryption.go#L325-L327
- https://github.com/ceph/ceph-csi/blob/cd2e25c290a642154c25c4bf42e739f39c1d51bd/internal/kms/vault.go#L288-L291
- https://github.com/ceph/ceph-csi/blob/cd2e25c290a642154c25c4bf42e739f39c1d51bd/vendor/github.com/libopenstorage/secrets/vault/vault.go#L96-L101
- https://github.com/ceph/ceph-csi/blob/cd2e25c290a642154c25c4bf42e739f39c1d51bd/vendor/github.com/hashicorp/vault/api/response.go#L118-L124

This suggests that the provided KMS configuration may have issues since ceph-csi rbd is also encountering problems communicating with Vault and receiving "Code: 403. Errors: * permission denied" when attempting to log in. As a result, it seems likely that the root cause of the problem lies in the configuration of the Vault.

Comment 6 narayanspg 2023-04-18 02:06:47 UTC
Hi , I am not able to see private messages. please let me know the info required. I will be sharing you the cluster details over IM if you would like to access.

Comment 8 narayanspg 2023-04-18 05:44:38 UTC
tested the vault connection from a testpod and the bastion node:

################from node i get below output:
[root@nara4-cicd-odf-ba4c-sao01-bastion-0 ~]# vault read auth/kubernetes/role/odf-rook-ceph-osd                                                              Key                                 Value
---                                 -----
alias_name_source                   serviceaccount_uid
bound_service_account_names         [rook-ceph-osd]
bound_service_account_namespaces    [openshift-storage]
policies                            [odf]
token_bound_cidrs                   []
token_explicit_max_ttl              0s
token_max_ttl                       0s
token_no_default_policy             false
token_num_uses                      0
token_period                        0s
token_policies                      [odf]
token_ttl                           1440h
token_type                          default
ttl                                 1440h

#############From test pod

 # ./vault read auth/kubernetes/role/odf-rook-ceph-osd
Key                                 Value
---                                 -----
alias_name_source                   serviceaccount_uid
bound_service_account_names         [rook-ceph-osd]
bound_service_account_namespaces    [openshift-storage]
policies                            [odf]
token_bound_cidrs                   []
token_explicit_max_ttl              0s
token_max_ttl                       0s
token_no_default_policy             false
token_num_uses                      0
token_period                        0s
token_policies                      [odf]
token_ttl                           1440h
token_type                          default
ttl                                 1440h

Comment 9 narayanspg 2023-04-18 05:46:38 UTC
Below are the config details and exact same configuration is working fine on Non-IBM setup(AWS).


[root@nara4-cicd-odf-ba4c-sao01-bastion-0 ~]# oc describe cm csi-kms-connection-details
Name:         csi-kms-connection-details
Namespace:    openshift-storage
Labels:       <none>
Annotations:  <none>

Data
====
Vault-test-1:
----
{"encryptionKMSType":"vaulttenantsa","kmsServiceName":"Vault-test-1","vaultAddress":"https://vault-cluster.vault.2467e33a-73f9-408b-b9ff-b0476a654d30.aws.hashicorp.cloud:8200","vaultBackendPath":"odf/","vaultTLSServerName":"","vaultCAFileName":"","vaultClientCertFileName":"","vaultClientCertKeyFileName":"","vaultAuthMethod":"kubernetes","vaultAuthPath":"/v1/auth/kubernetes/login","vaultAuthNamespace":"","vaultNamespace":"admin"}

BinaryData
====

Events:  <none>
[root@nara4-cicd-odf-ba4c-sao01-bastion-0 ~]# oc describe cm ocs-kms-connection-details
Name:         ocs-kms-connection-details
Namespace:    openshift-storage
Labels:       <none>
Annotations:  <none>

Data
====
KMS_PROVIDER:
----
vault
KMS_SERVICE_NAME:
----
Vault-test-1
VAULT_AUTH_KUBERNETES_ROLE:
----
odf-rook-ceph-op
VAULT_AUTH_METHOD:
----
kubernetes
VAULT_AUTH_MOUNT_PATH:
----
/v1/auth/kubernetes/login
VAULT_ADDR:
----
https://vault-cluster.vault.2467e33a-73f9-408b-b9ff-b0476a654d30.aws.hashicorp.cloud:8200
VAULT_BACKEND_PATH:
----
odf/
VAULT_NAMESPACE:
----
admin
VAULT_TLS_SERVER_NAME:
----


BinaryData
====

Events:  <none>
[root@nara4-cicd-odf-ba4c-sao01-bastion-0 ~]#

Comment 10 Alexander Indenbaum 2023-04-18 12:02:55 UTC
Hello @narayanspg  🖖

Based on the NooBaa operator log provided, it appears that the issue is related to the retrieval of the Vault token for the specified authentication method. 
Error source:
- https://github.com/libopenstorage/secrets/blob/1022cc4d5aeb8bceedfc664b32667755b35e6a15/vault/utils/utils.go#L159-L161
- https://github.com/libopenstorage/secrets/blob/1022cc4d5aeb8bceedfc664b32667755b35e6a15/vault/vault.go#L106-L110

Based on the information provided in comment #3 https://bugzilla.redhat.com/show_bug.cgi?id=2187197#c3, it seems that a similar error is also originating from ceph-csi rbd.
Error source:
- https://github.com/ceph/ceph-csi/blob/cd2e25c290a642154c25c4bf42e739f39c1d51bd/internal/rbd/encryption.go#L325-L327
- https://github.com/ceph/ceph-csi/blob/cd2e25c290a642154c25c4bf42e739f39c1d51bd/internal/kms/vault.go#L288-L291
- https://github.com/ceph/ceph-csi/blob/cd2e25c290a642154c25c4bf42e739f39c1d51bd/vendor/github.com/libopenstorage/secrets/vault/vault.go#L96-L101
- https://github.com/ceph/ceph-csi/blob/cd2e25c290a642154c25c4bf42e739f39c1d51bd/vendor/github.com/hashicorp/vault/api/response.go#L118-L124

This suggests that the provided KMS configuration may have issues since both NooBaa operator and ceph-csi rbd are encountering problems communicating with the Vault, ceph-csi rbd receiving "Code: 403. Errors: * permission denied" when attempting to log in. As a result, it seems likely that the root cause of the problem lies in the configuration of the Vault's credentials.

To address this issue, I suggest verifying that the Vault credentials used by the operators (NooBaa and ceph-csi rbd) to connect to Vault are properly configured.
To further investigate and resolve the issue, it would be helpful to review the must-gather logs, including the NooBaa CR YAML. 

Questions:
- Can you please provide the additional?
- Could you perform a test verifying you are able to communicate with the Valt using the NooBaa operator and ceph-csi rbd Vault credentials?

Thank you.

Comment 11 narayanspg 2023-04-18 16:52:47 UTC
Hi Alexander,

There is some problem in uploading mustgather with this account so shared over Box for temp access.  you can access here - https://ibm.box.com/s/ddjwhvw705d5yf9lzbsntzhtbzuvns49

Below is the connection test on noobaa operator pod.


[root@nara4-cicd-odf-ba4c-sao01-bastion-0 ~]# oc rsh noobaa-operator-76d488695b-wtv6r
#exported variables
sh-5.1$ ./vault read auth/kubernetes/role/odf-rook-ceph-osd
Key                                 Value
---                                 -----
alias_name_source                   serviceaccount_uid
bound_service_account_names         [rook-ceph-osd]
bound_service_account_namespaces    [openshift-storage]
policies                            [odf]
token_bound_cidrs                   []
token_explicit_max_ttl              0s
token_max_ttl                       0s
token_no_default_policy             false
token_num_uses                      0
token_period                        0s
token_policies                      [odf]
token_ttl                           1440h
token_type                          default
ttl                                 1440h
sh-5.1$ exit
exit


you can also access the cluster with below details.

web_console_url = "https://console-openshift-console.apps.nara4-cicd-odf-ba4c.redhat.com"
kubeadmin-password/Sm9Yd-3YJJY-CfyxU-ncY5r
etc_hosts_entries = <<EOT

169.57.180.37 api.nara4-cicd-odf-ba4c.redhat.com console-openshift-console.apps.nara4-cicd-odf-ba4c.redhat.com integrated-oauth-server-openshift-authentication.apps.nara4-cicd-odf-ba4c.redhat.com oauth-openshift.apps.nara4-cicd-odf-ba4c.redhat.com prometheus-k8s-openshift-monitoring.apps.nara4-cicd-odf-ba4c.redhat.com grafana-openshift-monitoring.apps.nara4-cicd-odf-ba4c.redhat.com example.apps.nara4-cicd-odf-ba4c.redhat.com

Comment 12 Alexander Indenbaum 2023-04-19 08:01:30 UTC
Hello @narayanspg  🖖,

Thank you for sharing the must-gather with me. Unfortunately, I encountered an error while accessing the web console cluster URL, even though I was connected to the RH VPN. 

The NooBaa CR KMS declaration is:
```yaml

    kms:
      connectionDetails:
        KMS_PROVIDER: vault
        KMS_SERVICE_NAME: Vault-test-1
        VAULT_ADDR: https://vault-cluster.vault.2467e33a-73f9-408b-b9ff-b0476a654d30.aws.hashicorp.cloud:8200
        VAULT_AUTH_KUBERNETES_ROLE: odf-rook-ceph-op
        VAULT_AUTH_METHOD: kubernetes
        VAULT_AUTH_MOUNT_PATH: /v1/auth/kubernetes/login
        VAULT_BACKEND_PATH: odf/
        VAULT_NAMESPACE: admin
        VAULT_TLS_SERVER_NAME: ""

```

Upon examining the NooBaa CR KMS declaration you provided, it seems that the k8s service account authentication should be used. However, I could not find any information about "odf-rook-ceph-op" in the must-gather. Could you please provide more details about the service account "odf-rook-ceph-op" and its token secret? You can use the following commands to retrieve the information:

kubectl -n <NS> get sa odf-rook-ceph-op -o yaml
kubectl -n <NS> get secret <SA-TOKEN> -o yaml

Also, regarding the VAULT_AUTH_MOUNT_PATH configuration parameter, I'm not entirely sure about its usage. Is there any specific reason for defining and how its value was calculated? Does it mean that the service account token gets mounted in a non-default path? According to a ceph-csi PR (https://github.com/ceph/ceph-csi/pull/2322), it seems that this value might have been mistakenly taken from the default auth path, note no "mount" component in the name. Therefore, I suggest removing the VAULT_AUTH_MOUNT_PATH variable and relying on the library's default value unless there is a good reason to use it.

Could you please try again with that variable removed?

Thank you for your help!

Comment 13 narayanspg 2023-04-20 15:24:29 UTC
Hi Alexander,

We tried recreate the env newly and this time we create storagesystem without encryption and storagecluster got to ready state.

create new storageclass , enabled encryption and created PVC. getting same error.

Below are the details requested.

[root@nara6-cicd-odf-e189-sao01-bastion-0 vault]# vault read auth/kubernetes/role/odf-rook-ceph-op
Key                                 Value
---                                 -----
alias_name_source                   serviceaccount_uid
bound_service_account_names         [rook-ceph-system rook-ceph-osd noobaa]
bound_service_account_namespaces    [openshift-storage]
policies                            [odf]
token_bound_cidrs                   []
token_explicit_max_ttl              0s
token_max_ttl                       0s
token_no_default_policy             false
token_num_uses                      0
token_period                        0s
token_policies                      [odf]
token_ttl                           1440h
token_type                          default
ttl                                 1440h


oc get secret odf-vault-auth-token  -o yaml
kind: Secret
metadata:
  annotations:
    kubernetes.io/service-account.name: odf-vault-auth
    kubernetes.io/service-account.uid: 7dae76ea-5aef-46a6-b8f7-b698b16d896b
  creationTimestamp: "2023-04-20T14:15:33Z"
  name: odf-vault-auth-token
  namespace: openshift-storage
  resourceVersion: "469116"
  uid: 122964b6-3bc9-45fd-9a7e-8bbcd47cd48a
type: kubernetes.io/service-account-token

Below are the cluster details :

web_console_url = "https://console-openshift-console.apps.nara6-cicd-odf-e189.redhat.com"
kubeadmin-password/iv4ZG-KbFDS-DvXoN-7DieY

etc_hosts_entries = <<EOT

169.57.180.34 api.nara6-cicd-odf-e189.redhat.com console-openshift-console.apps.nara6-cicd-odf-e189.redhat.com integrated-oauth-server-openshift-authentication.apps.nara6-cicd-odf-e189.redhat.com oauth-openshift.apps.nara6-cicd-odf-e189.redhat.com prometheus-k8s-openshift-monitoring.apps.nara6-cicd-odf-e189.redhat.com grafana-openshift-monitoring.apps.nara6-cicd-odf-e189.redhat.com example.apps.nara6-cicd-odf-e189.redhat.com

with same configuration and vault instance server its working on non-IBM cluster.  
Thanks,
Narayan

Comment 14 Mudit Agarwal 2023-04-24 06:15:17 UTC
Rakshith/Malay, please take a look in parallel to Noobaa.

Comment 15 Rakshith 2023-04-25 09:19:56 UTC
hey,

please link the mustgather (the temp link does not have the folder anymore)
Did this encryption work with ODF 4.12 ?
Can you share link to the documentation you are following to set this up ?

Comment 17 Rakshith 2023-04-25 10:57:29 UTC
The exact problem is vault needs to talk to OCP cluster to verify sa token but the `OCP_HOST` url provided is not reachable from vault server.

>https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.12/html/managing_and_allocating_storage_resources/storage-classes_rhodf#configuring-access-to-kms-using-vaulttenantsa_rhodf

> Step 4: Retrieve the OpenShift cluster endpoint.
```
$ OCP_HOST=$(oc config view --minify --flatten -o jsonpath="{.clusters[0].cluster.server}")
```

Following the steps, I get the following as the endpoint

```
[rakshith@fedora ~]$ OCP_HOST=$(oc config view --minify --flatten -o jsonpath="{.clusters[0].cluster.server}")
[rakshith@fedora ~]$ echo $OCP_HOST
https://api.nara6-cicd-odf-e189.redhat.com:6443
```

But this is not reachable from vault.

We had to add the domain entries manually on our laptops to access the cluster
```
etc_hosts_entries = <<EOT

169.57.180.34 api.nara6-cicd-odf-e189.redhat.com console-openshift-console.apps.nara6-cicd-odf-e189.redhat.com integrated-oauth-server-openshift-authentication.apps.nara6-cicd-odf-e189.redhat.com oauth-openshift.apps.nara6-cicd-odf-e189.redhat.com prometheus-k8s-openshift-monitoring.apps.nara6-cicd-odf-e189.redhat.com grafana-openshift-monitoring.apps.nara6-cicd-odf-e189.redhat.com example.apps.nara6-cicd-odf-e189.redhat.com
```

You have to find a way to add these entries to vault server or change the endpoint url to properly point to the cluster.

Comment 18 narayanspg 2023-04-25 12:41:05 UTC
Hi Rakshith,

comment #11 and #13 we have tried to test connectivity with Vault instance we are using. same are the results. on Non IBM environment there were no host entries added. 

We are using enterprise vault which is hosted service by hachi corp so we cant add host entry there.

Comment 19 Rakshith 2023-04-26 13:59:27 UTC
(In reply to narayanspg from comment #18)
> Hi Rakshith,
> 
> comment #11 and #13 we have tried to test connectivity with Vault instance
> we are using. same are the results. on Non IBM environment there were no
> host entries added. 

Its about vault being able to talk to cluster.
I know the other way around works.

Try with OCP_HOST=https://169.57.180.34:6443

Comment 20 narayanspg 2023-04-28 11:54:45 UTC
tried with with OCP_HOST=https://169.57.180.34:6443 but didn't work. PVC creation failed.

Vsphere enviornment as well seen with same issue.

Comment 22 Rakshith 2023-05-03 04:39:31 UTC
(In reply to Rakshith from comment #19)
> (In reply to narayanspg from comment #18)
> > Hi Rakshith,
> > 
> > comment #11 and #13 we have tried to test connectivity with Vault instance
> > we are using. same are the results. on Non IBM environment there were no
> > host entries added. 
> 
> Its about vault being able to talk to cluster.
> I know the other way around works.
> 
> Try with OCP_HOST=https://169.57.180.34:6443

(In reply to narayanspg from comment #20)
> tried with with OCP_HOST=https://169.57.180.34:6443 but didn't work. PVC
> creation failed.
> 
> Vsphere enviornment as well seen with same issue.


Closing this BZ, since this is not a product bug and the feature works with aws with new hp vault service,
and vsphere and other clusters within vpn when self hosted community vault service was used.

QE needs to create a cluster with visible public endpoint for vault to verify sa token or
figure out a way for vault to be able to talk with ocp cluster.

Please reopen the bug with necessary justification.

Thanks,

Comment 23 narayanspg 2023-05-11 04:38:29 UTC
tried with community vault instance. after adding host entries on the vault server the issue is not seen.