Bug 2097439

Summary: managed-csi StorageClass does not create PVs
Product: OpenShift Container Platform Reporter: OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component: StorageAssignee: Fabio Bertinatto <fbertina>
Storage sub component: Kubernetes External Components QA Contact: Wei Duan <wduan>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: apurty, asheth, jsafrane, mbarnes, pkanthal, shunaik
Version: 4.10Keywords: ServiceDeliveryImpact
Target Milestone: ---   
Target Release: 4.10.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-07-20 07:46:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2095049    
Bug Blocks:    

Description OpenShift BugZilla Robot 2022-06-15 17:01:12 UTC
+++ This bug was initially created as a clone of Bug #2095049 +++

Description of problem:

On a new install of OpenShift in Azure, the managed-csi StorageClass cannot provision new disks, and all PVCs are stuck in Pending that use this StorageClass. 

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Create 4.10 OpenShift cluster in Azure
2. Create a PVC and Pod that mounts the PV created

Actual results:

1. No PV is created for the PVC
2. The azure-disk-csi-driver-controller csi-provisioner logs indicate 404 errors when attempting to fetch a token to authenticate and interact with Azure

Expected results:

1. PV is created and mounted successfully into the Pod

Master Log:

Node Log (of failed PODs):

```
$ oc logs -n openshift-cluster-csi-drivers -l app=azure-disk-csi-driver-controller -c csi-provisioner
{...}
I0608 21:05:26.596290       1 controller.go:1337] provision "test/managed-csi" class "managed-csi": started
I0608 21:05:26.596448       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"test", Name:"managed-csi", UID:"a749b800-058b-493e-9943-8c9be54ebbcc", APIVersion:"v1", ResourceVersion:"1169244", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "test/managed-csi"
I0608 21:05:26.683627       1 controller.go:1075] Final error received, removing PVC a749b800-058b-493e-9943-8c9be54ebbcc from claims in progress
W0608 21:05:26.683656       1 controller.go:934] Retrying syncing claim "a749b800-058b-493e-9943-8c9be54ebbcc", failure 10
E0608 21:05:26.683677       1 controller.go:957] error syncing claim "a749b800-058b-493e-9943-8c9be54ebbcc": failed to provision volume with StorageClass "managed-csi": rpc error: code = Unknown desc = Retriable: false, RetryAfter: 0s, HTTPStatusCode: 404, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions//resourceGroups//providers/Microsoft.Compute/disks/pvc-a749b800-058b-493e-9943-8c9be54ebbcc?api-version=2021-04-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body:  Endpoint https://login.microsoftonline.com/oauth2/token
I0608 21:05:26.683773       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"test", Name:"managed-csi", UID:"a749b800-058b-493e-9943-8c9be54ebbcc", APIVersion:"v1", ResourceVersion:"1169244", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "managed-csi": rpc error: code = Unknown desc = Retriable: false, RetryAfter: 0s, HTTPStatusCode: 404, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions//resourceGroups//providers/Microsoft.Compute/disks/pvc-a749b800-058b-493e-9943-8c9be54ebbcc?api-version=2021-04-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body:  Endpoint https://login.microsoftonline.com/oauth2/token
{...}

(note the missing SubscriptionID and ResourceGroup in the token refresh URL - this log is pasted as-is, not scrubbed)

PV Dump:

N/A

PVC Dump:

```
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: managed-csi
spec:
  storageClassName: managed-csi
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: bb
spec:
  containers:
    - name: bb
      image: busybox:1.28
      volumeMounts:
      - mountPath: "/var/www/html"
        name: www
  volumes:
    - name: www
      persistentVolumeClaim:
        claimName: managed-csi
```

StorageClass Dump (if StorageClass used by PV/PVC):

```
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
  creationTimestamp: "2022-06-07T16:53:35Z"
  name: managed-csi
  resourceVersion: "532291"
  uid: 8b8bf889-2b03-4671-8990-3c1ab3368cd7
parameters:
  skuname: Premium_LRS
provisioner: disk.csi.azure.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
```

Additional info:

--- Additional comment from gmontero on 2022-06-08 21:15:44 UTC ---

wrong subcomponent - switching

--- Additional comment from mbarnes on 2022-06-10 19:07:35 UTC ---

I think I see what might be happening.  My claims need fact-checked, though, since I'm not very familiar with this controller.

Looking at the azure-disk-csi-driver-controller (csi-driver container) logs upon startup:


I0609 17:36:21.218505       1 azuredisk.go:142]ยท                                                                                                                                              
DRIVER INFORMATION:                                                                                                                                                                           
-------------------                                                                                                                                                                           
Build Date: "2022-05-12T09:57:10Z"                                                                                                                                                            
Compiler: gc                                                                                                                                                                                  
Driver Name: disk.csi.azure.com                                                                                                                                                               
Driver Version: v1.9.0                                                                                                                                                                        
Git Commit: 3937bec4f8027a66a9009af5c740a6135bc86f95                                                                                                                                          
Go Version: go1.17.5                                                                                                                                                                          
Platform: linux/amd64                                                                                                                                                                         
Topology Key: topology.disk.csi.azure.com/zone                                                                                                                                                
                                                                                                                                                                                              
Streaming logs below:                                                                                                                                                                         
I0609 17:36:21.218523       1 azuredisk.go:145] driver userAgent: disk.csi.azure.com/v1.9.0 gc/go1.17.5 (amd64-linux)                                                                         
I0609 17:36:21.219936       1 azure_disk_utils.go:129] reading cloud config from secret kube-system/azure-cloud-provider                                                                      
I0609 17:36:21.239303       1 azure_auth.go:119] azure: using client_id+client_secret to retrieve access token                                                                                
I0609 17:36:21.239386       1 azure_diskclient.go:67] Azure DisksClient using API version: 2021-04-01                                                                                         


You can see it's reading its cloud configuration from the kube-system/azure-cloud-provider secret.

Then, upon creating the same PVC as above:

(NOTE: I reformatted the request JSON for better readability.)


I0609 18:31:35.168694       1 utils.go:95] GRPC call: /csi.v1.Controller/CreateVolume
I0609 18:31:35.168727       1 utils.go:96] GRPC request: (NOT RELEVANT, OMITTED)
I0609 18:31:35.506522       1 controllerserver.go:274] begin to create azure disk(pvc-4f483e5f-2a57-48e2-b3d1-4ffc3ccfe40f) account type(Premium_LRS) rg() location() size(1) diskZone() maxShares(0)
E0609 18:31:35.621637       1 utils.go:100] GRPC error: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 404, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions//resourceGroups//providers/Microsoft.Compute/disks/pvc-4f483e5f-2a57-48e2-b3d1-4ffc3ccfe40f?api-version=2021-04-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body:  Endpoint https://login.microsoftonline.com/oauth2/token


Notice how the message with "begin to create azure disk" has "rg()", indicating it has no resource group.

Also, the Azure error message shows the requested resource URI lacks both the subscription ID and resource group:

https://management.azure.com/subscriptions//resourceGroups//providers/Microsoft.Compute/disks/pvc-4f483e5f-2a57-48e2-b3d1-4ffc3ccfe40f?api-version=2021-04-01
                                          ^^              ^^

The code that initializes the cloud Config struct is here:
https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/release-1.9/pkg/azureutils/azure_disk_utils.go#L128-L164

Note how, if it's able to read the kube-system/azure-cloud-provider secret then it doesn't even TRY to read the cloud config file (because 'config' is no longer nil).  In other words, the controller appears to assume ALL the cloud configuration lives either in the secret OR the config file.

At least on ARO, the kube-system/azure-cloud-provider secret only contains the client ID and client secret.  The rest of the cloud configuration lives in /etc/kubernetes/cloud.conf, which is not being read.

--- Additional comment from fbertina on 2022-06-10 20:33:40 UTC ---

@matthew, excellent points.

The CSI driver should better handle this scenario. It's an anti-pattern for a CSI driver to rely on Kubernetes resources for applying configuration options, specially when their names are hardcoded in the CSI driver.

While I'm happy to work with upstream to fix the CSI driver, is there a chance we can work something around in ARO to unblock the customer?

Who creates the kube-system/azure-cloud-provider secret in ARO? I don't have it on my IPI cluster, is it really necessary there? If so, can the owner of this secret add the missing configuration there?

--- Additional comment from mbarnes on 2022-06-12 15:53:52 UTC ---

The ARO Azure Resource Provider (which is Red Hat code) creates the "azure-cloud-provider" secret, and it's used for many things besides this one driver.

  $ oc get secret -n kube-system azure-cloud-provider -o json | jq -r '.data["cloud-config"]' | base64 -d
  aadClientId: CLIENT-ID
  aadClientSecret: CLIENT-SECRET


The "azure-config-credentials-injector" init-container added in [1] provides a partial solution, if the CSI driver could be made to actually read the merged config file.

Working with what's currently upstream, the CSI driver provides command-line options for overriding the cloud-config secret [2].  If these options could supplied to have the driver look for a non-existent secret (e.g. --cloud-config-secret-name="" --cloud-config-secret-namespace=""), it would force the driver to fall back to the config file created by "azure-config-credentials-injector".  (Unsure if the other pod containers provide similar options, or if that's even needed.)

However the operator reverts any edits to the "azure-disk-csi-driver-controller" replicaset, so it doesn't seem like this is something the customer could do on their own.


[1] https://github.com/openshift/azure-disk-csi-driver-operator/pull/30
[2] https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/fa3afcd4355aea94e9c413f72da615bb44a3bd5b/pkg/azurediskplugin/main.go#L52-L53

--- Additional comment from mbarnes on 2022-06-15 12:00:35 UTC ---

Thanks for the quick resolution!

The customer that encountered this is on 4.10, so requesting a backport once the PR is merged.

Comment 2 Wei Duan 2022-07-12 12:49:31 UTC
Verified passed on 4.10.0-0.nightly-2022-07-12-015552

$ oc -n openshift-cluster-csi-drivers get pod azure-disk-csi-driver-controller-86ccc95d45-s6cz8 -o yaml | grep cloud-config-secret
    - --cloud-config-secret-name=""
    - --cloud-config-secret-namespace=""

$ oc get pvc -A 
NAMESPACE   NAME       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
wduan-1     mypvc-16   Bound    pvc-268c9a2a-f60f-4670-8987-4c104c2cb0f1   1Gi        RWO            managed-csi       29s
wduan       mypvc-03   Bound    pvc-597aaa6d-7c3c-40cf-8475-9d01c9366a79   1Gi        RWO            managed-premium   10h
wduan       mypvc-13   Bound    pvc-ce41d9e3-0864-4ff2-b2f2-73a7cbf344b1   1Gi        RWO            managed-csi       10h

Comment 5 errata-xmlrpc 2022-07-20 07:46:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.23 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:5568