Bug 2095049 - managed-csi StorageClass does not create PVs
Summary: managed-csi StorageClass does not create PVs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.10
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 4.11.0
Assignee: Fabio Bertinatto
QA Contact: Wei Duan
URL:
Whiteboard:
Depends On:
Blocks: 2097439
TreeView+ depends on / blocked
 
Reported: 2022-06-08 21:13 UTC by bbergen
Modified: 2023-03-03 06:01 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 11:17:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift azure-disk-csi-driver-operator pull 49 0 None Merged Bug 2095049: Only use credentials that are provided by the azure-inject-credential… 2022-07-18 07:35:32 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:17:22 UTC

Description bbergen 2022-06-08 21:13:05 UTC
Description of problem:

On a new install of OpenShift in Azure, the managed-csi StorageClass cannot provision new disks, and all PVCs are stuck in Pending that use this StorageClass. 

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Create 4.10 OpenShift cluster in Azure
2. Create a PVC and Pod that mounts the PV created

Actual results:

1. No PV is created for the PVC
2. The azure-disk-csi-driver-controller csi-provisioner logs indicate 404 errors when attempting to fetch a token to authenticate and interact with Azure

Expected results:

1. PV is created and mounted successfully into the Pod

Master Log:

Node Log (of failed PODs):

```
$ oc logs -n openshift-cluster-csi-drivers -l app=azure-disk-csi-driver-controller -c csi-provisioner
{...}
I0608 21:05:26.596290       1 controller.go:1337] provision "test/managed-csi" class "managed-csi": started
I0608 21:05:26.596448       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"test", Name:"managed-csi", UID:"a749b800-058b-493e-9943-8c9be54ebbcc", APIVersion:"v1", ResourceVersion:"1169244", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "test/managed-csi"
I0608 21:05:26.683627       1 controller.go:1075] Final error received, removing PVC a749b800-058b-493e-9943-8c9be54ebbcc from claims in progress
W0608 21:05:26.683656       1 controller.go:934] Retrying syncing claim "a749b800-058b-493e-9943-8c9be54ebbcc", failure 10
E0608 21:05:26.683677       1 controller.go:957] error syncing claim "a749b800-058b-493e-9943-8c9be54ebbcc": failed to provision volume with StorageClass "managed-csi": rpc error: code = Unknown desc = Retriable: false, RetryAfter: 0s, HTTPStatusCode: 404, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions//resourceGroups//providers/Microsoft.Compute/disks/pvc-a749b800-058b-493e-9943-8c9be54ebbcc?api-version=2021-04-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body:  Endpoint https://login.microsoftonline.com/oauth2/token
I0608 21:05:26.683773       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"test", Name:"managed-csi", UID:"a749b800-058b-493e-9943-8c9be54ebbcc", APIVersion:"v1", ResourceVersion:"1169244", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "managed-csi": rpc error: code = Unknown desc = Retriable: false, RetryAfter: 0s, HTTPStatusCode: 404, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions//resourceGroups//providers/Microsoft.Compute/disks/pvc-a749b800-058b-493e-9943-8c9be54ebbcc?api-version=2021-04-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body:  Endpoint https://login.microsoftonline.com/oauth2/token
{...}

(note the missing SubscriptionID and ResourceGroup in the token refresh URL - this log is pasted as-is, not scrubbed)

PV Dump:

N/A

PVC Dump:

```
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: managed-csi
spec:
  storageClassName: managed-csi
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: bb
spec:
  containers:
    - name: bb
      image: busybox:1.28
      volumeMounts:
      - mountPath: "/var/www/html"
        name: www
  volumes:
    - name: www
      persistentVolumeClaim:
        claimName: managed-csi
```

StorageClass Dump (if StorageClass used by PV/PVC):

```
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
  creationTimestamp: "2022-06-07T16:53:35Z"
  name: managed-csi
  resourceVersion: "532291"
  uid: 8b8bf889-2b03-4671-8990-3c1ab3368cd7
parameters:
  skuname: Premium_LRS
provisioner: disk.csi.azure.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
```

Additional info:

Comment 1 Gabe Montero 2022-06-08 21:15:44 UTC
wrong subcomponent - switching

Comment 2 Matthew Barnes 2022-06-10 19:07:35 UTC
I think I see what might be happening.  My claims need fact-checked, though, since I'm not very familiar with this controller.

Looking at the azure-disk-csi-driver-controller (csi-driver container) logs upon startup:


I0609 17:36:21.218505       1 azuredisk.go:142]·                                                                                                                                              
DRIVER INFORMATION:                                                                                                                                                                           
-------------------                                                                                                                                                                           
Build Date: "2022-05-12T09:57:10Z"                                                                                                                                                            
Compiler: gc                                                                                                                                                                                  
Driver Name: disk.csi.azure.com                                                                                                                                                               
Driver Version: v1.9.0                                                                                                                                                                        
Git Commit: 3937bec4f8027a66a9009af5c740a6135bc86f95                                                                                                                                          
Go Version: go1.17.5                                                                                                                                                                          
Platform: linux/amd64                                                                                                                                                                         
Topology Key: topology.disk.csi.azure.com/zone                                                                                                                                                
                                                                                                                                                                                              
Streaming logs below:                                                                                                                                                                         
I0609 17:36:21.218523       1 azuredisk.go:145] driver userAgent: disk.csi.azure.com/v1.9.0 gc/go1.17.5 (amd64-linux)                                                                         
I0609 17:36:21.219936       1 azure_disk_utils.go:129] reading cloud config from secret kube-system/azure-cloud-provider                                                                      
I0609 17:36:21.239303       1 azure_auth.go:119] azure: using client_id+client_secret to retrieve access token                                                                                
I0609 17:36:21.239386       1 azure_diskclient.go:67] Azure DisksClient using API version: 2021-04-01                                                                                         


You can see it's reading its cloud configuration from the kube-system/azure-cloud-provider secret.

Then, upon creating the same PVC as above:

(NOTE: I reformatted the request JSON for better readability.)


I0609 18:31:35.168694       1 utils.go:95] GRPC call: /csi.v1.Controller/CreateVolume
I0609 18:31:35.168727       1 utils.go:96] GRPC request: (NOT RELEVANT, OMITTED)
I0609 18:31:35.506522       1 controllerserver.go:274] begin to create azure disk(pvc-4f483e5f-2a57-48e2-b3d1-4ffc3ccfe40f) account type(Premium_LRS) rg() location() size(1) diskZone() maxShares(0)
E0609 18:31:35.621637       1 utils.go:100] GRPC error: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 404, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions//resourceGroups//providers/Microsoft.Compute/disks/pvc-4f483e5f-2a57-48e2-b3d1-4ffc3ccfe40f?api-version=2021-04-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body:  Endpoint https://login.microsoftonline.com/oauth2/token


Notice how the message with "begin to create azure disk" has "rg()", indicating it has no resource group.

Also, the Azure error message shows the requested resource URI lacks both the subscription ID and resource group:

https://management.azure.com/subscriptions//resourceGroups//providers/Microsoft.Compute/disks/pvc-4f483e5f-2a57-48e2-b3d1-4ffc3ccfe40f?api-version=2021-04-01
                                          ^^              ^^

The code that initializes the cloud Config struct is here:
https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/release-1.9/pkg/azureutils/azure_disk_utils.go#L128-L164

Note how, if it's able to read the kube-system/azure-cloud-provider secret then it doesn't even TRY to read the cloud config file (because 'config' is no longer nil).  In other words, the controller appears to assume ALL the cloud configuration lives either in the secret OR the config file.

At least on ARO, the kube-system/azure-cloud-provider secret only contains the client ID and client secret.  The rest of the cloud configuration lives in /etc/kubernetes/cloud.conf, which is not being read.

Comment 3 Fabio Bertinatto 2022-06-10 20:33:40 UTC
@matthew, excellent points.

The CSI driver should better handle this scenario. It's an anti-pattern for a CSI driver to rely on Kubernetes resources for applying configuration options, specially when their names are hardcoded in the CSI driver.

While I'm happy to work with upstream to fix the CSI driver, is there a chance we can work something around in ARO to unblock the customer?

Who creates the kube-system/azure-cloud-provider secret in ARO? I don't have it on my IPI cluster, is it really necessary there? If so, can the owner of this secret add the missing configuration there?

Comment 4 Matthew Barnes 2022-06-12 15:53:52 UTC
The ARO Azure Resource Provider (which is Red Hat code) creates the "azure-cloud-provider" secret, and it's used for many things besides this one driver.

  $ oc get secret -n kube-system azure-cloud-provider -o json | jq -r '.data["cloud-config"]' | base64 -d
  aadClientId: CLIENT-ID
  aadClientSecret: CLIENT-SECRET


The "azure-config-credentials-injector" init-container added in [1] provides a partial solution, if the CSI driver could be made to actually read the merged config file.

Working with what's currently upstream, the CSI driver provides command-line options for overriding the cloud-config secret [2].  If these options could supplied to have the driver look for a non-existent secret (e.g. --cloud-config-secret-name="" --cloud-config-secret-namespace=""), it would force the driver to fall back to the config file created by "azure-config-credentials-injector".  (Unsure if the other pod containers provide similar options, or if that's even needed.)

However the operator reverts any edits to the "azure-disk-csi-driver-controller" replicaset, so it doesn't seem like this is something the customer could do on their own.


[1] https://github.com/openshift/azure-disk-csi-driver-operator/pull/30
[2] https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/fa3afcd4355aea94e9c413f72da615bb44a3bd5b/pkg/azurediskplugin/main.go#L52-L53

Comment 5 Matthew Barnes 2022-06-15 12:00:35 UTC
Thanks for the quick resolution!

The customer that encountered this is on 4.10, so requesting a backport once the PR is merged.

Comment 7 Wei Duan 2022-06-20 08:02:13 UTC
Issue reprodued on 4.10.15 on ARO cluster.

Verified pass on 4.11.0-0.ci-2022-06-17-201634

$ oc -n openshift-cluster-csi-drivers get pod azure-disk-csi-driver-controller-7f87c479db-gjm7f -o yaml | grep config
    - --cloud-config-secret-name=""
    - --cloud-config-secret-namespace=""

$ oc get pod,pvc
NAME                               READY   STATUS    RESTARTS   AGE
pod/mydeploy-16-7db899cb84-26sgl   1/1     Running   0          97s

NAME                             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/mypvc-16   Bound    pvc-e87841b3-b039-48b7-8bb7-204d18c4b827   1Gi        RWO            managed-csi    97s


Will double check on Accepted 4.11 nightly and also check no impact on common OCP Cluster.

Comment 9 Wei Duan 2022-06-22 08:07:03 UTC
Verified on 4.11.0-0.nightly-2022-06-21-151125.

Comment 11 errata-xmlrpc 2022-08-10 11:17:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.