Description of problem: On a new install of OpenShift in Azure, the managed-csi StorageClass cannot provision new disks, and all PVCs are stuck in Pending that use this StorageClass. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Create 4.10 OpenShift cluster in Azure 2. Create a PVC and Pod that mounts the PV created Actual results: 1. No PV is created for the PVC 2. The azure-disk-csi-driver-controller csi-provisioner logs indicate 404 errors when attempting to fetch a token to authenticate and interact with Azure Expected results: 1. PV is created and mounted successfully into the Pod Master Log: Node Log (of failed PODs): ``` $ oc logs -n openshift-cluster-csi-drivers -l app=azure-disk-csi-driver-controller -c csi-provisioner {...} I0608 21:05:26.596290 1 controller.go:1337] provision "test/managed-csi" class "managed-csi": started I0608 21:05:26.596448 1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"test", Name:"managed-csi", UID:"a749b800-058b-493e-9943-8c9be54ebbcc", APIVersion:"v1", ResourceVersion:"1169244", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "test/managed-csi" I0608 21:05:26.683627 1 controller.go:1075] Final error received, removing PVC a749b800-058b-493e-9943-8c9be54ebbcc from claims in progress W0608 21:05:26.683656 1 controller.go:934] Retrying syncing claim "a749b800-058b-493e-9943-8c9be54ebbcc", failure 10 E0608 21:05:26.683677 1 controller.go:957] error syncing claim "a749b800-058b-493e-9943-8c9be54ebbcc": failed to provision volume with StorageClass "managed-csi": rpc error: code = Unknown desc = Retriable: false, RetryAfter: 0s, HTTPStatusCode: 404, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions//resourceGroups//providers/Microsoft.Compute/disks/pvc-a749b800-058b-493e-9943-8c9be54ebbcc?api-version=2021-04-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body: Endpoint https://login.microsoftonline.com/oauth2/token I0608 21:05:26.683773 1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"test", Name:"managed-csi", UID:"a749b800-058b-493e-9943-8c9be54ebbcc", APIVersion:"v1", ResourceVersion:"1169244", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "managed-csi": rpc error: code = Unknown desc = Retriable: false, RetryAfter: 0s, HTTPStatusCode: 404, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions//resourceGroups//providers/Microsoft.Compute/disks/pvc-a749b800-058b-493e-9943-8c9be54ebbcc?api-version=2021-04-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body: Endpoint https://login.microsoftonline.com/oauth2/token {...} (note the missing SubscriptionID and ResourceGroup in the token refresh URL - this log is pasted as-is, not scrubbed) PV Dump: N/A PVC Dump: ``` apiVersion: v1 kind: PersistentVolumeClaim metadata: name: managed-csi spec: storageClassName: managed-csi accessModes: - ReadWriteOnce resources: requests: storage: 1Gi --- apiVersion: v1 kind: Pod metadata: name: bb spec: containers: - name: bb image: busybox:1.28 volumeMounts: - mountPath: "/var/www/html" name: www volumes: - name: www persistentVolumeClaim: claimName: managed-csi ``` StorageClass Dump (if StorageClass used by PV/PVC): ``` apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: storageclass.kubernetes.io/is-default-class: "false" creationTimestamp: "2022-06-07T16:53:35Z" name: managed-csi resourceVersion: "532291" uid: 8b8bf889-2b03-4671-8990-3c1ab3368cd7 parameters: skuname: Premium_LRS provisioner: disk.csi.azure.com reclaimPolicy: Delete volumeBindingMode: WaitForFirstConsumer ``` Additional info:
wrong subcomponent - switching
I think I see what might be happening. My claims need fact-checked, though, since I'm not very familiar with this controller. Looking at the azure-disk-csi-driver-controller (csi-driver container) logs upon startup: I0609 17:36:21.218505 1 azuredisk.go:142]· DRIVER INFORMATION: ------------------- Build Date: "2022-05-12T09:57:10Z" Compiler: gc Driver Name: disk.csi.azure.com Driver Version: v1.9.0 Git Commit: 3937bec4f8027a66a9009af5c740a6135bc86f95 Go Version: go1.17.5 Platform: linux/amd64 Topology Key: topology.disk.csi.azure.com/zone Streaming logs below: I0609 17:36:21.218523 1 azuredisk.go:145] driver userAgent: disk.csi.azure.com/v1.9.0 gc/go1.17.5 (amd64-linux) I0609 17:36:21.219936 1 azure_disk_utils.go:129] reading cloud config from secret kube-system/azure-cloud-provider I0609 17:36:21.239303 1 azure_auth.go:119] azure: using client_id+client_secret to retrieve access token I0609 17:36:21.239386 1 azure_diskclient.go:67] Azure DisksClient using API version: 2021-04-01 You can see it's reading its cloud configuration from the kube-system/azure-cloud-provider secret. Then, upon creating the same PVC as above: (NOTE: I reformatted the request JSON for better readability.) I0609 18:31:35.168694 1 utils.go:95] GRPC call: /csi.v1.Controller/CreateVolume I0609 18:31:35.168727 1 utils.go:96] GRPC request: (NOT RELEVANT, OMITTED) I0609 18:31:35.506522 1 controllerserver.go:274] begin to create azure disk(pvc-4f483e5f-2a57-48e2-b3d1-4ffc3ccfe40f) account type(Premium_LRS) rg() location() size(1) diskZone() maxShares(0) E0609 18:31:35.621637 1 utils.go:100] GRPC error: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 404, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions//resourceGroups//providers/Microsoft.Compute/disks/pvc-4f483e5f-2a57-48e2-b3d1-4ffc3ccfe40f?api-version=2021-04-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body: Endpoint https://login.microsoftonline.com/oauth2/token Notice how the message with "begin to create azure disk" has "rg()", indicating it has no resource group. Also, the Azure error message shows the requested resource URI lacks both the subscription ID and resource group: https://management.azure.com/subscriptions//resourceGroups//providers/Microsoft.Compute/disks/pvc-4f483e5f-2a57-48e2-b3d1-4ffc3ccfe40f?api-version=2021-04-01 ^^ ^^ The code that initializes the cloud Config struct is here: https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/release-1.9/pkg/azureutils/azure_disk_utils.go#L128-L164 Note how, if it's able to read the kube-system/azure-cloud-provider secret then it doesn't even TRY to read the cloud config file (because 'config' is no longer nil). In other words, the controller appears to assume ALL the cloud configuration lives either in the secret OR the config file. At least on ARO, the kube-system/azure-cloud-provider secret only contains the client ID and client secret. The rest of the cloud configuration lives in /etc/kubernetes/cloud.conf, which is not being read.
@matthew, excellent points. The CSI driver should better handle this scenario. It's an anti-pattern for a CSI driver to rely on Kubernetes resources for applying configuration options, specially when their names are hardcoded in the CSI driver. While I'm happy to work with upstream to fix the CSI driver, is there a chance we can work something around in ARO to unblock the customer? Who creates the kube-system/azure-cloud-provider secret in ARO? I don't have it on my IPI cluster, is it really necessary there? If so, can the owner of this secret add the missing configuration there?
The ARO Azure Resource Provider (which is Red Hat code) creates the "azure-cloud-provider" secret, and it's used for many things besides this one driver. $ oc get secret -n kube-system azure-cloud-provider -o json | jq -r '.data["cloud-config"]' | base64 -d aadClientId: CLIENT-ID aadClientSecret: CLIENT-SECRET The "azure-config-credentials-injector" init-container added in [1] provides a partial solution, if the CSI driver could be made to actually read the merged config file. Working with what's currently upstream, the CSI driver provides command-line options for overriding the cloud-config secret [2]. If these options could supplied to have the driver look for a non-existent secret (e.g. --cloud-config-secret-name="" --cloud-config-secret-namespace=""), it would force the driver to fall back to the config file created by "azure-config-credentials-injector". (Unsure if the other pod containers provide similar options, or if that's even needed.) However the operator reverts any edits to the "azure-disk-csi-driver-controller" replicaset, so it doesn't seem like this is something the customer could do on their own. [1] https://github.com/openshift/azure-disk-csi-driver-operator/pull/30 [2] https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/fa3afcd4355aea94e9c413f72da615bb44a3bd5b/pkg/azurediskplugin/main.go#L52-L53
Thanks for the quick resolution! The customer that encountered this is on 4.10, so requesting a backport once the PR is merged.
Issue reprodued on 4.10.15 on ARO cluster. Verified pass on 4.11.0-0.ci-2022-06-17-201634 $ oc -n openshift-cluster-csi-drivers get pod azure-disk-csi-driver-controller-7f87c479db-gjm7f -o yaml | grep config - --cloud-config-secret-name="" - --cloud-config-secret-namespace="" $ oc get pod,pvc NAME READY STATUS RESTARTS AGE pod/mydeploy-16-7db899cb84-26sgl 1/1 Running 0 97s NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/mypvc-16 Bound pvc-e87841b3-b039-48b7-8bb7-204d18c4b827 1Gi RWO managed-csi 97s Will double check on Accepted 4.11 nightly and also check no impact on common OCP Cluster.
Verified on 4.11.0-0.nightly-2022-06-21-151125.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069