Bug 2095049
Summary: | managed-csi StorageClass does not create PVs | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | bbergen |
Component: | Storage | Assignee: | Fabio Bertinatto <fbertina> |
Storage sub component: | Kubernetes External Components | QA Contact: | Wei Duan <wduan> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | unspecified | CC: | apurty, asheth, jsafrane, mbarnes |
Version: | 4.10 | Keywords: | ServiceDeliveryImpact |
Target Milestone: | --- | ||
Target Release: | 4.11.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-08-10 11:17:00 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2097439 |
Description
bbergen
2022-06-08 21:13:05 UTC
wrong subcomponent - switching I think I see what might be happening. My claims need fact-checked, though, since I'm not very familiar with this controller. Looking at the azure-disk-csi-driver-controller (csi-driver container) logs upon startup: I0609 17:36:21.218505 1 azuredisk.go:142]ยท DRIVER INFORMATION: ------------------- Build Date: "2022-05-12T09:57:10Z" Compiler: gc Driver Name: disk.csi.azure.com Driver Version: v1.9.0 Git Commit: 3937bec4f8027a66a9009af5c740a6135bc86f95 Go Version: go1.17.5 Platform: linux/amd64 Topology Key: topology.disk.csi.azure.com/zone Streaming logs below: I0609 17:36:21.218523 1 azuredisk.go:145] driver userAgent: disk.csi.azure.com/v1.9.0 gc/go1.17.5 (amd64-linux) I0609 17:36:21.219936 1 azure_disk_utils.go:129] reading cloud config from secret kube-system/azure-cloud-provider I0609 17:36:21.239303 1 azure_auth.go:119] azure: using client_id+client_secret to retrieve access token I0609 17:36:21.239386 1 azure_diskclient.go:67] Azure DisksClient using API version: 2021-04-01 You can see it's reading its cloud configuration from the kube-system/azure-cloud-provider secret. Then, upon creating the same PVC as above: (NOTE: I reformatted the request JSON for better readability.) I0609 18:31:35.168694 1 utils.go:95] GRPC call: /csi.v1.Controller/CreateVolume I0609 18:31:35.168727 1 utils.go:96] GRPC request: (NOT RELEVANT, OMITTED) I0609 18:31:35.506522 1 controllerserver.go:274] begin to create azure disk(pvc-4f483e5f-2a57-48e2-b3d1-4ffc3ccfe40f) account type(Premium_LRS) rg() location() size(1) diskZone() maxShares(0) E0609 18:31:35.621637 1 utils.go:100] GRPC error: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 404, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions//resourceGroups//providers/Microsoft.Compute/disks/pvc-4f483e5f-2a57-48e2-b3d1-4ffc3ccfe40f?api-version=2021-04-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body: Endpoint https://login.microsoftonline.com/oauth2/token Notice how the message with "begin to create azure disk" has "rg()", indicating it has no resource group. Also, the Azure error message shows the requested resource URI lacks both the subscription ID and resource group: https://management.azure.com/subscriptions//resourceGroups//providers/Microsoft.Compute/disks/pvc-4f483e5f-2a57-48e2-b3d1-4ffc3ccfe40f?api-version=2021-04-01 ^^ ^^ The code that initializes the cloud Config struct is here: https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/release-1.9/pkg/azureutils/azure_disk_utils.go#L128-L164 Note how, if it's able to read the kube-system/azure-cloud-provider secret then it doesn't even TRY to read the cloud config file (because 'config' is no longer nil). In other words, the controller appears to assume ALL the cloud configuration lives either in the secret OR the config file. At least on ARO, the kube-system/azure-cloud-provider secret only contains the client ID and client secret. The rest of the cloud configuration lives in /etc/kubernetes/cloud.conf, which is not being read. @matthew, excellent points. The CSI driver should better handle this scenario. It's an anti-pattern for a CSI driver to rely on Kubernetes resources for applying configuration options, specially when their names are hardcoded in the CSI driver. While I'm happy to work with upstream to fix the CSI driver, is there a chance we can work something around in ARO to unblock the customer? Who creates the kube-system/azure-cloud-provider secret in ARO? I don't have it on my IPI cluster, is it really necessary there? If so, can the owner of this secret add the missing configuration there? The ARO Azure Resource Provider (which is Red Hat code) creates the "azure-cloud-provider" secret, and it's used for many things besides this one driver. $ oc get secret -n kube-system azure-cloud-provider -o json | jq -r '.data["cloud-config"]' | base64 -d aadClientId: CLIENT-ID aadClientSecret: CLIENT-SECRET The "azure-config-credentials-injector" init-container added in [1] provides a partial solution, if the CSI driver could be made to actually read the merged config file. Working with what's currently upstream, the CSI driver provides command-line options for overriding the cloud-config secret [2]. If these options could supplied to have the driver look for a non-existent secret (e.g. --cloud-config-secret-name="" --cloud-config-secret-namespace=""), it would force the driver to fall back to the config file created by "azure-config-credentials-injector". (Unsure if the other pod containers provide similar options, or if that's even needed.) However the operator reverts any edits to the "azure-disk-csi-driver-controller" replicaset, so it doesn't seem like this is something the customer could do on their own. [1] https://github.com/openshift/azure-disk-csi-driver-operator/pull/30 [2] https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/fa3afcd4355aea94e9c413f72da615bb44a3bd5b/pkg/azurediskplugin/main.go#L52-L53 Thanks for the quick resolution! The customer that encountered this is on 4.10, so requesting a backport once the PR is merged. Issue reprodued on 4.10.15 on ARO cluster. Verified pass on 4.11.0-0.ci-2022-06-17-201634 $ oc -n openshift-cluster-csi-drivers get pod azure-disk-csi-driver-controller-7f87c479db-gjm7f -o yaml | grep config - --cloud-config-secret-name="" - --cloud-config-secret-namespace="" $ oc get pod,pvc NAME READY STATUS RESTARTS AGE pod/mydeploy-16-7db899cb84-26sgl 1/1 Running 0 97s NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/mypvc-16 Bound pvc-e87841b3-b039-48b7-8bb7-204d18c4b827 1Gi RWO managed-csi 97s Will double check on Accepted 4.11 nightly and also check no impact on common OCP Cluster. Verified on 4.11.0-0.nightly-2022-06-21-151125. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |