Description of problem: When AWS CSI migration is enabled, seems the metrics about the cloudprovider error requests are lost Version-Release number of selected component (if applicable): 4.8.0-fc.0 How reproducible: Always Steps to Reproduce: 1. Setup a cluster on AWS 2. Enable csi migration 3. Create a PVC with gp2 storageclass 4. Expand this PVC 2 times Actual results: Risizing failed and got the following error: Warning VolumeResizeFailed 1s external-resizer ebs.csi.aws.com (combined from similar events): resize volume "pvc-f6286614-1d0f-411d-afb3-323d6a4c605b" by resizer "ebs.csi.aws.com" failed: rpc error: code = Internal desc = Could not resize volume "vol-064cf733d700d2365": could not modify AWS volume "vol-064cf733d700d2365": VolumeModificationRateExceeded: You've reached the maximum modification rate per volume limit. Wait at least 6 hours between modifications per EBS volume. status code: 400, request id: d58b1b51-00d2-4e26-a1cc-380a6b3b182e Check the metrics: No "cloudprovider_aws_api_request_errors" metrics or "csi_sidecar_operations_errors" metrics Expected results: Still can get the metrics about the error requests or operations. Master Log: Node Log (of failed PODs): PV Dump: PVC Dump: StorageClass Dump (if StorageClass used by PV/PVC): Additional info:
yeah this is a known issue - https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/806
Just for the record, we had previously talked about this in a team meeting and we decided that we do need cloud metrics before CSI migration goes GA. However, it's OK to not have it in Tech Preview (4.8).
*** Bug 1956791 has been marked as a duplicate of this bug. ***
fixed here https://github.com/openshift/aws-ebs-csi-driver-operator/pull/125
Verified with: 4.8.0-0.nightly-2021-05-12-184904 $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "en" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=cloudprovider_aws_api_request_errors'|jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 876 0 876 0 0 41714 0 --:--:-- --:--:-- --:--:-- 43800 { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "cloudprovider_aws_api_request_errors", "container": "driver-kube-rbac-proxy", "endpoint": "driver-m", "instance": "10.0.162.1:9206", "job": "aws-ebs-csi-driver-controller-metrics", "namespace": "openshift-cluster-csi-drivers", "pod": "aws-ebs-csi-driver-controller-7d49867c85-bgbpv", "request": "DescribeVolumesModifications", "service": "aws-ebs-csi-driver-controller-metrics" }, "value": [ 1620886264.24, "1" ] }, { "metric": { "__name__": "cloudprovider_aws_api_request_errors", "container": "driver-kube-rbac-proxy", "endpoint": "driver-m", "instance": "10.0.162.1:9206", "job": "aws-ebs-csi-driver-controller-metrics", "namespace": "openshift-cluster-csi-drivers", "pod": "aws-ebs-csi-driver-controller-7d49867c85-bgbpv", "request": "ModifyVolume", "service": "aws-ebs-csi-driver-controller-metrics" }, "value": [ 1620886264.24, "9" ] } ] } }
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438