In the last few days I have noticed several stateful set tests (which depend on PVCs and detach behaving reasonably) time out waiting for detach. This is usually a leading indicator we have a detach race / bug. The tests are flaking ~1/10 runs, so this isn't happening all the time, but failure to detach is bad. We need to triage this to determine what is going on. https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_service-ca-operator/52/pull-ci-openshift-service-ca-operator-master-e2e-aws/319#openshift-tests-sig-apps-statefulset-k8sio-basic-statefulset-functionality-statefulsetbasic-should-perform-rolling-updates-and-roll-backs-of-template-modifications-with-pvcs-suiteopenshiftconformanceparallel-suitek8s [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications with PVCs [Suite:openshift/conformance/parallel] [Suite:k8s] 15m8s Apr 29 14:28:50.978: INFO: Still waiting for pvs of statefulset to disappear: pvc-07169b19-6a89-11e9-9e43-0ab516d65a44: {Phase:Failed Message:Error deleting EBS volume "vol-081919e38a0970c39" since volume is currently attached to "i-0efeaa5d10ef5d981" Reason:} --------------------------------------------------------- Received interrupt. Running AfterSuite... ^C again to terminate immediately
Part of the code is clearly running into quota issues: I0429 14:27:54.993644 1 aws_util.go:63] Error deleting EBS Disk volume aws://us-east-1b/vol-081919e38a0970c39: Error deleting EBS volume "vol-081919e38a0970c39" since volume is currently attached to "i-0efeaa5d10ef5d981" I0429 14:28:48.328103 1 aws_util.go:63] Error deleting EBS Disk volume aws://us-east-1b/vol-081919e38a0970c39: error deleting EBS volume "vol-081919e38a0970c39": "RequestLimitExceeded: Request limit exceeded.\n\tstatus code: 503, request id: 8e7efc52-8f96-4e42-a908-9959413f8bba" .. .. E0429 14:32:51.346054 1 goroutinemap.go:150] Operation for "delete-pvc-07169b19-6a89-11e9-9e43-0ab516d65a44[078ce9ee-6a89-11e9-9e43-0ab516d65a44]" failed. No retries permitted until 2019-04-29 14:32:59.34602468 +0000 UTC m=+2672.681919500 (durationBeforeRetry 8s). Error: "error deleting EBS volume \"vol-081919e38a0970c39\": \"RequestLimitExceeded: Request limit exceeded.\\n\\tstatus code: 503, request id: 7902311f-e683-4ada-9cc4-4c2f269fa707\"" .. and it eventually detaches. I0429 14:34:52.149830 1 aws_util.go:67] Successfully deleted EBS Disk volume aws://us-east-1b/vol-081919e38a0970c39
err - it eventually deletes. The detach fwiw happened correctly at: I0429 14:27:50.454737 1 operation_generator.go:423] DetachVolume.Detach succeeded for volume "pvc-07169b19-6a89-11e9-9e43-0ab516d65a44" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-081919e38a0970c39") on node "ip-10-0-144-221.ec2.internal"
I debugged this to some extent and found that this e2e in total made 600 describeVolume calls. Although total number of storage operations that could trigger describeVolume was small, so I guess we kept wait looping for volume to reach desired state by calling describeVolume repeatedly. I also looked at API usage in overall account and these were top users from aws-sdk: 414382 aws-sdk-go/1.14.12: 163625 ec2.DescribeInstances 125678 ec2.DescribeVolumes 24278 ec2.DescribeSecurityGroups 10054 elasticloadbalancing.DescribeLoadBalancers 8481 ec2.AttachVolume 8455 ec2.DetachVolume 7745 ec2.CreateVolume 7450 ec2.DeleteVolume 5013 ec2.DescribeSubnets 4991 ec2.DescribeRouteTables 4601 elasticloadbalancing.ModifyLoadBalancerAttributes 4601 elasticloadbalancing.DescribeLoadBalancerAttributes I am guessing aws-sdk-go is k8s client. Installer is still top user though: 255441 iam.GetRole 239358 iam.GetUser 32121 ec2.DeleteNatGateway 28028 iam.ListRoles 26878 iam.ListUsers 22798 ec2.DeleteNetworkInterface (Client.InvalidParameterValue) 18387 elasticloadbalancing.DescribeLoadBalancers 18345 ec2.DeleteSubnet (Client.RequestLimitExceeded) There isn't anything immediately obvious that can point me to the fact that our code is wrong. We have made many small tweaks in past to reduce mutable API usage in k8s at the cost of immutable API usage. I need to keep looking.
Created attachment 1560513 [details] storage operation
Created attachment 1560514 [details] cloudprovider API usage
Just to put a timescale on this [1]: failed: (15m8s) 2019-04-29T14:29:02 "[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications with PVCs [Suite:openshift/conformance/parallel] [Suite:k8s]" With detach at 14:27:50 and delete at 14:34:52 per earlier comments, we're detaching over a minute before the failure message, although we're deleting over five minutes after the failure message. Are the tests blocked on the detach or the delete? Maybe the detach, but there was subsequent activity that didn't have time to complete? As a short term mitigation, can we bump the timeout for this test by 10 minutes or something? [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_service-ca-operator/52/pull-ci-openshift-service-ca-operator-master-e2e-aws/319/build-log.txt
We aren't blocked on detach - detach happened on: I0429 14:27:50.454737 1 operation_generator.go:423] DetachVolume.Detach succeeded for volume "pvc-07169b19-6a89-11e9-9e43-0ab516d65a44" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-081919e38a0970c39") on node "ip-10-0-144-221.ec2.internal" but for some reason when volume delete fails to delete the volume because of API quota issues, the event does not changes to API quota messaging and hence from outside if you just look at build-log.txt, it appears that test is blocked on detach. One of the ways we could somewhat improve this is, we could prevent delete operation until detach is successful. Currently they are running concurrently. Traditionally we block two operations on same volume from running concurrently but in this because attach/detach and delete operation is managed by different controllers, we don't.
Detach and delete happening concurrently is a known thing - https://github.com/kubernetes/kubernetes/issues/31511 . It is not easy to fix that though because volume deletion and detach are managed by different controllers and sharing state between these two is going to be problematic.
Pushed to 4.2 since it seems risky to fix now, and the delete eventually happens.
I debugged some more and found out why detach was not attempted for volume vol-081919e38a0970c39 between 14.14 and 14.27. What I learnt was, vol-081919e38a0970c39 is not the problematic volume in this test run. It is a red herring and actual volume that is root cause of this bug is - vol-0826a4064906dd466. So what happened is, this test creates a SS with replica of 3. It makes various changes to SS and expects updates to transition correctly. During one of the changes to one of the pods, vol-0826a4064906dd466. refuses to detach and hence does not successfully transition to correct state: openshift-kube-controller-manager_kube-controller-manager-ip-10-0-158-12.ec2.internal_kube-controller-manager-4.log:E0429 14:23:49.173151 1 attacher.go:270] Error detaching volumeID "vol-0826a4064906dd466": error detaching EBS volume "vol-0826a4064906dd466" from "i-09f3d8ff56e1ae785": "RequestLimitExceeded: Request limit exceeded.\n\tstatus code: 503, request id: b50b6338-9e9d-4f7c-902d-849afd79265c" This causes other 2 pods to just hang around doing nothing until the entire SS is deleted. At that point, API quota issue strikes again and logs the volume.
So we have two problems 1. we don't detach in high load scenarios (this will continue to affect CI so I expect high priority on this post 4.1) 2. we are hitting rate limiting in our CI account which needs either code / rate limit changes to correct (or better, both). ?
Also 3. We have known concurrency issues (this will continue to affect CI so I also expect high priority in 4.2)
AWS has three API quotas (https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html). They're counted per account, not per client. Describe actions: OpenShift (338240 total): 163625 ec2.DescribeInstances 125678 ec2.DescribeVolumes 24278 ec2.DescribeSecurityGroups 10054 elasticloadbalancing.DescribeLoadBalancers 5013 ec2.DescribeSubnets 4991 ec2.DescribeRouteTables 4601 elasticloadbalancing.DescribeLoadBalancerAttributes Installer (586437 total): 255441 iam.GetRole 239358 iam.GetUser 28028 iam.ListRoles 26878 iam.ListUsers 18387 elasticloadbalancing.DescribeLoadBalancers 18345 ec2.DeleteSubnet Modify actions: OpenShift (36732 total): 8481 ec2.AttachVolume 8455 ec2.DetachVolume 7745 ec2.CreateVolume 7450 ec2.DeleteVolume 4601 elasticloadbalancing.ModifyLoadBalancerAttributes Installer (73264 total): 32121 ec2.DeleteNatGateway 22798 ec2.DeleteNetworkInterface 18345 ec2.DeleteSubnet From the numbers it's clear that it's the installer who should be fixed first.
We'll continue to improve OpenShift storage part in this BZ, I filled https://bugzilla.redhat.com/show_bug.cgi?id=1708262 for the installer part.
There has been at least 3 PRs in upstream (which I will link shortly) to reduce API usage further in k8s. Lets backport those and close this. One major concern was it is not immediately obviously when API quota errors are hit and we are tracking bubbling those errors as events as separate BZ.
We should backport - 1. https://github.com/kubernetes/kubernetes/pull/78276 2. https://github.com/kubernetes/kubernetes/pull/78601 3. https://github.com/kubernetes/kubernetes/pull/78140
I've submitted a PR [1] to backport these into the openshift/origin master branch. [1] https://github.com/openshift/origin/pull/23424
Moving this to POST as a PR [1] has been submitted. [1] https://github.com/openshift/origin/pull/23424
1 Create 6 statefulset like below: apiVersion: apps/v1 kind: StatefulSet metadata: name: web0 spec: selector: matchLabels: app: nginx # has to match .spec.template.metadata.labels serviceName: "nginx" replicas: 3 # by default is 1 template: metadata: labels: app: nginx # has to match .spec.selector.matchLabels spec: terminationGracePeriodSeconds: 10 containers: - name: nginx image: aosqe/hello-openshift ports: - containerPort: 80 name: web volumeMounts: - name: ww0 mountPath: /usr/share/nginx/html volumeClaimTemplates: - metadata: name: ww0 spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 1Gi 2.Delete them; then re-create these statefulset. Repeat above 10 times but did not hit the aws quota. Clusterversion is 4.2.0-0.nightly-2019-08-15-232721 Update this bug to verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922