Bug 1704356
| Summary: | [stability] Some e2e tests failing due to unable to detach EBS volume in under 10m | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> | ||||||
| Component: | Storage | Assignee: | Christian Huffman <chuffman> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Chao Yang <chaoyang> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 4.1.0 | CC: | aos-bugs, aos-storage-staff, bbennett, hekumar, jsafrane, nagrawal, vlaad, wking | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | 4.2.0 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2019-10-16 06:28:21 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Clayton Coleman
2019-04-29 15:45:40 UTC
Part of the code is clearly running into quota issues: I0429 14:27:54.993644 1 aws_util.go:63] Error deleting EBS Disk volume aws://us-east-1b/vol-081919e38a0970c39: Error deleting EBS volume "vol-081919e38a0970c39" since volume is currently attached to "i-0efeaa5d10ef5d981" I0429 14:28:48.328103 1 aws_util.go:63] Error deleting EBS Disk volume aws://us-east-1b/vol-081919e38a0970c39: error deleting EBS volume "vol-081919e38a0970c39": "RequestLimitExceeded: Request limit exceeded.\n\tstatus code: 503, request id: 8e7efc52-8f96-4e42-a908-9959413f8bba" .. .. E0429 14:32:51.346054 1 goroutinemap.go:150] Operation for "delete-pvc-07169b19-6a89-11e9-9e43-0ab516d65a44[078ce9ee-6a89-11e9-9e43-0ab516d65a44]" failed. No retries permitted until 2019-04-29 14:32:59.34602468 +0000 UTC m=+2672.681919500 (durationBeforeRetry 8s). Error: "error deleting EBS volume \"vol-081919e38a0970c39\": \"RequestLimitExceeded: Request limit exceeded.\\n\\tstatus code: 503, request id: 7902311f-e683-4ada-9cc4-4c2f269fa707\"" .. and it eventually detaches. I0429 14:34:52.149830 1 aws_util.go:67] Successfully deleted EBS Disk volume aws://us-east-1b/vol-081919e38a0970c39 err - it eventually deletes. The detach fwiw happened correctly at: I0429 14:27:50.454737 1 operation_generator.go:423] DetachVolume.Detach succeeded for volume "pvc-07169b19-6a89-11e9-9e43-0ab516d65a44" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-081919e38a0970c39") on node "ip-10-0-144-221.ec2.internal" I debugged this to some extent and found that this e2e in total made 600 describeVolume calls. Although total number of storage operations that could trigger describeVolume was small, so I guess we kept wait looping for volume to reach desired state by calling describeVolume repeatedly.
I also looked at API usage in overall account and these were top users from aws-sdk:
414382 aws-sdk-go/1.14.12:
163625 ec2.DescribeInstances
125678 ec2.DescribeVolumes
24278 ec2.DescribeSecurityGroups
10054 elasticloadbalancing.DescribeLoadBalancers
8481 ec2.AttachVolume
8455 ec2.DetachVolume
7745 ec2.CreateVolume
7450 ec2.DeleteVolume
5013 ec2.DescribeSubnets
4991 ec2.DescribeRouteTables
4601 elasticloadbalancing.ModifyLoadBalancerAttributes
4601 elasticloadbalancing.DescribeLoadBalancerAttributes
I am guessing aws-sdk-go is k8s client. Installer is still top user though:
255441 iam.GetRole
239358 iam.GetUser
32121 ec2.DeleteNatGateway
28028 iam.ListRoles
26878 iam.ListUsers
22798 ec2.DeleteNetworkInterface (Client.InvalidParameterValue)
18387 elasticloadbalancing.DescribeLoadBalancers
18345 ec2.DeleteSubnet (Client.RequestLimitExceeded)
There isn't anything immediately obvious that can point me to the fact that our code is wrong. We have made many small tweaks in past to reduce mutable API usage in k8s at the cost of immutable API usage. I need to keep looking.
Created attachment 1560513 [details]
storage operation
Created attachment 1560514 [details]
cloudprovider API usage
Just to put a timescale on this [1]: failed: (15m8s) 2019-04-29T14:29:02 "[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications with PVCs [Suite:openshift/conformance/parallel] [Suite:k8s]" With detach at 14:27:50 and delete at 14:34:52 per earlier comments, we're detaching over a minute before the failure message, although we're deleting over five minutes after the failure message. Are the tests blocked on the detach or the delete? Maybe the detach, but there was subsequent activity that didn't have time to complete? As a short term mitigation, can we bump the timeout for this test by 10 minutes or something? [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_service-ca-operator/52/pull-ci-openshift-service-ca-operator-master-e2e-aws/319/build-log.txt We aren't blocked on detach - detach happened on: I0429 14:27:50.454737 1 operation_generator.go:423] DetachVolume.Detach succeeded for volume "pvc-07169b19-6a89-11e9-9e43-0ab516d65a44" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-081919e38a0970c39") on node "ip-10-0-144-221.ec2.internal" but for some reason when volume delete fails to delete the volume because of API quota issues, the event does not changes to API quota messaging and hence from outside if you just look at build-log.txt, it appears that test is blocked on detach. One of the ways we could somewhat improve this is, we could prevent delete operation until detach is successful. Currently they are running concurrently. Traditionally we block two operations on same volume from running concurrently but in this because attach/detach and delete operation is managed by different controllers, we don't. Detach and delete happening concurrently is a known thing - https://github.com/kubernetes/kubernetes/issues/31511 . It is not easy to fix that though because volume deletion and detach are managed by different controllers and sharing state between these two is going to be problematic. Pushed to 4.2 since it seems risky to fix now, and the delete eventually happens. I debugged some more and found out why detach was not attempted for volume vol-081919e38a0970c39 between 14.14 and 14.27. What I learnt was, vol-081919e38a0970c39 is not the problematic volume in this test run. It is a red herring and actual volume that is root cause of this bug is - vol-0826a4064906dd466. So what happened is, this test creates a SS with replica of 3. It makes various changes to SS and expects updates to transition correctly. During one of the changes to one of the pods, vol-0826a4064906dd466. refuses to detach and hence does not successfully transition to correct state: openshift-kube-controller-manager_kube-controller-manager-ip-10-0-158-12.ec2.internal_kube-controller-manager-4.log:E0429 14:23:49.173151 1 attacher.go:270] Error detaching volumeID "vol-0826a4064906dd466": error detaching EBS volume "vol-0826a4064906dd466" from "i-09f3d8ff56e1ae785": "RequestLimitExceeded: Request limit exceeded.\n\tstatus code: 503, request id: b50b6338-9e9d-4f7c-902d-849afd79265c" This causes other 2 pods to just hang around doing nothing until the entire SS is deleted. At that point, API quota issue strikes again and logs the volume. So we have two problems 1. we don't detach in high load scenarios (this will continue to affect CI so I expect high priority on this post 4.1) 2. we are hitting rate limiting in our CI account which needs either code / rate limit changes to correct (or better, both). ? Also 3. We have known concurrency issues (this will continue to affect CI so I also expect high priority in 4.2) AWS has three API quotas (https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html). They're counted per account, not per client. Describe actions: OpenShift (338240 total): 163625 ec2.DescribeInstances 125678 ec2.DescribeVolumes 24278 ec2.DescribeSecurityGroups 10054 elasticloadbalancing.DescribeLoadBalancers 5013 ec2.DescribeSubnets 4991 ec2.DescribeRouteTables 4601 elasticloadbalancing.DescribeLoadBalancerAttributes Installer (586437 total): 255441 iam.GetRole 239358 iam.GetUser 28028 iam.ListRoles 26878 iam.ListUsers 18387 elasticloadbalancing.DescribeLoadBalancers 18345 ec2.DeleteSubnet Modify actions: OpenShift (36732 total): 8481 ec2.AttachVolume 8455 ec2.DetachVolume 7745 ec2.CreateVolume 7450 ec2.DeleteVolume 4601 elasticloadbalancing.ModifyLoadBalancerAttributes Installer (73264 total): 32121 ec2.DeleteNatGateway 22798 ec2.DeleteNetworkInterface 18345 ec2.DeleteSubnet From the numbers it's clear that it's the installer who should be fixed first. We'll continue to improve OpenShift storage part in this BZ, I filled https://bugzilla.redhat.com/show_bug.cgi?id=1708262 for the installer part. There has been at least 3 PRs in upstream (which I will link shortly) to reduce API usage further in k8s. Lets backport those and close this. One major concern was it is not immediately obviously when API quota errors are hit and we are tracking bubbling those errors as events as separate BZ. We should backport - 1. https://github.com/kubernetes/kubernetes/pull/78276 2. https://github.com/kubernetes/kubernetes/pull/78601 3. https://github.com/kubernetes/kubernetes/pull/78140 I've submitted a PR [1] to backport these into the openshift/origin master branch. [1] https://github.com/openshift/origin/pull/23424 Moving this to POST as a PR [1] has been submitted. [1] https://github.com/openshift/origin/pull/23424 1 Create 6 statefulset like below:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web0
spec:
selector:
matchLabels:
app: nginx # has to match .spec.template.metadata.labels
serviceName: "nginx"
replicas: 3 # by default is 1
template:
metadata:
labels:
app: nginx # has to match .spec.selector.matchLabels
spec:
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: aosqe/hello-openshift
ports:
- containerPort: 80
name: web
volumeMounts:
- name: ww0
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: ww0
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
2.Delete them; then re-create these statefulset.
Repeat above 10 times but did not hit the aws quota.
Clusterversion is 4.2.0-0.nightly-2019-08-15-232721
Update this bug to verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |