Bug 1704356

Summary:

[stability] Some e2e tests failing due to unable to detach EBS volume in under 10m

Product:

OpenShift Container Platform

Reporter:

Clayton Coleman <ccoleman>

Component:

Storage

Assignee:

Christian Huffman <chuffman>

Status:

CLOSED ERRATA

QA Contact:

Chao Yang <chaoyang>

Severity:

high

Docs Contact:

Priority:

medium

Version:

4.1.0

CC:

aos-bugs, aos-storage-staff, bbennett, hekumar, jsafrane, nagrawal, vlaad, wking

Target Milestone:

---

Target Release:

4.2.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-10-16 06:28:21 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
storage operation	none
cloudprovider API usage	none

Description Clayton Coleman 2019-04-29 15:45:40 UTC

In the last few days I have noticed several stateful set tests (which depend on PVCs and detach behaving reasonably) time out waiting for detach.  This is usually a leading indicator we have a detach race / bug.  The tests are flaking ~1/10 runs, so this isn't happening all the time, but failure to detach is bad.

We need to triage this to determine what is going on.

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_service-ca-operator/52/pull-ci-openshift-service-ca-operator-master-e2e-aws/319#openshift-tests-sig-apps-statefulset-k8sio-basic-statefulset-functionality-statefulsetbasic-should-perform-rolling-updates-and-roll-backs-of-template-modifications-with-pvcs-suiteopenshiftconformanceparallel-suitek8s

[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications with PVCs [Suite:openshift/conformance/parallel] [Suite:k8s] 15m8s

Apr 29 14:28:50.978: INFO: Still waiting for pvs of statefulset to disappear:
pvc-07169b19-6a89-11e9-9e43-0ab516d65a44: {Phase:Failed Message:Error deleting EBS volume "vol-081919e38a0970c39" since volume is currently attached to "i-0efeaa5d10ef5d981" Reason:}

---------------------------------------------------------
Received interrupt.  Running AfterSuite...
^C again to terminate immediately

Comment 1 Hemant Kumar 2019-04-29 16:55:30 UTC

Part of the code is clearly running into quota issues:

I0429 14:27:54.993644       1 aws_util.go:63] Error deleting EBS Disk volume aws://us-east-1b/vol-081919e38a0970c39: Error deleting EBS volume "vol-081919e38a0970c39" since volume is currently attached to "i-0efeaa5d10ef5d981"
I0429 14:28:48.328103       1 aws_util.go:63] Error deleting EBS Disk volume aws://us-east-1b/vol-081919e38a0970c39: error deleting EBS volume "vol-081919e38a0970c39": "RequestLimitExceeded: Request limit exceeded.\n\tstatus code: 503, request id: 8e7efc52-8f96-4e42-a908-9959413f8bba"
..
..
E0429 14:32:51.346054       1 goroutinemap.go:150] Operation for "delete-pvc-07169b19-6a89-11e9-9e43-0ab516d65a44[078ce9ee-6a89-11e9-9e43-0ab516d65a44]" failed. No retries permitted until 2019-04-29 14:32:59.34602468 +0000 UTC m=+2672.681919500 (durationBeforeRetry 8s). Error: "error deleting EBS volume \"vol-081919e38a0970c39\": \"RequestLimitExceeded: Request limit exceeded.\\n\\tstatus code: 503, request id: 7902311f-e683-4ada-9cc4-4c2f269fa707\""
..

and it eventually detaches.

I0429 14:34:52.149830       1 aws_util.go:67] Successfully deleted EBS Disk volume aws://us-east-1b/vol-081919e38a0970c39

Comment 2 Hemant Kumar 2019-04-29 16:56:53 UTC

err - it eventually deletes. The detach fwiw happened correctly at:

I0429 14:27:50.454737       1 operation_generator.go:423] DetachVolume.Detach succeeded for volume "pvc-07169b19-6a89-11e9-9e43-0ab516d65a44" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-081919e38a0970c39") on node "ip-10-0-144-221.ec2.internal"

Comment 3 Hemant Kumar 2019-04-30 20:15:27 UTC

I debugged this to some extent and found that this e2e in total made 600 describeVolume calls. Although total number of storage operations that could trigger describeVolume was small, so I guess we kept wait looping for volume to reach desired state by calling describeVolume repeatedly.

I also looked at API usage in overall account and these were top users from aws-sdk:

414382  aws-sdk-go/1.14.12:                                                                                                                                                                                                                                                                
        163625  ec2.DescribeInstances                                                                                                                                                                                                                                                      
        125678  ec2.DescribeVolumes                                                                                                                                                                                                                                                        
        24278   ec2.DescribeSecurityGroups                                                                                                                                                                                                                                                 
        10054   elasticloadbalancing.DescribeLoadBalancers                                                                                                                                                                                                                                 
        8481    ec2.AttachVolume                                                                                                                                                                                                                                                           
        8455    ec2.DetachVolume                                                                                                                                                                                                                                                           
        7745    ec2.CreateVolume                                                                                                                                                                                                                                                           
        7450    ec2.DeleteVolume                                                                                               
        5013    ec2.DescribeSubnets                                                                                                                                                                                                                                                        
        4991    ec2.DescribeRouteTables                                                                                                                                                                                                                                                    
        4601    elasticloadbalancing.ModifyLoadBalancerAttributes                                                                                                                                                                                                                          
        4601    elasticloadbalancing.DescribeLoadBalancerAttributes                       


I am guessing aws-sdk-go is k8s client. Installer is still top user though:

        255441  iam.GetRole                                                             
        239358  iam.GetUser                                                                                                                                                                                                                                                                
        32121   ec2.DeleteNatGateway                                                                                                                                                                                                                                                       
        28028   iam.ListRoles                                                                                                                                                                                                                                                              
        26878   iam.ListUsers                                                                        
        22798   ec2.DeleteNetworkInterface (Client.InvalidParameterValue)                 
        18387   elasticloadbalancing.DescribeLoadBalancers                               
        18345   ec2.DeleteSubnet (Client.RequestLimitExceeded)                              



There isn't anything immediately obvious that can point me to the fact that our code is wrong. We have made many small tweaks in past to reduce mutable API usage in k8s at the cost of immutable API usage. I need to keep looking.

Comment 4 Hemant Kumar 2019-04-30 20:16:43 UTC

Created attachment 1560513 [details]
storage operation

Comment 5 Hemant Kumar 2019-04-30 20:17:06 UTC

Created attachment 1560514 [details]
cloudprovider API usage

Comment 6 W. Trevor King 2019-04-30 20:46:01 UTC

Just to put a timescale on this [1]:

  failed: (15m8s) 2019-04-29T14:29:02 "[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications with PVCs [Suite:openshift/conformance/parallel] [Suite:k8s]"

With detach at 14:27:50 and delete at 14:34:52 per earlier comments, we're detaching over a minute before the failure message, although we're deleting over five minutes after the failure message.  Are the tests blocked on the detach or the delete?  Maybe the detach, but there was subsequent activity that didn't have time to complete?  As a short term mitigation, can we bump the timeout for this test by 10 minutes or something?

[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_service-ca-operator/52/pull-ci-openshift-service-ca-operator-master-e2e-aws/319/build-log.txt

Comment 7 Hemant Kumar 2019-04-30 21:38:36 UTC

We aren't blocked on detach -  detach happened on:

I0429 14:27:50.454737       1 operation_generator.go:423] DetachVolume.Detach succeeded for volume "pvc-07169b19-6a89-11e9-9e43-0ab516d65a44" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-081919e38a0970c39") on node "ip-10-0-144-221.ec2.internal"

but for some reason when volume delete fails to delete the volume because of API quota issues, the event does not changes to API quota messaging and hence from outside if you just look at build-log.txt, it appears that test is blocked on detach. One of the ways we could somewhat improve this is, we could prevent delete operation until detach is successful. Currently they are running concurrently. Traditionally we block two operations on same volume from running concurrently but in this because attach/detach and delete operation is managed by different controllers, we don't.

Comment 8 Hemant Kumar 2019-05-01 18:59:57 UTC

Detach and delete happening concurrently is a known thing - https://github.com/kubernetes/kubernetes/issues/31511 . It is not easy to fix that though because volume deletion and detach are managed by different controllers and sharing state between these two is going to be problematic.

Comment 9 Ben Bennett 2019-05-02 12:52:28 UTC

Pushed to 4.2 since it seems risky to fix now, and the delete eventually happens.

Comment 10 Hemant Kumar 2019-05-02 18:05:56 UTC

I debugged some more and found out why detach was not attempted for volume vol-081919e38a0970c39 between 14.14 and 14.27. What I learnt was, vol-081919e38a0970c39 is not the problematic volume in this test run. It is a red herring and actual volume that is root cause of this bug is - vol-0826a4064906dd466. So what happened is, this test creates a SS with replica of 3. It makes various changes to SS and expects updates to transition correctly. During one of the changes to one of the pods, vol-0826a4064906dd466. refuses to detach and hence does not successfully transition to correct state:

openshift-kube-controller-manager_kube-controller-manager-ip-10-0-158-12.ec2.internal_kube-controller-manager-4.log:E0429 14:23:49.173151       1 attacher.go:270] Error detaching volumeID "vol-0826a4064906dd466": error detaching EBS volume "vol-0826a4064906dd466" from "i-09f3d8ff56e1ae785": "RequestLimitExceeded: Request limit exceeded.\n\tstatus code: 503, request id: b50b6338-9e9d-4f7c-902d-849afd79265c"


This causes other 2 pods to just hang around doing nothing until the entire SS is deleted. At that point, API quota issue strikes again and logs the volume.

Comment 11 Clayton Coleman 2019-05-03 16:55:57 UTC

So we have two problems

1. we don't detach in high load scenarios (this will continue to affect CI so I expect high priority on this post 4.1)
2. we are hitting rate limiting in our CI account which needs either code / rate limit changes to correct (or better, both).

?

Comment 12 Clayton Coleman 2019-05-03 16:56:28 UTC

Also

3. We have known concurrency issues (this will continue to affect CI so I also expect high priority in 4.2)

Comment 13 Jan Safranek 2019-05-09 13:21:57 UTC

AWS has three API quotas (https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html). They're counted per account, not per client.

Describe actions:

OpenShift (338240 total):
163625	ec2.DescribeInstances
125678	ec2.DescribeVolumes
24278	ec2.DescribeSecurityGroups
10054	elasticloadbalancing.DescribeLoadBalancers
5013	ec2.DescribeSubnets
4991	ec2.DescribeRouteTables
4601	elasticloadbalancing.DescribeLoadBalancerAttributes

Installer (586437 total):
255441	iam.GetRole
239358	iam.GetUser
28028	iam.ListRoles
26878	iam.ListUsers
18387	elasticloadbalancing.DescribeLoadBalancers
18345	ec2.DeleteSubnet


Modify actions:

OpenShift (36732 total):
8481	ec2.AttachVolume
8455	ec2.DetachVolume
7745	ec2.CreateVolume
7450	ec2.DeleteVolume
4601	elasticloadbalancing.ModifyLoadBalancerAttributes

Installer (73264 total):
32121	ec2.DeleteNatGateway
22798	ec2.DeleteNetworkInterface
18345	ec2.DeleteSubnet


From the numbers it's clear that it's the installer who should be fixed first.

Comment 14 Jan Safranek 2019-05-09 13:24:59 UTC

We'll continue to improve OpenShift storage part in this BZ, I filled https://bugzilla.redhat.com/show_bug.cgi?id=1708262 for the installer part.

Comment 15 Hemant Kumar 2019-07-16 14:47:36 UTC

There has been at least 3 PRs in upstream (which I will link shortly) to reduce API usage further in k8s. Lets backport those and close this. One major concern was it is not immediately obviously when API quota errors are hit and we are tracking bubbling those errors as events as separate BZ.

Comment 16 Hemant Kumar 2019-07-17 14:18:10 UTC

We should backport - 

1. https://github.com/kubernetes/kubernetes/pull/78276
2. https://github.com/kubernetes/kubernetes/pull/78601
3. https://github.com/kubernetes/kubernetes/pull/78140

Comment 17 Christian Huffman 2019-07-18 19:28:04 UTC

I've submitted a PR [1] to backport these into the openshift/origin master branch.

[1] https://github.com/openshift/origin/pull/23424

Comment 20 Christian Huffman 2019-07-18 20:01:28 UTC

Moving this to POST as a PR [1] has been submitted.

[1] https://github.com/openshift/origin/pull/23424

Comment 22 Chao Yang 2019-08-16 09:22:02 UTC

1 Create 6 statefulset like below:
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web0
spec:
  selector:
    matchLabels:
      app: nginx # has to match .spec.template.metadata.labels
  serviceName: "nginx"
  replicas: 3 # by default is 1
  template:
    metadata:
      labels:
        app: nginx # has to match .spec.selector.matchLabels
    spec:
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: aosqe/hello-openshift
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: ww0
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: ww0
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi
2.Delete them; then re-create these statefulset.

Repeat above 10 times but did not hit the aws quota. 

Clusterversion is 4.2.0-0.nightly-2019-08-15-232721

Update this bug to verified

Comment 23 errata-xmlrpc 2019-10-16 06:28:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922