1416959 – Lots of failures 'CreatingLoadBalancerFailed' after upgrading to OpenShift 3.4

Bug 1416959 - Lots of failures 'CreatingLoadBalancerFailed' after upgrading to OpenShift 3.4

Summary: Lots of failures 'CreatingLoadBalancerFailed' after upgrading to OpenShift 3.4

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Ben Bennett
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Depends On:	OSOPS_V3
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-26 23:19 UTC by Max Whittingham
Modified:	2018-04-23 18:52 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-01-31 19:39:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Max Whittingham 2017-01-26 23:19:34 UTC

Description of problem:
We're seeing lots of AccessPointNotFoundException (36,000+ today) failures

Version-Release number of selected component (if applicable):
atomic-openshift-3.4.0.39-1

Steps to Reproduce:
1. Upgrade cluster to 3.4 from 3.3
2. Have 100 nodes running

Actual results:
Seeing lots of AccessPointNotFoundException from CloudTrail logs. 36k today. Seeing lots of Warning: CreatingLoadBalancerFailed from the events log on the master.

Expected results:


Additional info:

     {
            "apiVersion": "2012-06-01",
            "awsRegion": "us-east-1",
            "errorCode": "AccessPointNotFoundException",
            "errorMessage": "There is no ACTIVE Load Balancer named 'ad2a1e56acddf11e6a7440e63b9c1c48'",
            "eventID": "0a165ef2-1712-4b82-908a-ed93eececa3b",
            "eventName": "DescribeLoadBalancers",
            "eventSource": "elasticloadbalancing.amazonaws.com",
            "eventTime": "2017-01-20T00:02:27Z",
            "eventType": "AwsApiCall",
            "eventVersion": "1.04",
            "recipientAccountId": "REDACTED",
            "requestID": "bac95152-dea3-11e6-ac34-3560770da0b9",
            "requestParameters": {
                "loadBalancerNames": [
                    "ad2a1e56acddf11e6a7440e63b9c1c48"
                ]
            },
            "responseElements": null,
            "sourceIPAddress": "52.206.63.195",
            "userAgent": "aws-sdk-go/1.0.8 (go1.7.4; linux; amd64)",
            "userIdentity": {
                "accessKeyId": "REDACTED",
                "accountId": "REDACTED",
                "arn": "arn:aws:iam::REDACTED:user/cloud_provider",
                "principalId": "REDACTED",
                "type": "IAMUser",
                "userName": "cloud_provider"
            }
        },


There are also many attempts to create LoadBalancers in the event log. This would explain the increase in AccessPointNotFoundException errors.

[removed]             1h        1h        1         wildfly   Service             Warning   CreatingLoadBalancerFailed   {service-controller }   Error creating load balancer (will retry): Error getting LB for service [removed]/wildfly: Throttling: Rate exceeded
[removed]              1h        1h        1         mongodb           Service             Warning   CreatingLoadBalancerFailed   {service-controller }                    Error creating load balancer (will retry): Error getting LB for service [removed]/mongodb: Throttling: Rate exceeded
[removed]   1h        1h        1         nodejs-mongo-persistent   Service             Warning   CreatingLoadBalancerFailed   {service-controller }   Error creating load balancer (will retry): Error getting LB for service [removed]/nodejs-mongo-persistent: Throttling: Rate exceeded
[removed]      1h        1h        1         hello-openshift      Service             Warning   CreatingLoadBalancerFailed   {service-controller }                     Error creating load balancer (will retry): Error getting LB for service [removed]/hello-openshift: Throttling: Rate exceeded
[removed]       1h        1h        1         nodejs-mongo-persistent   Service                 Warning   CreatingLoadBalancerFailed     {service-controller }        Error creating load balancer (will retry): Error getting LB for service [removed]/nodejs-mongo-persistent: Throttling: Rate exceeded



Previously, in 3.3, dakini worked around this issue by changing all the service types from LoadBalancer to ClusterIP. See also: https://bugzilla.redhat.com/show_bug.cgi?id=1415839#c2

Comment 1 Ben Bennett 2017-01-27 19:56:02 UTC

You are getting throttled by AWS "Throttling: Rate exceeded".

There's an upstream issue at https://github.com/kubernetes/kubernetes/issues/22906 where the request limiter isn't getting used when there are permissions problems.  Is that what you are hitting?

Comment 2 Max Whittingham 2017-01-27 20:35:53 UTC

Yes, it looks like that is the same issue we're seeing.

Comment 3 Ben Bennett 2017-01-27 21:00:38 UTC

The root cause of the repetition is lack of permission to access the AWS APIs.  Can you see if that is the problem?

Comment 4 Stefanie Forrester 2017-01-27 21:42:11 UTC

If permissions were causing the majority of the repetition, then we would receive a significant number of AccessDenied errors back from the API.

But as you can see here, the closest error we're getting is a few Client.UnauthorizedOperation.

[sedgar@use-tower1 20]$ logfile=~/preview_cloudtrail_20170120.txt
[sedgar@use-tower1 20]$ for error in $(grep errorCode $logfile |sort -u|awk '{print $2}'|tr -d \":,) ; do echo "-------- $error ---------" ; grep errorCode $logfile |grep -c $error ; done
-------- AccessPointNotFoundException ---------
30561
-------- Client.IncorrectState ---------
1
-------- Client.InvalidVolume.NotFound ---------
234
-------- Client.RequestLimitExceeded ---------
221550
-------- Client.UnauthorizedOperation ---------
135
-------- Client.VolumeInUse ---------
144042
-------- Server.InternalError ---------
1
-------- Server.Unavailable ---------
166

In addition to that, when we look at the cause of the AccessPointNotFoundException errors, we can see those are caused by DescribeLoadBalancers events.

Here are our current IAM permissions for the cloud provider user, in case that helps clear up any permissions concerns.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "ec2:AttachVolume",
                "ec2:CreateTags",
                "ec2:CreateVolume",
                "ec2:DeleteVolume",
                "ec2:DescribeInstances",
                "elasticloadbalancing:DescribeInstanceHealth",
                "elasticloadbalancing:DescribeLoadBalancerAttributes",
                "elasticloadbalancing:DescribeLoadBalancers",
                "ec2:DescribeVolumes",
                "ec2:DetachVolume",
                "kms:CreateGrant"
            ],
            "Resource": [
                "*"
            ],
            "Effect": "Allow",
            "Sid": "Stmt1438195894000"
        }
    ]
}

Comment 5 Ben Bennett 2017-01-30 15:57:17 UTC

Sorry, I wasn't clear.  I meant the root cause of the repetition in the upstream issue was permission problems that caused Kubernetes to retry and get throttled.

I wonder if we are just generating more requests than the default API thottle allows.  Do you know if we can get Amazon to increase it?

I'll look into why we are doing so many DescribeLoadBalancers, but it would be interesting to know why there are so many VolumeInUse errors being thrown... those look like the long pole.

Comment 6 Stefanie Forrester 2017-01-30 17:16:52 UTC

The VolumeInUse errors have been going on for a long time, since we upgraded to 3.3, iirc. We've been throttled by AWS about 100,000 - 200,000 times per day since then. I had a ticket open with them for a while, but it didn't seem that they would be able to increase our limit.

My interpretation of the problem has been that the volume detach operations are not retried often enough. DeleteVolume operations are being retried about every 30 seconds (though with 3.4.1.2 we'll be getting an exponential backoff that should help), but as far as I know we only attempt the detach one time. So the volumes are often still attached when the DeleteVolume requests are attempted.

This is the bug I have open for the detach issue.
https://bugzilla.redhat.com/show_bug.cgi?id=1377486#c22

I'm working around it by detaching all Failed PV volumes every 5 minutes. If my script stops running for any reason, the number of built-up Failed PVs and subsequent DeleteVolume attempts causes enough throttling that it affects the users. So that's definitely a significant part of the issue we're seeing here.

Anything we can do to reduce any of these API hits would be really useful. iirc, there was also an issue with kubernetes doing a DescribeInstances for every single instance, every time it tries to attach a volume. Or something like that. Mark Turansky knows the details.

Anyway, here are yesterday's logs to get an idea of how things are going:

[sedgar@use-tower1 29]$ logfile=~/preview_cloudtrail_20170129.txt
[sedgar@use-tower1 29]$ for error in $(grep errorCode $logfile |sort -u|awk '{print $2}'|tr -d \":,) ; do echo "-------- $error ---------" ; grep errorCode $logfile |grep -c $error ; done
-------- AccessPointNotFoundException ---------
106365
-------- Client.InvalidVolume.NotFound ---------
1091
-------- Client.RequestLimitExceeded ---------
533546
-------- Client.UnauthorizedOperation ---------
284
-------- Client.VolumeInUse ---------
428147
-------- Server.InternalError ---------
2
-------- Server.Unavailable ---------
255
[sedgar@use-tower1 29]$ for event in $(grep eventName $logfile |sort -u|awk '{print $2}'|tr -d \":,) ; do echo "-------- $event ---------" ; grep eventName $logfile |grep -c $event ; done
-------- AttachVolume ---------
538643
-------- CreateSnapshot ---------
1459
-------- CreateTags ---------
3011
-------- CreateVolume ---------
112
-------- DeleteNetworkInterface ---------
9
-------- DeleteSnapshot ---------
73
-------- DeleteVolume ---------
116000
-------- DescribeInstances ---------
1213717
-------- DescribeLoadBalancers ---------
106641
-------- DescribeSubnets ---------
293
-------- DescribeVolumes ---------
89428
-------- DetachVolume ---------
83

Comment 7 Ben Bennett 2017-01-30 19:20:54 UTC

Given that we are hitting the throttle and will retry, is there something more that you think we can do when creating a load balancer?

Comment 8 Ben Bennett 2017-01-31 19:39:03 UTC

I've double-checked that we are doing backoff on loadbalancer creation.  And we are doing it at:
   vendor/k8s.io/kubernetes/pkg/controller/service/servicecontroller.go line 227
(commit eb2ef25bfa17c52b903c5a6bc3f91534d3710a0e)

Given all of that, it looks like we aren't the culprit that's using the leases, and it looks like we will retry creation when the throttle eases.

Note You need to log in before you can comment on or make changes to this bug.