Description of problem: We're seeing lots of AccessPointNotFoundException (36,000+ today) failures Version-Release number of selected component (if applicable): atomic-openshift-3.4.0.39-1 Steps to Reproduce: 1. Upgrade cluster to 3.4 from 3.3 2. Have 100 nodes running Actual results: Seeing lots of AccessPointNotFoundException from CloudTrail logs. 36k today. Seeing lots of Warning: CreatingLoadBalancerFailed from the events log on the master. Expected results: Additional info: { "apiVersion": "2012-06-01", "awsRegion": "us-east-1", "errorCode": "AccessPointNotFoundException", "errorMessage": "There is no ACTIVE Load Balancer named 'ad2a1e56acddf11e6a7440e63b9c1c48'", "eventID": "0a165ef2-1712-4b82-908a-ed93eececa3b", "eventName": "DescribeLoadBalancers", "eventSource": "elasticloadbalancing.amazonaws.com", "eventTime": "2017-01-20T00:02:27Z", "eventType": "AwsApiCall", "eventVersion": "1.04", "recipientAccountId": "REDACTED", "requestID": "bac95152-dea3-11e6-ac34-3560770da0b9", "requestParameters": { "loadBalancerNames": [ "ad2a1e56acddf11e6a7440e63b9c1c48" ] }, "responseElements": null, "sourceIPAddress": "52.206.63.195", "userAgent": "aws-sdk-go/1.0.8 (go1.7.4; linux; amd64)", "userIdentity": { "accessKeyId": "REDACTED", "accountId": "REDACTED", "arn": "arn:aws:iam::REDACTED:user/cloud_provider", "principalId": "REDACTED", "type": "IAMUser", "userName": "cloud_provider" } }, There are also many attempts to create LoadBalancers in the event log. This would explain the increase in AccessPointNotFoundException errors. [removed] 1h 1h 1 wildfly Service Warning CreatingLoadBalancerFailed {service-controller } Error creating load balancer (will retry): Error getting LB for service [removed]/wildfly: Throttling: Rate exceeded [removed] 1h 1h 1 mongodb Service Warning CreatingLoadBalancerFailed {service-controller } Error creating load balancer (will retry): Error getting LB for service [removed]/mongodb: Throttling: Rate exceeded [removed] 1h 1h 1 nodejs-mongo-persistent Service Warning CreatingLoadBalancerFailed {service-controller } Error creating load balancer (will retry): Error getting LB for service [removed]/nodejs-mongo-persistent: Throttling: Rate exceeded [removed] 1h 1h 1 hello-openshift Service Warning CreatingLoadBalancerFailed {service-controller } Error creating load balancer (will retry): Error getting LB for service [removed]/hello-openshift: Throttling: Rate exceeded [removed] 1h 1h 1 nodejs-mongo-persistent Service Warning CreatingLoadBalancerFailed {service-controller } Error creating load balancer (will retry): Error getting LB for service [removed]/nodejs-mongo-persistent: Throttling: Rate exceeded Previously, in 3.3, dakini worked around this issue by changing all the service types from LoadBalancer to ClusterIP. See also: https://bugzilla.redhat.com/show_bug.cgi?id=1415839#c2
You are getting throttled by AWS "Throttling: Rate exceeded". There's an upstream issue at https://github.com/kubernetes/kubernetes/issues/22906 where the request limiter isn't getting used when there are permissions problems. Is that what you are hitting?
Yes, it looks like that is the same issue we're seeing.
The root cause of the repetition is lack of permission to access the AWS APIs. Can you see if that is the problem?
If permissions were causing the majority of the repetition, then we would receive a significant number of AccessDenied errors back from the API. But as you can see here, the closest error we're getting is a few Client.UnauthorizedOperation. [sedgar@use-tower1 20]$ logfile=~/preview_cloudtrail_20170120.txt [sedgar@use-tower1 20]$ for error in $(grep errorCode $logfile |sort -u|awk '{print $2}'|tr -d \":,) ; do echo "-------- $error ---------" ; grep errorCode $logfile |grep -c $error ; done -------- AccessPointNotFoundException --------- 30561 -------- Client.IncorrectState --------- 1 -------- Client.InvalidVolume.NotFound --------- 234 -------- Client.RequestLimitExceeded --------- 221550 -------- Client.UnauthorizedOperation --------- 135 -------- Client.VolumeInUse --------- 144042 -------- Server.InternalError --------- 1 -------- Server.Unavailable --------- 166 In addition to that, when we look at the cause of the AccessPointNotFoundException errors, we can see those are caused by DescribeLoadBalancers events. Here are our current IAM permissions for the cloud provider user, in case that helps clear up any permissions concerns. { "Version": "2012-10-17", "Statement": [ { "Action": [ "ec2:AttachVolume", "ec2:CreateTags", "ec2:CreateVolume", "ec2:DeleteVolume", "ec2:DescribeInstances", "elasticloadbalancing:DescribeInstanceHealth", "elasticloadbalancing:DescribeLoadBalancerAttributes", "elasticloadbalancing:DescribeLoadBalancers", "ec2:DescribeVolumes", "ec2:DetachVolume", "kms:CreateGrant" ], "Resource": [ "*" ], "Effect": "Allow", "Sid": "Stmt1438195894000" } ] }
Sorry, I wasn't clear. I meant the root cause of the repetition in the upstream issue was permission problems that caused Kubernetes to retry and get throttled. I wonder if we are just generating more requests than the default API thottle allows. Do you know if we can get Amazon to increase it? I'll look into why we are doing so many DescribeLoadBalancers, but it would be interesting to know why there are so many VolumeInUse errors being thrown... those look like the long pole.
The VolumeInUse errors have been going on for a long time, since we upgraded to 3.3, iirc. We've been throttled by AWS about 100,000 - 200,000 times per day since then. I had a ticket open with them for a while, but it didn't seem that they would be able to increase our limit. My interpretation of the problem has been that the volume detach operations are not retried often enough. DeleteVolume operations are being retried about every 30 seconds (though with 3.4.1.2 we'll be getting an exponential backoff that should help), but as far as I know we only attempt the detach one time. So the volumes are often still attached when the DeleteVolume requests are attempted. This is the bug I have open for the detach issue. https://bugzilla.redhat.com/show_bug.cgi?id=1377486#c22 I'm working around it by detaching all Failed PV volumes every 5 minutes. If my script stops running for any reason, the number of built-up Failed PVs and subsequent DeleteVolume attempts causes enough throttling that it affects the users. So that's definitely a significant part of the issue we're seeing here. Anything we can do to reduce any of these API hits would be really useful. iirc, there was also an issue with kubernetes doing a DescribeInstances for every single instance, every time it tries to attach a volume. Or something like that. Mark Turansky knows the details. Anyway, here are yesterday's logs to get an idea of how things are going: [sedgar@use-tower1 29]$ logfile=~/preview_cloudtrail_20170129.txt [sedgar@use-tower1 29]$ for error in $(grep errorCode $logfile |sort -u|awk '{print $2}'|tr -d \":,) ; do echo "-------- $error ---------" ; grep errorCode $logfile |grep -c $error ; done -------- AccessPointNotFoundException --------- 106365 -------- Client.InvalidVolume.NotFound --------- 1091 -------- Client.RequestLimitExceeded --------- 533546 -------- Client.UnauthorizedOperation --------- 284 -------- Client.VolumeInUse --------- 428147 -------- Server.InternalError --------- 2 -------- Server.Unavailable --------- 255 [sedgar@use-tower1 29]$ for event in $(grep eventName $logfile |sort -u|awk '{print $2}'|tr -d \":,) ; do echo "-------- $event ---------" ; grep eventName $logfile |grep -c $event ; done -------- AttachVolume --------- 538643 -------- CreateSnapshot --------- 1459 -------- CreateTags --------- 3011 -------- CreateVolume --------- 112 -------- DeleteNetworkInterface --------- 9 -------- DeleteSnapshot --------- 73 -------- DeleteVolume --------- 116000 -------- DescribeInstances --------- 1213717 -------- DescribeLoadBalancers --------- 106641 -------- DescribeSubnets --------- 293 -------- DescribeVolumes --------- 89428 -------- DetachVolume --------- 83
Given that we are hitting the throttle and will retry, is there something more that you think we can do when creating a load balancer?
I've double-checked that we are doing backoff on loadbalancer creation. And we are doing it at: vendor/k8s.io/kubernetes/pkg/controller/service/servicecontroller.go line 227 (commit eb2ef25bfa17c52b903c5a6bc3f91534d3710a0e) Given all of that, it looks like we aren't the culprit that's using the leases, and it looks like we will retry creation when the throttle eases.