Bug 1381745 - Controllers shut down under heavy AWS API throttling
Summary: Controllers shut down under heavy AWS API throttling
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Paul Morie
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-04 21:36 UTC by Stefanie Forrester
Modified: 2016-10-26 18:10 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-10-26 18:10:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Stefanie Forrester 2016-10-04 21:36:54 UTC
Description of problem:

The atomic-openshift-master-controllers process is restarting frequently in two environments where AWS throttling is present. I believe the throttling is occurring because of this bug https://bugzilla.redhat.com/show_bug.cgi?id=1367229. But ideally the controllers should remain running even when many requests are being throttled.

What I'm seeing in prod is hundreds of these events per day:

Oct 04 20:51:33 ip-172-31-10-25.ec2.internal atomic-openshift-master-controllers[110809]: W1004 20:51:33.785019  110809 retry_handler.go:55] Inserting delay before AWS request (ec2::DeleteVolume) to avoid RequestLimitExceeded: 6s
Oct 04 20:51:33 ip-172-31-10-25.ec2.internal atomic-openshift-master-controllers[110809]: F1004 20:51:33.797506  110809 start_master.go:568] Controller graceful shutdown requested
Oct 04 20:51:34 ip-172-31-10-25.ec2.internal systemd[1]: atomic-openshift-master-controllers.service: main process exited, code=exited, status=255/n/a

Version-Release number of selected component (if applicable):
atomic-openshift-3.3.0.33-1.git.0.8601ee7.el7.x86_64

How reproducible:

Hundreds of times per day in Prod. It doesn't appear to be happening in INT and STG. It also happened more than a dozen times in ded-int-aws where AWS throttling messages are present.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Derek Carr 2016-10-05 18:56:37 UTC
related https://github.com/kubernetes/kubernetes/issues/33088

Comment 2 Derek Carr 2016-10-07 15:25:34 UTC
This is a duplicate of 1377483

*** This bug has been marked as a duplicate of bug 1377483 ***

Comment 3 Derek Carr 2016-10-07 15:26:52 UTC
tracking the online issue separate from the non-online issue separately.

Comment 4 Stefanie Forrester 2016-10-07 15:43:05 UTC
I thought I'd mention a difference between the two bugs that are now marked as duplicates: 

bz 1377483 caused a harder crash (the controllers did not start back up afterwards). And it left core files behind on the file system after the crashes. Whereas this bug (bz 1381745) does a graceful shutdown and recovers, so the controllers are able to keep running.

Also, I wanted to give an update on the impact of this bug for Ops. This issue is less severe than it was a few days ago. I was able to get the controller crashes down to about 11-15 times per day by decreasing the amount of AWS API calls made for DeleteVolume. So with fewer API calls, there's less throttling, which means Ops isn't being impacted so badly by this anymore.

This is the bug related to the DeleteVolume requests https://bugzilla.redhat.com/show_bug.cgi?id=1377486#c22

Comment 5 Paul Morie 2016-10-26 14:02:00 UTC
Stefanie-

Is this still something you're running into?

Comment 6 Stefanie Forrester 2016-10-26 15:44:15 UTC
No, we're not hitting the issue anymore because the API DoS problem has been fixed (bz #1367229). Since we're not being throttled, there are no more controller crashes.

Comment 7 Derek Carr 2016-10-26 17:58:37 UTC
Stefanie - does this bug need to remain open if the issue no longer exists?

Comment 8 Derek Carr 2016-10-26 17:59:23 UTC
Stefanie - and if not, can an RFE process be used to handle throttling specific concerns?

Comment 9 Stefanie Forrester 2016-10-26 18:10:22 UTC
Derek, we can close this one. It probably doesn't make sense to add support for DoSing your cloud service provider :)


Note You need to log in before you can comment on or make changes to this bug.