Bug 1381745

Summary: Controllers shut down under heavy AWS API throttling
Product: OpenShift Container Platform Reporter: Stefanie Forrester <dakini>
Component: NodeAssignee: Paul Morie <pmorie>
Status: CLOSED NOTABUG QA Contact: DeShuai Ma <dma>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.3.0CC: aos-bugs, bingli, decarr, jokerman, mifiedle, mmccomas, wmeng
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-26 18:10:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Stefanie Forrester 2016-10-04 21:36:54 UTC
Description of problem:

The atomic-openshift-master-controllers process is restarting frequently in two environments where AWS throttling is present. I believe the throttling is occurring because of this bug https://bugzilla.redhat.com/show_bug.cgi?id=1367229. But ideally the controllers should remain running even when many requests are being throttled.

What I'm seeing in prod is hundreds of these events per day:

Oct 04 20:51:33 ip-172-31-10-25.ec2.internal atomic-openshift-master-controllers[110809]: W1004 20:51:33.785019  110809 retry_handler.go:55] Inserting delay before AWS request (ec2::DeleteVolume) to avoid RequestLimitExceeded: 6s
Oct 04 20:51:33 ip-172-31-10-25.ec2.internal atomic-openshift-master-controllers[110809]: F1004 20:51:33.797506  110809 start_master.go:568] Controller graceful shutdown requested
Oct 04 20:51:34 ip-172-31-10-25.ec2.internal systemd[1]: atomic-openshift-master-controllers.service: main process exited, code=exited, status=255/n/a

Version-Release number of selected component (if applicable):
atomic-openshift-3.3.0.33-1.git.0.8601ee7.el7.x86_64

How reproducible:

Hundreds of times per day in Prod. It doesn't appear to be happening in INT and STG. It also happened more than a dozen times in ded-int-aws where AWS throttling messages are present.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Derek Carr 2016-10-05 18:56:37 UTC
related https://github.com/kubernetes/kubernetes/issues/33088

Comment 2 Derek Carr 2016-10-07 15:25:34 UTC
This is a duplicate of 1377483

*** This bug has been marked as a duplicate of bug 1377483 ***

Comment 3 Derek Carr 2016-10-07 15:26:52 UTC
tracking the online issue separate from the non-online issue separately.

Comment 4 Stefanie Forrester 2016-10-07 15:43:05 UTC
I thought I'd mention a difference between the two bugs that are now marked as duplicates: 

bz 1377483 caused a harder crash (the controllers did not start back up afterwards). And it left core files behind on the file system after the crashes. Whereas this bug (bz 1381745) does a graceful shutdown and recovers, so the controllers are able to keep running.

Also, I wanted to give an update on the impact of this bug for Ops. This issue is less severe than it was a few days ago. I was able to get the controller crashes down to about 11-15 times per day by decreasing the amount of AWS API calls made for DeleteVolume. So with fewer API calls, there's less throttling, which means Ops isn't being impacted so badly by this anymore.

This is the bug related to the DeleteVolume requests https://bugzilla.redhat.com/show_bug.cgi?id=1377486#c22

Comment 5 Paul Morie 2016-10-26 14:02:00 UTC
Stefanie-

Is this still something you're running into?

Comment 6 Stefanie Forrester 2016-10-26 15:44:15 UTC
No, we're not hitting the issue anymore because the API DoS problem has been fixed (bz #1367229). Since we're not being throttled, there are no more controller crashes.

Comment 7 Derek Carr 2016-10-26 17:58:37 UTC
Stefanie - does this bug need to remain open if the issue no longer exists?

Comment 8 Derek Carr 2016-10-26 17:59:23 UTC
Stefanie - and if not, can an RFE process be used to handle throttling specific concerns?

Comment 9 Stefanie Forrester 2016-10-26 18:10:22 UTC
Derek, we can close this one. It probably doesn't make sense to add support for DoSing your cloud service provider :)