Description of problem: The atomic-openshift-master-controllers process is restarting frequently in two environments where AWS throttling is present. I believe the throttling is occurring because of this bug https://bugzilla.redhat.com/show_bug.cgi?id=1367229. But ideally the controllers should remain running even when many requests are being throttled. What I'm seeing in prod is hundreds of these events per day: Oct 04 20:51:33 ip-172-31-10-25.ec2.internal atomic-openshift-master-controllers[110809]: W1004 20:51:33.785019 110809 retry_handler.go:55] Inserting delay before AWS request (ec2::DeleteVolume) to avoid RequestLimitExceeded: 6s Oct 04 20:51:33 ip-172-31-10-25.ec2.internal atomic-openshift-master-controllers[110809]: F1004 20:51:33.797506 110809 start_master.go:568] Controller graceful shutdown requested Oct 04 20:51:34 ip-172-31-10-25.ec2.internal systemd[1]: atomic-openshift-master-controllers.service: main process exited, code=exited, status=255/n/a Version-Release number of selected component (if applicable): atomic-openshift-3.3.0.33-1.git.0.8601ee7.el7.x86_64 How reproducible: Hundreds of times per day in Prod. It doesn't appear to be happening in INT and STG. It also happened more than a dozen times in ded-int-aws where AWS throttling messages are present. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
related https://github.com/kubernetes/kubernetes/issues/33088
This is a duplicate of 1377483 *** This bug has been marked as a duplicate of bug 1377483 ***
tracking the online issue separate from the non-online issue separately.
I thought I'd mention a difference between the two bugs that are now marked as duplicates: bz 1377483 caused a harder crash (the controllers did not start back up afterwards). And it left core files behind on the file system after the crashes. Whereas this bug (bz 1381745) does a graceful shutdown and recovers, so the controllers are able to keep running. Also, I wanted to give an update on the impact of this bug for Ops. This issue is less severe than it was a few days ago. I was able to get the controller crashes down to about 11-15 times per day by decreasing the amount of AWS API calls made for DeleteVolume. So with fewer API calls, there's less throttling, which means Ops isn't being impacted so badly by this anymore. This is the bug related to the DeleteVolume requests https://bugzilla.redhat.com/show_bug.cgi?id=1377486#c22
Stefanie- Is this still something you're running into?
No, we're not hitting the issue anymore because the API DoS problem has been fixed (bz #1367229). Since we're not being throttled, there are no more controller crashes.
Stefanie - does this bug need to remain open if the issue no longer exists?
Stefanie - and if not, can an RFE process be used to handle throttling specific concerns?
Derek, we can close this one. It probably doesn't make sense to add support for DoSing your cloud service provider :)