Bug 1690069 - openshift-install times out creating infrastructure resources
Summary: openshift-install times out creating infrastructure resources
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Credential Operator
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.1.0
Assignee: Joel Diaz
QA Contact: Oleg Nesterov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-18 17:34 UTC by Corey Daley
Modified: 2019-06-04 10:46 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The in-cluster cloud-cred-operator could constantly loop updating CredentialsRequest object(s). Consequence: Excessive API usage. Fix: Modify the way CredentialsRequest conditions are updated so that timestamps aren't needlessly being updated each time. Result: Allow for proper exponential backoff in the face of error conditions.
Clone Of:
Environment:
Last Closed: 2019-06-04 10:46:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 0 None None None 2019-06-04 10:46:10 UTC

Description Corey Daley 2019-03-18 17:34:27 UTC
Description of problem:
 high `iam.GetUser` count for the cred operator

867962  cloud-credential-operator-iam-ro:
    860032  iam.GetUser

Comment 1 W. Trevor King 2019-03-18 17:59:26 UTC
For example, it occasionally syncs twice per second:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5800/artifacts/e2e-aws/pods/openshift-cloud-credential-operator_cloud-credential-operator-769dfbf44-qnlnt_manager.log.gz | gunzip | grep 'syncing cluster operator status' | tail
time="2019-03-18T13:37:40Z" level=debug msg="syncing cluster operator status"
time="2019-03-18T13:37:40Z" level=debug msg="syncing cluster operator status"
time="2019-03-18T13:41:07Z" level=debug msg="syncing cluster operator status"
time="2019-03-18T13:41:07Z" level=debug msg="syncing cluster operator status"
time="2019-03-18T13:41:08Z" level=debug msg="syncing cluster operator status"
time="2019-03-18T13:41:08Z" level=debug msg="syncing cluster operator status"
time="2019-03-18T13:41:11Z" level=debug msg="syncing cluster operator status"
time="2019-03-18T13:41:11Z" level=debug msg="syncing cluster operator status"
time="2019-03-18T13:41:12Z" level=debug msg="syncing cluster operator status"
time="2019-03-18T13:41:12Z" level=debug msg="syncing cluster operator status"

Comment 2 Ben Parees 2019-03-18 18:05:42 UTC
This is also the suspected cause of upgrade failures resulting from rate limiting as seen here:
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/263/

level=error msg="1 error occurred:"
level=error msg="\t* module.vpc.aws_route_table_association.private_routing[5]: 1 error occurred:"
level=error msg="\t* aws_route_table_association.private_routing.5: timeout while waiting for state to become 'success' (timeout: 5m0s)"

Comment 3 Joel Diaz 2019-03-18 23:15:34 UTC
(In reply to W. Trevor King from comment #1)
It will do the syncOperatorStatus() for every credentialsrequest object being reconciled. https://github.com/openshift/cloud-credential-operator/blob/master/pkg/controller/credentialsrequest/credentialsrequest_controller.go#L262-L267

I do see a lot of these "data was migrated, resyncing" messages, and from looking at the code it migrates in two stages (first spec, then status),

Comment 4 Joel Diaz 2019-03-18 23:16:39 UTC
^ that last comment was unfinished...i was just trying to cc dgoodwin

Comment 5 Devan Goodwin 2019-03-19 11:58:25 UTC
Multiple per second wouldn't worry me too much as it would process multiple credentials in parallel, but if it goes on for any length of time then I'd start getting worried.

I do see in the log Trevor links above that there's about 5x more migration from v1beta1 than I'd expect, limiting to openshift-machine-api cred for example I see 10 migrations, all of them preceeded by: 

time="2019-03-18T13:41:07Z" level=warning msg="found legacy v1beta1 AWSProviderSpec, upgrading to v1" actuator=aws cr=openshift-cloud-credential-operator/openshift-machine-api

Always the Spec.

Comment 6 Devan Goodwin 2019-03-19 12:14:44 UTC
I think what is happening, I forgot to update the cred requests in the various repos from v1beta1 to v1. Our code is migrating them, then every few minutes the CVO comes along and restores the old spec. (something external appears to be reverting them back) I think if we update those component repos to v1 it should reduce quite a bit. 

This wouldn't explain a hotloop, it may not even actually be a hotloop, just an increase in requests due to syncing a bit more often than it should. 

We can probably look into storing last synced generation and just staying away from AWS is we know these creds were successfully validated within the last X hours as well.

Comment 7 Devan Goodwin 2019-03-19 12:30:18 UTC
So steps here:

 1. Update cred requests in openshift/machine-api-operator, openshift/cluster-ingress-operator, and openshift/cluster-image-registry-operator, just flip v1beta1 to v1 in the definitions.
 1. Store last successfully synced generation and a timestamp, don't hit AWS if within say 1 hour. 

I was tempted to also implement hotloop detection for error scenarios resyncing too often but we can probably let exponential backoff take over there.

Comment 9 Chance Zibolski 2019-03-19 18:43:56 UTC
Changed the title to reflect the current assumed issue. We can add other flakes/issues related to infra resources timing out during creation that might be related to some components using most of the API rate limit quota to this bz to centralize whatever is currently causing this.

Comment 11 W. Trevor King 2019-03-20 19:11:32 UTC
(In reply to Ben Parees from comment #10)
> Mitigation (reduce frequency of flapping):
> https://github.com/openshift/cloud-credential-operator/pull/44

This is still open.

> fixes (avoid flapping entirely):
> https://github.com/openshift/cluster-ingress-operator/pull/177
> https://github.com/openshift/machine-api-operator/pull/250
> https://github.com/openshift/cluster-image-registry-operator/pull/243

These have all landed.

Comment 12 W. Trevor King 2019-03-20 23:08:41 UTC
(In reply to Corey Daley from comment #0)
> Description of problem:
>  high `iam.GetUser` count for the cred operator
> 
> 867962  cloud-credential-operator-iam-ro:
>     860032  iam.GetUser

Following up on this, Joel had suggested looking for runaway users, and when I did, it turned out we had a single cluster in this bucket responsible for the vast majority of these.  Pulling that out into its own bucket:

  852102  49671d7b-6bd0-4728-cloud-credential-operator-iam-ro-82djx: <- the crazy runaway
    852102  iam.GetUser
  ...
  15860 cloud-credential-operator-iam-ro:
    7930  iam.GetUser
    3965  iam.ListAccessKeys
    3965  iam.GetUserPolicy

That's much, much more reasonable :p.  I dunno what the runaway was doing (I removed the user), but this removes the cred operator from my list of API suspects.  Total AWS API calls over the 48 hours ending on the 18th (excluding humans and some boring accounts) are now:

  852102	49671d7b-6bd0-4728-cloud-credential-operator-iam-ro-82djx:
  59354	terraform:
  51828	destroyer:
  39365	aws-sdk-go/1.14.12:
  18029	openshift-machine-api:
  15860	cloud-credential-operator-iam-ro:
  15437	aws-elasticloadbalancing:
  8697	cloud-credential-operator:
  1354	aws-ec2:
  ... small fry ...

It's likely that the single runaway cluster was the cause of a fair bit of our rate limiting.

Comment 16 Joel Diaz 2019-03-21 17:01:18 UTC
Where do we go from here on this BZ? The rogue IAM user making 90+% of the calls has been identified/cleaned up.

Comment 17 Ben Parees 2019-03-21 17:02:12 UTC
i think once all the PRs from https://bugzilla.redhat.com/show_bug.cgi?id=1690069#c10 are merged, you can move it to MODIFIED.

Comment 19 Ben Parees 2019-03-21 19:06:39 UTC
We are still seeing this class of failure (timeout creating routing infra resources) despite https://bugzilla.redhat.com/show_bug.cgi?id=1690069#c12

So while this is good to be resolved, we're probably going to end up opening a new bug to resume the chase of why this is still happening:

level=error msg="1 error occurred:"
level=error msg="\t* module.vpc.aws_route_table_association.private_routing[5]: 1 error occurred:"
level=error msg="\t* aws_route_table_association.private_routing.5: timeout while waiting for state to become 'success' (timeout: 5m0s)"

Comment 20 W. Trevor King 2019-03-21 23:30:31 UTC
> We are still seeing this class of failure...

We're doing better though: https://raw.githubusercontent.com/wking/openshift-release/c9aacff244de5e2c9710ac970e3c3c1b7970776a/deck-build-log/deck-build-log.png

This failure is 32/178 for our *-e2e-aws* failures across the whole CI system so far in 2019-03-21Z, and it was 119/448 for 2019-03-20Z.  From the nightly data:

$ curl -s https://s3.amazonaws.com/aws-athena-query-results-460538899914-us-east-1/agents-and-events.csv | ./agents-and-events.py | grep '^[0-9]'
765249	terraform:
721717	destroyer:
588239	49671d7b-6bd0-4728-cloud-credential-operator-iam-ro-82djx:
408388	aws-sdk-go/1.14.12:
256796	cloud-credential-operator:
174820	openshift-machine-api:
161536	aws-elasticloadbalancing:
126485	cloud-credential-operator-iam-ro:
15081	aws-ec2:
...

with the script from [1], so you can the crazy account starting to fall off the leaderboard, with the installer ('terraform' for creation and 'destroyer' for deletion) as the new leader, and aws-sdk-go/1.14.12 (presumably the Kubernetes control-plane) coming in a somewhat distant second.  The fact that Terraform and our destroy code are neck and neck suggests there may not be all that much more fat to cut in the destroy implementation (which has been historically chatty), so I expect further reduction to be hard-fought.  How do we feel about starting to shard out over multiple AWS accounts if we can't talk them into raising our API quotas (if that's a thing they do?)?

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/agents-and-events

Comment 21 W. Trevor King 2019-03-22 06:28:30 UTC
Follow up discussion for aws_route.to_nat_gw in particular in bug 1691516.

Comment 22 W. Trevor King 2019-03-25 19:36:52 UTC
Hmm, I'm seeing a lot of cloud-credential-operator GetUser requests over the weekend, although this time it's the cloud-credential-operator account:

$ head -n3 api-consumers-2019-03-25.txt 
551030	cloud-credential-operator:
	532234	iam.GetUser
	7762	iam.CreateAccessKey
$ grep '^[0-9]' api-consumers-2019-03-25.txt | head -n3
551030	cloud-credential-operator:
220669	aws-sdk-go/1.14.12:
110954	terraform:

Any ideas about why that would be higher now?  I'll see if I can narrow down the potential sources.

Comment 24 Wei Sun 2019-04-10 02:56:34 UTC
Please help check if it could be verified against the latest build.

Comment 25 W. Trevor King 2019-04-10 03:08:20 UTC
[1], linked from comment 22, is a public script summarizing a public, nightly CSV dump.  You can check to see yesterday's main AWS API consumers.  I haven't checked this week, but the last two issues we had with the cred operator were runaway Hive-launched clusters that I had only partially reaped.  I'm not aware of any excessive consumers in recent, intact clusters.  It's hard to definitively declare something like this resolved, but we can always reopen if we see it again ;).

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/agents-and-events

Comment 26 Devan Goodwin 2019-04-10 10:40:44 UTC
Just yesterday I accidentally surfaced a cred minter hotloop that could explain what we were seeing in this bug. We had rate limiting that should have kicked in but didn't due to attempts to record errors in status, I am working on a fix. 

It would require another problem beforehand, what that problem was we don't know, but if a persistent problem was preventing cred minter from working, depending on it's type, it could trigger a constant hotloop and eat up all your API calls. 

I have no idea how we can reproduce this in the wild, I had to do it with throwing a hard error in code in the right spot, but none the less lets keep this open and I will update when we have the fix.

Comment 27 Devan Goodwin 2019-04-12 11:38:13 UTC
This fix is now merged, not sure how this can be QE'd. :(

Comment 28 Oleg Nesterov 2019-04-23 12:12:01 UTC
IN tests I did upgrade from 4.0.0-0.nightly-2019-04-22-005059 to 4.0.0-0.nightly-2019-04-22-192613

Comment 29 Ben Bennett 2019-04-29 17:46:07 UTC
Sorry, I have to re-open this because we are seeing the error a lot in recent runs:
  https://search.svc.ci.openshift.org/?search=aws_route_table_association.private_routing&maxAge=168h&context=2&type=all

Specifically https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/914:
  level=error msg="Error: Error applying plan:" 
  level=error 
  level=error msg="1 error occurred:" 
  level=error msg="\t* module.vpc.aws_route_table_association.private_routing[0]: 1 error occurred:" 
  level=error msg="\t* aws_route_table_association.private_routing.0: timeout while waiting for state to become 'success' (timeout: 5m0s)"

Comment 30 Joel Diaz 2019-04-29 18:04:32 UTC
If this BZ is going to track the above kind of errors, it should be reassigned to someone who would look at these kinds of problems. Cloud API timeouts above during the terraform stage are happening before cloud-cred-operator is even installed.

Comment 31 Devan Goodwin 2019-04-29 18:07:16 UTC
Yeah we might need a new bug for that issue Ben, this one was a specific problem where we were running out of API quota, which doesn't look quite like the success timeout. (but I'm not certain)

Comment 32 Scott Dodson 2019-04-29 20:10:05 UTC
Moving back to verified, per comment #29 we need to do some analysis to find out out which component is believed to be the bad actor and file a bug against that component.

Comment 33 Scott Dodson 2019-04-29 20:11:15 UTC
Sorry, I was less than clear about who needs to do what. What I meant is the reporter needs to do the analysis and open a new bug, not the assignee of this bug.

Comment 34 W. Trevor King 2019-04-29 23:07:39 UTC
I pointed Ben here based on comment 9:

> We can add other flakes/issues related to infra resources timing out during creation that might be related to some components using most of the API rate limit quota to this bz to centralize whatever is currently causing this.

I agree that we should reassign to the installer team if we want to go that route.  If we want this issue to just be about the cred operator's GetUser calls, we should go back to the original "high `iam.GetUser` count for the cred operator" summary here.  Thoughts?

Comment 36 errata-xmlrpc 2019-06-04 10:46:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.