Hide Forgot
Description of problem: high `iam.GetUser` count for the cred operator 867962 cloud-credential-operator-iam-ro: 860032 iam.GetUser
For example, it occasionally syncs twice per second: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5800/artifacts/e2e-aws/pods/openshift-cloud-credential-operator_cloud-credential-operator-769dfbf44-qnlnt_manager.log.gz | gunzip | grep 'syncing cluster operator status' | tail time="2019-03-18T13:37:40Z" level=debug msg="syncing cluster operator status" time="2019-03-18T13:37:40Z" level=debug msg="syncing cluster operator status" time="2019-03-18T13:41:07Z" level=debug msg="syncing cluster operator status" time="2019-03-18T13:41:07Z" level=debug msg="syncing cluster operator status" time="2019-03-18T13:41:08Z" level=debug msg="syncing cluster operator status" time="2019-03-18T13:41:08Z" level=debug msg="syncing cluster operator status" time="2019-03-18T13:41:11Z" level=debug msg="syncing cluster operator status" time="2019-03-18T13:41:11Z" level=debug msg="syncing cluster operator status" time="2019-03-18T13:41:12Z" level=debug msg="syncing cluster operator status" time="2019-03-18T13:41:12Z" level=debug msg="syncing cluster operator status"
This is also the suspected cause of upgrade failures resulting from rate limiting as seen here: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/263/ level=error msg="1 error occurred:" level=error msg="\t* module.vpc.aws_route_table_association.private_routing[5]: 1 error occurred:" level=error msg="\t* aws_route_table_association.private_routing.5: timeout while waiting for state to become 'success' (timeout: 5m0s)"
(In reply to W. Trevor King from comment #1) It will do the syncOperatorStatus() for every credentialsrequest object being reconciled. https://github.com/openshift/cloud-credential-operator/blob/master/pkg/controller/credentialsrequest/credentialsrequest_controller.go#L262-L267 I do see a lot of these "data was migrated, resyncing" messages, and from looking at the code it migrates in two stages (first spec, then status),
^ that last comment was unfinished...i was just trying to cc dgoodwin
Multiple per second wouldn't worry me too much as it would process multiple credentials in parallel, but if it goes on for any length of time then I'd start getting worried. I do see in the log Trevor links above that there's about 5x more migration from v1beta1 than I'd expect, limiting to openshift-machine-api cred for example I see 10 migrations, all of them preceeded by: time="2019-03-18T13:41:07Z" level=warning msg="found legacy v1beta1 AWSProviderSpec, upgrading to v1" actuator=aws cr=openshift-cloud-credential-operator/openshift-machine-api Always the Spec.
I think what is happening, I forgot to update the cred requests in the various repos from v1beta1 to v1. Our code is migrating them, then every few minutes the CVO comes along and restores the old spec. (something external appears to be reverting them back) I think if we update those component repos to v1 it should reduce quite a bit. This wouldn't explain a hotloop, it may not even actually be a hotloop, just an increase in requests due to syncing a bit more often than it should. We can probably look into storing last synced generation and just staying away from AWS is we know these creds were successfully validated within the last X hours as well.
So steps here: 1. Update cred requests in openshift/machine-api-operator, openshift/cluster-ingress-operator, and openshift/cluster-image-registry-operator, just flip v1beta1 to v1 in the definitions. 1. Store last successfully synced generation and a timestamp, don't hit AWS if within say 1 hour. I was tempted to also implement hotloop detection for error scenarios resyncing too often but we can probably let exponential backoff take over there.
Changed the title to reflect the current assumed issue. We can add other flakes/issues related to infra resources timing out during creation that might be related to some components using most of the API rate limit quota to this bz to centralize whatever is currently causing this.
Mitigation (reduce frequency of flapping): https://github.com/openshift/cloud-credential-operator/pull/44 fixes (avoid flapping entirely): https://github.com/openshift/cluster-ingress-operator/pull/177 https://github.com/openshift/machine-api-operator/pull/250 https://github.com/openshift/cluster-image-registry-operator/pull/243
(In reply to Ben Parees from comment #10) > Mitigation (reduce frequency of flapping): > https://github.com/openshift/cloud-credential-operator/pull/44 This is still open. > fixes (avoid flapping entirely): > https://github.com/openshift/cluster-ingress-operator/pull/177 > https://github.com/openshift/machine-api-operator/pull/250 > https://github.com/openshift/cluster-image-registry-operator/pull/243 These have all landed.
(In reply to Corey Daley from comment #0) > Description of problem: > high `iam.GetUser` count for the cred operator > > 867962 cloud-credential-operator-iam-ro: > 860032 iam.GetUser Following up on this, Joel had suggested looking for runaway users, and when I did, it turned out we had a single cluster in this bucket responsible for the vast majority of these. Pulling that out into its own bucket: 852102 49671d7b-6bd0-4728-cloud-credential-operator-iam-ro-82djx: <- the crazy runaway 852102 iam.GetUser ... 15860 cloud-credential-operator-iam-ro: 7930 iam.GetUser 3965 iam.ListAccessKeys 3965 iam.GetUserPolicy That's much, much more reasonable :p. I dunno what the runaway was doing (I removed the user), but this removes the cred operator from my list of API suspects. Total AWS API calls over the 48 hours ending on the 18th (excluding humans and some boring accounts) are now: 852102 49671d7b-6bd0-4728-cloud-credential-operator-iam-ro-82djx: 59354 terraform: 51828 destroyer: 39365 aws-sdk-go/1.14.12: 18029 openshift-machine-api: 15860 cloud-credential-operator-iam-ro: 15437 aws-elasticloadbalancing: 8697 cloud-credential-operator: 1354 aws-ec2: ... small fry ... It's likely that the single runaway cluster was the cause of a fair bit of our rate limiting.
Where do we go from here on this BZ? The rogue IAM user making 90+% of the calls has been identified/cleaned up.
i think once all the PRs from https://bugzilla.redhat.com/show_bug.cgi?id=1690069#c10 are merged, you can move it to MODIFIED.
> https://github.com/openshift/cloud-credential-operator/pull/44 > https://github.com/openshift/cluster-ingress-operator/pull/177 > https://github.com/openshift/machine-api-operator/pull/250 > https://github.com/openshift/cluster-image-registry-operator/pull/243 ^ have all landed plus another one we missed previously https://github.com/openshift/cloud-credential-operator/pull/45
We are still seeing this class of failure (timeout creating routing infra resources) despite https://bugzilla.redhat.com/show_bug.cgi?id=1690069#c12 So while this is good to be resolved, we're probably going to end up opening a new bug to resume the chase of why this is still happening: level=error msg="1 error occurred:" level=error msg="\t* module.vpc.aws_route_table_association.private_routing[5]: 1 error occurred:" level=error msg="\t* aws_route_table_association.private_routing.5: timeout while waiting for state to become 'success' (timeout: 5m0s)"
> We are still seeing this class of failure... We're doing better though: https://raw.githubusercontent.com/wking/openshift-release/c9aacff244de5e2c9710ac970e3c3c1b7970776a/deck-build-log/deck-build-log.png This failure is 32/178 for our *-e2e-aws* failures across the whole CI system so far in 2019-03-21Z, and it was 119/448 for 2019-03-20Z. From the nightly data: $ curl -s https://s3.amazonaws.com/aws-athena-query-results-460538899914-us-east-1/agents-and-events.csv | ./agents-and-events.py | grep '^[0-9]' 765249 terraform: 721717 destroyer: 588239 49671d7b-6bd0-4728-cloud-credential-operator-iam-ro-82djx: 408388 aws-sdk-go/1.14.12: 256796 cloud-credential-operator: 174820 openshift-machine-api: 161536 aws-elasticloadbalancing: 126485 cloud-credential-operator-iam-ro: 15081 aws-ec2: ... with the script from [1], so you can the crazy account starting to fall off the leaderboard, with the installer ('terraform' for creation and 'destroyer' for deletion) as the new leader, and aws-sdk-go/1.14.12 (presumably the Kubernetes control-plane) coming in a somewhat distant second. The fact that Terraform and our destroy code are neck and neck suggests there may not be all that much more fat to cut in the destroy implementation (which has been historically chatty), so I expect further reduction to be hard-fought. How do we feel about starting to shard out over multiple AWS accounts if we can't talk them into raising our API quotas (if that's a thing they do?)? [1]: https://github.com/wking/openshift-release/tree/debug-scripts/agents-and-events
Follow up discussion for aws_route.to_nat_gw in particular in bug 1691516.
Hmm, I'm seeing a lot of cloud-credential-operator GetUser requests over the weekend, although this time it's the cloud-credential-operator account: $ head -n3 api-consumers-2019-03-25.txt 551030 cloud-credential-operator: 532234 iam.GetUser 7762 iam.CreateAccessKey $ grep '^[0-9]' api-consumers-2019-03-25.txt | head -n3 551030 cloud-credential-operator: 220669 aws-sdk-go/1.14.12: 110954 terraform: Any ideas about why that would be higher now? I'll see if I can narrow down the potential sources.
Please help check if it could be verified against the latest build.
[1], linked from comment 22, is a public script summarizing a public, nightly CSV dump. You can check to see yesterday's main AWS API consumers. I haven't checked this week, but the last two issues we had with the cred operator were runaway Hive-launched clusters that I had only partially reaped. I'm not aware of any excessive consumers in recent, intact clusters. It's hard to definitively declare something like this resolved, but we can always reopen if we see it again ;). [1]: https://github.com/wking/openshift-release/tree/debug-scripts/agents-and-events
Just yesterday I accidentally surfaced a cred minter hotloop that could explain what we were seeing in this bug. We had rate limiting that should have kicked in but didn't due to attempts to record errors in status, I am working on a fix. It would require another problem beforehand, what that problem was we don't know, but if a persistent problem was preventing cred minter from working, depending on it's type, it could trigger a constant hotloop and eat up all your API calls. I have no idea how we can reproduce this in the wild, I had to do it with throwing a hard error in code in the right spot, but none the less lets keep this open and I will update when we have the fix.
This fix is now merged, not sure how this can be QE'd. :(
IN tests I did upgrade from 4.0.0-0.nightly-2019-04-22-005059 to 4.0.0-0.nightly-2019-04-22-192613
Sorry, I have to re-open this because we are seeing the error a lot in recent runs: https://search.svc.ci.openshift.org/?search=aws_route_table_association.private_routing&maxAge=168h&context=2&type=all Specifically https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/914: level=error msg="Error: Error applying plan:" level=error level=error msg="1 error occurred:" level=error msg="\t* module.vpc.aws_route_table_association.private_routing[0]: 1 error occurred:" level=error msg="\t* aws_route_table_association.private_routing.0: timeout while waiting for state to become 'success' (timeout: 5m0s)"
If this BZ is going to track the above kind of errors, it should be reassigned to someone who would look at these kinds of problems. Cloud API timeouts above during the terraform stage are happening before cloud-cred-operator is even installed.
Yeah we might need a new bug for that issue Ben, this one was a specific problem where we were running out of API quota, which doesn't look quite like the success timeout. (but I'm not certain)
Moving back to verified, per comment #29 we need to do some analysis to find out out which component is believed to be the bad actor and file a bug against that component.
Sorry, I was less than clear about who needs to do what. What I meant is the reporter needs to do the analysis and open a new bug, not the assignee of this bug.
I pointed Ben here based on comment 9: > We can add other flakes/issues related to infra resources timing out during creation that might be related to some components using most of the API rate limit quota to this bz to centralize whatever is currently causing this. I agree that we should reassign to the installer team if we want to go that route. If we want this issue to just be about the cred operator's GetUser calls, we should go back to the original "high `iam.GetUser` count for the cred operator" summary here. Thoughts?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758