Description of problem: Our CI aws account recently hit its S3 bucket limit -- primarily because the s3-bucket for the image-registry was not being destroyed. See attached bucket list for 08-15-2019. How reproducible: Around 50-65 buckets a day are being leaked. DPP has a script running to clean up these buckets after they are > 5h old, but this should be considered a backup for leaked resources (and will soon be replaced with alerting instead of automatic cleanup) Additional info: This may belong with the registry team, but starting here for triage.
One of the leaked buckets that James cleaned up: Removing bucket ci-op-9w240v43-cd80e-plppw-image-registry-us-east-1-fblgrpryso { "TagSet": [ { "Key": "openshift_creationDate", "Value": "2019-08-15T23:55:03.991537+00:00" }, { "Key": "Name", "Value": "ci-op-9w240v43-cd80e-plppw-image-registry" }, { "Key": "kubernetes.io/cluster/ci-op-9w240v43-cd80e-plppw", "Value": "owned" } ] } Searching for the namespace prefix in ci-search-next [1] turned up [2] where the bucket was deleted: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/operator-framework_operator-metering/900/pull-ci-operator-framework-operator-metering-master-metering-e2e-aws/982/artifacts/metering-e2e-aws/installer/.openshift_install.log | grep ci-op-9w240v43-cd80e-plppw-image-registry-us-east-1-fblgrpryso time="2019-08-15T23:23:14Z" level=debug msg=Emptied arn="arn:aws:s3:::ci-op-9w240v43-cd80e-plppw-image-registry-us-east-1-fblgrpryso" time="2019-08-15T23:23:15Z" level=info msg=Deleted arn="arn:aws:s3:::ci-op-9w240v43-cd80e-plppw-image-registry-us-east-1-fblgrpryso" With Athena: SELECT eventtime, eventname, useridentity.username, useragent, requestparameters errorcode, errormessage FROM "default"."cloudtrail_logs_cloud_trail_test_clayton" WHERE from_iso8601_timestamp(eventtime) > from_iso8601_timestamp('2019-08-15T20:00Z') AND from_iso8601_timestamp(eventtime) < from_iso8601_timestamp('2019-08-15T23:59Z') AND eventname IN ('CreateBucket') ORDER BY eventtime; feeding create-bucket.csv, you can see the bucket being recreated in that time range: $ grep ci-op-9w240v43-cd80e-plppw-image-registry-us-east-1-fblgrpryso create-bucket.csv "2019-08-15T23:05:59Z","CreateBucket","ci-op-9w240v43-cd80e-openshift-image-registry-5r6lq","[aws-sdk-go/1.21.5 (go1.12.5; linux; amd64) openshift.io cluster-image-registry-operator/4.0.0-322-g683925b-dirty]","{""host"":[""ci-op-9w240v43-cd80e-plppw-image-registry-us-east-1-fblgrpryso.s3.amazonaws.com""],""bucketName"":""ci-op-9w240v43-cd80e-plppw-image-registry-us-east-1-fblgrpryso""}", "2019-08-15T23:23:17Z","CreateBucket","ci-op-9w240v43-cd80e-openshift-image-registry-5r6lq","[aws-sdk-go/1.21.5 (go1.12.5; linux; amd64) openshift.io cluster-image-registry-operator/4.0.0-322-g683925b-dirty]","{""host"":[""ci-op-9w240v43-cd80e-plppw-image-registry-us-east-1-fblgrpryso.s3.amazonaws.com""],""bucketName"":""ci-op-9w240v43-cd80e-plppw-image-registry-us-east-1-fblgrpryso""}", So this is a dup of the still-unfixed bug 1740933. [1]: https://ci-search-ci-search-next.svc.ci.openshift.org/?search=ci-op-9w240v43&maxAge=336h&context=2&type=all [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/operator-framework_operator-metering/900/pull-ci-operator-framework-operator-metering-master-metering-e2e-aws/982 *** This bug has been marked as a duplicate of bug 1740933 ***
From James destroy logs: $ grep 'Removing bucket' deleted-buckets-2019-08-19-0800-PST.txt | wc -l 539 $ grep 'kubernetes.io/cluster' deleted-buckets-2019-08-19-0800-PST.txt | wc -l 487 So while we expect all of these are a result of the instance-destroy vs. registry S3 creation discussed in bug 1740933, 487 of them would have also been avoided by the retry-ARN-deletion discussed in bug 1740935.