Bug 1743313

Summary: image-registry s3 bucket is not deleted during destroy
Product: OpenShift Container Platform Reporter: Justin Pierce <jupierce>
Component: InstallerAssignee: Abhinav Dahiya <adahiya>
Installer sub component: openshift-installer QA Contact: Johnny Liu <jialiu>
Status: CLOSED DUPLICATE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: wking
Version: 4.2.0   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-19 16:32:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Justin Pierce 2019-08-19 15:45:58 UTC
Description of problem:
Our CI aws account recently hit its S3 bucket limit -- primarily because the s3-bucket for the image-registry was not being destroyed. See attached bucket list for 08-15-2019.

How reproducible:
Around 50-65 buckets a day are being leaked. DPP has a script running to clean up these buckets after they are > 5h old, but this should be considered a backup for leaked resources (and will soon be replaced with alerting instead of automatic cleanup)

Additional info:
This may belong with the registry team, but starting here for triage.

Comment 3 W. Trevor King 2019-08-19 16:32:39 UTC
One of the leaked buckets that James cleaned up:

Removing bucket ci-op-9w240v43-cd80e-plppw-image-registry-us-east-1-fblgrpryso
{
    "TagSet": [
        {
            "Key": "openshift_creationDate",
            "Value": "2019-08-15T23:55:03.991537+00:00"
        },
        {
            "Key": "Name",
            "Value": "ci-op-9w240v43-cd80e-plppw-image-registry"
        },
        {
            "Key": "kubernetes.io/cluster/ci-op-9w240v43-cd80e-plppw",
            "Value": "owned"
        }
    ]
}

Searching for the namespace prefix in ci-search-next [1] turned up [2] where the bucket was deleted:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/operator-framework_operator-metering/900/pull-ci-operator-framework-operator-metering-master-metering-e2e-aws/982/artifacts/metering-e2e-aws/installer/.openshift_install.log | grep ci-op-9w240v43-cd80e-plppw-image-registry-us-east-1-fblgrpryso
time="2019-08-15T23:23:14Z" level=debug msg=Emptied arn="arn:aws:s3:::ci-op-9w240v43-cd80e-plppw-image-registry-us-east-1-fblgrpryso"
time="2019-08-15T23:23:15Z" level=info msg=Deleted arn="arn:aws:s3:::ci-op-9w240v43-cd80e-plppw-image-registry-us-east-1-fblgrpryso"

With Athena:

SELECT eventtime,
    eventname,
    useridentity.username,
    useragent,
    requestparameters
    errorcode,
    errormessage
FROM "default"."cloudtrail_logs_cloud_trail_test_clayton"
WHERE from_iso8601_timestamp(eventtime) > from_iso8601_timestamp('2019-08-15T20:00Z')
  AND from_iso8601_timestamp(eventtime) < from_iso8601_timestamp('2019-08-15T23:59Z')
  AND eventname IN ('CreateBucket')
ORDER BY eventtime;

feeding create-bucket.csv, you can see the bucket being recreated in that time range:

$ grep ci-op-9w240v43-cd80e-plppw-image-registry-us-east-1-fblgrpryso create-bucket.csv 
"2019-08-15T23:05:59Z","CreateBucket","ci-op-9w240v43-cd80e-openshift-image-registry-5r6lq","[aws-sdk-go/1.21.5 (go1.12.5; linux; amd64) openshift.io cluster-image-registry-operator/4.0.0-322-g683925b-dirty]","{""host"":[""ci-op-9w240v43-cd80e-plppw-image-registry-us-east-1-fblgrpryso.s3.amazonaws.com""],""bucketName"":""ci-op-9w240v43-cd80e-plppw-image-registry-us-east-1-fblgrpryso""}",
"2019-08-15T23:23:17Z","CreateBucket","ci-op-9w240v43-cd80e-openshift-image-registry-5r6lq","[aws-sdk-go/1.21.5 (go1.12.5; linux; amd64) openshift.io cluster-image-registry-operator/4.0.0-322-g683925b-dirty]","{""host"":[""ci-op-9w240v43-cd80e-plppw-image-registry-us-east-1-fblgrpryso.s3.amazonaws.com""],""bucketName"":""ci-op-9w240v43-cd80e-plppw-image-registry-us-east-1-fblgrpryso""}",

So this is a dup of the still-unfixed bug 1740933.

[1]: https://ci-search-ci-search-next.svc.ci.openshift.org/?search=ci-op-9w240v43&maxAge=336h&context=2&type=all
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/operator-framework_operator-metering/900/pull-ci-operator-framework-operator-metering-master-metering-e2e-aws/982

*** This bug has been marked as a duplicate of bug 1740933 ***

Comment 4 W. Trevor King 2019-08-19 16:36:46 UTC
From James destroy logs:

$ grep 'Removing bucket' deleted-buckets-2019-08-19-0800-PST.txt | wc -l
539
$ grep 'kubernetes.io/cluster' deleted-buckets-2019-08-19-0800-PST.txt | wc -l
487

So while we expect all of these are a result of the instance-destroy vs. registry S3 creation discussed in bug 1740933, 487 of them would have also been avoided by the retry-ARN-deletion discussed in bug 1740935.