Bug 1683648

Summary: Installation in AWS environment fails with 'level=fatal msg="failed to initialize the cluster: Cluster operator openshift-cloud-credential-operator is reporting a failure: 4 of 4 credentials requests are failing to sync."'
Product: OpenShift Container Platform Reporter: David Caldwell <dcaldwel>
Component: Cloud ComputeAssignee: Joel Diaz <jdiaz>
Status: CLOSED DUPLICATE QA Contact: Jianwei Hou <jhou>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.1.0CC: aos-bugs, aos-cloud, dcaldwel, dgoodwin, jdiaz, jokerman, mmccomas, ocasalsa, rhowe, sperezto, sponnaga, suchaudh, wking
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-15 23:48:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1664187    

Comment 6 Ryan Howe 2019-03-04 14:27:27 UTC
The root of the issue I believe is the lack of permissions with the AWS account. 

Error ends up coming from here:

   https://github.com/openshift/cloud-credential-operator/blob/release-4.0/pkg/controller/credentialsrequest/status.go#L108-L143

Due to the lack of iam authorization this cloud-credential-operator fails to create the credentials. 

The bug here more with the operator failing due to the permission issues. It should just pass the original creds through in this case (with the install-time warning for this less-secure approach), instead of failing later.

Comment 7 Joel Diaz 2019-03-04 16:22:51 UTC
What are the annotations on the secret kube-system/aws-creds?

More interestingly, what AWS IAM permissions was the cluster installed with?

Comment 9 Joel Diaz 2019-03-05 14:50:12 UTC
I suppose it depends on exactly how the IAM permissions are organized/granted. You should be able to get your IAM permissions with something like this:

aws iam get-user | jq -r .User.UserName (using the IAM creds used for cluster installation)

aws iam list-groups-for-user --user-name <USER_NAME>

For each group above:
aws iam list-group-policies --group-name <GROUP_NAME>
aws iam list-attached-group-policies --group-name <GROUP_NAME>

For each policy:
aws iam get-group-policy --group-name <GROUP_NAME> --policy-name <POLICY_NAME>

Each of those 'get-group-policy' commands should dump the actual policies.

Comment 13 Joel Diaz 2019-03-06 16:59:20 UTC
The installer can/does change permissions from release-to-release.

Can't really see the IAM permissions since you don't have GetGroupPolicy permission.

Can you grab all the CredentialsRequest objects 'oc get credentialsrequest --all-namespaces -o yaml'

And the logs from the cloud-credential-operator:
oc get pods -n openshift-cloud-credential-operator

then for that pod get the logs:
oc logs  -n openshift-cloud-credential-operator POD_NAME_FROM_PREV_COMMAND

Comment 15 David Caldwell 2019-03-11 09:19:10 UTC
If this issue is fixed in 0.14 installer, maybe this BZ should be closed?

Comment 16 Joel Diaz 2019-03-11 13:51:10 UTC
If I had to guess, this might be a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1685729

Is the original issue reproducible with 0.14?

Comment 18 W. Trevor King 2019-03-13 23:01:02 UTC
We still see something like this in CI:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/127/artifacts/e2e-aws-upgrade/clusterversion.json | jq -r '.items[].status.conditions[] | select(.type == "Failing").message'
Cluster operator openshift-cloud-credential-operator is reporting a failure: 2 of 4 credentials requests are failing to sync.
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/127/artifacts/e2e-aws-upgrade/pods/openshift-cloud-credential-operator_cloud-credential-operator-5b5b6bd77-kmbs7_manager.log.gz | gunzip | grep -2 level=error | tail
--
time="2019-03-13T22:44:54Z" level=debug msg="creating read AWS client" actuator=aws cr=openshift-cloud-credential-operator/openshift-ingress secret=openshift-cloud-credential-operator/cloud-credential-operator-iam-ro-creds
time="2019-03-13T22:45:56Z" level=info msg="validating cloud cred secret" controller=secretannotator
time="2019-03-13T22:47:34Z" level=error msg="error while validating cloud credentials: failed checking create cloud creds: error querying current username: RequestError: send request failed\ncaused by: Post https://iam.amazonaws.com/: dial tcp: lookup iam.amazonaws.com on 172.30.0.10:53: read udp 10.128.0.65:55377->172.30.0.10:53: i/o timeout" controller=secretannotator
time="2019-03-13T22:48:34Z" level=error msg="error getting user: {\n\n}" actuator=aws cr=openshift-cloud-credential-operator/openshift-ingress error="RequestError: send request failed\ncaused by: Post https://iam.amazonaws.com/: dial tcp: i/o timeout"
time="2019-03-13T22:48:34Z" level=error msg="error determining whether a credentials update is needed" actuator=aws cr=openshift-cloud-credential-operator/openshift-ingress error="unable to read info for username {\n\n}: RequestError: send request failed\ncaused by: Post https://iam.amazonaws.com/: dial tcp: i/o timeout"
time="2019-03-13T22:48:34Z" level=error msg="error syncing credentials: <nil>" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress secret=openshift-ingress-operator/cloud-credentials
time="2019-03-13T22:48:34Z" level=error msg="errored with condition: CredentialsProvisionFailure" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress secret=openshift-ingress-operator/cloud-credentials
time="2019-03-13T22:48:34Z" level=debug msg="updating credentials request status" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress secret=openshift-ingress-operator/cloud-credentials
time="2019-03-13T22:48:34Z" level=info msg="status has changed, updating" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress secret=openshift-ingress-operator/cloud-credentials

Dunno what's up with that; maybe networking died?

Comment 19 W. Trevor King 2019-03-15 23:48:29 UTC
I'm going to close this as a dup of bug 1687881.  If you can reproduce this issue since that bug was fixed, comment here with logs and we can re-open.

*** This bug has been marked as a duplicate of bug 1687881 ***