Created attachment 1848192 [details] must-gather.tar.gz Description: Starting from OCP 4.8 , we support the sts (security token service) format cloud credential. In a non-sts cluster , the secret for components looks like: $ oc get secret -n openshift-image-registry installer-cloud-credentials -o json | jq -r ".data" { "aws_access_key_id": "QUtJQVVNUUFXXXXXXXXX", "aws_secret_access_key": "Um42ekNsZmpORXXXXXXX", "credentials": "W2RlZmF1bHRdCmF3c19hY2Nlc3XXXXX=" } And the .data.credentials field looks like: $ oc get secret -n openshift-image-registry installer-cloud-credentials -o json | jq -r ".data.credentials" | base64 -d [default] aws_access_key_id = QUtJQVVNUUFXXXXXXXXX aws_secret_access_key = Um42ekNsZmpORXXXXXXX ##Note## The .data.credentials field doesn’t alway exist in non-sts clusters in some cases. In an sts cluster , the secret .data field looks like: $ oc get secret cloud-credentials -o json | jq -r ".data" { "credentials": "W2RlZmF1bHRdCnJvbGVfYXJuXXXXXXXX" } And the .data.credentials looks like: $ oc get secret -n openshift-image-registry installer-cloud-credentials -o json | jq -r ".data.credentials" | base64 -d [default] role_arn = arn:aws:iam::301721915996:role/lwansts1229-7128-openshift-image-registry-installer-cloud-creden web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token Now cloud-network-config operator doesn’t recognize .data.credentials field in https://github.com/openshift/cloud-network-config-controller/blob/61b8877b693c6a59d6cffcad3d042c1bf89d8023/pkg/cloudprovider/aws.go#L31-L43 A related code for image-registry operator: https://github.com/openshift/cluster-image-registry-operator/blob/4fa9b89852c9c317847e32cfc47c7ab7ede8dffe/pkg/storage/s3/s3.go#L843-L863 How reproducible: Always Steps to Reproduce: 1.Launch an sts cluster using 4.10 payload with cloud-network-config operator already included(refer doc: https://docs.openshift.com/container-platform/4.9/authentication/managing_cloud_provider_credentials/cco-mode-sts.html) Actual result: The installation fails and Cloud-network-config-controller pod crash with below error: ### $ oc logs cloud-network-config-controller-c48c4b84c-s5p25 W1229 08:33:47.740717 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I1229 08:33:47.741578 1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock... I1229 08:33:47.761840 1 leaderelection.go:258] successfully acquired lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock F1229 08:33:47.762139 1 main.go:98] Error building cloud provider client, err: %vunable to read secret data, err: open /etc/secret/cloudprovider/aws_access_key_id: no such file or directory goroutine 25 [running]: k8s.io/klog/v2.stacks(0x1) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/klog/v2/klog.go:1038 +0x8a k8s.io/klog/v2.(*loggingT).output(0x3939700, 0x3, 0x0, 0xc00012a070, 0x1, {0x2d0cebb, 0x20}, 0xc00032d400, 0x0) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/klog/v2/klog.go:987 +0x5fd Expected result: The installation should succeed
The issue blocks all sts related cluster testing, add TestBlocker keywords
Does this only apply to AWS? If so, this will be fixed by PR: https://github.com/openshift/cloud-network-config-controller/pull/13
Now we support token credentials only on aws and gcp, only haves issue for aws platform, will verify it with the fix PR
Verified with cluster-bot image with the fix PR merged, the crash issue has fixed, but the operator hits another issue #### caused by: WebIdentityErr: unable to read file at /var/run/secrets/openshift/serviceaccount/token caused by: open /var/run/secrets/openshift/serviceaccount/token: no such file or directory, requeuing in node workqueue E0104 02:53:03.460524 1 controller.go:165] error syncing 'ip-10-0-169-247.us-east-2.compute.internal': error retrieving the private IP configuration for node: ip-10-0-169-247.us-east-2.compute.internal, err: error: cannot list ec2 instance for node: ip-10-0-169-247.us-east-2.compute.internal, err: WebIdentityErr: failed fetching WebIdentity token: caused by: WebIdentityErr: unable to read file at /var/run/secrets/openshift/serviceaccount/token caused by: open /var/run/secrets/openshift/serviceaccount/token: no such file or directory, requeuing in node workqueue @jdiaz is the developer of Cloud Credential Operator , I think he can give some information if you need help.
Based on the error message, it looks like the projected ServiceAccount token has not been mounted into /var/run/secrets/openshift/serviceaccount/token. Looking at the github repo, I didn't see where the Deployment manifest is defined, but I would expect the Deployment that is defined for running this 'cloud-network-config-controller' software to have a volume mount that looks like https://github.com/openshift/cluster-image-registry-operator/blob/master/manifests/07-operator.yaml#L75-L77
The Hypershift fix also includes PR: https://github.com/openshift/cluster-network-operator/pull/1268/files which loads the project service account token. I suspect that should fix the problem completely and maybe we should retest once that PR merges.
I saw another error logs, please check ### E0113 01:50:55.536951 1 controller.go:165] error syncing 'ip-10-0-184-222.us-east-2.compute.internal': error retrieving the private IP configuration for node: ip-10-0-184-222.us-east-2.compute.internal, err: error: cannot list ec2 instance for node: ip-10-0-184-222.us-east-2.compute.internal, err: WebIdentityErr: failed to retrieve credentials caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity status code: 403, request id: 3572c767-04c8-42c0-86c1-ba0f2c5d5091, requeuing in node workqueue E0113 01:50:55.554710 1 controller.go:165] error syncing 'ip-10-0-202-212.us-east-2.compute.internal': error retrieving the private IP configuration for node: ip-10-0-202-212.us-east-2.compute.internal, err: error: cannot list ec2 instance for node: ip-10-0-202-212.us-east-2.compute.internal, err: WebIdentityErr: failed to retrieve credentials caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity status code: 403, request id: 896bdc0d-23dd-44fa-be42-52e5a4bf7893, requeuing in node workqueue I0113 01:53:39.559676 1 controller.go:160] Dropping key 'ip-10-0-143-215.us-east-2.compute.internal' from the node workqueue I0113 01:53:39.578314 1 controller.go:160] Dropping key 'ip-10-0-202-212.us-east-2.compute.internal' from the node workqueue I0113 01:53:39.578350 1 controller.go:160] Dropping key 'ip-10-0-191-236.us-east-2.compute.internal' from the node workqueue I0113 01:53:39.578356 1 controller.go:160] Dropping key 'ip-10-0-198-79.us-east-2.compute.internal' from the node workqueue I0113 01:53:39.578363 1 controller.go:160] Dropping key 'ip-10-0-184-222.us-east-2.compute.internal' from the node workqueue I0113 01:53:39.578365 1 controller.go:160] Dropping key 'ip-10-0-145-35.us-east-2.compute.internal' from the node workqueue The CredentialsRequests I used for the cloud-network-config-controller is cat > "${creds_dir}/0000_50_cloud-network_00_credentials-request.yaml" <<EOF --- apiVersion: cloudcredential.openshift.io/v1 kind: CredentialsRequest metadata: name: openshift-cloud-network-config-controller namespace: openshift-cloud-credential-operator spec: providerSpec: apiVersion: cloudcredential.openshift.io/v1 kind: AWSProviderSpec statementEntries: - action: - ec2:DescribeInstances - ec2:DescribeInstanceStatus - ec2:DescribeInstanceTypes - ec2:UnassignPrivateIpAddresses - ec2:AssignPrivateIpAddresses - ec2:UnassignIpv6Addresses - ec2:AssignIpv6Addresses - ec2:DescribeSubnets - ec2:DescribeNetworkInterfaces effect: Allow resource: '*' secretRef: name: cloud-credentials namespace: openshift-cloud-network-config-controller serviceAccountNames: - cloud-network-config-controllers EOF attanched must-gather logs
@Casey Could you please have a look at this bug? Not sure how STS credentials should work and what might be missing. PR: https://github.com/openshift/cluster-network-operator/pull/1268/files mentioned that AssumeRoleWithWebIdentity should work on AWS, but it seems it doesn't
Tested using payload 4.10.0-0.nightly-2022-01-17-182202 with PR 1275 merged, the same issuse @jdiaz Could you help take a look? I checked s3/key.json is: ### { "keys": [ { "use": "sig", "kty": "RSA", "kid": "0zCe_XCNey2bBDh9fYuVVCitvEYP7bbKgFb_jOpuTEU", "alg": "RS256", "n": "uN0XAwS4shiBQinahpsbrHOCgAT4jssPWjEDs2jerBOGPETySxDlc295HkuZCb8tVLZdu3PKQlxXDEIzBuTO1DW2yWTHONVu7tnEywtuOJVgnNTu-95LU1oIFLZMnxjVmuaZSygTgDQI_p_K9wmiDmQvGudJUHVokd0N0rM63Qf1JE1UmDOgdS6jfHbDKgBLI3lZfMi7Xfdb9YMQ8JG8fIh86yB6iPWPyasNqwSAUiIoTPdnjMm6s2mFM-AVMof3z7Fs4MThFRpyNkQUKfxhBOZRsLwgoK1H0Z1N9F5VAcRTC6tIkBFYY_ty74KQtzTy1neZNam8mRVPavjGdnUbse9J3Fn7kS5d9iylGg-2WDFNnrjwUTU3Cqh8BGkFV5UubAWLjald-0-yJUEC4EuBGH0sJkYnVHTONUNFmyO4F1nbhwISZQNtjayROKDRRdSYuEkC1KKWYZiNbmn8bUCjDqD3f5Sy_D0DQEqdLbv9t5Bry0stugwwjmdS_NamMSVpZtCfVcP0B0seqMU8kIhDEE_sYQpG72NgLLgg1JZEMV-8FafoL2z3_uJ3ZEJc7ECc-nM_TEDfEPObEO6orZhK87tXfPr4bomAMZGchSEE3lZhaj5y61iRYtUPZ6aYR3n5TZKY9PRzX6Z0-oABt_bYCJsJ80bw5pkeP8Q5XOoVWF8", "e": "AQAB" } ] } Token for CCNC is: ### $ oc rsh cloud-network-config-controller-7dbdf9bb85-4xct7 sh-4.4$ cat /var/run/secrets/openshift/serviceaccount/token | awk -F. '{ print $1 }' | base64 -d {"alg":"RS256","kid":"0zCe_XCNey2bBDh9fYuVVCitvEYP7bbKgFb_jOpuTEU"}base64: invalid input sh-4.4$ cat /var/run/secrets/openshift/serviceaccount/token | awk -F. '{ print $2 }' | base64 -d {"aud":["openshift"],"exp":1642478458,"iat":1642474858,"iss":"https://lwansts0118-14958-oidc.s3.us-east-2.amazonaws.com","kubernetes.io":{"namespace":"openshift-cloud-network-config-controller","pod":{"name":"cloud-network-config-controller-7dbdf9bb85-4xct7","uid":"6f9dfb1b-f9b5-480b-ae80-70cca4b87ecc"},"serviceaccount":{"name":"cloud-network-config-controller","uid":"4b630e33-3e18-45ed-bd77-5e07f94573c5"}},"nbf":1642474858,"sub":"system:serviceaccount:openshift-cloud-network-config-controller:cloud-network-config-controller"} I can't see what issue's on it, but the pod show this error: ### E0118 03:08:20.806257 1 controller.go:165] error syncing 'ip-10-0-213-232.us-east-2.compute.internal': error retrieving the private IP configuration for node: ip-10-0-213-232.us-east-2.compute.internal, err: error: cannot list ec2 instance for node: ip-10-0-213-232.us-east-2.compute.internal, err: WebIdentityErr: failed to retrieve credentials caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity status code: 403, request id: a7f6aa50-fb30-40cc-99b9-ed92c6497aee, requeuing in node workqueue
also Trust Relationship: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::301721915996:oidc-provider/lwansts0118-14958-oidc.s3.us-east-2.amazonaws.com" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "lwansts0118-14958-oidc.s3.us-east-2.amazonaws.com:sub": "system:serviceaccount:openshift-cloud-network-config-controller:cloud-network-config-controllers" } } } ] }
@cdc GCP workload identity cluster has the similar issue, the installation can successful, so I didn't notice the issue before. Do you want to use one bug to track or need me to create another bug for GCP platform? ### E0118 05:22:55.659257 1 controller.go:165] error syncing 'lwanstsg0118-5lwqc-worker-c-b2g8k.c.openshift-qe.internal': error retrieving the private IP configuration for node: lwanstsg0118-5lwqc-worker-c-b2g8k.c.openshift-qe.internal, err: error retrieving instance associated with node, err: Get "https://compute.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instances/lwanstsg0118-5lwqc-worker-c-b2g8k?alt=json&prettyPrint=false": oauth2/google: status code 403: { "error": { "code": 403, "message": "The caller does not have permission", "status": "PERMISSION_DENIED" } } , requeuing in node workqueue
No, PR 1283 can't fix the issue, when I create sts cluster,I already manually add serviceAccountNames field to CredentialsRequest, I use ccoctl tool to create sts related manifests follow this offical doc https://docs.openshift.com/container-platform/4.9/authentication/managing_cloud_provider_credentials/cco-mode-sts.html#sts-mode-installing , if there is no serviceAccountNames filed, ccoctl tool will hit an issue like : ##### 2021/12/07 11:02:07 Failed to process IAM Roles: Failed while processing each CredentialsRequest: error while creating Role policy document for openshift-cluster-api-aws: CredentialsRequest must provide ServieAccounts to bind the Role policy to ##### so I have manually added serviceAccountNames before I launch cluster
@jdiaz , sorry for the trouble above, I checked with Casey, I gave serviceAccountNames "cloud-network-config-controllers" value, but it should be cloud-network-config-controller without 's', please ignore above comments.
The permission issue has fixed, will move the bug to verified ####pod logs#### I0118 11:34:35.683463 1 leaderelection.go:258] successfully acquired lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock I0118 11:34:35.684267 1 controller.go:88] Starting node controller I0118 11:34:35.684278 1 controller.go:91] Waiting for informer caches to sync for node workqueue I0118 11:34:35.684761 1 controller.go:88] Starting secret controller I0118 11:34:35.684771 1 controller.go:91] Waiting for informer caches to sync for secret workqueue I0118 11:34:35.684780 1 controller.go:88] Starting cloud-private-ip-config controller I0118 11:34:35.684786 1 controller.go:91] Waiting for informer caches to sync for cloud-private-ip-config workqueue I0118 11:34:35.687749 1 controller.go:182] Assigning key: ip-10-0-194-96.us-east-2.compute.internal to node workqueue I0118 11:34:35.688394 1 controller.go:182] Assigning key: ip-10-0-141-229.us-east-2.compute.internal to node workqueue I0118 11:34:35.688412 1 controller.go:182] Assigning key: ip-10-0-186-235.us-east-2.compute.internal to node workqueue I0118 11:34:35.784845 1 controller.go:96] Starting cloud-private-ip-config workers I0118 11:34:35.784862 1 controller.go:96] Starting node workers I0118 11:34:35.784872 1 controller.go:102] Started cloud-private-ip-config workers I0118 11:34:35.784880 1 controller.go:102] Started node workers I0118 11:34:35.784896 1 controller.go:96] Starting secret workers I0118 11:34:35.784922 1 controller.go:102] Started secret workers
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056