Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2036029

Summary: New added cloud-network-config operator doesn’t supported aws sts format credential
Product: OpenShift Container Platform Reporter: wang lin <lwan>
Component: NetworkingAssignee: Casey Callendrello <cdc>
Networking sub component: openshift-sdn QA Contact: wang lin <lwan>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: anbhat, bpickard, lwan
Version: 4.10Keywords: TestBlocker
Target Milestone: ---Flags: lwan: needinfo-
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 16:36:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
must-gather.tar.gz none

Description wang lin 2021-12-29 11:13:03 UTC
Created attachment 1848192 [details]
must-gather.tar.gz

Description:
Starting from OCP 4.8 , we support the sts (security token service) format cloud credential. 
In a non-sts cluster , the secret for components looks like:
$ oc get secret -n openshift-image-registry installer-cloud-credentials -o json | jq -r ".data"
{
  "aws_access_key_id": "QUtJQVVNUUFXXXXXXXXX",
  "aws_secret_access_key": "Um42ekNsZmpORXXXXXXX",
  "credentials": "W2RlZmF1bHRdCmF3c19hY2Nlc3XXXXX="
}
And the .data.credentials field looks like:
$ oc get secret -n openshift-image-registry installer-cloud-credentials -o json | jq -r ".data.credentials" | base64 -d
[default]
aws_access_key_id = QUtJQVVNUUFXXXXXXXXX
aws_secret_access_key = Um42ekNsZmpORXXXXXXX
##Note##
The .data.credentials field doesn’t alway exist in non-sts clusters in some cases.

In an sts cluster , the secret .data field looks like:
$ oc get secret cloud-credentials -o json | jq -r ".data"
{
  "credentials": "W2RlZmF1bHRdCnJvbGVfYXJuXXXXXXXX"
}
And the .data.credentials looks like:
$ oc get secret -n openshift-image-registry installer-cloud-credentials -o json | jq -r ".data.credentials" | base64 -d
[default]
role_arn = arn:aws:iam::301721915996:role/lwansts1229-7128-openshift-image-registry-installer-cloud-creden
web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token

Now cloud-network-config operator doesn’t recognize .data.credentials field in 
https://github.com/openshift/cloud-network-config-controller/blob/61b8877b693c6a59d6cffcad3d042c1bf89d8023/pkg/cloudprovider/aws.go#L31-L43

A related code for image-registry operator:
https://github.com/openshift/cluster-image-registry-operator/blob/4fa9b89852c9c317847e32cfc47c7ab7ede8dffe/pkg/storage/s3/s3.go#L843-L863

How reproducible:
Always

Steps to Reproduce:
1.Launch an sts cluster using 4.10 payload with cloud-network-config operator already included(refer doc: https://docs.openshift.com/container-platform/4.9/authentication/managing_cloud_provider_credentials/cco-mode-sts.html)

Actual result:
The installation fails and Cloud-network-config-controller pod crash with below error:
###
$ oc logs cloud-network-config-controller-c48c4b84c-s5p25
W1229 08:33:47.740717       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I1229 08:33:47.741578       1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock...
I1229 08:33:47.761840       1 leaderelection.go:258] successfully acquired lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock
F1229 08:33:47.762139       1 main.go:98] Error building cloud provider client, err: %vunable to read secret data, err: open /etc/secret/cloudprovider/aws_access_key_id: no such file or directory
goroutine 25 [running]:
k8s.io/klog/v2.stacks(0x1)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/klog/v2/klog.go:1038 +0x8a
k8s.io/klog/v2.(*loggingT).output(0x3939700, 0x3, 0x0, 0xc00012a070, 0x1, {0x2d0cebb, 0x20}, 0xc00032d400, 0x0)
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/klog/v2/klog.go:987 +0x5fd

Expected result:
The installation should succeed

Comment 1 wang lin 2021-12-29 11:18:00 UTC
The issue blocks all sts related cluster testing, add TestBlocker keywords

Comment 2 Alexander Constantinescu 2022-01-03 09:41:43 UTC
Does this only apply to AWS? If so, this will be fixed by PR: https://github.com/openshift/cloud-network-config-controller/pull/13

Comment 3 wang lin 2022-01-04 01:27:16 UTC
Now we support token credentials only on aws and gcp, only haves issue for aws platform, will verify it with the fix PR

Comment 4 wang lin 2022-01-04 03:45:35 UTC
Verified with cluster-bot image with the fix PR merged, the crash issue has fixed, but the operator hits another issue

####
caused by: WebIdentityErr: unable to read file at /var/run/secrets/openshift/serviceaccount/token
caused by: open /var/run/secrets/openshift/serviceaccount/token: no such file or directory, requeuing in node workqueue
E0104 02:53:03.460524       1 controller.go:165] error syncing 'ip-10-0-169-247.us-east-2.compute.internal': error retrieving the private IP configuration for node: ip-10-0-169-247.us-east-2.compute.internal, err: error: cannot list ec2 instance for node: ip-10-0-169-247.us-east-2.compute.internal, err: WebIdentityErr: failed fetching WebIdentity token: 
caused by: WebIdentityErr: unable to read file at /var/run/secrets/openshift/serviceaccount/token
caused by: open /var/run/secrets/openshift/serviceaccount/token: no such file or directory, requeuing in node workqueue

@jdiaz is the developer of Cloud Credential Operator , I think he can give some information if you need help.

Comment 5 Joel Diaz 2022-01-04 12:37:53 UTC
Based on the error message, it looks like the projected ServiceAccount token has not been mounted into /var/run/secrets/openshift/serviceaccount/token.

Looking at the github repo, I didn't see where the Deployment manifest is defined, but I would expect the Deployment that is defined for running this 'cloud-network-config-controller' software to have a volume mount that looks like https://github.com/openshift/cluster-image-registry-operator/blob/master/manifests/07-operator.yaml#L75-L77

Comment 8 Alexander Constantinescu 2022-01-06 08:56:42 UTC
The Hypershift fix also includes PR: https://github.com/openshift/cluster-network-operator/pull/1268/files which loads the project service account token. I suspect that should fix the problem completely and maybe we should retest once that PR merges.

Comment 10 wang lin 2022-01-13 03:40:46 UTC
I saw another error logs, please check
###
E0113 01:50:55.536951       1 controller.go:165] error syncing 'ip-10-0-184-222.us-east-2.compute.internal': error retrieving the private IP configuration for node: ip-10-0-184-222.us-east-2.compute.internal, err: error: cannot list ec2 instance for node: ip-10-0-184-222.us-east-2.compute.internal, err: WebIdentityErr: failed to retrieve credentials
caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
	status code: 403, request id: 3572c767-04c8-42c0-86c1-ba0f2c5d5091, requeuing in node workqueue
E0113 01:50:55.554710       1 controller.go:165] error syncing 'ip-10-0-202-212.us-east-2.compute.internal': error retrieving the private IP configuration for node: ip-10-0-202-212.us-east-2.compute.internal, err: error: cannot list ec2 instance for node: ip-10-0-202-212.us-east-2.compute.internal, err: WebIdentityErr: failed to retrieve credentials
caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
	status code: 403, request id: 896bdc0d-23dd-44fa-be42-52e5a4bf7893, requeuing in node workqueue
I0113 01:53:39.559676       1 controller.go:160] Dropping key 'ip-10-0-143-215.us-east-2.compute.internal' from the node workqueue
I0113 01:53:39.578314       1 controller.go:160] Dropping key 'ip-10-0-202-212.us-east-2.compute.internal' from the node workqueue
I0113 01:53:39.578350       1 controller.go:160] Dropping key 'ip-10-0-191-236.us-east-2.compute.internal' from the node workqueue
I0113 01:53:39.578356       1 controller.go:160] Dropping key 'ip-10-0-198-79.us-east-2.compute.internal' from the node workqueue
I0113 01:53:39.578363       1 controller.go:160] Dropping key 'ip-10-0-184-222.us-east-2.compute.internal' from the node workqueue
I0113 01:53:39.578365       1 controller.go:160] Dropping key 'ip-10-0-145-35.us-east-2.compute.internal' from the node workqueue

The CredentialsRequests I used for the cloud-network-config-controller is
cat > "${creds_dir}/0000_50_cloud-network_00_credentials-request.yaml" <<EOF
---
apiVersion: cloudcredential.openshift.io/v1
kind: CredentialsRequest
metadata:
  name: openshift-cloud-network-config-controller
  namespace: openshift-cloud-credential-operator
spec:
  providerSpec:
    apiVersion: cloudcredential.openshift.io/v1
    kind: AWSProviderSpec
    statementEntries:
    - action:
      - ec2:DescribeInstances
      - ec2:DescribeInstanceStatus
      - ec2:DescribeInstanceTypes
      - ec2:UnassignPrivateIpAddresses
      - ec2:AssignPrivateIpAddresses
      - ec2:UnassignIpv6Addresses
      - ec2:AssignIpv6Addresses
      - ec2:DescribeSubnets
      - ec2:DescribeNetworkInterfaces
      effect: Allow
      resource: '*'
  secretRef:
    name: cloud-credentials
    namespace: openshift-cloud-network-config-controller
  serviceAccountNames:
  - cloud-network-config-controllers
EOF

attanched must-gather logs

Comment 13 Alexander Constantinescu 2022-01-13 14:27:47 UTC
@Casey

Could you please have a look at this bug? Not sure how STS credentials should work and what might be missing. PR: https://github.com/openshift/cluster-network-operator/pull/1268/files mentioned that AssumeRoleWithWebIdentity should work on AWS, but it seems it doesn't

Comment 16 wang lin 2022-01-18 04:16:24 UTC
Tested using payload 4.10.0-0.nightly-2022-01-17-182202 with PR 1275 merged, the same issuse

@jdiaz Could you help take a look? I checked s3/key.json is:
###
{
    "keys": [
        {
            "use": "sig",
            "kty": "RSA",
            "kid": "0zCe_XCNey2bBDh9fYuVVCitvEYP7bbKgFb_jOpuTEU",
            "alg": "RS256",
            "n": "uN0XAwS4shiBQinahpsbrHOCgAT4jssPWjEDs2jerBOGPETySxDlc295HkuZCb8tVLZdu3PKQlxXDEIzBuTO1DW2yWTHONVu7tnEywtuOJVgnNTu-95LU1oIFLZMnxjVmuaZSygTgDQI_p_K9wmiDmQvGudJUHVokd0N0rM63Qf1JE1UmDOgdS6jfHbDKgBLI3lZfMi7Xfdb9YMQ8JG8fIh86yB6iPWPyasNqwSAUiIoTPdnjMm6s2mFM-AVMof3z7Fs4MThFRpyNkQUKfxhBOZRsLwgoK1H0Z1N9F5VAcRTC6tIkBFYY_ty74KQtzTy1neZNam8mRVPavjGdnUbse9J3Fn7kS5d9iylGg-2WDFNnrjwUTU3Cqh8BGkFV5UubAWLjald-0-yJUEC4EuBGH0sJkYnVHTONUNFmyO4F1nbhwISZQNtjayROKDRRdSYuEkC1KKWYZiNbmn8bUCjDqD3f5Sy_D0DQEqdLbv9t5Bry0stugwwjmdS_NamMSVpZtCfVcP0B0seqMU8kIhDEE_sYQpG72NgLLgg1JZEMV-8FafoL2z3_uJ3ZEJc7ECc-nM_TEDfEPObEO6orZhK87tXfPr4bomAMZGchSEE3lZhaj5y61iRYtUPZ6aYR3n5TZKY9PRzX6Z0-oABt_bYCJsJ80bw5pkeP8Q5XOoVWF8",
            "e": "AQAB"
        }
    ]
}

Token for CCNC is:
###
$ oc rsh cloud-network-config-controller-7dbdf9bb85-4xct7
sh-4.4$ cat /var/run/secrets/openshift/serviceaccount/token | awk -F. '{ print $1 }' | base64 -d
{"alg":"RS256","kid":"0zCe_XCNey2bBDh9fYuVVCitvEYP7bbKgFb_jOpuTEU"}base64: invalid input
sh-4.4$ cat /var/run/secrets/openshift/serviceaccount/token | awk -F. '{ print $2 }' | base64 -d
{"aud":["openshift"],"exp":1642478458,"iat":1642474858,"iss":"https://lwansts0118-14958-oidc.s3.us-east-2.amazonaws.com","kubernetes.io":{"namespace":"openshift-cloud-network-config-controller","pod":{"name":"cloud-network-config-controller-7dbdf9bb85-4xct7","uid":"6f9dfb1b-f9b5-480b-ae80-70cca4b87ecc"},"serviceaccount":{"name":"cloud-network-config-controller","uid":"4b630e33-3e18-45ed-bd77-5e07f94573c5"}},"nbf":1642474858,"sub":"system:serviceaccount:openshift-cloud-network-config-controller:cloud-network-config-controller"}


I can't see what issue's on it, but the pod show this error:
###
E0118 03:08:20.806257       1 controller.go:165] error syncing 'ip-10-0-213-232.us-east-2.compute.internal': error retrieving the private IP configuration for node: ip-10-0-213-232.us-east-2.compute.internal, err: error: cannot list ec2 instance for node: ip-10-0-213-232.us-east-2.compute.internal, err: WebIdentityErr: failed to retrieve credentials
caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
	status code: 403, request id: a7f6aa50-fb30-40cc-99b9-ed92c6497aee, requeuing in node workqueue

Comment 17 wang lin 2022-01-18 04:21:48 UTC
also Trust Relationship:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::301721915996:oidc-provider/lwansts0118-14958-oidc.s3.us-east-2.amazonaws.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "lwansts0118-14958-oidc.s3.us-east-2.amazonaws.com:sub": "system:serviceaccount:openshift-cloud-network-config-controller:cloud-network-config-controllers"
        }
      }
    }
  ]
}

Comment 18 wang lin 2022-01-18 07:57:00 UTC
@cdc GCP workload identity cluster has the similar issue, the installation can successful, so I didn't notice the issue before. Do you want to use one bug to track or need me to create another bug for GCP platform?

###
E0118 05:22:55.659257       1 controller.go:165] error syncing 'lwanstsg0118-5lwqc-worker-c-b2g8k.c.openshift-qe.internal': error retrieving the private IP configuration for node: lwanstsg0118-5lwqc-worker-c-b2g8k.c.openshift-qe.internal, err: error retrieving instance associated with node, err: Get "https://compute.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instances/lwanstsg0118-5lwqc-worker-c-b2g8k?alt=json&prettyPrint=false": oauth2/google: status code 403: {
  "error": {
    "code": 403,
    "message": "The caller does not have permission",
    "status": "PERMISSION_DENIED"
  }
}
, requeuing in node workqueue

Comment 20 wang lin 2022-01-18 09:34:26 UTC
No, PR 1283 can't fix the issue, when I create sts cluster,I already manually add serviceAccountNames field to CredentialsRequest, I use ccoctl tool to create sts related manifests follow this offical doc https://docs.openshift.com/container-platform/4.9/authentication/managing_cloud_provider_credentials/cco-mode-sts.html#sts-mode-installing ,

if there is no serviceAccountNames filed, ccoctl tool will hit an issue like :
#####
2021/12/07 11:02:07 Failed to process IAM Roles: Failed while processing each CredentialsRequest: error while creating Role policy document for openshift-cluster-api-aws: CredentialsRequest must provide ServieAccounts to bind the Role policy to
#####

so I have manually added serviceAccountNames before I launch cluster

Comment 21 wang lin 2022-01-18 11:11:31 UTC
@jdiaz , sorry for the trouble above, I checked with Casey, I gave serviceAccountNames "cloud-network-config-controllers" value, but it should be cloud-network-config-controller without 's', please ignore above comments.

Comment 22 wang lin 2022-01-18 12:51:03 UTC
The permission issue has fixed, will move the bug to verified

####pod logs####
I0118 11:34:35.683463       1 leaderelection.go:258] successfully acquired lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock
I0118 11:34:35.684267       1 controller.go:88] Starting node controller
I0118 11:34:35.684278       1 controller.go:91] Waiting for informer caches to sync for node workqueue
I0118 11:34:35.684761       1 controller.go:88] Starting secret controller
I0118 11:34:35.684771       1 controller.go:91] Waiting for informer caches to sync for secret workqueue
I0118 11:34:35.684780       1 controller.go:88] Starting cloud-private-ip-config controller
I0118 11:34:35.684786       1 controller.go:91] Waiting for informer caches to sync for cloud-private-ip-config workqueue
I0118 11:34:35.687749       1 controller.go:182] Assigning key: ip-10-0-194-96.us-east-2.compute.internal to node workqueue
I0118 11:34:35.688394       1 controller.go:182] Assigning key: ip-10-0-141-229.us-east-2.compute.internal to node workqueue
I0118 11:34:35.688412       1 controller.go:182] Assigning key: ip-10-0-186-235.us-east-2.compute.internal to node workqueue
I0118 11:34:35.784845       1 controller.go:96] Starting cloud-private-ip-config workers
I0118 11:34:35.784862       1 controller.go:96] Starting node workers
I0118 11:34:35.784872       1 controller.go:102] Started cloud-private-ip-config workers
I0118 11:34:35.784880       1 controller.go:102] Started node workers
I0118 11:34:35.784896       1 controller.go:96] Starting secret workers
I0118 11:34:35.784922       1 controller.go:102] Started secret workers

Comment 27 errata-xmlrpc 2022-03-10 16:36:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056