Description of problem: When installing sts cluster on a disconnected network,the Image registry hits WebIdentityErr when it try to assume Role With WebIdentity, I have created a vpc interface endpoint for regional sts and set custom endpoint in install-config.yaml file, it should be able to access the sts service through regional sts vpc endpoint, but the Image registry only sends requests to global sts endpoint(https://sts.amazonaws.com/ is a global sts endpoint), then Degraded because the request timed out. The error message from the installation log output: level=debug msg=Still waiting for the cluster to initialize: Working towards 4.8.0-0.nightly-2021-03-15-144314: 667 of 669 done (99% complete), waiting on image-registry level=debug msg=Still waiting for the cluster to initialize: Working towards 4.8.0-0.nightly-2021-03-15-144314: 667 of 669 done (99% complete), waiting on image-registry level=info msg=Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform level=info msg=Cluster operator image-registry Available is False with DeploymentNotFound: NodeCADaemonAvailable: The daemon set node-ca has available replicas level=info msg=Available: The deployment does not exist level=info msg=ImagePrunerAvailable: Pruner CronJob has been created level=info msg=Cluster operator image-registry Progressing is True with Error: Progressing: Unable to apply resources: unable to sync storage configuration: WebIdentityErr: failed to retrieve credentials level=info msg=Progressing: caused by: RequestError: send request failed level=info msg=Progressing: caused by: Post "https://sts.amazonaws.com/": dial tcp 54.239.29.25:443: i/o timeout level=error msg=Cluster operator image-registry Degraded is True with Unavailable: Degraded: The deployment does not exist Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-03-15-144314 How reproducible: always Steps to Reproduce: 1. Create vpc and related network resources through cloudformation, and make sure there is a vpc interface endpoint for sts like: stsEndpoint: Type: AWS::EC2::VPCEndpoint Properties: PrivateDnsEnabled: true VpcEndpointType: Interface SecurityGroupIds: - !Ref EndpointSecurityGroup SubnetIds: - !Ref PrivateSubnet - !If [DoAz2, !Ref PrivateSubnet2, !Ref "AWS::NoValue"] ServiceName: !Join - '' - - com.amazonaws. - !Ref 'AWS::Region' - .sts VpcId: !Ref VPC 2. Create install-config.yaml and add custom sts endpoint like: platform: aws: region: us-east-2 subnets: - subnet-0e500f52xxxxxxxxx - subnet-08562e4fxxxxxxxxx serviceEndpoints: - name: sts url: https://sts.us-east-2.amazonaws.com 3. Prepare OIDC service for token validation, related doc: https://deploy-preview-29545--osdocs.netlify.app/openshift-enterprise/latest/authentication/managing_cloud_provider_credentials/cco-mode-sts.html#sts-mode-installing-manual-config 4.install cluster ./openshift-install create cluster --log-level debug --dir cluster1 Actual results: The installation failed on cluster image-registry initialization. Expected results: The operator should request regional sts endpoint as defined in install-config.yaml, the installation should be successfully Additional info: The ocp 4.7 has the same issue. Other operators(such as ingress, machine-api) can work well after I created sts vpc interface endpoint and set custom endpoint in install-config.yaml
can't install sts cluster successfully in disconnected network , so added testblocker keywords.
moving to image-registry component. other AWS-interacting components came up okay, and CCO is not involved in creating the credentials when cluster is in STS-mode.
Add more info as background: by default, our operators always send sts requests to a global sts endpoint(https://sts.amazonaws.com) rather than a regional sts endpoint( like https://sts.us-east-1.amazonaws.com) in a connected network, it's ok because the global endpoint is accessible but in a disconnected network, we need to create an interface VPC endpoint for sts in cluster VPC to make operators sts requests can communicate with AWS sts service, but it only can be set to a regional endpoint, then we need to put this regional sts endpoint to install-config.yaml before we create the cluster to let the operator know. ## platform: aws: region: us-east-2 subnets: - subnet-0e500f52xxxxxxxxx - subnet-08562e4fxxxxxxxxx serviceEndpoints: - name: sts url: https://sts.us-east-2.amazonaws.com ## after setting this serviceEndpoints field in install-config.yaml, other operators like machineapi/ingress, they all respect this set, and send sts requests to the regional sts endpoint, but image-registry doesn't.
As I can see, only the installer has a generic handler for serviceEndpoints [1] and can be aware of sts endpoints. machine-api-operator doesn't use aws-sdk-go at all (it doesn't interact with aws?), cluster-ingress-operator and cluster-image-registry-operator both have a hardcoded list of services that are checked in serviceEndpoints, so I'm a bit surprised that only the image registry operator has the problem. They both should have problems. I don't know how to create a disconnected cluster, so I can draft a PR only blindly. wang, if I draft a PR, will you able to launch a cluster with it? [1]: https://github.com/openshift/installer/blob/7ba3f375977b7e2a0adc856db3f258f2c53b8aef/pkg/asset/installconfig/aws/session.go#L47-L48
FWIW, with the way I understand machine-api, the machine-api AWS piece lives here https://github.com/openshift/cluster-api-provider-aws
Hi Oleg, yes, I can launch a cluster with your pr, our qe has env to install such cluster.
In my test, machine api seems to be able to respect the serviceEndpoints ##without the serviceEndpoints setting , machine-api will send request to global sts endpoint(https://sts.amazonaws.com), and hits the below error in a disconnect env, and can't be able to create worker machines, like: 0610 07:31:11.779058 1 controller.go:302] controller-runtime/manager/controller/machine_controller "msg"="Reconciler error" "error"="WebIdentityErr: failed to retrieve credentials\ncaused by: RequestError: send request failed\ncaused by: Post \"https://sts.amazonaws.com/\": dial tcp 52.46.134.192:443: i/o timeout" "name"="lwandissts-qdf97-worker-us-east-2b-wj2ng" "namespace"="openshift-machine-api" $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-49-113.us-east-2.compute.internal Ready master 23h v1.21.0-rc.0+7f76571 ip-10-0-70-33.us-east-2.compute.internal Ready master 23h v1.21.0-rc.0+7f76571 ip-10-0-79-65.us-east-2.compute.internal Ready master 23h v1.21.0-rc.0+7f76571 ##with the serviceEndpoints setting, it will work well , and be able to create worker machines. $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-49-30.us-east-2.compute.internal Ready worker 22h v1.21.0-rc.0+7f76571 ip-10-0-50-182.us-east-2.compute.internal Ready worker 22h v1.21.0-rc.0+7f76571 ip-10-0-53-255.us-east-2.compute.internal Ready master 23h v1.21.0-rc.0+7f76571 ip-10-0-67-79.us-east-2.compute.internal Ready master 23h v1.21.0-rc.0+7f76571 ip-10-0-78-157.us-east-2.compute.internal Ready worker 22h v1.21.0-rc.0+7f76571 ip-10-0-79-232.us-east-2.compute.internal Ready master 23h v1.21.0-rc.0+7f76571
wang, can you check https://github.com/openshift/cluster-image-registry-operator/pull/699 ?
Hi oleg, I test this with cluster-bot image with pr#699 merged, the cluster operator image-registry status is healthy, and no longer see `Post "https://sts.amazonaws.com/": timeout` error $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.ci.test-2021-06-25-014202-ci-ln-r5pbs0t-latest True False 57m Cluster version is 4.8.0-0.ci.test-2021-06-25-014202-ci-ln-r5pbs0t-latest #The co status $ oc get co image-registry NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE image-registry 4.8.0-0.ci.test-2021-06-25-014202-ci-ln-r5pbs0t-latest True False False 135m #the Infrastructure CR $ oc get infrastructure NAME AGE cluster 155m [cloud-user@preserve-lwan-edcrfv ~]$ oc get infrastructure cluster -o yaml apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2021-06-25T02:32:39Z" generation: 1 name: cluster resourceVersion: "679" uid: b33c272f-ac73-4620-81d6-d53e4eefcab9 spec: cloudConfig: name: "" platformSpec: aws: serviceEndpoints: - name: sts url: https://sts.us-east-2.amazonaws.com type: AWS status: apiServerInternalURI: https://api-int.lwanstsdis0625.qe.devcluster.openshift.com:6443 apiServerURL: https://api.lwanstsdis0625.qe.devcluster.openshift.com:6443 controlPlaneTopology: HighlyAvailable etcdDiscoveryDomain: "" infrastructureName: lwanstsdis0625-7g2kb infrastructureTopology: HighlyAvailable platform: AWS platformStatus: aws: region: us-east-2 serviceEndpoints: - name: sts url: https://sts.us-east-2.amazonaws.com type: AWS #the image-registry log $ oc logs cluster-image-registry-operator-5d95cf4847-2d2gl -n openshift-image-registry | grep "sts" Nothing output
Has verified before pr merge, will move to Verified directly.
Haven't noticed cluster-bot created image is 4.8 version, I retest it on 4.9 nightly build, it worked as expected. But oleg, I see this changed haven't been merged to 4.8, do we backport it to 4.8? ##test result $ oc get co image-registry NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE image-registry 4.9.0-0.nightly-2021-06-28-114004 True False False 93m $ oc get infrastructure cluster -o yaml apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2021-06-29T02:26:24Z" generation: 1 name: cluster resourceVersion: "688" uid: 0ad5d320-192d-4993-82a5-031ea0d2e2c2 spec: cloudConfig: name: "" platformSpec: aws: serviceEndpoints: - name: sts url: https://sts.us-east-2.amazonaws.com type: AWS status: apiServerInternalURI: https://api-int.lwanipid0629.qe.devcluster.openshift.com:6443 apiServerURL: https://api.lwanipid0629.qe.devcluster.openshift.com:6443 controlPlaneTopology: HighlyAvailable etcdDiscoveryDomain: "" infrastructureName: lwanipid0629-ztgmk infrastructureTopology: HighlyAvailable platform: AWS platformStatus: aws: region: us-east-2 serviceEndpoints: - name: sts url: https://sts.us-east-2.amazonaws.com type: AWS
When upgrading the 4.9 to 4.10, meet this issue. raise the Priority/Severity, as follows: [cloud-user@preserve-olm-env jian]$ oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 --allow-explicit-upgrade --allow-upgrade-with-warnings warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade to the update to proceed anyway warning: --allow-upgrade-with-warnings is bypassing: already upgrading. Reason: ImageVerificationFailed Message: Unable to apply registry.ci.openshift.org/ocp/release@sha256:b8375a1c73d968d340dda2a8c38f6e417f1ff2d7facac579986a193a0e922be5: the image may not be safe to use Updating to release image registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 [cloud-user@preserve-olm-env jian]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-09-17-210126 True True 4m34s Unable to apply 4.10.0-0.nightly-2021-09-21-181111: it may not be safe to apply this update [cloud-user@preserve-olm-env jian]$ oc get clusterversion version -o yaml apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2021-09-22T04:08:46Z" generation: 4 name: version resourceVersion: "47959" uid: 3389b705-45d2-4a50-8eea-aa22249def23 spec: channel: stable-4.9 clusterID: 2a995974-0127-42ad-a867-398aab50523b desiredUpdate: force: false image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 version: "" upstream: https://amd64.ocp.releases.ci.openshift.org/graph https://geometrydashunblocked.io did you happen to check this afterwards to see if it eventually resolved?
(In reply to Catherinen from comment #22) > When upgrading the 4.9 to 4.10, meet this issue. raise the > Priority/Severity, as follows: > > [cloud-user@preserve-olm-env jian]$ oc adm upgrade --to-image > registry.ci.openshift.org/ocp/release@sha256: > af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 > --allow-explicit-upgrade --allow-upgrade-with-warnings > warning: The requested upgrade image is not one of the available updates. > You have used --allow-explicit-upgrade to the update to proceed anyway > warning: --allow-upgrade-with-warnings is bypassing: already upgrading. your target and original image both are not signatured, you should use '--force' in the upgrade cmd. it's a different one with the original bug issue. `oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 --allow-explicit-upgrade --allow-upgrade-with-warnings --force`