Bug 1939842
Summary: | Image registry Degraded caused by requesting to aws sts global endpoint timeout when installing sts cluster in a disconnected network | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | wang lin <lwan> | |
Component: | Image Registry | Assignee: | Oleg Bulatov <obulatov> | |
Status: | CLOSED WONTFIX | QA Contact: | XiuJuan Wang <xiuwang> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.8 | CC: | aos-bugs, arane, djohnsto, jdiaz, jshu, lwan, obulatov, xiuwang, yunjiang | |
Target Milestone: | --- | Keywords: | TestBlocker | |
Target Release: | 4.9.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1974499 1977184 (view as bug list) | Environment: | ||
Last Closed: | 2021-08-30 11:10:11 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1977184 |
Description
wang lin
2021-03-17 08:09:38 UTC
can't install sts cluster successfully in disconnected network , so added testblocker keywords. moving to image-registry component. other AWS-interacting components came up okay, and CCO is not involved in creating the credentials when cluster is in STS-mode. Add more info as background: by default, our operators always send sts requests to a global sts endpoint(https://sts.amazonaws.com) rather than a regional sts endpoint( like https://sts.us-east-1.amazonaws.com) in a connected network, it's ok because the global endpoint is accessible but in a disconnected network, we need to create an interface VPC endpoint for sts in cluster VPC to make operators sts requests can communicate with AWS sts service, but it only can be set to a regional endpoint, then we need to put this regional sts endpoint to install-config.yaml before we create the cluster to let the operator know. ## platform: aws: region: us-east-2 subnets: - subnet-0e500f52xxxxxxxxx - subnet-08562e4fxxxxxxxxx serviceEndpoints: - name: sts url: https://sts.us-east-2.amazonaws.com ## after setting this serviceEndpoints field in install-config.yaml, other operators like machineapi/ingress, they all respect this set, and send sts requests to the regional sts endpoint, but image-registry doesn't. As I can see, only the installer has a generic handler for serviceEndpoints [1] and can be aware of sts endpoints. machine-api-operator doesn't use aws-sdk-go at all (it doesn't interact with aws?), cluster-ingress-operator and cluster-image-registry-operator both have a hardcoded list of services that are checked in serviceEndpoints, so I'm a bit surprised that only the image registry operator has the problem. They both should have problems. I don't know how to create a disconnected cluster, so I can draft a PR only blindly. wang, if I draft a PR, will you able to launch a cluster with it? [1]: https://github.com/openshift/installer/blob/7ba3f375977b7e2a0adc856db3f258f2c53b8aef/pkg/asset/installconfig/aws/session.go#L47-L48 FWIW, with the way I understand machine-api, the machine-api AWS piece lives here https://github.com/openshift/cluster-api-provider-aws Hi Oleg, yes, I can launch a cluster with your pr, our qe has env to install such cluster. In my test, machine api seems to be able to respect the serviceEndpoints ##without the serviceEndpoints setting , machine-api will send request to global sts endpoint(https://sts.amazonaws.com), and hits the below error in a disconnect env, and can't be able to create worker machines, like: 0610 07:31:11.779058 1 controller.go:302] controller-runtime/manager/controller/machine_controller "msg"="Reconciler error" "error"="WebIdentityErr: failed to retrieve credentials\ncaused by: RequestError: send request failed\ncaused by: Post \"https://sts.amazonaws.com/\": dial tcp 52.46.134.192:443: i/o timeout" "name"="lwandissts-qdf97-worker-us-east-2b-wj2ng" "namespace"="openshift-machine-api" $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-49-113.us-east-2.compute.internal Ready master 23h v1.21.0-rc.0+7f76571 ip-10-0-70-33.us-east-2.compute.internal Ready master 23h v1.21.0-rc.0+7f76571 ip-10-0-79-65.us-east-2.compute.internal Ready master 23h v1.21.0-rc.0+7f76571 ##with the serviceEndpoints setting, it will work well , and be able to create worker machines. $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-49-30.us-east-2.compute.internal Ready worker 22h v1.21.0-rc.0+7f76571 ip-10-0-50-182.us-east-2.compute.internal Ready worker 22h v1.21.0-rc.0+7f76571 ip-10-0-53-255.us-east-2.compute.internal Ready master 23h v1.21.0-rc.0+7f76571 ip-10-0-67-79.us-east-2.compute.internal Ready master 23h v1.21.0-rc.0+7f76571 ip-10-0-78-157.us-east-2.compute.internal Ready worker 22h v1.21.0-rc.0+7f76571 ip-10-0-79-232.us-east-2.compute.internal Ready master 23h v1.21.0-rc.0+7f76571 wang, can you check https://github.com/openshift/cluster-image-registry-operator/pull/699 ? Hi oleg, I test this with cluster-bot image with pr#699 merged, the cluster operator image-registry status is healthy, and no longer see `Post "https://sts.amazonaws.com/": timeout` error $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.ci.test-2021-06-25-014202-ci-ln-r5pbs0t-latest True False 57m Cluster version is 4.8.0-0.ci.test-2021-06-25-014202-ci-ln-r5pbs0t-latest #The co status $ oc get co image-registry NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE image-registry 4.8.0-0.ci.test-2021-06-25-014202-ci-ln-r5pbs0t-latest True False False 135m #the Infrastructure CR $ oc get infrastructure NAME AGE cluster 155m [cloud-user@preserve-lwan-edcrfv ~]$ oc get infrastructure cluster -o yaml apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2021-06-25T02:32:39Z" generation: 1 name: cluster resourceVersion: "679" uid: b33c272f-ac73-4620-81d6-d53e4eefcab9 spec: cloudConfig: name: "" platformSpec: aws: serviceEndpoints: - name: sts url: https://sts.us-east-2.amazonaws.com type: AWS status: apiServerInternalURI: https://api-int.lwanstsdis0625.qe.devcluster.openshift.com:6443 apiServerURL: https://api.lwanstsdis0625.qe.devcluster.openshift.com:6443 controlPlaneTopology: HighlyAvailable etcdDiscoveryDomain: "" infrastructureName: lwanstsdis0625-7g2kb infrastructureTopology: HighlyAvailable platform: AWS platformStatus: aws: region: us-east-2 serviceEndpoints: - name: sts url: https://sts.us-east-2.amazonaws.com type: AWS #the image-registry log $ oc logs cluster-image-registry-operator-5d95cf4847-2d2gl -n openshift-image-registry | grep "sts" Nothing output Has verified before pr merge, will move to Verified directly. Haven't noticed cluster-bot created image is 4.8 version, I retest it on 4.9 nightly build, it worked as expected. But oleg, I see this changed haven't been merged to 4.8, do we backport it to 4.8? ##test result $ oc get co image-registry NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE image-registry 4.9.0-0.nightly-2021-06-28-114004 True False False 93m $ oc get infrastructure cluster -o yaml apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2021-06-29T02:26:24Z" generation: 1 name: cluster resourceVersion: "688" uid: 0ad5d320-192d-4993-82a5-031ea0d2e2c2 spec: cloudConfig: name: "" platformSpec: aws: serviceEndpoints: - name: sts url: https://sts.us-east-2.amazonaws.com type: AWS status: apiServerInternalURI: https://api-int.lwanipid0629.qe.devcluster.openshift.com:6443 apiServerURL: https://api.lwanipid0629.qe.devcluster.openshift.com:6443 controlPlaneTopology: HighlyAvailable etcdDiscoveryDomain: "" infrastructureName: lwanipid0629-ztgmk infrastructureTopology: HighlyAvailable platform: AWS platformStatus: aws: region: us-east-2 serviceEndpoints: - name: sts url: https://sts.us-east-2.amazonaws.com type: AWS This comment was flagged as spam, view the edit history to see the original text if required. (In reply to Catherinen from comment #22) > When upgrading the 4.9 to 4.10, meet this issue. raise the > Priority/Severity, as follows: > > [cloud-user@preserve-olm-env jian]$ oc adm upgrade --to-image > registry.ci.openshift.org/ocp/release@sha256: > af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 > --allow-explicit-upgrade --allow-upgrade-with-warnings > warning: The requested upgrade image is not one of the available updates. > You have used --allow-explicit-upgrade to the update to proceed anyway > warning: --allow-upgrade-with-warnings is bypassing: already upgrading. your target and original image both are not signatured, you should use '--force' in the upgrade cmd. it's a different one with the original bug issue. `oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 --allow-explicit-upgrade --allow-upgrade-with-warnings --force` |