Bug 1939842

Summary:	Image registry Degraded caused by requesting to aws sts global endpoint timeout when installing sts cluster in a disconnected network
Product:	OpenShift Container Platform	Reporter:	wang lin <lwan>
Component:	Image Registry	Assignee:	Oleg Bulatov <obulatov>
Status:	CLOSED WONTFIX	QA Contact:	XiuJuan Wang <xiuwang>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.8	CC:	aos-bugs, arane, djohnsto, jdiaz, jshu, lwan, obulatov, xiuwang, yunjiang
Target Milestone:	---	Keywords:	TestBlocker
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1974499 1977184 (view as bug list)		Environment:
Last Closed:	2021-08-30 11:10:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1977184

Description wang lin 2021-03-17 08:09:38 UTC

Description of problem:
When installing sts cluster on a disconnected network,the Image registry hits WebIdentityErr when it try to assume Role With WebIdentity, I have created a vpc interface endpoint for regional sts and set custom endpoint in install-config.yaml file, it should be able to access the sts service through regional sts vpc endpoint, but the Image registry only sends requests to global sts endpoint(https://sts.amazonaws.com/ is a global sts endpoint), then Degraded because the request timed out.

The error message from the installation log output:
level=debug msg=Still waiting for the cluster to initialize: Working towards 4.8.0-0.nightly-2021-03-15-144314: 667 of 669 done (99% complete), waiting on image-registry
level=debug msg=Still waiting for the cluster to initialize: Working towards 4.8.0-0.nightly-2021-03-15-144314: 667 of 669 done (99% complete), waiting on image-registry
level=info msg=Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform
level=info msg=Cluster operator image-registry Available is False with DeploymentNotFound: NodeCADaemonAvailable: The daemon set node-ca has available replicas
level=info msg=Available: The deployment does not exist
level=info msg=ImagePrunerAvailable: Pruner CronJob has been created
level=info msg=Cluster operator image-registry Progressing is True with Error: Progressing: Unable to apply resources: unable to sync storage configuration: WebIdentityErr: failed to retrieve credentials
level=info msg=Progressing: caused by: RequestError: send request failed
level=info msg=Progressing: caused by: Post "https://sts.amazonaws.com/": dial tcp 54.239.29.25:443: i/o timeout
level=error msg=Cluster operator image-registry Degraded is True with Unavailable: Degraded: The deployment does not exist


Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-03-15-144314

How reproducible:
always

Steps to Reproduce: 
1. Create vpc and related network resources through cloudformation, and make sure there is a vpc interface endpoint for sts like:
stsEndpoint:
   Type: AWS::EC2::VPCEndpoint
   Properties:
     PrivateDnsEnabled: true
     VpcEndpointType: Interface
     SecurityGroupIds:
     - !Ref EndpointSecurityGroup
     SubnetIds:
     - !Ref PrivateSubnet
     - !If [DoAz2, !Ref PrivateSubnet2, !Ref "AWS::NoValue"]
     ServiceName: !Join
     - ''
     - - com.amazonaws.
       - !Ref 'AWS::Region'
       - .sts
     VpcId: !Ref VPC
 
2. Create install-config.yaml and add custom sts endpoint like:
platform:
 aws:
   region: us-east-2
   subnets:
   - subnet-0e500f52xxxxxxxxx
   - subnet-08562e4fxxxxxxxxx
   serviceEndpoints:
   - name: sts
     url: https://sts.us-east-2.amazonaws.com
3. Prepare OIDC service for token validation, related doc: https://deploy-preview-29545--osdocs.netlify.app/openshift-enterprise/latest/authentication/managing_cloud_provider_credentials/cco-mode-sts.html#sts-mode-installing-manual-config
4.install cluster
./openshift-install create cluster --log-level debug --dir cluster1

Actual results:
The installation failed on cluster image-registry initialization.

Expected results:
The operator should request regional sts endpoint as defined in install-config.yaml, the installation should be successfully

Additional info:
The ocp 4.7 has the same issue.
Other operators(such as ingress, machine-api) can work well after I created sts vpc interface endpoint and set custom endpoint in install-config.yaml

Comment 3 wang lin 2021-05-07 09:07:30 UTC

can't install sts cluster successfully in disconnected network , so added testblocker keywords.

Comment 4 Joel Diaz 2021-05-17 15:05:37 UTC

moving to image-registry component. other AWS-interacting components came up okay, and CCO is not involved in creating the credentials when cluster is in STS-mode.

Comment 5 wang lin 2021-06-04 02:39:28 UTC

Add more info as background:

  by default, our operators always send sts requests to a global sts endpoint(https://sts.amazonaws.com) rather than a regional sts endpoint( like https://sts.us-east-1.amazonaws.com)
  in a connected network, it's ok because the global endpoint is accessible
  but in a disconnected network,  we need to create an interface VPC endpoint for sts in cluster VPC to make operators sts requests can communicate with AWS sts service, but it only can be set to a regional endpoint, then we need to put this regional sts endpoint to install-config.yaml before we create the cluster to let the operator know.
##
platform:
 aws:
   region: us-east-2
   subnets:
   - subnet-0e500f52xxxxxxxxx
   - subnet-08562e4fxxxxxxxxx
   serviceEndpoints:
   - name: sts
     url: https://sts.us-east-2.amazonaws.com
##

  after setting this serviceEndpoints field in install-config.yaml, other operators like machineapi/ingress, they all respect this set, and send sts requests to the regional sts endpoint, but image-registry doesn't.

Comment 6 Oleg Bulatov 2021-06-08 14:56:06 UTC

As I can see, only the installer has a generic handler for serviceEndpoints [1] and can be aware of sts endpoints. machine-api-operator doesn't use aws-sdk-go at all (it doesn't interact with aws?), cluster-ingress-operator and cluster-image-registry-operator both have a hardcoded list of services that are checked in serviceEndpoints, so I'm a bit surprised that only the image registry operator has the problem. They both should have problems.

I don't know how to create a disconnected cluster, so I can draft a PR only blindly.

wang, if I draft a PR, will you able to launch a cluster with it?

[1]: https://github.com/openshift/installer/blob/7ba3f375977b7e2a0adc856db3f258f2c53b8aef/pkg/asset/installconfig/aws/session.go#L47-L48

Comment 7 Joel Diaz 2021-06-08 15:02:46 UTC

FWIW, with the way I understand machine-api, the machine-api AWS piece lives here https://github.com/openshift/cluster-api-provider-aws

Comment 8 wang lin 2021-06-09 09:30:43 UTC

Hi Oleg, yes, I can launch a cluster with your pr, our qe has env to install such cluster.

Comment 9 wang lin 2021-06-10 07:53:19 UTC

In my test, machine api seems to be able to respect the serviceEndpoints

##without the serviceEndpoints setting , machine-api will send request to global sts endpoint(https://sts.amazonaws.com), and hits the below error in a disconnect env, and can't be able to create worker machines, like:

0610 07:31:11.779058       1 controller.go:302] controller-runtime/manager/controller/machine_controller "msg"="Reconciler error" "error"="WebIdentityErr: failed to retrieve credentials\ncaused by: RequestError: send request failed\ncaused by: Post \"https://sts.amazonaws.com/\": dial tcp 52.46.134.192:443: i/o timeout" "name"="lwandissts-qdf97-worker-us-east-2b-wj2ng" "namespace"="openshift-machine-api" 

$ oc get nodes
NAME                                        STATUS   ROLES    AGE   VERSION
ip-10-0-49-113.us-east-2.compute.internal   Ready    master   23h   v1.21.0-rc.0+7f76571
ip-10-0-70-33.us-east-2.compute.internal    Ready    master   23h   v1.21.0-rc.0+7f76571
ip-10-0-79-65.us-east-2.compute.internal    Ready    master   23h   v1.21.0-rc.0+7f76571


##with the serviceEndpoints setting, it will work well , and be able to create worker machines.

$ oc get nodes
NAME                                        STATUS   ROLES    AGE   VERSION
ip-10-0-49-30.us-east-2.compute.internal    Ready    worker   22h   v1.21.0-rc.0+7f76571
ip-10-0-50-182.us-east-2.compute.internal   Ready    worker   22h   v1.21.0-rc.0+7f76571
ip-10-0-53-255.us-east-2.compute.internal   Ready    master   23h   v1.21.0-rc.0+7f76571
ip-10-0-67-79.us-east-2.compute.internal    Ready    master   23h   v1.21.0-rc.0+7f76571
ip-10-0-78-157.us-east-2.compute.internal   Ready    worker   22h   v1.21.0-rc.0+7f76571
ip-10-0-79-232.us-east-2.compute.internal   Ready    master   23h   v1.21.0-rc.0+7f76571

Comment 13 Oleg Bulatov 2021-06-24 11:04:55 UTC

wang, can you check https://github.com/openshift/cluster-image-registry-operator/pull/699 ?

Comment 14 wang lin 2021-06-25 05:49:36 UTC

Hi oleg, I test this with cluster-bot image with pr#699 merged, the cluster operator image-registry status is healthy, and no longer see `Post "https://sts.amazonaws.com/": timeout` error

$ oc get clusterversion
NAME      VERSION                                                  AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.ci.test-2021-06-25-014202-ci-ln-r5pbs0t-latest   True        False         57m     Cluster version is 4.8.0-0.ci.test-2021-06-25-014202-ci-ln-r5pbs0t-latest


#The co status
$ oc get co image-registry
NAME             VERSION                                                  AVAILABLE   PROGRESSING   DEGRADED   SINCE
image-registry   4.8.0-0.ci.test-2021-06-25-014202-ci-ln-r5pbs0t-latest   True        False         False      135m


#the Infrastructure CR
$ oc get infrastructure
NAME      AGE
cluster   155m
[cloud-user@preserve-lwan-edcrfv ~]$ oc get infrastructure cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2021-06-25T02:32:39Z"
  generation: 1
  name: cluster
  resourceVersion: "679"
  uid: b33c272f-ac73-4620-81d6-d53e4eefcab9
spec:
  cloudConfig:
    name: ""
  platformSpec:
    aws:
      serviceEndpoints:
      - name: sts
        url: https://sts.us-east-2.amazonaws.com
    type: AWS
status:
  apiServerInternalURI: https://api-int.lwanstsdis0625.qe.devcluster.openshift.com:6443
  apiServerURL: https://api.lwanstsdis0625.qe.devcluster.openshift.com:6443
  controlPlaneTopology: HighlyAvailable
  etcdDiscoveryDomain: ""
  infrastructureName: lwanstsdis0625-7g2kb
  infrastructureTopology: HighlyAvailable
  platform: AWS
  platformStatus:
    aws:
      region: us-east-2
      serviceEndpoints:
      - name: sts
        url: https://sts.us-east-2.amazonaws.com
    type: AWS


#the image-registry log
$ oc logs cluster-image-registry-operator-5d95cf4847-2d2gl -n openshift-image-registry  | grep "sts"
Nothing output

Comment 16 wang lin 2021-06-28 11:33:04 UTC

Has verified before pr merge, will move to Verified directly.

Comment 17 wang lin 2021-06-29 04:20:40 UTC

Haven't noticed cluster-bot created image is 4.8 version, I retest it on 4.9 nightly build, it worked as expected. 

But oleg, I see this changed haven't been merged to 4.8, do we backport it to 4.8? 


##test result
$ oc get co image-registry
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
image-registry   4.9.0-0.nightly-2021-06-28-114004   True        False         False      93m  

$ oc get infrastructure cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2021-06-29T02:26:24Z"
  generation: 1
  name: cluster
  resourceVersion: "688"
  uid: 0ad5d320-192d-4993-82a5-031ea0d2e2c2
spec:
  cloudConfig:
    name: ""
  platformSpec:
    aws:
      serviceEndpoints:
      - name: sts
        url: https://sts.us-east-2.amazonaws.com
    type: AWS
status:
  apiServerInternalURI: https://api-int.lwanipid0629.qe.devcluster.openshift.com:6443
  apiServerURL: https://api.lwanipid0629.qe.devcluster.openshift.com:6443
  controlPlaneTopology: HighlyAvailable
  etcdDiscoveryDomain: ""
  infrastructureName: lwanipid0629-ztgmk
  infrastructureTopology: HighlyAvailable
  platform: AWS
  platformStatus:
    aws:
      region: us-east-2
      serviceEndpoints:
      - name: sts
        url: https://sts.us-east-2.amazonaws.com
    type: AWS

Comment 22 Catherinen 2022-11-21 09:56:16 UTC Comment hidden (spam)

This comment was flagged as spam, view the edit history to see the original text if required.

Comment 23 XiuJuan Wang 2022-11-23 02:18:02 UTC

(In reply to Catherinen from comment #22)
> When upgrading the 4.9 to 4.10, meet this issue. raise the
> Priority/Severity, as follows:
> 
> [cloud-user@preserve-olm-env jian]$ oc adm upgrade --to-image
> registry.ci.openshift.org/ocp/release@sha256:
> af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08
> --allow-explicit-upgrade  --allow-upgrade-with-warnings
> warning: The requested upgrade image is not one of the available updates. 
> You have used --allow-explicit-upgrade to the update to proceed anyway
> warning: --allow-upgrade-with-warnings is bypassing: already upgrading.

your target and original image both are not signatured, you should use '--force' in the upgrade cmd.
it's a different one with the original bug issue.
`oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 --allow-explicit-upgrade  --allow-upgrade-with-warnings --force`