Description of problem: install cluster in custom region as below but ingress is degraded with DNSReady=False spec: cloudConfig: name: "" platformSpec: aws: serviceEndpoints: - name: ec2 url: https://ec2.af-south-1.amazonaws.com - name: elasticloadbalancing url: https://elasticloadbalancing.af-south-1.amazonaws.com - name: s3 url: https://s3.af-south-1.amazonaws.com - name: tagging url: https://tagging.af-south-1.amazonaws.com type: AWS Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-08-04-210224 How reproducible: 100% Steps to Reproduce: 1. install 4.6 cluster in custom region 2. 3. Actual results: $ oc get co/ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE ingress 4.6.0-0.nightly-2020-08-04-210224 True False True 113m $ oc -n openshift-ingress-operator get ingresscontroller/default -o yaml <---snip---> - lastTransitionTime: "2020-08-05T08:29:18Z" message: 'The record failed to provision in some zones: [{ map[Name:yunjiang-05bug-hrdw5-int kubernetes.io/cluster/yunjiang-05bug-hrdw5:owned]}]' reason: FailedZones status: "False" type: DNSReady - lastTransitionTime: "2020-08-05T08:38:01Z" message: 'One or more other status conditions indicate a degraded state: DNSReady=False' reason: DegradedConditions status: "True" type: Degraded $ oc -n openshift-ingress-operator get dnsrecords.ingress.operator.openshift.io -oyaml <---snip---> status: observedGeneration: 1 zones: - conditions: - lastTransitionTime: "2020-08-05T08:29:01Z" message: "The DNS provider failed to ensure the record: failed to find hosted zone for record: failed to get tagged resources: InvalidSignatureException: Credential should be scoped to a valid region, not 'us-east-1'. \n\tstatus code: 400, request id: d5e77ec8-4c39-443e-b1c3-1435fd64a3a6" reason: ProviderError status: "True" type: Failed dnsZone: tags: Name: yunjiang-05bug-hrdw5-int kubernetes.io/cluster/yunjiang-05bug-hrdw5: owned - conditions: - lastTransitionTime: "2020-08-05T08:29:11Z" message: The DNS provider succeeded in ensuring the record reason: ProviderSuccess status: "False" type: Failed dnsZone: id: Z3B3KOVA3TRCWP Expected results: ingress should not be degraded Additional info:
see https://bugzilla.redhat.com/show_bug.cgi?id=1862065#c5, route53 endpoint was not specified when installing the custer.
The tagging api is used to find the hosted zone of the elb provisioned by the router's LB service. The tagging api needs to be in the same region as the route53 endpoint. Since route53 is non-regionalized, the endpoint supports either no region id or us-east-1. Can test with following tagging endpoint and let us know if the issue gets resolved? - name: tagging url: https://tagging.us-east-1.amazonaws.com
Please see https://bugzilla.redhat.com/show_bug.cgi?id=1866299#c2
Thanks Daneyon, it works with following tagging endpoint - name: tagging url: https://tagging.us-east-1.amazonaws.com $ oc get co/ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE ingress 4.6.0-0.nightly-2020-08-10-200500 True False False 27m $ oc get infrastructures.config.openshift.io/cluster -oyaml <---snip---> spec: cloudConfig: name: "" platformSpec: aws: serviceEndpoints: - name: ec2 url: https://ec2.af-south-1.amazonaws.com - name: elasticloadbalancing url: https://elasticloadbalancing.af-south-1.amazonaws.com - name: s3 url: https://s3.af-south-1.amazonaws.com - name: tagging url: https://tagging.us-east-1.amazonaws.com type: AWS $ oc -n openshift-ingress-operator get dnsrecords/default-wildcard -oyaml <---snip---> spec: dnsName: '*.apps.hongli-cusreg.qe.devcluster.openshift.com.' recordTTL: 30 recordType: CNAME targets: - ab7fe7b58ff224c8ab6076fa84f3d169-1535927524.af-south-1.elb.amazonaws.com status: observedGeneration: 1 zones: - conditions: - lastTransitionTime: "2020-08-11T02:05:37Z" message: The DNS provider succeeded in ensuring the record reason: ProviderSuccess status: "False" type: Failed dnsZone: tags: Name: hongli-cusreg-9dwx9-int kubernetes.io/cluster/hongli-cusreg-9dwx9: owned - conditions: - lastTransitionTime: "2020-08-11T02:05:45Z" message: The DNS provider succeeded in ensuring the record reason: ProviderSuccess status: "False" type: Failed dnsZone: id: Z3B3KOVA3TRCWP
Reopening since I consider this a bug. The ingress operator should verify the following for ingress-related service endpoints: 1. The route53 endpoint is either "https://route53.amazonaws.com" or "https://route53.us-east-1.amazonaws.com" (based upon https://docs.aws.amazon.com/general/latest/gr/r53.html) 2. The tagging endpoint is "https://tagging.us-east-1.amazonaws.com". This region is required to get the hosted zone (route 53) of the elb. 3. The elb endpoint is either "https://elasticloadbalancing.${REGION}.amazonaws.com", where ${REGION} is taken from infrastructure/cluster platformStatus.AWS.Region.
*** Bug 1862065 has been marked as a duplicate of this bug. ***
met same issue on us-gov-west-1 region: >> install-config: platform: aws: region: us-gov-west-1 amiID: ami-0dc4fa6c subnets: - subnet-085b5e8aa10376edd - subnet-0bb780712c00ec3d6 serviceEndpoints: - name: ec2 url: https://ec2.us-gov-west-1.amazonaws.com [1] - name: elasticloadbalancing url: https://elasticloadbalancing.us-gov-west-1.amazonaws.com [1] - name: s3 url: https://s3.us-gov-west-1.amazonaws.com [1] - name: tagging url: https://tagging.us-gov-west-1.amazonaws.com [2] >> install log level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-08-16-072105: 99% complete" level=debug msg="Still waiting for the cluster to initialize: Cluster operator console is reporting a failure: RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.yun4.qe.devcluster.openshift.com/health): Get \"https://console-openshift-console.apps.yun4.qe.devcluster.openshift.com/health\": dial tcp: lookup console-openshift-console.apps.yun4.qe.devcluster.openshift.com on 172.30.0.10:53: no such host" level=debug msg="Still waiting for the cluster to initialize: Cluster operator console is reporting a failure: RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.yun4.qe.devcluster.openshift.com/health): Get \"https://console-openshift-console.apps.yun4.qe.devcluster.openshift.com/health\": dial tcp: lookup console-openshift-console.apps.yun4.qe.devcluster.openshift.com on 172.30.0.10:53: no such host" level=error msg="Cluster operator authentication Degraded is True with OAuthRouteCheckEndpointAccessibleController_SyncError: OAuthRouteCheckEndpointAccessibleControllerDegraded: Get \"https://oauth-openshift.apps.yun4.qe.devcluster.openshift.com/healthz\": dial tcp: lookup oauth-openshift.apps.yun4.qe.devcluster.openshift.com on 172.30.0.10:53: no such host" level=info msg="Cluster operator authentication Progressing is True with OAuthVersionRoute_WaitingForRoute: OAuthVersionRouteProgressing: Request to \"https://oauth-openshift.apps.yun4.qe.devcluster.openshift.com/healthz\" not successfull yet" level=info msg="Cluster operator authentication Available is False with OAuthRouteCheckEndpointAccessibleController_EndpointUnavailable::OAuthVersionRoute_RequestFailed: OAuthVersionRouteAvailable: HTTP request to \"https://oauth-openshift.apps.yun4.qe.devcluster.openshift.com/healthz\" failed: dial tcp: lookup oauth-openshift.apps.yun4.qe.devcluster.openshift.com on 172.30.0.10:53: no such host\nOAuthRouteCheckEndpointAccessibleControllerAvailable: Get \"https://oauth-openshift.apps.yun4.qe.devcluster.openshift.com/healthz\": dial tcp: lookup oauth-openshift.apps.yun4.qe.devcluster.openshift.com on 172.30.0.10:53: no such host" level=error msg="Cluster operator console Degraded is True with RouteHealth_FailedGet: RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.yun4.qe.devcluster.openshift.com/health): Get \"https://console-openshift-console.apps.yun4.qe.devcluster.openshift.com/health\": dial tcp: lookup console-openshift-console.apps.yun4.qe.devcluster.openshift.com on 172.30.0.10:53: no such host" level=info msg="Cluster operator console Progressing is True with SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward version 4.6.0-0.nightly-2020-08-16-072105" level=info msg="Cluster operator console Available is False with Deployment_InsufficientReplicas: DeploymentAvailable: 0 pods available for console deployment" level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default" level=info msg="Cluster operator insights Disabled is False with AsExpected: " level=fatal msg="failed to initialize the cluster: Cluster operator console is reporting a failure: RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.yun4.qe.devcluster.openshift.com/health): Get \"https://console-openshift-console.apps.yun4.qe.devcluster.openshift.com/health\": dial tcp: lookup console-openshift-console.apps.yun4.qe.devcluster.openshift.com on 172.30.0.10:53: no such host" >> oc get dns.config -oyaml - lastTransitionTime: "2020-08-17T03:16:02Z" message: 'The record failed to provision in some zones: [{ map[Name:yun4-4w5fl-int kubernetes.io/cluster/yun4-4w5fl:owned]}]' reason: FailedZones status: "False" type: DNSReady - lastTransitionTime: "2020-08-17T03:21:34Z" message: 'One or more other status conditions indicate a degraded state: DNSReady=False' reason: DegradedConditions status: "True" type: Degraded [1] https://docs.aws.amazon.com/govcloud-us/latest/UserGuide/using-govcloud-endpoints.html [2] https://docs.aws.amazon.com/general/latest/gr/rande.html
@Daneyon, please correct me if any service endpoints is wrong, especially tagging endpoint. Thanks. Add TestBlocker label since it blocks all testings against us-gov region. Feel free to remove it if any workaround is available.
(In reply to Yunfei Jiang from comment #8) > >> oc get dns.config -oyaml correct: command should be `oc -n openshift-ingress-operator get ingresscontroller/default -oyaml` > - lastTransitionTime: "2020-08-17T03:16:02Z" > message: 'The record failed to provision in some zones: [{ > map[Name:yun4-4w5fl-int kubernetes.io/cluster/yun4-4w5fl:owned]}]' > reason: FailedZones > status: "False" > type: DNSReady > - lastTransitionTime: "2020-08-17T03:21:34Z" > message: 'One or more other status conditions indicate a degraded state: > DNSReady=False' > reason: DegradedConditions > status: "True" > type: Degraded
I’m adding UpcomingSprint because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
@Yunfel, I have removed the restriction for the tagging api to use region us-gov-east-1 for GovCloud in https://github.com/openshift/cluster-ingress-operator/pull/441. Since [1] in https://bugzilla.redhat.com/show_bug.cgi?id=1866299#c8 does not specify any tagging api details, can you test whether us-gov-east-1 and us-gov-west-1 work? If us-gov-west-1 does not work, I'll update PR 441 to add the restriction back in.
@Daneyon Since your pull request is not merged, I can not test against your build. following results are based on 4.6.0-0.nightly-2020-08-23-214712 Test1: After adding https://tagging.us-gov-west-1.amazonaws.com as IAM endpoints: - lastTransitionTime: "2020-08-24T03:08:50Z" message: The record isn't present in any zones. reason: NoZones status: "False" type: DNSReady - lastTransitionTime: "2020-08-24T03:15:15Z" message: 'One or more other status conditions indicate a degraded state: DNSReady=False' reason: DegradedConditions status: "True" type: Degraded Test2: without any endpoints: - lastTransitionTime: "2020-08-24T03:02:46Z" message: 'The record failed to provision in some zones: [{ map[Name:yunjiang-govnoep-9lf88-int kubernetes.io/cluster/yunjiang-govnoep-9lf88:owned]}]' reason: FailedZones status: "False" type: DNSReady - lastTransitionTime: "2020-08-24T03:08:54Z" message: 'One or more other status conditions indicate a degraded state: DNSReady=False' reason: DegradedConditions status: "True" type: Degraded per Abhinav's note https://bugzilla.redhat.com/show_bug.cgi?id=1866723#c8 installing a cluster into us-gov region, service endpoint is not required (from install-config).
Yunfeil, https://github.com/openshift/cluster-ingress-operator/pull/441 recently merged. Can you test the following: 1. The Ingress Operator should report expected status conditions when [1] reports either us-gov-east-1 or us-gov-west-1 regions. 2. The Ingress Operator should report expected status conditions when [1] reports either us-gov-east-1 or us-gov-west-1 regions and infrastructures/cluster is configured with the following AWS GovCloud custom endpoints: $ REGION=$(oc get infrastructures/cluster -o jsonpath={.status.platformStatus.aws.region}) endpoints: - name: route53 url: https://route53.us-gov.amazonaws.com - name: tagging url: https://tagging.${REGION}.amazonaws.com - name: elasticloadbalancing url: https://elasticloadbalancing.${REGION}.amazonaws.com [1] oc get infrastructures/cluster -o jsonpath={.status.platformStatus.aws.region}
Hello Daneyon, All tests fail. version: 4.6.0-0.nightly-2020-08-27-005538 >> TEST 1 install-config: <!--snip--> platform: aws: region: us-gov-west-1 amiID: ami-0dc4fa6c subnets: - subnet-02da8cfb48ddaa17d - subnet-00ab96300aa21e259 <!--snip--> ./oc get infrastructures/cluster -ojson | jq .status.platformStatus.aws { "region": "us-gov-west-1" } result: ./oc -n openshift-ingress-operator get ingresscontroller/default -oyaml - lastTransitionTime: "2020-08-27T06:10:11Z" message: 'The record failed to provision in some zones: [{ map[Name:yunjiang-govnoep2-l7z8x-int kubernetes.io/cluster/yunjiang-govnoep2-l7z8x:owned]}]' reason: FailedZones status: "False" type: DNSReady - lastTransitionTime: "2020-08-27T06:15:49Z" message: 'One or more other status conditions indicate a degraded state: DNSReady=False' reason: DegradedConditions status: "True" type: Degraded >> TEST 2 install-config: <!--snip--> platform: aws: region: us-gov-west-1 amiID: ami-0dc4fa6c subnets: - subnet-0bad2fc475a264fe4 - subnet-0d79f6a8d47f6ae5f serviceEndpoints: - name: elasticloadbalancing url: https://elasticloadbalancing.us-gov-west-1.amazonaws.com - name: tagging url: https://tagging.us-gov-west-1.amazonaws.com - name: route53 url: https://route53.us-gov.amazonaws.com <!--snip--> ./oc get infrastructures/cluster -ojson | jq .status.platformStatus.aws { "region": "us-gov-west-1", "serviceEndpoints": [ { "name": "elasticloadbalancing", "url": "https://elasticloadbalancing.us-gov-west-1.amazonaws.com" }, { "name": "route53", "url": "https://route53.us-gov.amazonaws.com" }, { "name": "tagging", "url": "https://tagging.us-gov-west-1.amazonaws.com" } ] } result: ./oc -n openshift-ingress-operator get ingresscontroller/default -oyaml - lastTransitionTime: "2020-08-27T06:06:18Z" message: 'The record failed to provision in some zones: [{ map[Name:yunjiang-gov3ep-85tkc-int kubernetes.io/cluster/yunjiang-gov3ep-85tkc:owned]}]' reason: FailedZones status: "False" type: DNSReady - lastTransitionTime: "2020-08-27T06:11:38Z" message: 'One or more other status conditions indicate a degraded state: DNSReady=False' reason: DegradedConditions status: "True" type: Degraded
Yunfei, Thank you for for working through the test cases. Can you attach the ingress operator logs so I can diagnose why the two tests are failing?
Created attachment 1712914 [details] ingress log without endpoints
Created attachment 1712916 [details] ingress log with 3 endpoints
Daneyon, ingress logs attached
Per my understanding, the Secret will be created automatically during the installation process, no need to provide them manually. I checked cloud-credential-operator ( in comment 24 's clusters ), they are not in manual mode. @Abhinav, Please help to confirm that do we need to provide Secrets for operators (e.g. ingress) manually when install a private cluster in us-gov region? Thanks! ( please refer to Daneyon's comment 25 ) Thanks!
> @Abhinav, Please help to confirm that do we need to provide Secrets for operators (e.g. ingress) manually when install a private cluster in us-gov region? Thanks! ( please refer to Daneyon's comment 25 ) Credential operator should provide secrets in GovCloud regions similar to any other commercial region. There should be no need for manual steps.
Yunfeil, I successfully tested https://github.com/openshift/cluster-ingress-operator/pull/454 with and without us-gov-west-1 custom endpoints. Can you keep this cluster running until this PR merges? Since region is immutable, I was unable to test against the us-gov-east-1 region. Can you provide a us-gov-east-1 cluster so I can verify the PR works as expected in both regions?
Yunfei, https://github.com/openshift/cluster-ingress-operator/pull/454 merged, so please test when you have a moment and update this BZ with your findings.
Thanks @yunfei that having a cluster with usgov region ready. Tested ingress related cases and all passed. Will check the cluster with custom endpoints later. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-09-09-021653 True False 3h25m Cluster version is 4.6.0-0.nightly-2020-09-09-021653 spec: cloudConfig: name: "" platformSpec: aws: {} type: AWS status: <---snip---> platform: AWS platformStatus: aws: region: us-gov-west-1 type: AWS $ oc -n openshift-ingress-operator get dnsrecords/default-wildcard -o yaml <---snip---> spec: dnsName: '*.apps.yunjiang-debug5.qe.devcluster.openshift.com.' recordTTL: 30 recordType: CNAME targets: - internal-ae1c79eb0900e41f6832a1f37c6fc8a1-1651558077.us-gov-west-1.elb.amazonaws.com status: observedGeneration: 1 zones: - conditions: - lastTransitionTime: "2020-09-09T04:49:06Z" message: The DNS provider succeeded in ensuring the record reason: ProviderSuccess status: "False" type: Failed
Tested another cluster with usgov region and custom endpoints and passed as well, so moving to verified. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-09-09-021653 True False 75m Cluster version is 4.6.0-0.nightly-2020-09-09-021653 spec: cloudConfig: name: "" platformSpec: aws: serviceEndpoints: - name: elasticloadbalancing url: https://elasticloadbalancing.us-gov-west-1.amazonaws.com - name: tagging url: https://tagging.us-gov-west-1.amazonaws.com - name: route53 url: https://route53.us-gov.amazonaws.com type: AWS status: <---snip---> platform: AWS platformStatus: aws: region: us-gov-west-1 serviceEndpoints: - name: elasticloadbalancing url: https://elasticloadbalancing.us-gov-west-1.amazonaws.com - name: route53 url: https://route53.us-gov.amazonaws.com - name: tagging url: https://tagging.us-gov-west-1.amazonaws.com spec: dnsName: '*.apps.yunjiang-debug7.qe.devcluster.openshift.com.' recordTTL: 30 recordType: CNAME targets: - internal-ad61fd41e2ca545658d7a1c49de4782d-1800445331.us-gov-west-1.elb.amazonaws.com status: observedGeneration: 1 zones: - conditions: - lastTransitionTime: "2020-09-09T08:21:51Z" message: The DNS provider succeeded in ensuring the record reason: ProviderSuccess status: "False" type: Failed $ oc -n openshift-ingress-operator get ingresscontroller/default -oyaml <---snip---> - lastTransitionTime: "2020-09-09T08:22:26Z" message: The record is provisioned in all reported zones. reason: NoFailedZones status: "True" type: DNSReady - lastTransitionTime: "2020-09-09T08:28:53Z" status: "False" type: Degraded
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
Changing back to POST due to https://github.com/openshift/cluster-ingress-operator/pull/457 (improved provider validation).
Yunfei, Thanks for testing. It appears that the installer is not properly handing the region requirements of the tagging client as outlined in https://bugzilla.redhat.com/show_bug.cgi?id=1866299#c50. time="2020-09-16T09:39:14Z" level=info msg="Creating infrastructure resources..." time="2020-09-16T09:39:14Z" level=debug msg="resolved AWS service tagging (us-gov-east-1) to \"https://tagging.us-gov-west-1.amazonaws.com\"" time="2020-09-16T09:39:14Z" level=debug msg="Tagging arn:aws-us-gov:ec2:us-gov-east-1:225746144451:subnet/subnet-063f1fb3fd3c529b2 with kubernetes.io/cluster/yunjiang-16i3-dnrfs: shared" time="2020-09-16T09:39:14Z" level=debug msg="Tagging arn:aws-us-gov:ec2:us-gov-east-1:225746144451:subnet/subnet-0c959be9654a4867b with kubernetes.io/cluster/yunjiang-16i3-dnrfs: shared" time="2020-09-16T09:39:14Z" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": InvalidSignatureException: Credential should be scoped to a valid region, not 'us-gov-east-1'. \n\tstatus code: 400, request id: 5ce26265-7cd9-4490-922b-f9c21543e351" Although the region is "us-gov-east-1", the tagging client config should use "us-gov-west-1". Note: the route53 client has similar requirements (see https://bugzilla.redhat.com/show_bug.cgi?id=1866299#c50 for details). Reassigning to the installer team for further investigation.
The install needs access to the resources tagging endpoint in the same region as the cluster to tag the subnets. So this is invalid for us-gov-west-1 ``` serviceEndpoints: - name: elasticloadbalancing url: https://elasticloadbalancing.us-gov-east-1.amazonaws.com - name: tagging url: https://tagging.us-gov-west-1.amazonaws.com - name: route53 url: https://route53.us-gov.amazonaws.com ``` I want to re iterate, that gov cloud is not a custom region and it does not require service endpoints. The user setting invalid service endpoints is invalid configuration and not a bug.
So #Comment 34 and #Comment 42 are not true? So for GovCloud, we just support us-gov-east-1 and us-gov-west-1 without any custom service endpoints? or just doesn't require elb/tagging/route53 service endpoints? We already verified that it works in region us-gov-east-1 and us-gov-west-1 without service endpoints. It this is expected result then we can move this bug to verified and file a new BZ for documentation about the service endpoints are not required for GovCloud. @Daneyon any thoughts?
Thanks Yunfei for launching a cluster in us-gov-east with serviceEndpoints, no issues found. # oc get co/ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE ingress 4.6.0-0.nightly-2020-09-23-022756 True False False 19h spec: cloudConfig: name: "" platformSpec: aws: serviceEndpoints: - name: elasticloadbalancing url: https://elasticloadbalancing.us-gov-east-1.amazonaws.com - name: tagging url: https://tagging.us-gov-east-1.amazonaws.com - name: route53 url: https://route53.us-gov.amazonaws.com type: AWS status: <---snip---> platform: AWS platformStatus: aws: region: us-gov-east-1 serviceEndpoints: - name: elasticloadbalancing url: https://elasticloadbalancing.us-gov-east-1.amazonaws.com - name: route53 url: https://route53.us-gov.amazonaws.com - name: tagging url: https://tagging.us-gov-east-1.amazonaws.com type: AWS
Tested in both us-gov-east and us-gov-west region and with/without serviceEndpoints, all four clusters are working well. Thank you Daneyon!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196