Bug 1866299

Summary:

ingress is degrade with DNSReady=False when installing cluster in AWS custom region

Product:

OpenShift Container Platform

Reporter:

Hongan Li <hongli>

Component:

Networking

Assignee:

Daneyon Hansen <dhansen>

Networking sub component:

router

QA Contact:

Hongan Li <hongli>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

high

CC:

adahiya, amcdermo, aos-bugs, bleanhar, dhansen, jokerman, sgreene, yunjiang, zhsun

Version:

4.6

Keywords:

Reopened

Target Milestone:

---

Target Release:

4.6.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-10-27 16:24:53 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ingress log without endpoints	none
ingress log with 3 endpoints	none

Description Hongan Li 2020-08-05 10:38:26 UTC

Description of problem:
install cluster in custom region as below but ingress is degraded with DNSReady=False
  spec:
    cloudConfig:
      name: ""
    platformSpec:
      aws:
        serviceEndpoints:
        - name: ec2
          url: https://ec2.af-south-1.amazonaws.com
        - name: elasticloadbalancing
          url: https://elasticloadbalancing.af-south-1.amazonaws.com
        - name: s3
          url: https://s3.af-south-1.amazonaws.com
        - name: tagging
          url: https://tagging.af-south-1.amazonaws.com
      type: AWS


Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-08-04-210224

How reproducible:
100%

Steps to Reproduce:
1. install 4.6 cluster in custom region
2.
3.

Actual results:
$ oc get co/ingress
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress   4.6.0-0.nightly-2020-08-04-210224   True        False         True       113m

$ oc -n openshift-ingress-operator get ingresscontroller/default -o yaml
<---snip--->
  - lastTransitionTime: "2020-08-05T08:29:18Z"
    message: 'The record failed to provision in some zones: [{ map[Name:yunjiang-05bug-hrdw5-int
      kubernetes.io/cluster/yunjiang-05bug-hrdw5:owned]}]'
    reason: FailedZones
    status: "False"
    type: DNSReady
  - lastTransitionTime: "2020-08-05T08:38:01Z"
    message: 'One or more other status conditions indicate a degraded state: DNSReady=False'
    reason: DegradedConditions
    status: "True"
    type: Degraded


$ oc -n openshift-ingress-operator get dnsrecords.ingress.operator.openshift.io -oyaml
<---snip--->
  status:
    observedGeneration: 1
    zones:
    - conditions:
      - lastTransitionTime: "2020-08-05T08:29:01Z"
        message: "The DNS provider failed to ensure the record: failed to find hosted
          zone for record: failed to get tagged resources: InvalidSignatureException:
          Credential should be scoped to a valid region, not 'us-east-1'. \n\tstatus
          code: 400, request id: d5e77ec8-4c39-443e-b1c3-1435fd64a3a6"
        reason: ProviderError
        status: "True"
        type: Failed
      dnsZone:
        tags:
          Name: yunjiang-05bug-hrdw5-int
          kubernetes.io/cluster/yunjiang-05bug-hrdw5: owned
    - conditions:
      - lastTransitionTime: "2020-08-05T08:29:11Z"
        message: The DNS provider succeeded in ensuring the record
        reason: ProviderSuccess
        status: "False"
        type: Failed
      dnsZone:
        id: Z3B3KOVA3TRCWP



Expected results:
ingress should not be degraded

Additional info:

Comment 1 Hongan Li 2020-08-05 10:40:23 UTC

see https://bugzilla.redhat.com/show_bug.cgi?id=1862065#c5, route53 endpoint was not specified when installing the custer.

Comment 2 Daneyon Hansen 2020-08-06 16:55:22 UTC

The tagging api is used to find the hosted zone of the elb provisioned by the router's LB service. The tagging api needs to be in the same region as the route53 endpoint. Since route53 is non-regionalized, the endpoint supports either no region id or us-east-1. Can test with following tagging endpoint and let us know if the issue gets resolved?

- name: tagging
    url: https://tagging.us-east-1.amazonaws.com

Comment 3 Daneyon Hansen 2020-08-06 17:34:56 UTC

Please see https://bugzilla.redhat.com/show_bug.cgi?id=1866299#c2

Comment 4 Hongan Li 2020-08-11 02:45:42 UTC

Thanks Daneyon, it works with following tagging endpoint
      - name: tagging
        url: https://tagging.us-east-1.amazonaws.com

$ oc get co/ingress
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress   4.6.0-0.nightly-2020-08-10-200500   True        False         False      27m

$ oc get infrastructures.config.openshift.io/cluster -oyaml
<---snip--->
spec:
  cloudConfig:
    name: ""
  platformSpec:
    aws:
      serviceEndpoints:
      - name: ec2
        url: https://ec2.af-south-1.amazonaws.com
      - name: elasticloadbalancing
        url: https://elasticloadbalancing.af-south-1.amazonaws.com
      - name: s3
        url: https://s3.af-south-1.amazonaws.com
      - name: tagging
        url: https://tagging.us-east-1.amazonaws.com
    type: AWS

$ oc -n openshift-ingress-operator get dnsrecords/default-wildcard -oyaml
<---snip--->
spec:
  dnsName: '*.apps.hongli-cusreg.qe.devcluster.openshift.com.'
  recordTTL: 30
  recordType: CNAME
  targets:
  - ab7fe7b58ff224c8ab6076fa84f3d169-1535927524.af-south-1.elb.amazonaws.com
status:
  observedGeneration: 1
  zones:
  - conditions:
    - lastTransitionTime: "2020-08-11T02:05:37Z"
      message: The DNS provider succeeded in ensuring the record
      reason: ProviderSuccess
      status: "False"
      type: Failed
    dnsZone:
      tags:
        Name: hongli-cusreg-9dwx9-int
        kubernetes.io/cluster/hongli-cusreg-9dwx9: owned
  - conditions:
    - lastTransitionTime: "2020-08-11T02:05:45Z"
      message: The DNS provider succeeded in ensuring the record
      reason: ProviderSuccess
      status: "False"
      type: Failed
    dnsZone:
      id: Z3B3KOVA3TRCWP

Comment 5 Daneyon Hansen 2020-08-12 21:02:59 UTC

Reopening since I consider this a bug. The ingress operator should verify the following for ingress-related service endpoints:

1. The route53 endpoint is either "https://route53.amazonaws.com" or "https://route53.us-east-1.amazonaws.com" (based upon https://docs.aws.amazon.com/general/latest/gr/r53.html)
2. The tagging endpoint is "https://tagging.us-east-1.amazonaws.com". This region is required to get the hosted zone (route 53) of the elb.
3. The elb endpoint is either "https://elasticloadbalancing.${REGION}.amazonaws.com", where ${REGION} is taken from infrastructure/cluster platformStatus.AWS.Region.

Comment 6 Daneyon Hansen 2020-08-12 21:15:37 UTC

*** Bug 1862065 has been marked as a duplicate of this bug. ***

Comment 8 Yunfei Jiang 2020-08-18 11:12:57 UTC

met same issue on us-gov-west-1 region:

>> install-config:

platform:
  aws:
    region: us-gov-west-1
    amiID: ami-0dc4fa6c
    subnets:
    - subnet-085b5e8aa10376edd
    - subnet-0bb780712c00ec3d6
    serviceEndpoints:
    - name: ec2
      url: https://ec2.us-gov-west-1.amazonaws.com  [1]
    - name: elasticloadbalancing
      url: https://elasticloadbalancing.us-gov-west-1.amazonaws.com  [1]
    - name: s3
      url: https://s3.us-gov-west-1.amazonaws.com  [1]
    - name: tagging
      url: https://tagging.us-gov-west-1.amazonaws.com  [2]

>> install log

level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-08-16-072105: 99% complete"
level=debug msg="Still waiting for the cluster to initialize: Cluster operator console is reporting a failure: RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.yun4.qe.devcluster.openshift.com/health): Get \"https://console-openshift-console.apps.yun4.qe.devcluster.openshift.com/health\": dial tcp: lookup console-openshift-console.apps.yun4.qe.devcluster.openshift.com on 172.30.0.10:53: no such host"
level=debug msg="Still waiting for the cluster to initialize: Cluster operator console is reporting a failure: RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.yun4.qe.devcluster.openshift.com/health): Get \"https://console-openshift-console.apps.yun4.qe.devcluster.openshift.com/health\": dial tcp: lookup console-openshift-console.apps.yun4.qe.devcluster.openshift.com on 172.30.0.10:53: no such host"
level=error msg="Cluster operator authentication Degraded is True with OAuthRouteCheckEndpointAccessibleController_SyncError: OAuthRouteCheckEndpointAccessibleControllerDegraded: Get \"https://oauth-openshift.apps.yun4.qe.devcluster.openshift.com/healthz\": dial tcp: lookup oauth-openshift.apps.yun4.qe.devcluster.openshift.com on 172.30.0.10:53: no such host"
level=info msg="Cluster operator authentication Progressing is True with OAuthVersionRoute_WaitingForRoute: OAuthVersionRouteProgressing: Request to \"https://oauth-openshift.apps.yun4.qe.devcluster.openshift.com/healthz\" not successfull yet"
level=info msg="Cluster operator authentication Available is False with OAuthRouteCheckEndpointAccessibleController_EndpointUnavailable::OAuthVersionRoute_RequestFailed: OAuthVersionRouteAvailable: HTTP request to \"https://oauth-openshift.apps.yun4.qe.devcluster.openshift.com/healthz\" failed: dial tcp: lookup oauth-openshift.apps.yun4.qe.devcluster.openshift.com on 172.30.0.10:53: no such host\nOAuthRouteCheckEndpointAccessibleControllerAvailable: Get \"https://oauth-openshift.apps.yun4.qe.devcluster.openshift.com/healthz\": dial tcp: lookup oauth-openshift.apps.yun4.qe.devcluster.openshift.com on 172.30.0.10:53: no such host"
level=error msg="Cluster operator console Degraded is True with RouteHealth_FailedGet: RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.yun4.qe.devcluster.openshift.com/health): Get \"https://console-openshift-console.apps.yun4.qe.devcluster.openshift.com/health\": dial tcp: lookup console-openshift-console.apps.yun4.qe.devcluster.openshift.com on 172.30.0.10:53: no such host"
level=info msg="Cluster operator console Progressing is True with SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward version 4.6.0-0.nightly-2020-08-16-072105"
level=info msg="Cluster operator console Available is False with Deployment_InsufficientReplicas: DeploymentAvailable: 0 pods available for console deployment"
level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default"
level=info msg="Cluster operator insights Disabled is False with AsExpected: "
level=fatal msg="failed to initialize the cluster: Cluster operator console is reporting a failure: RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.yun4.qe.devcluster.openshift.com/health): Get \"https://console-openshift-console.apps.yun4.qe.devcluster.openshift.com/health\": dial tcp: lookup console-openshift-console.apps.yun4.qe.devcluster.openshift.com on 172.30.0.10:53: no such host"

>> oc get dns.config -oyaml

  - lastTransitionTime: "2020-08-17T03:16:02Z"
    message: 'The record failed to provision in some zones: [{ map[Name:yun4-4w5fl-int kubernetes.io/cluster/yun4-4w5fl:owned]}]'
    reason: FailedZones
    status: "False"
    type: DNSReady
  - lastTransitionTime: "2020-08-17T03:21:34Z"
    message: 'One or more other status conditions indicate a degraded state: DNSReady=False'
    reason: DegradedConditions
    status: "True"
    type: Degraded


[1] https://docs.aws.amazon.com/govcloud-us/latest/UserGuide/using-govcloud-endpoints.html
[2] https://docs.aws.amazon.com/general/latest/gr/rande.html

Comment 9 Yunfei Jiang 2020-08-18 11:19:32 UTC

@Daneyon, please correct me if any service endpoints is wrong, especially tagging endpoint. Thanks.

Add TestBlocker label since it blocks all testings against us-gov region.
Feel free to remove it if any workaround is available.

Comment 10 Yunfei Jiang 2020-08-19 03:19:21 UTC

(In reply to Yunfei Jiang from comment #8)

> >> oc get dns.config -oyaml

correct: command should be `oc -n openshift-ingress-operator get ingresscontroller/default -oyaml`

>   - lastTransitionTime: "2020-08-17T03:16:02Z"
>     message: 'The record failed to provision in some zones: [{
> map[Name:yun4-4w5fl-int kubernetes.io/cluster/yun4-4w5fl:owned]}]'
>     reason: FailedZones
>     status: "False"
>     type: DNSReady
>   - lastTransitionTime: "2020-08-17T03:21:34Z"
>     message: 'One or more other status conditions indicate a degraded state:
> DNSReady=False'
>     reason: DegradedConditions
>     status: "True"
>     type: Degraded

Comment 12 Daneyon Hansen 2020-08-21 20:08:15 UTC

I’m adding UpcomingSprint because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 13 Daneyon Hansen 2020-08-21 23:38:41 UTC

@Yunfel,

I have removed the restriction for the tagging api to use region us-gov-east-1 for GovCloud in https://github.com/openshift/cluster-ingress-operator/pull/441. Since [1] in https://bugzilla.redhat.com/show_bug.cgi?id=1866299#c8 does not specify any tagging api details, can you test whether us-gov-east-1 and us-gov-west-1 work? If us-gov-west-1 does not work, I'll update PR 441 to add the restriction back in.

Comment 14 Yunfei Jiang 2020-08-24 09:56:41 UTC

@Daneyon

Since your pull request is not merged, I can not test against your build.

following results are based on 4.6.0-0.nightly-2020-08-23-214712

Test1: After adding https://tagging.us-gov-west-1.amazonaws.com as IAM endpoints:

  - lastTransitionTime: "2020-08-24T03:08:50Z"
    message: The record isn't present in any zones.
    reason: NoZones
    status: "False"
    type: DNSReady
  - lastTransitionTime: "2020-08-24T03:15:15Z"
    message: 'One or more other status conditions indicate a degraded state: DNSReady=False'
    reason: DegradedConditions
    status: "True"
    type: Degraded

Test2: without any endpoints:

  - lastTransitionTime: "2020-08-24T03:02:46Z"
    message: 'The record failed to provision in some zones: [{ map[Name:yunjiang-govnoep-9lf88-int kubernetes.io/cluster/yunjiang-govnoep-9lf88:owned]}]'
    reason: FailedZones
    status: "False"
    type: DNSReady
  - lastTransitionTime: "2020-08-24T03:08:54Z"
    message: 'One or more other status conditions indicate a degraded state: DNSReady=False'
    reason: DegradedConditions
    status: "True"
    type: Degraded

per Abhinav's note https://bugzilla.redhat.com/show_bug.cgi?id=1866723#c8
installing a cluster into us-gov region, service endpoint is not required (from install-config).

Comment 18 Daneyon Hansen 2020-08-26 20:57:34 UTC

Yunfeil,

https://github.com/openshift/cluster-ingress-operator/pull/441 recently merged. Can you test the following:

1. The Ingress Operator should report expected status conditions when [1] reports either us-gov-east-1 or us-gov-west-1 regions.
2. The Ingress Operator should report expected status conditions when [1] reports either us-gov-east-1 or us-gov-west-1 regions and infrastructures/cluster is configured with the following AWS GovCloud custom endpoints:

$ REGION=$(oc get infrastructures/cluster -o jsonpath={.status.platformStatus.aws.region})
endpoints:
  - name: route53
    url: https://route53.us-gov.amazonaws.com
  - name: tagging
    url: https://tagging.${REGION}.amazonaws.com
  - name: elasticloadbalancing
    url: https://elasticloadbalancing.${REGION}.amazonaws.com


[1] oc get infrastructures/cluster -o jsonpath={.status.platformStatus.aws.region}

Comment 19 Yunfei Jiang 2020-08-27 08:46:43 UTC

Hello Daneyon,

All tests fail.

version: 4.6.0-0.nightly-2020-08-27-005538


>> TEST 1

install-config:
<!--snip-->
platform:
  aws:
    region: us-gov-west-1
    amiID: ami-0dc4fa6c
    subnets:
    - subnet-02da8cfb48ddaa17d
    - subnet-00ab96300aa21e259
<!--snip-->

./oc get infrastructures/cluster -ojson | jq .status.platformStatus.aws
{
  "region": "us-gov-west-1"
}


result:
./oc -n openshift-ingress-operator get ingresscontroller/default -oyaml
  - lastTransitionTime: "2020-08-27T06:10:11Z"
    message: 'The record failed to provision in some zones: [{ map[Name:yunjiang-govnoep2-l7z8x-int kubernetes.io/cluster/yunjiang-govnoep2-l7z8x:owned]}]'
    reason: FailedZones
    status: "False"
    type: DNSReady
  - lastTransitionTime: "2020-08-27T06:15:49Z"
    message: 'One or more other status conditions indicate a degraded state: DNSReady=False'
    reason: DegradedConditions
    status: "True"
    type: Degraded


>> TEST 2

install-config:
<!--snip-->
platform:
  aws:
    region: us-gov-west-1
    amiID: ami-0dc4fa6c
    subnets:
    - subnet-0bad2fc475a264fe4
    - subnet-0d79f6a8d47f6ae5f
    serviceEndpoints:
    - name: elasticloadbalancing
      url: https://elasticloadbalancing.us-gov-west-1.amazonaws.com
    - name: tagging
      url: https://tagging.us-gov-west-1.amazonaws.com
    - name: route53
      url: https://route53.us-gov.amazonaws.com

<!--snip-->

./oc get infrastructures/cluster -ojson | jq .status.platformStatus.aws
{
  "region": "us-gov-west-1",
  "serviceEndpoints": [
    {
      "name": "elasticloadbalancing",
      "url": "https://elasticloadbalancing.us-gov-west-1.amazonaws.com"
    },
    {
      "name": "route53",
      "url": "https://route53.us-gov.amazonaws.com"
    },
    {
      "name": "tagging",
      "url": "https://tagging.us-gov-west-1.amazonaws.com"
    }
  ]
}


result:
./oc -n openshift-ingress-operator get ingresscontroller/default -oyaml
  - lastTransitionTime: "2020-08-27T06:06:18Z"
    message: 'The record failed to provision in some zones: [{ map[Name:yunjiang-gov3ep-85tkc-int kubernetes.io/cluster/yunjiang-gov3ep-85tkc:owned]}]'
    reason: FailedZones
    status: "False"
    type: DNSReady
  - lastTransitionTime: "2020-08-27T06:11:38Z"
    message: 'One or more other status conditions indicate a degraded state: DNSReady=False'
    reason: DegradedConditions
    status: "True"
    type: Degraded

Comment 20 Daneyon Hansen 2020-08-27 15:25:46 UTC

Yunfei,

Thank you for for working through the test cases. Can you attach the ingress operator logs so I can diagnose why the two tests are failing?

Comment 21 Yunfei Jiang 2020-08-28 07:10:24 UTC

Created attachment 1712914 [details]
ingress log without endpoints

Comment 22 Yunfei Jiang 2020-08-28 07:10:58 UTC

Created attachment 1712916 [details]
ingress log with 3 endpoints

Comment 23 Yunfei Jiang 2020-08-28 07:12:34 UTC

Daneyon, ingress logs attached

Comment 27 Yunfei Jiang 2020-08-31 14:04:00 UTC

Per my understanding, the Secret will be created automatically during the installation process, no need to provide them manually. I checked cloud-credential-operator ( in comment 24 's clusters ), they are not in manual mode. 

@Abhinav, Please help to confirm that do we need to provide Secrets for operators (e.g. ingress) manually when install a private cluster in us-gov region? Thanks! ( please refer to Daneyon's comment 25 )

Thanks!

Comment 28 Abhinav Dahiya 2020-09-01 22:34:14 UTC

> @Abhinav, Please help to confirm that do we need to provide Secrets for operators (e.g. ingress) manually when install a private cluster in us-gov region? Thanks! ( please refer to Daneyon's comment 25 )

Credential operator should provide secrets in GovCloud regions similar to any other commercial region. There should be no need for manual steps.

Comment 33 Daneyon Hansen 2020-09-04 20:56:35 UTC

Yunfeil,

I successfully tested https://github.com/openshift/cluster-ingress-operator/pull/454 with and without us-gov-west-1 custom endpoints. Can you keep this cluster running until this PR merges? Since region is immutable, I was unable to test against the us-gov-east-1 region. Can you provide a us-gov-east-1 cluster so I can verify the PR works as expected in both regions?

Comment 36 Daneyon Hansen 2020-09-08 23:17:19 UTC

Yunfei,

https://github.com/openshift/cluster-ingress-operator/pull/454 merged, so please test when you have a moment and update this BZ with your findings.

Comment 38 Hongan Li 2020-09-09 08:40:49 UTC

Thanks @yunfei that having a cluster with usgov region ready.
Tested ingress related cases and all passed.
Will check the cluster with custom endpoints later. 

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-09-021653   True        False         3h25m   Cluster version is 4.6.0-0.nightly-2020-09-09-021653

spec:
  cloudConfig:
    name: ""
  platformSpec:
    aws: {}
    type: AWS
status:
  <---snip--->
  platform: AWS
  platformStatus:
    aws:
      region: us-gov-west-1
    type: AWS

$ oc -n openshift-ingress-operator get dnsrecords/default-wildcard -o yaml
<---snip--->
spec:
  dnsName: '*.apps.yunjiang-debug5.qe.devcluster.openshift.com.'
  recordTTL: 30
  recordType: CNAME
  targets:
  - internal-ae1c79eb0900e41f6832a1f37c6fc8a1-1651558077.us-gov-west-1.elb.amazonaws.com
status:
  observedGeneration: 1
  zones:
  - conditions:
    - lastTransitionTime: "2020-09-09T04:49:06Z"
      message: The DNS provider succeeded in ensuring the record
      reason: ProviderSuccess
      status: "False"
      type: Failed

Comment 40 Hongan Li 2020-09-09 10:06:21 UTC

Tested another cluster with usgov region and custom endpoints and passed as well, so moving to verified.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-09-021653   True        False         75m     Cluster version is 4.6.0-0.nightly-2020-09-09-021653

spec:
  cloudConfig:
    name: ""
  platformSpec:
    aws:
      serviceEndpoints:
      - name: elasticloadbalancing
        url: https://elasticloadbalancing.us-gov-west-1.amazonaws.com
      - name: tagging
        url: https://tagging.us-gov-west-1.amazonaws.com
      - name: route53
        url: https://route53.us-gov.amazonaws.com
    type: AWS
status:
  <---snip--->
  platform: AWS
  platformStatus:
    aws:
      region: us-gov-west-1
      serviceEndpoints:
      - name: elasticloadbalancing
        url: https://elasticloadbalancing.us-gov-west-1.amazonaws.com
      - name: route53
        url: https://route53.us-gov.amazonaws.com
      - name: tagging
        url: https://tagging.us-gov-west-1.amazonaws.com

spec:
  dnsName: '*.apps.yunjiang-debug7.qe.devcluster.openshift.com.'
  recordTTL: 30
  recordType: CNAME
  targets:
  - internal-ad61fd41e2ca545658d7a1c49de4782d-1800445331.us-gov-west-1.elb.amazonaws.com
status:
  observedGeneration: 1
  zones:
  - conditions:
    - lastTransitionTime: "2020-09-09T08:21:51Z"
      message: The DNS provider succeeded in ensuring the record
      reason: ProviderSuccess
      status: "False"
      type: Failed

$ oc -n openshift-ingress-operator get ingresscontroller/default -oyaml
<---snip--->
  - lastTransitionTime: "2020-09-09T08:22:26Z"
    message: The record is provisioned in all reported zones.
    reason: NoFailedZones
    status: "True"
    type: DNSReady
  - lastTransitionTime: "2020-09-09T08:28:53Z"
    status: "False"
    type: Degraded

Comment 44 Andrew McDermott 2020-09-10 11:56:21 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 47 Daneyon Hansen 2020-09-12 23:55:55 UTC

Changing back to POST due to https://github.com/openshift/cluster-ingress-operator/pull/457 (improved provider validation).

Comment 52 Daneyon Hansen 2020-09-16 16:41:02 UTC

Yunfei,

Thanks for testing. It appears that the installer is not properly handing the region requirements of the tagging client as outlined in https://bugzilla.redhat.com/show_bug.cgi?id=1866299#c50.

time="2020-09-16T09:39:14Z" level=info msg="Creating infrastructure resources..."
time="2020-09-16T09:39:14Z" level=debug msg="resolved AWS service tagging (us-gov-east-1) to \"https://tagging.us-gov-west-1.amazonaws.com\""
time="2020-09-16T09:39:14Z" level=debug msg="Tagging arn:aws-us-gov:ec2:us-gov-east-1:225746144451:subnet/subnet-063f1fb3fd3c529b2 with kubernetes.io/cluster/yunjiang-16i3-dnrfs: shared"
time="2020-09-16T09:39:14Z" level=debug msg="Tagging arn:aws-us-gov:ec2:us-gov-east-1:225746144451:subnet/subnet-0c959be9654a4867b with kubernetes.io/cluster/yunjiang-16i3-dnrfs: shared"
time="2020-09-16T09:39:14Z" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": InvalidSignatureException: Credential should be scoped to a valid region, not 'us-gov-east-1'. \n\tstatus code: 400, request id: 5ce26265-7cd9-4490-922b-f9c21543e351"

Although the region is "us-gov-east-1", the tagging client config should use "us-gov-west-1". Note: the route53 client has similar requirements (see https://bugzilla.redhat.com/show_bug.cgi?id=1866299#c50 for details). Reassigning to the installer team for further investigation.

Comment 53 Abhinav Dahiya 2020-09-16 17:25:02 UTC

The install needs access to the resources tagging endpoint in the same region as the cluster to tag the subnets.

So this is invalid for us-gov-west-1

```
    serviceEndpoints:
    - name: elasticloadbalancing
      url: https://elasticloadbalancing.us-gov-east-1.amazonaws.com
    - name: tagging
      url: https://tagging.us-gov-west-1.amazonaws.com
    - name: route53
      url: https://route53.us-gov.amazonaws.com

```

I want to re iterate, that gov cloud is not a custom region and it does not require service endpoints. The user setting invalid service endpoints is invalid configuration and not a bug.

Comment 54 Hongan Li 2020-09-17 06:01:09 UTC

So #Comment 34 and #Comment 42 are not true?

So for GovCloud, we just support us-gov-east-1 and us-gov-west-1 without any custom service endpoints? or just doesn't require elb/tagging/route53 service endpoints? 

We already verified that it works in region us-gov-east-1 and us-gov-west-1 without service endpoints. It this is  expected result then we can move this bug to verified and file a new BZ for documentation about the service endpoints are not required for GovCloud.

@Daneyon any thoughts?

Comment 58 Hongan Li 2020-09-24 02:40:24 UTC

Thanks Yunfei for launching a cluster in us-gov-east with serviceEndpoints, no issues found.

# oc get co/ingress
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress   4.6.0-0.nightly-2020-09-23-022756   True        False         False      19h


  spec:
    cloudConfig:
      name: ""
    platformSpec:
      aws:
        serviceEndpoints:
        - name: elasticloadbalancing
          url: https://elasticloadbalancing.us-gov-east-1.amazonaws.com
        - name: tagging
          url: https://tagging.us-gov-east-1.amazonaws.com
        - name: route53
          url: https://route53.us-gov.amazonaws.com
      type: AWS
  status:
    <---snip--->
    platform: AWS
    platformStatus:
      aws:
        region: us-gov-east-1
        serviceEndpoints:
        - name: elasticloadbalancing
          url: https://elasticloadbalancing.us-gov-east-1.amazonaws.com
        - name: route53
          url: https://route53.us-gov.amazonaws.com
        - name: tagging
          url: https://tagging.us-gov-east-1.amazonaws.com
      type: AWS

Comment 59 Hongan Li 2020-09-24 08:32:39 UTC

Tested in both us-gov-east and us-gov-west region and with/without serviceEndpoints, all four clusters are working well. 

Thank you Daneyon!

Comment 61 errata-xmlrpc 2020-10-27 16:24:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196