1666846 – [OCP 4.0]web console is not accessible: DNS record set problem?

Bug 1666846 - [OCP 4.0]web console is not accessible: DNS record set problem?

Summary: [OCP 4.0]web console is not accessible: DNS record set problem?

Keywords:
Status:	CLOSED DUPLICATE of bug 1666908
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Dan Mace
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-01-16 18:02 UTC by Hongkai Liu
Modified:	2022-08-04 22:20 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-24 12:50:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Hongkai Liu 2019-01-16 18:02:13 UTC

Description of problem:
Try with several versions of installer already.
After installation, visit console with browser: The page cannot be opened.

the hostname cannot be found by DNS.
$ nslookup console-openshift-console.apps.hongkliu3.qe.devcluster.openshift.com
Server:		10.11.5.19
Address:	10.11.5.19#53

** server can't find console-openshift-console.apps.hongkliu3.qe.devcluster.openshift.com: NXDOMAIN


$ ./openshift-install version
./openshift-install v0.10.0

$ oc get clusterversion
NAME      VERSION     AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-0.1   True        False         30m       Cluster version is 4.0.0-0.1

Not sure why it happens or related:
The value of record count is mostly 7 or 8.
The wildcard for app domain is missing when 7 while it is there when 8.

Will upload logs and screenshots later.

Comment 2 Mike Fiedler 2019-01-16 20:55:46 UTC

Hit this as well.   First install with v0.10.0 was successful, routes worked ok.    After running openshift-install destroy cluster and then re-installing the wildcard DNS A record was not created.   Only the api server and etcd route53 records were created.

Comment 3 W. Trevor King 2019-01-17 07:29:29 UTC

For future reference, the best logs for debugging are in .openshift-install.log [1] (because they have timestamps too).  For this issue, that doesn't matter though.

The installer creates Kubernetes API and etcd records in Terraform as part of bringing up the cluster.  So failure to create those should be fatal, and Terraform will log some notes about what went wrong.

The installer does not create the console or other routes.  Those are created by the ingress operator, so checking its logs may shed light on the issue you're seeing here.  The installer does wait for the route to be created in Kubernetes, and it will log entries like:

  INFO Waiting up to 10m0s for the openshift-console route to be created...
  DEBUG Still waiting for the console route...
  DEBUG Still waiting for the console route: the server could not find the requested resource (get routes.route.openshift.io)
  DEBUG Still waiting for the console route...
  DEBUG Still waiting for the console route...
  DEBUG Still waiting for the console route...
  DEBUG Route found in openshift-console namespace: console
  DEBUG OpenShift console route is created
  INFO Install complete!

for that portion of the install.  But that's just waiting for the OpenShift route object, it's not checking to see if the operator successfully applied that object to Route 53.  There's an open issue about blocking until the console DNS entry resolves [3], but see discussion in that issue about why it may not be desirable.

So to follow up on this, can you track down the ingress operator's log?  Grabbing a random CI run as an example, you should see something like:

  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1076/pull-ci-openshift-installer-master-e2e-aws/2976/artifacts/e2e-aws/pods/openshift-ingress-operator_ingress-operator-8476bffd8f-n747h_ingress-operator.log.gz | gunzip | grep -3 '\.apps\.'
  time="2019-01-17T05:13:48Z" level=info msg="updated router asset *v1.Deployment openshift-ingress/router-default"
  time="2019-01-17T05:13:48Z" level=info msg="using public zone /hostedzone/Z328TAU66YDJH7"
  time="2019-01-17T05:13:49Z" level=info msg="using private zone Z8ZM3Y24BCRRH"
  time="2019-01-17T05:13:49Z" level=info msg="updated DNS record in zone /hostedzone/Z328TAU66YDJH7, *.apps.ci-op-0nt07pvh-1d3f3.origin-ci-int-aws.dev.rhcloud.com -> aaa0d4c8e1a1611e9a52a1222ca93574-856474717.us-east-1.elb.amazonaws.com: {\n  ChangeInfo: {\n    Id: \"/change/C3RNOAZYEEDUBY\",\n    Status: \"PENDING\",\n    SubmittedAt: 2019-01-17 05:13:49.3 +0000 UTC\n  }\n}"
  time="2019-01-17T05:13:49Z" level=info msg="updated DNS record in zone Z8ZM3Y24BCRRH, *.apps.ci-op-0nt07pvh-1d3f3.origin-ci-int-aws.dev.rhcloud.com -> aaa0d4c8e1a1611e9a52a1222ca93574-856474717.us-east-1.elb.amazonaws.com: {\n  ChangeInfo: {\n    Id: \"/change/CPR65N8IDFB7D\",\n    Status: \"PENDING\",\n    SubmittedAt: 2019-01-17 05:13:49.383 +0000 UTC\n  }\n}"
  time="2019-01-17T05:13:49Z" level=info msg="reconciling request: reconcile.Request{NamespacedName:types.NamespacedName{Namespace:\"openshift-ingress\", Name:\"router-default\"}}"
  time="2019-01-17T05:13:49Z" level=info msg="updated router asset *v1.Namespace /openshift-ingress"
  time="2019-01-17T05:13:49Z" level=info msg="reconciling clusteringress v1alpha1.ClusterIngress{TypeMeta:v1.TypeMeta{Kind:\"ClusterIngress\", APIVersion:\"ingress.openshift.io/v1alpha1\"}, ObjectMeta:v1.ObjectMeta{Name:\"default\", GenerateName:\"\", Namespace:\"openshift-ingress-operator\", SelfLink:\"/apis/ingress.openshift.io/v1alpha1/namespaces/openshift-ingress-operator/clusteringresses/default\", UID:\"a929e0fa-1a16-11e9-a52a-1222ca935746\", ResourceVersion:\"6131\", Generation:1, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63683298824, loc:(*time.Location)(0x1fa28c0)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string{\"ingress.openshift.io/default-cluster-ingress\"}, ClusterName:\"\"}, Spec:v1alpha1.ClusterIngressSpec{IngressDomain:(*string)(0xc420a077b0), NodePlacement:(*v1alpha1.NodePlacement)(0xc4207da6e0), DefaultCertificateSecret:(*string)(nil), NamespaceSelector:(*v1.LabelSelector)(nil), Replicas:1, RouteSelector:(*v1.LabelSelector)(nil), HighAvailability:(*v1alpha1.ClusterIngressHighAvailability)(0xc420a077a0), UnsupportedExtensions:(*[]string)(nil)}, Status:v1alpha1.ClusterIngressStatus{Replicas:0, Selector:\"app=router,router=router-default\"}}"

[1]: https://github.com/openshift/installer/blob/v0.10.0/docs/user/troubleshooting.md#installer-fails-to-create-resources
[2]: https://github.com/openshift/cluster-ingress-operator/
[3]: https://github.com/openshift/installer/issues/974

Comment 4 Hongkai Liu 2019-01-17 13:16:26 UTC

Thanks for the information.
Let me collect ingress operator logs today.

Comment 12 W. Trevor King 2019-01-18 01:14:05 UTC

This sounds a lot like https://github.com/openshift/installer/issues/1051 , in case folks make progress there first.

Comment 13 Hongkai Liu 2019-01-18 14:41:25 UTC

Update:

I tried to reproduce it this morning (with the same version of installer: 0.10.0) and it turned out the DNS is working for the console.
Checked the route53 console too, all good.

Not sure what has been changed in the background or something is done on the aws account.

Also watched the github issue 1051.

Comment 14 Aleksandar Kostadinov 2019-01-18 14:49:11 UTC

I wonder if any route53 or lb issues can be found in logs. I've seen very often lb api rate limit reached. Maybe installations failed at a time with exhausted limits.
If this is the case, then probably this needs to be made more obvious to the user.

Comment 15 W. Trevor King 2019-01-18 15:31:59 UTC

> Not sure what has been changed in the background or something is done on the aws account.

0.10.0 pins its update payload, so none of our source changed.  I'm 99% sure this is an AWS-context (high-population throttling, high-population paging, etc.) vs. some frail ingress-operator code.

> I wonder if any route53 or lb issues can be found in logs.

That's what I'd expect.  See comment 3 for what a good run looks like in the logs.  If bad runs also contain "updated DNS record in zone...", and I've heard reports that they might, then we may need to improve the ingress logging before we can catch whatever's causing this bug.

> I've seen very often lb api rate limit reached. Maybe installations failed at a time with exhausted limits.  If this is the case, then probably this needs to be made more obvious to the user.

However this shakes out, this is a day-2, ingress issue that occurs after the installer has successfully launched the (pinned) update payload.  Making this error more obvious to the user would probably mean having the ingress operator report it to the cluster-version operator via its ClusterOperator object [1] which already exits:

  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1086/pull-ci-openshift-installer-master-e2e-aws/3013/artifacts/e2e-aws/clusteroperators.json | jq '.items[] | select(.metadata.name == "openshift-ingress-operator")'
  {
    "apiVersion": "config.openshift.io/v1",
    "kind": "ClusterOperator",
    "metadata": {
      "creationTimestamp": "2019-01-18T09:01:57Z",
      "generation": 1,
      "name": "openshift-ingress-operator",
      "namespace": "",
      "resourceVersion": "10545",
      "selfLink": "/apis/config.openshift.io/v1/clusteroperators/openshift-ingress-operator",
      "uid": "b5a67dfc-1aff-11e9-95f3-0ece53272666"
    },
    "spec": {},
    "status": {
      "conditions": [
        {
          "lastTransitionTime": "2019-01-18T09:01:58Z",
          "status": "False",
          "type": "Failing"
        },
        {
          "lastTransitionTime": "2019-01-18T09:01:58Z",
          "status": "False",
          "type": "Progressing"
        },
        {
          "lastTransitionTime": "2019-01-18T09:04:01Z",
          "status": "True",
          "type": "Available"
        }
      ],
      "extension": null,
      "version": "0.0.1"
    }
  }

My current impression is that when this breaks, the ingress operator is somehow not aware of the breakage.

[1]: https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md

Comment 16 W. Trevor King 2019-01-24 08:53:36 UTC

I'm moving this to the Routing component, because that's what's creating these wildcard records.

Comment 17 Dan Mace 2019-01-24 12:50:40 UTC

(In reply to Hongkai Liu from comment #5)
> Updated version and log files:
> http://file.rdu.redhat.com/~hongkliu/test_result/bz1666846/20190117.1/

This output indicates a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1666908 which is already resolved. The provided output also indicates a cluster running an outdated origin build which doesn't contain the fix.

Note that because the bug results in the cluster-ingress-operator spamming the AWS API in a hot loop, you can expect persistent API throttling which affects the cluster-ingress-operators in other clusters sharing the same account, which can prevent even up-to-date versions from performing Route53 API calls. So the cluster used to report this issue likely caused issues for other clusters and should be destroyed.

I'm going to mark this bug as a duplicate. If you suspect you're found some new issue, please provide supporting evidence in the form the output of:

   oc logs -n openshift-ingress-operator deployments/ingress-operator

Thanks!

*** This bug has been marked as a duplicate of bug 1666908 ***

Note You need to log in before you can comment on or make changes to this bug.