Description of problem: Try with several versions of installer already. After installation, visit console with browser: The page cannot be opened. the hostname cannot be found by DNS. $ nslookup console-openshift-console.apps.hongkliu3.qe.devcluster.openshift.com Server: 10.11.5.19 Address: 10.11.5.19#53 ** server can't find console-openshift-console.apps.hongkliu3.qe.devcluster.openshift.com: NXDOMAIN $ ./openshift-install version ./openshift-install v0.10.0 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.1 True False 30m Cluster version is 4.0.0-0.1 Not sure why it happens or related: The value of record count is mostly 7 or 8. The wildcard for app domain is missing when 7 while it is there when 8. Will upload logs and screenshots later.
Hit this as well. First install with v0.10.0 was successful, routes worked ok. After running openshift-install destroy cluster and then re-installing the wildcard DNS A record was not created. Only the api server and etcd route53 records were created.
For future reference, the best logs for debugging are in .openshift-install.log [1] (because they have timestamps too). For this issue, that doesn't matter though. The installer creates Kubernetes API and etcd records in Terraform as part of bringing up the cluster. So failure to create those should be fatal, and Terraform will log some notes about what went wrong. The installer does not create the console or other routes. Those are created by the ingress operator, so checking its logs may shed light on the issue you're seeing here. The installer does wait for the route to be created in Kubernetes, and it will log entries like: INFO Waiting up to 10m0s for the openshift-console route to be created... DEBUG Still waiting for the console route... DEBUG Still waiting for the console route: the server could not find the requested resource (get routes.route.openshift.io) DEBUG Still waiting for the console route... DEBUG Still waiting for the console route... DEBUG Still waiting for the console route... DEBUG Route found in openshift-console namespace: console DEBUG OpenShift console route is created INFO Install complete! for that portion of the install. But that's just waiting for the OpenShift route object, it's not checking to see if the operator successfully applied that object to Route 53. There's an open issue about blocking until the console DNS entry resolves [3], but see discussion in that issue about why it may not be desirable. So to follow up on this, can you track down the ingress operator's log? Grabbing a random CI run as an example, you should see something like: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1076/pull-ci-openshift-installer-master-e2e-aws/2976/artifacts/e2e-aws/pods/openshift-ingress-operator_ingress-operator-8476bffd8f-n747h_ingress-operator.log.gz | gunzip | grep -3 '\.apps\.' time="2019-01-17T05:13:48Z" level=info msg="updated router asset *v1.Deployment openshift-ingress/router-default" time="2019-01-17T05:13:48Z" level=info msg="using public zone /hostedzone/Z328TAU66YDJH7" time="2019-01-17T05:13:49Z" level=info msg="using private zone Z8ZM3Y24BCRRH" time="2019-01-17T05:13:49Z" level=info msg="updated DNS record in zone /hostedzone/Z328TAU66YDJH7, *.apps.ci-op-0nt07pvh-1d3f3.origin-ci-int-aws.dev.rhcloud.com -> aaa0d4c8e1a1611e9a52a1222ca93574-856474717.us-east-1.elb.amazonaws.com: {\n ChangeInfo: {\n Id: \"/change/C3RNOAZYEEDUBY\",\n Status: \"PENDING\",\n SubmittedAt: 2019-01-17 05:13:49.3 +0000 UTC\n }\n}" time="2019-01-17T05:13:49Z" level=info msg="updated DNS record in zone Z8ZM3Y24BCRRH, *.apps.ci-op-0nt07pvh-1d3f3.origin-ci-int-aws.dev.rhcloud.com -> aaa0d4c8e1a1611e9a52a1222ca93574-856474717.us-east-1.elb.amazonaws.com: {\n ChangeInfo: {\n Id: \"/change/CPR65N8IDFB7D\",\n Status: \"PENDING\",\n SubmittedAt: 2019-01-17 05:13:49.383 +0000 UTC\n }\n}" time="2019-01-17T05:13:49Z" level=info msg="reconciling request: reconcile.Request{NamespacedName:types.NamespacedName{Namespace:\"openshift-ingress\", Name:\"router-default\"}}" time="2019-01-17T05:13:49Z" level=info msg="updated router asset *v1.Namespace /openshift-ingress" time="2019-01-17T05:13:49Z" level=info msg="reconciling clusteringress v1alpha1.ClusterIngress{TypeMeta:v1.TypeMeta{Kind:\"ClusterIngress\", APIVersion:\"ingress.openshift.io/v1alpha1\"}, ObjectMeta:v1.ObjectMeta{Name:\"default\", GenerateName:\"\", Namespace:\"openshift-ingress-operator\", SelfLink:\"/apis/ingress.openshift.io/v1alpha1/namespaces/openshift-ingress-operator/clusteringresses/default\", UID:\"a929e0fa-1a16-11e9-a52a-1222ca935746\", ResourceVersion:\"6131\", Generation:1, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63683298824, loc:(*time.Location)(0x1fa28c0)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string{\"ingress.openshift.io/default-cluster-ingress\"}, ClusterName:\"\"}, Spec:v1alpha1.ClusterIngressSpec{IngressDomain:(*string)(0xc420a077b0), NodePlacement:(*v1alpha1.NodePlacement)(0xc4207da6e0), DefaultCertificateSecret:(*string)(nil), NamespaceSelector:(*v1.LabelSelector)(nil), Replicas:1, RouteSelector:(*v1.LabelSelector)(nil), HighAvailability:(*v1alpha1.ClusterIngressHighAvailability)(0xc420a077a0), UnsupportedExtensions:(*[]string)(nil)}, Status:v1alpha1.ClusterIngressStatus{Replicas:0, Selector:\"app=router,router=router-default\"}}" [1]: https://github.com/openshift/installer/blob/v0.10.0/docs/user/troubleshooting.md#installer-fails-to-create-resources [2]: https://github.com/openshift/cluster-ingress-operator/ [3]: https://github.com/openshift/installer/issues/974
Thanks for the information. Let me collect ingress operator logs today.
This sounds a lot like https://github.com/openshift/installer/issues/1051 , in case folks make progress there first.
Update: I tried to reproduce it this morning (with the same version of installer: 0.10.0) and it turned out the DNS is working for the console. Checked the route53 console too, all good. Not sure what has been changed in the background or something is done on the aws account. Also watched the github issue 1051.
I wonder if any route53 or lb issues can be found in logs. I've seen very often lb api rate limit reached. Maybe installations failed at a time with exhausted limits. If this is the case, then probably this needs to be made more obvious to the user.
> Not sure what has been changed in the background or something is done on the aws account. 0.10.0 pins its update payload, so none of our source changed. I'm 99% sure this is an AWS-context (high-population throttling, high-population paging, etc.) vs. some frail ingress-operator code. > I wonder if any route53 or lb issues can be found in logs. That's what I'd expect. See comment 3 for what a good run looks like in the logs. If bad runs also contain "updated DNS record in zone...", and I've heard reports that they might, then we may need to improve the ingress logging before we can catch whatever's causing this bug. > I've seen very often lb api rate limit reached. Maybe installations failed at a time with exhausted limits. If this is the case, then probably this needs to be made more obvious to the user. However this shakes out, this is a day-2, ingress issue that occurs after the installer has successfully launched the (pinned) update payload. Making this error more obvious to the user would probably mean having the ingress operator report it to the cluster-version operator via its ClusterOperator object [1] which already exits: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1086/pull-ci-openshift-installer-master-e2e-aws/3013/artifacts/e2e-aws/clusteroperators.json | jq '.items[] | select(.metadata.name == "openshift-ingress-operator")' { "apiVersion": "config.openshift.io/v1", "kind": "ClusterOperator", "metadata": { "creationTimestamp": "2019-01-18T09:01:57Z", "generation": 1, "name": "openshift-ingress-operator", "namespace": "", "resourceVersion": "10545", "selfLink": "/apis/config.openshift.io/v1/clusteroperators/openshift-ingress-operator", "uid": "b5a67dfc-1aff-11e9-95f3-0ece53272666" }, "spec": {}, "status": { "conditions": [ { "lastTransitionTime": "2019-01-18T09:01:58Z", "status": "False", "type": "Failing" }, { "lastTransitionTime": "2019-01-18T09:01:58Z", "status": "False", "type": "Progressing" }, { "lastTransitionTime": "2019-01-18T09:04:01Z", "status": "True", "type": "Available" } ], "extension": null, "version": "0.0.1" } } My current impression is that when this breaks, the ingress operator is somehow not aware of the breakage. [1]: https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md
I'm moving this to the Routing component, because that's what's creating these wildcard records.
(In reply to Hongkai Liu from comment #5) > Updated version and log files: > http://file.rdu.redhat.com/~hongkliu/test_result/bz1666846/20190117.1/ This output indicates a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1666908 which is already resolved. The provided output also indicates a cluster running an outdated origin build which doesn't contain the fix. Note that because the bug results in the cluster-ingress-operator spamming the AWS API in a hot loop, you can expect persistent API throttling which affects the cluster-ingress-operators in other clusters sharing the same account, which can prevent even up-to-date versions from performing Route53 API calls. So the cluster used to report this issue likely caused issues for other clusters and should be destroyed. I'm going to mark this bug as a duplicate. If you suspect you're found some new issue, please provide supporting evidence in the form the output of: oc logs -n openshift-ingress-operator deployments/ingress-operator Thanks! *** This bug has been marked as a duplicate of bug 1666908 ***