Bug 1351645
Summary: | SkyDNS resolution intermittently fails when at least 1 master is down in an HA setup | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Andy Goldstein <agoldste> |
Component: | Node | Assignee: | Andy Goldstein <agoldste> |
Status: | CLOSED ERRATA | QA Contact: | DeShuai Ma <dma> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 3.1.0 | CC: | agoldste, akokshar, aos-bugs, bbennett, ccoleman, chezhang, danw, decarr, dma, erich, jkaur, jokerman, knakayam, marc.jadoul, mbarrett, misalunk, mmccomas, pep, rhowe, sdodson, steven, stwalter, twiest, whearn, xtian |
Target Milestone: | --- | ||
Target Release: | 3.2.1 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: In an HA environment with multiple masters, one or more of the masters goes down.
Consequence: DNS requests sent to the cluster nameserver running at kubernetes.default.svc.cluster.local can become slow, which can result in things such as builds taking significantly longer than usual if they perform several DNS lookups.
Fix: All the masters now coordinate to maintain an up to date list of endpoints for kubernetes.default.svc.cluster.local. If a master goes down, its endpoint is removed from the list. Note that it may take up to 20 seconds for the endpoint to be removed. When a master comes back up, its endpoint is reinserted into the list.
Result: DNS resolution returns to normal once the endpoints list is updated to remove the down master.
|
Story Points: | --- |
Clone Of: | 1300028 | Environment: | |
Last Closed: | 2016-07-20 19:37:27 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1300028 | ||
Bug Blocks: | 1303130, 1267746, 1286513 |
Comment 6
Zhang Cheng
2016-07-15 10:41:20 UTC
Could you please retest, and if it fails again, capture the information we need to be able to debug: - where did you test (e.g. AWS, local, ...)? - how many masters? - how many nodes? - what load balancer did you use? if haproxy, logs from it - logs from all atomic-openshift-master-api services - logs from all atomic-openshift-master-controller services - logs from all atomic-openshift-node services - oc describe buildconfig/<name of build config> - oc get events --all-namespaces Thanks! Also, the master and node config files too. @Andy Goldstein MTV2 is OpenStack, you maybe cannot access. Attachment exclude master1-controller-service-log, this size about 93MB, more than the limitation. I will attach in a mail and send to you. @Andy Goldstein Because the size of master1-controller-service-log more than limitation of email, I put it in my google drive and shared with you. There are a few items to point out: 1) Until https://github.com/openshift/openshift-ansible/issues/1563 is resolved, you will have to manually configure /etc/origin/master/openshift-master.kubeconfig to point either to the load balancer or to kubernetes.default.svc.cluster.local. This is used by the controllers to know the URL and credentials for the masters. Out of the box, this file is not configured to talk to an HA endpoint. Given that the fix for this bug updates the endpoints for kubernetes.default.svc.cluster.local, I would recommend updating the config to point to that URL. 2) The controllers talk directly to etcd to attempt to acquire the lease to become the active controller. As long as the active controller is still able to talk to etcd, it will remain active. In the event that the active controller is configured to talk only to its colocated master, and not to the load balancer or kubernetes service, it will happily continue being the active controller, even after the master goes down. 3) As mentioned before, it may take 10 to 20 seconds before a now-dead master's endpoint is removed from the list of endpoints for the kubernetes service. @Andy Goldstein Thanks for your clarification. Trigger deployment and build successful anyway after manually configure /etc/origin/master/openshift-master.kubeconfig to point to load balancer. I will mark status to verified as above discussion and we already have pr https://github.com/openshift/openshift-ansible/issues/1563 to trace the special scenario. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1466 *** Bug 1370610 has been marked as a duplicate of this bug. *** |