Description of problem: After https://github.com/openshift/openshift-ansible/pull/8345 is merged, master can not be started due to etcd dns become unresolved or timeout. Version-Release number of the following components: openshift-ansible-3.10.0-0.46.0.git.0.85c3afd.el7.noarch # openshift version openshift v3.10.0-0.46.0 kubernetes v1.10.0+b81c8f8 etcd 3.2.16 How reproducible: Always Steps to Reproduce: 1. Trigger an installation. 2. 3. Actual results: Installer failed at the following step: TASK [openshift_node_group : Apply the config] ********************************* Tuesday 15 May 2018 20:27:51 -0400 (0:00:11.231) 0:16:57.854 *********** fatal: [qe-smoke310-master-etcd-1.0515-clo.qe.rhcloud.com]: FAILED! => {"changed": true, "cmd": "/usr/local/bin/oc --config=/etc/origin/master/admin.kubeconfig apply -f /tmp/ansible-7ikVHY", "delta": "0:01:38.869169", "end": "2018-05-16 00:29:53.149868", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2018-05-16 00:28:14.280699", "stderr": "Unable to connect to the server: unexpected EOF\nUnable to connect to the server: net/http: TLS handshake timeout", "stderr_lines": ["Unable to connect to the server: unexpected EOF", "Unable to connect to the server: net/http: TLS handshake timeout"], "stdout": "imagestreamtag.image.openshift.io \"node:v3.10\" created\ndaemonset.apps \"sync\" created", "stdout_lines": ["imagestreamtag.image.openshift.io \"node:v3.10\" created", "daemonset.apps \"sync\" created"]} Go to master, found master is restarted again and again. # docker ps -a|grep api e92eea3e42d4 35628f394eff "/bin/bash -c '#!/..." About a minute ago Up About a minute k8s_api_master-api-qe-smoke310-master-etcd-1_kube-system_6a2accb687dabf002706f2f7d1d15266_98 04f0f4c9ec78 35628f394eff "/bin/bash -c '#!/..." 7 minutes ago Exited (255) 6 minutes ago k8s_api_master-api-qe-smoke310-master-etcd-1_kube-system_6a2accb687dabf002706f2f7d1d15266_97 5b8c78f6121d registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.10.0-0.46.0 "/usr/bin/pod" 6 hours ago Up 6 hours k8s_POD_master-api-qe-smoke310-master-etcd-1_kube-system_6a2accb687dabf002706f2f7d1d15266_0 # cat /etc/origin/master/master-config.yaml <--snp--> dnsConfig: bindAddress: 0.0.0.0:8053 bindNetwork: tcp4 etcdClientInfo: ca: master.etcd-ca.crt certFile: master.etcd-client.crt keyFile: master.etcd-client.key urls: - https://qe-smoke310-master-etcd-1:2379 etcdStorageConfig: kubernetesStoragePrefix: kubernetes.io kubernetesStorageVersion: v1 openShiftStoragePrefix: openshift.io openShiftStorageVersion: v1 <--snp--> # cat /etc/resolv.conf # nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh # Generated by NetworkManager search cluster.local c.openshift-gce-devel.internal google.internal nameserver 10.240.0.2 -> node itself where dnsmasq is running # ifconfig eth0 eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1460 inet 10.240.0.2 netmask 255.255.255.255 broadcast 10.240.0.2 inet6 fe80::4001:aff:fef0:2 prefixlen 64 scopeid 0x20<link> ether 42:01:0a:f0:00:02 txqueuelen 1000 (Ethernet) RX packets 100417 bytes 716004345 (682.8 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 72747 bytes 11720760 (11.1 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 # cat /etc/dnsmasq.d/* server=/in-addr.arpa/127.0.0.1 server=/cluster.local/127.0.0.1 no-resolv domain-needed no-negcache max-cache-ttl=1 enable-dbus dns-forward-max=10000 cache-size=10000 bind-dynamic except-interface=lo # End of config server=169.254.169.254 -> gce instance's upstream DNS # host -a qe-smoke310-master-etcd-1 Trying "qe-smoke310-master-etcd-1.cluster.local" ;; connection timed out; no servers could be reached # host -a qe-smoke310-master-etcd-1 169.254.169.254 Trying "qe-smoke310-master-etcd-1.cluster.local" Trying "qe-smoke310-master-etcd-1.c.openshift-gce-devel.internal" Using domain server: Name: 169.254.169.254 Address: 169.254.169.254#53 Aliases: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 61840 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;qe-smoke310-master-etcd-1.c.openshift-gce-devel.internal. IN ANY ;; ANSWER SECTION: qe-smoke310-master-etcd-1.c.openshift-gce-devel.internal. 30 IN A 10.240.0.2 Received 90 bytes from 169.254.169.254#53 in 16 ms # time ping -c 1 qe-smoke310-master-etcd-1 PING qe-smoke310-master-etcd-1 (10.240.0.2) 56(84) bytes of data. 64 bytes from 10.240.0.2: icmp_seq=1 ttl=64 time=0.025 ms --- qe-smoke310-master-etcd-1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.025/0.025/0.025/0.000 ms real 0m20.010s user 0m0.001s sys 0m0.004s Comment out the following line in /etc/dnsmasq.d/node-dnsmasq.conf, try again, the dns is resolved very quickly, and master is started successfully. #server=/in-addr.arpa/127.0.0.1 #server=/cluster.local/127.0.0.1 # time ping -c 1 qe-smoke310-master-etcd-1 PING qe-smoke310-master-etcd-1.c.openshift-gce-devel.internal (10.240.0.2) 56(84) bytes of data. 64 bytes from qe-smoke310-master-etcd-1.c.openshift-gce-devel.internal (10.240.0.2): icmp_seq=1 ttl=64 time=0.023 ms --- qe-smoke310-master-etcd-1.c.openshift-gce-devel.internal ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.023/0.023/0.023/0.000 ms real 0m0.022s user 0m0.002s sys 0m0.002s Expected results: Installation is passed. Additional info: This issue does not happened with openshift-ansible-3.10.0-0.41.0, related PR is https://github.com/openshift/openshift-ansible/pull/8345; I understand adding "server=/cluster.local/127.0.0.1" is for cluster internal DNS resolve. But a big surprise is on a cluster installed with openshift-ansible-3.10.0-0.41.0, there is no "server=/cluster.local/127.0.0.1" setting for dnsmasq, those cluster internal dns still could be resolved. I am not know how this happen, so seem like the line is not needed any more. # host docker-registry.default.svc docker-registry.default.svc.cluster.local has address 172.30.210.50 # oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE docker-registry ClusterIP 172.30.210.50 <none> 5000/TCP 5h
This is blocking all the cluster which is using etcd dns url resolved by internal DNS server.
Reverted the change that's suspected to have broken this. https://github.com/openshift/openshift-ansible/pull/8409
@Scott, according to my testing in my initial report, do you know why cluster internal DNS still could be resolved even without "server=/cluster.local/127.0.0.1" setting?
The node is going to send a message to dnsmasq via dbus when it starts it stops dynamically changing the config. When the node is running it will forward requests to the dnsRecursiveResolvConf defined in node-config.yaml Is the node not resolving reverse lookups properly? # dig @127.0.0.1 -x 8.8.8.8 ;; ANSWER SECTION: 8.8.8.8.in-addr.arpa. 53 IN PTR google-public-dns-a.google.com.
@QA Contact,the PR 8409 has been merged to 3.10.0-0.50.0,please check the bug.
Verified this bug with openshift-ansible-3.10.0-0.50.0.git.0.bd68ade.el7.noarch, and PASS. Run a system container install with AH on GCE, installation is completed successfully. root@qe-jialiu310-master-etcd-1 ~]# cat /etc/resolv.conf # nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh # Generated by NetworkManager search cluster.local c.openshift-gce-devel.internal google.internal nameserver 10.240.0.48 # oc get nodes NAME STATUS ROLES AGE VERSION qe-jialiu310-master-etcd-1 Ready master 3h v1.10.0+b81c8f8 qe-jialiu310-node-registry-router-1 Ready compute 3h v1.10.0+b81c8f8 [root@qe-jialiu310-master-etcd-1 ~]# time ping -c 1 qe-jialiu310-node-registry-router-1 PING qe-jialiu310-node-registry-router-1.c.openshift-gce-devel.internal (10.240.0.49) 56(84) bytes of data. 64 bytes from qe-jialiu310-node-registry-router-1.c.openshift-gce-devel.internal (10.240.0.49): icmp_seq=1 ttl=64 time=1.26 ms --- qe-jialiu310-node-registry-router-1.c.openshift-gce-devel.internal ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 1.263/1.263/1.263/0.000 ms real 0m0.083s user 0m0.002s sys 0m0.005s [root@qe-jialiu310-master-etcd-1 ~]# ls /etc/dnsmasq.d/ origin-dns.conf origin-upstream-dns.conf [root@qe-jialiu310-master-etcd-1 ~]# cat /etc/dnsmasq.d/* no-resolv domain-needed no-negcache max-cache-ttl=1 enable-dbus dns-forward-max=10000 cache-size=10000 bind-dynamic except-interface=lo # End of config server=169.254.169.254 [root@qe-jialiu310-master-etcd-1 ~]# oc logs nodejs-mongodb-example-1-build -n install-test <--snip--> Pushing image docker-registry.default.svc:5000/install-test/nodejs-mongodb-example:latest ... Pushed 5/6 layers, 91% complete Pushed 6/6 layers, 100% complete Push successful [root@qe-jialiu310-master-etcd-1 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Atomic Host release 7.5
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816