1578720 – master is restart again and again due to etcd dns resolve become unavailable or timeout

Bug 1578720 - master is restart again and again due to etcd dns resolve become unavailable or timeout

Summary: master is restart again and again due to etcd dns resolve become unavailable ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	3.10.0
Assignee:	Scott Dodson
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-16 09:22 UTC by Johnny Liu
Modified:	2018-07-30 19:15 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2018-07-30 19:15:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1816	0	None	None	None	2018-07-30 19:15:54 UTC

Description Johnny Liu 2018-05-16 09:22:33 UTC

Description of problem:
After https://github.com/openshift/openshift-ansible/pull/8345 is merged, master can not be started due to etcd dns become unresolved or timeout.

Version-Release number of the following components:
openshift-ansible-3.10.0-0.46.0.git.0.85c3afd.el7.noarch
# openshift version
openshift v3.10.0-0.46.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16


How reproducible:
Always

Steps to Reproduce:
1. Trigger an installation.
2.
3.

Actual results:
Installer failed at the following step:
TASK [openshift_node_group : Apply the config] *********************************
Tuesday 15 May 2018  20:27:51 -0400 (0:00:11.231)       0:16:57.854 *********** 
fatal: [qe-smoke310-master-etcd-1.0515-clo.qe.rhcloud.com]: FAILED! => {"changed": true, "cmd": "/usr/local/bin/oc --config=/etc/origin/master/admin.kubeconfig apply -f /tmp/ansible-7ikVHY", "delta": "0:01:38.869169", "end": "2018-05-16 00:29:53.149868", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2018-05-16 00:28:14.280699", "stderr": "Unable to connect to the server: unexpected EOF\nUnable to connect to the server: net/http: TLS handshake timeout", "stderr_lines": ["Unable to connect to the server: unexpected EOF", "Unable to connect to the server: net/http: TLS handshake timeout"], "stdout": "imagestreamtag.image.openshift.io \"node:v3.10\" created\ndaemonset.apps \"sync\" created", "stdout_lines": ["imagestreamtag.image.openshift.io \"node:v3.10\" created", "daemonset.apps \"sync\" created"]}

Go to master, found master is restarted again and again.
# docker ps -a|grep api
e92eea3e42d4        35628f394eff                                                                                                    "/bin/bash -c '#!/..."   About a minute ago   Up About a minute                                k8s_api_master-api-qe-smoke310-master-etcd-1_kube-system_6a2accb687dabf002706f2f7d1d15266_98
04f0f4c9ec78        35628f394eff                                                                                                    "/bin/bash -c '#!/..."   7 minutes ago        Exited (255) 6 minutes ago                       k8s_api_master-api-qe-smoke310-master-etcd-1_kube-system_6a2accb687dabf002706f2f7d1d15266_97
5b8c78f6121d        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.10.0-0.46.0                                            "/usr/bin/pod"           6 hours ago          Up 6 hours                                       k8s_POD_master-api-qe-smoke310-master-etcd-1_kube-system_6a2accb687dabf002706f2f7d1d15266_0


# cat /etc/origin/master/master-config.yaml
<--snp-->
dnsConfig:
  bindAddress: 0.0.0.0:8053
  bindNetwork: tcp4
etcdClientInfo:
  ca: master.etcd-ca.crt
  certFile: master.etcd-client.crt
  keyFile: master.etcd-client.key
  urls:
  - https://qe-smoke310-master-etcd-1:2379
etcdStorageConfig:
  kubernetesStoragePrefix: kubernetes.io
  kubernetesStorageVersion: v1
  openShiftStoragePrefix: openshift.io
  openShiftStorageVersion: v1
<--snp-->

# cat /etc/resolv.conf 
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
# Generated by NetworkManager
search cluster.local c.openshift-gce-devel.internal google.internal
nameserver 10.240.0.2   -> node itself where dnsmasq is running

# ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1460
        inet 10.240.0.2  netmask 255.255.255.255  broadcast 10.240.0.2
        inet6 fe80::4001:aff:fef0:2  prefixlen 64  scopeid 0x20<link>
        ether 42:01:0a:f0:00:02  txqueuelen 1000  (Ethernet)
        RX packets 100417  bytes 716004345 (682.8 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 72747  bytes 11720760 (11.1 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

# cat /etc/dnsmasq.d/*
server=/in-addr.arpa/127.0.0.1
server=/cluster.local/127.0.0.1
no-resolv
domain-needed
no-negcache
max-cache-ttl=1
enable-dbus
dns-forward-max=10000
cache-size=10000
bind-dynamic
except-interface=lo
# End of config
server=169.254.169.254    -> gce instance's upstream DNS

# host -a qe-smoke310-master-etcd-1
Trying "qe-smoke310-master-etcd-1.cluster.local"
;; connection timed out; no servers could be reached

# host -a qe-smoke310-master-etcd-1 169.254.169.254
Trying "qe-smoke310-master-etcd-1.cluster.local"
Trying "qe-smoke310-master-etcd-1.c.openshift-gce-devel.internal"
Using domain server:
Name: 169.254.169.254
Address: 169.254.169.254#53
Aliases: 

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 61840
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;qe-smoke310-master-etcd-1.c.openshift-gce-devel.internal. IN ANY

;; ANSWER SECTION:
qe-smoke310-master-etcd-1.c.openshift-gce-devel.internal. 30 IN	A 10.240.0.2

Received 90 bytes from 169.254.169.254#53 in 16 ms

# time ping -c 1 qe-smoke310-master-etcd-1
PING qe-smoke310-master-etcd-1 (10.240.0.2) 56(84) bytes of data.
64 bytes from 10.240.0.2: icmp_seq=1 ttl=64 time=0.025 ms

--- qe-smoke310-master-etcd-1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.025/0.025/0.025/0.000 ms

real	0m20.010s
user	0m0.001s
sys	0m0.004s

Comment out the following line in /etc/dnsmasq.d/node-dnsmasq.conf, try again, the dns is resolved very quickly, and master is started successfully.

#server=/in-addr.arpa/127.0.0.1
#server=/cluster.local/127.0.0.1

# time ping -c 1 qe-smoke310-master-etcd-1
PING qe-smoke310-master-etcd-1.c.openshift-gce-devel.internal (10.240.0.2) 56(84) bytes of data.
64 bytes from qe-smoke310-master-etcd-1.c.openshift-gce-devel.internal (10.240.0.2): icmp_seq=1 ttl=64 time=0.023 ms

--- qe-smoke310-master-etcd-1.c.openshift-gce-devel.internal ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.023/0.023/0.023/0.000 ms

real	0m0.022s
user	0m0.002s
sys	0m0.002s


Expected results:
Installation is passed.

Additional info:
This issue does not happened with openshift-ansible-3.10.0-0.41.0, related PR is  https://github.com/openshift/openshift-ansible/pull/8345; 

I understand adding "server=/cluster.local/127.0.0.1" is for cluster internal DNS resolve.

But a big surprise is on a cluster installed with openshift-ansible-3.10.0-0.41.0, there is no "server=/cluster.local/127.0.0.1" setting for dnsmasq, those cluster internal dns still could be resolved. I am not know how this happen, so seem like the line is not needed any more.

# host docker-registry.default.svc
docker-registry.default.svc.cluster.local has address 172.30.210.50

# oc get svc
NAME               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                   AGE
docker-registry    ClusterIP   172.30.210.50    <none>        5000/TCP                  5h

Comment 1 Johnny Liu 2018-05-16 09:24:03 UTC

This is blocking all the cluster which is using etcd dns url resolved by internal DNS server.

Comment 2 Scott Dodson 2018-05-17 13:08:53 UTC

Reverted the change that's suspected to have broken this.

https://github.com/openshift/openshift-ansible/pull/8409

Comment 3 Johnny Liu 2018-05-18 02:39:34 UTC

@Scott, according to my testing in my initial report, do you know why cluster internal DNS still could be resolved even without "server=/cluster.local/127.0.0.1" setting?

Comment 4 Scott Dodson 2018-05-18 17:44:32 UTC

The node is going to send a message to dnsmasq via dbus when it starts it stops dynamically changing the config. When the node is running it will forward requests to the dnsRecursiveResolvConf defined in node-config.yaml

Is the node not resolving reverse lookups properly?
# dig @127.0.0.1 -x 8.8.8.8
;; ANSWER SECTION:
8.8.8.8.in-addr.arpa.   53      IN      PTR     google-public-dns-a.google.com.

Comment 5 Wei Sun 2018-05-22 01:20:30 UTC

@QA Contact,the PR 8409 has been merged to 3.10.0-0.50.0,please check the bug.

Comment 6 Johnny Liu 2018-05-22 06:47:18 UTC

Verified this bug with openshift-ansible-3.10.0-0.50.0.git.0.bd68ade.el7.noarch, and PASS.

Run a system container install with AH on GCE, installation is completed successfully.


root@qe-jialiu310-master-etcd-1 ~]# cat /etc/resolv.conf 
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
# Generated by NetworkManager
search cluster.local c.openshift-gce-devel.internal google.internal
nameserver 10.240.0.48

# oc get nodes
NAME                                  STATUS    ROLES     AGE       VERSION
qe-jialiu310-master-etcd-1            Ready     master    3h        v1.10.0+b81c8f8
qe-jialiu310-node-registry-router-1   Ready     compute   3h        v1.10.0+b81c8f8

[root@qe-jialiu310-master-etcd-1 ~]# time ping -c 1 qe-jialiu310-node-registry-router-1
PING qe-jialiu310-node-registry-router-1.c.openshift-gce-devel.internal (10.240.0.49) 56(84) bytes of data.
64 bytes from qe-jialiu310-node-registry-router-1.c.openshift-gce-devel.internal (10.240.0.49): icmp_seq=1 ttl=64 time=1.26 ms

--- qe-jialiu310-node-registry-router-1.c.openshift-gce-devel.internal ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.263/1.263/1.263/0.000 ms

real	0m0.083s
user	0m0.002s
sys	0m0.005s

[root@qe-jialiu310-master-etcd-1 ~]# ls /etc/dnsmasq.d/
origin-dns.conf  origin-upstream-dns.conf
[root@qe-jialiu310-master-etcd-1 ~]# cat /etc/dnsmasq.d/*
no-resolv
domain-needed
no-negcache
max-cache-ttl=1
enable-dbus
dns-forward-max=10000
cache-size=10000
bind-dynamic
except-interface=lo
# End of config
server=169.254.169.254

[root@qe-jialiu310-master-etcd-1 ~]# oc logs nodejs-mongodb-example-1-build -n install-test
<--snip-->
Pushing image docker-registry.default.svc:5000/install-test/nodejs-mongodb-example:latest ...
Pushed 5/6 layers, 91% complete
Pushed 6/6 layers, 100% complete
Push successful


[root@qe-jialiu310-master-etcd-1 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Atomic Host release 7.5

Comment 8 errata-xmlrpc 2018-07-30 19:15:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816

Note You need to log in before you can comment on or make changes to this bug.