1574677 – Liveness probes on etcd static pod does not align with etcd.conf configuration

Bug 1574677 - Liveness probes on etcd static pod does not align with etcd.conf configuration

Summary: Liveness probes on etcd static pod does not align with etcd.conf configuration

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.10.0
Assignee:	Vadim Rutkovsky
QA Contact:	Gan Huang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-03 20:15 UTC by Scott Dodson
Modified:	2018-07-30 19:14 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-07-30 19:14:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1816	0	None	None	None	2018-07-30 19:14:59 UTC

Description Scott Dodson 2018-05-03 20:15:46 UTC

Description of problem:
On a host with multiple interfaces, /etc/etcd/etcd.conf was configured to use the address of eth1, however the liveness probe is hitting the address of eth0. We need to make sure that we configure /etc/etcd/etcd.conf to match the address that would be use for the liveness probe.

Version-Release number of the following components:
master branch / 3.10

How reproducible:
unknown

Steps to Reproduce:
1. Provision host with two interfaces for use as a master
2. Install OpenShift
3. 

Actual results:
etcd static pod is killed due to liveness probe failures

Expected results:
etcd static pod runs successfully

Additional info:
Network config
docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 0.0.0.0
        ether 02:42:23:ed:77:2d  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.121.138  netmask 255.255.255.0  broadcast 192.168.121.255
        inet6 fe80::5054:ff:fea3:a6e2  prefixlen 64  scopeid 0x20<link>
        ether 52:54:00:a3:a6:e2  txqueuelen 1000  (Ethernet)
        RX packets 170661  bytes 1001384775 (954.9 MiB)
        RX errors 0  dropped 2  overruns 0  frame 0
        TX packets 138811  bytes 10348076 (9.8 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.120.4  netmask 255.255.255.0  broadcast 192.168.120.255
        inet6 fe80::5054:ff:fede:66fb  prefixlen 64  scopeid 0x20<link>
        ether 52:54:00:de:66:fb  txqueuelen 1000  (Ethernet)
        RX packets 1573  bytes 83138 (81.1 KiB)
        RX errors 0  dropped 292  overruns 0  frame 0
        TX packets 24  bytes 2812 (2.7 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1  (Local Loopback)
        RX packets 263009  bytes 116104090 (110.7 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 263009  bytes 116104090 (110.7 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Node logs showing liveness probe failure

May 03 19:48:22 localhost.localdomain atomic-openshift-node[3463]: I0503 19:48:22.785710    3463 prober.go:111] Liveness probe for "master-etcd-localhost.localdomain_kube-system(e41c2a57b9d15b28cb8a240c7780597f):etcd" failed (failure): dial tcp 192.168.121.138:2379: getsockopt: connection refused


etcd config and logs (from a different startup but same behavior)

ETCD_ADVERTISE_CLIENT_URLS=https://192.168.120.4:2379
ETCD_CERT_FILE=/etc/etcd/server.crt
ETCD_CLIENT_CERT_AUTH=true
ETCD_DATA_DIR=/var/lib/etcd/
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://192.168.120.4:2380
ETCD_INITIAL_CLUSTER=192.168.120.4.nip.io=https://192.168.120.4:2380
ETCD_LISTEN_CLIENT_URLS=https://192.168.120.4:2379
ETCD_LISTEN_PEER_URLS=https://192.168.120.4:2380
ETCD_NAME=192.168.120.4.nip.io
...

2018-05-03 19:37:36.722605 I | etcdserver: published {Name:192.168.120.4.nip.io ClientURLs:[https://192.168.120.4:2379]} to cluster f576e02791b30b8a
2018-05-03 19:37:36.722721 I | embed: ready to serve client requests
...
2018-05-03 19:37:52.994948 D | auth: found common name 192.168.120.4.nip.io
2018-05-03 19:37:53.089500 N | pkg/osutil: received terminated signal, shutting down...

Comment 1 Vadim Rutkovsky 2018-05-23 08:03:16 UTC

PR https://github.com/openshift/openshift-ansible/pull/8495

Comment 2 Vadim Rutkovsky 2018-05-24 07:37:44 UTC

Fix is available in openshift-ansible-3.10.0-0.51.0

Comment 3 Gan Huang 2018-05-29 06:29:50 UTC

Verified in openshift-ansible-3.10.0-0.53.0.git.0.53fe016.el7.noarch.rpm

1) Spin up two instances with two interfaces
# ip addr
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether fa:16:3e:7b:08:8f brd ff:ff:ff:ff:ff:ff
    inet 172.16.120.67/24 brd 172.16.120.255 scope global noprefixroute dynamic eth0
       valid_lft 77221sec preferred_lft 77221sec
    inet6 fe80::f816:3eff:fe7b:88f/64 scope link 
       valid_lft forever preferred_lft forever

4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether fa:16:3e:d5:bf:9b brd ff:ff:ff:ff:ff:ff
    inet 192.168.33.5/24 brd 192.168.33.255 scope global noprefixroute dynamic eth1
       valid_lft 79793sec preferred_lft 79793sec
    inet6 fe80::f816:3eff:fed5:bf9b/64 scope link 

2) Spefify openshift_ip and openshift_hostname in etcd host group to use eth1 interface.

[etcd]
host-8-246-185.host.xx.redhat.com openshift_public_hostname=host-8-246-185.host.xx.redhat.com openshift_hostname=192.168.33.5 openshift_ip=192.168.33.5

3) etcd server was running successfully

# cat /etc/origin/node/pods/etcd.yaml
<--snip--> 
    livenessProbe:
      exec:
        command:
        - etcdctl
        - --cert-file
        - /etc/etcd/peer.crt
        - --key-file
        - /etc/etcd/peer.key
        - --ca-file
        - /etc/etcd/ca.crt
        - -C
        - https://192.168.33.5:2379
        - cluster-health
      initialDelaySeconds: 15
      timeoutSeconds: 10
<--snip-->

# cat /etc/etcd/etcd.conf
ETCD_NAME=192.168.33.5
ETCD_LISTEN_PEER_URLS=https://192.168.33.5:2380
ETCD_DATA_DIR=/var/lib/etcd/
#ETCD_WAL_DIR=
#ETCD_SNAPSHOT_COUNT=10000
ETCD_HEARTBEAT_INTERVAL=500
ETCD_ELECTION_TIMEOUT=2500
ETCD_LISTEN_CLIENT_URLS=https://192.168.33.5:2379
#ETCD_MAX_SNAPSHOTS=5
#ETCD_MAX_WALS=5
#ETCD_CORS=


#[cluster]
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://192.168.33.5:2380
ETCD_INITIAL_CLUSTER=192.168.33.5=https://192.168.33.5:2380
ETCD_INITIAL_CLUSTER_STATE=new
ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-1
#ETCD_DISCOVERY=
#ETCD_DISCOVERY_SRV=
#ETCD_DISCOVERY_FALLBACK=proxy
#ETCD_DISCOVERY_PROXY=
ETCD_ADVERTISE_CLIENT_URLS=https://192.168.33.5:2379
<--snip-->

Comment 5 errata-xmlrpc 2018-07-30 19:14:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816

Note You need to log in before you can comment on or make changes to this bug.