Bug 1458941

Summary:	etcd member is unable to start due to snapshot size
Product:	Red Hat Enterprise Linux 7	Reporter:	Stefanie Forrester <dakini>
Component:	etcd	Assignee:	Jan Chaloupka <jchaloup>
Status:	CLOSED ERRATA	QA Contact:	atomic-bugs <atomic-bugs>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	7.3	CC:	aos-bugs, jchaloup, jeder, jokerman, mifiedle, mmccomas, nbhatt, ypu
Target Milestone:	rc	Keywords:	Extras
Target Release:	7.3
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	etcd-3.1.9-1.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-06-28 15:40:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Stefanie Forrester 2017-06-05 21:45:49 UTC

Description of problem:

etcd is writing snapshots every 10 seconds. When one etcd host was rebooted, it was unable to rejoin the cluster due to this frequent snapshotting.

Any attempts at running etcd commands gave the following error on the bad member:

client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://ip-172-31-54-162.ec2.internal:2379 exceeded header timeout


Version-Release number of selected component (if applicable):

etcd-3.1.3-1.el7.x86_64
atomic-openshift-3.5.5.19-1.git.0.aef7e02.el7.x86_64

How reproducible:

Every time, but only on a large, busy cluster.

Steps to Reproduce:
1. reboot one of the etcd hosts
2.
3.

Actual results:

The host that was rebooted is unable to rejoin the cluster.

Expected results:

Etcd members should be able to rejoin the cluster after a reboot.

Additional info:

After a reboot, I was unable to get 172.31.54.162 back online (starter-us-east-1-master-25064). etcdctl commands failed to work from that bad member.

[root@starter-us-east-1-master-3a4e4 ~]# ops-etcdctl member list
2017-05-26 14:07:37.177458 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2621106c8b14e1c8: name=ip-172-31-55-199.ec2.internal peerURLs=https://172.31.55.199:2380 clientURLs=https://172.31.55.199:2379 isLeader=false
6423287aeccfcfd0: name=ip-172-31-54-162.ec2.internal peerURLs=https://172.31.54.162:2380 clientURLs=https://172.31.54.162:2379 isLeader=false
ace27f42a1052c35: name=ip-172-31-60-65.ec2.internal peerURLs=https://172.31.60.65:2380 clientURLs=https://172.31.60.65:2379 isLeader=true

[root@starter-us-east-1-master-25064 ~]# ops-etcdctl member list
2017-05-26 14:08:02.679785 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://ip-172-31-54-162.ec2.internal:2379 exceeded header timeout

[root@starter-us-east-1-master-51852 ~]# ops-etcdctl member list
2017-05-26 14:11:06.543141 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2621106c8b14e1c8: name=ip-172-31-55-199.ec2.internal peerURLs=https://172.31.55.199:2380 clientURLs=https://172.31.55.199:2379 isLeader=true
6423287aeccfcfd0: name=ip-172-31-54-162.ec2.internal peerURLs=https://172.31.54.162:2380 clientURLs=https://172.31.54.162:2379 isLeader=false
ace27f42a1052c35: name=ip-172-31-60-65.ec2.internal peerURLs=https://172.31.60.65:2380 clientURLs=https://172.31.60.65:2379 isLeader=false


Even after removing the bad member from the cluster and adding it as a new member, that member is unable to communicate with the other members of the cluster.

[root@starter-us-east-1-master-25064 ~]# journalctl -flu etcd
May 26 14:54:25 ip-172-31-54-162.ec2.internal etcd[118095]: failed to find member ace27f42a1052c35 in cluster c5f8e274769ca921
May 26 14:54:25 ip-172-31-54-162.ec2.internal etcd[118095]: failed to find member 2621106c8b14e1c8 in cluster c5f8e274769ca921

Testing connectivity via telnet showed that I was able to access the etcd ports of the other members from the OS just fine. But etcd was unable to.

# buggy member can talk to other members
[root@starter-us-east-1-master-25064 ~]# telnet 172.31.55.199 2380
Trying 172.31.55.199...
Connected to 172.31.55.199.
Escape character is '^]'.

[root@starter-us-east-1-master-25064 ~]# telnet 172.31.60.65 2380
Trying 172.31.60.65...
Connected to 172.31.60.65.
Escape character is '^]'.

[root@starter-us-east-1-master-25064 ~]# telnet 172.31.60.65 2379
Trying 172.31.60.65...
Connected to 172.31.60.65.
Escape character is '^]'.

[root@starter-us-east-1-master-25064 ~]# telnet 172.31.55.199 2379
Trying 172.31.55.199...
Connected to 172.31.55.199.


# On the leader, it shows the bad member connecting and disconnecting continually

 26 16:25:22 ip-172-31-55-199.ec2.internal etcd[1882]: peer ffc3e7d2b8c3d7f7 became inactive
May 26 16:25:22 ip-172-31-55-199.ec2.internal etcd[1882]: peer ffc3e7d2b8c3d7f7 became active
May 26 16:25:22 ip-172-31-55-199.ec2.internal etcd[1882]: failed to dial ffc3e7d2b8c3d7f7 on stream MsgApp v2 (peer ffc3e7d2b8c3d7f7 failed to find local node 2621106c8b14e1c8)
May 26 16:25:22 ip-172-31-55-199.ec2.internal etcd[1882]: peer ffc3e7d2b8c3d7f7 became inactive
May 26 16:25:22 ip-172-31-55-199.ec2.internal etcd[1882]: peer ffc3e7d2b8c3d7f7 became active
May 26 16:25:22 ip-172-31-55-199.ec2.internal etcd[1882]: failed to dial ffc3e7d2b8c3d7f7 on stream Message (peer ffc3e7d2b8c3d7f7 failed to find local node 2621106c8b14e1c8)


# so the above log means that that bad member (ffc3e7d2b8c3d7f7) is unable to find the Leader (2621106c8b14e1c8)

# Leader can see everyone
[root@starter-us-east-1-master-51852 ~]# ops-etcdctl member list
2017-05-26 16:25:26.363579 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2621106c8b14e1c8: name=ip-172-31-55-199.ec2.internal peerURLs=https://172.31.55.199:2380 clientURLs=https://172.31.55.199:2379 isLeader=true
ace27f42a1052c35: name=ip-172-31-60-65.ec2.internal peerURLs=https://172.31.60.65:2380 clientURLs=https://172.31.60.65:2379 isLeader=false
ffc3e7d2b8c3d7f7: name=ip-172-31-54-162.ec2.internal peerURLs=https://172.31.54.162:2380 clientURLs=https://172.31.54.162:2379 isLeader=false

# bad member can't see anyone but itself
May 26 16:24:50 ip-172-31-54-162.ec2.internal etcd[57788]: failed to find member ace27f42a1052c35 in cluster c5f8e274769ca921
May 26 16:24:50 ip-172-31-54-162.ec2.internal etcd[57788]: failed to find member 2621106c8b14e1c8 in cluster c5f8e274769ca921
May 26 16:24:50 ip-172-31-54-162.ec2.internal etcd[57788]: failed to find member 2621106c8b14e1c8 in cluster c5f8e274769ca921
May 26 16:24:50 ip-172-31-54-162.ec2.internal etcd[57788]: failed to find member ace27f42a1052c35 in cluster c5f8e274769ca921

Comment 3 Derek Carr 2017-06-07 14:16:45 UTC

From the metrics data gathered, we have observed some excessive writes from HPA controller doing status updates.

UPSTREAM PR:
https://github.com/kubernetes/kubernetes/pull/47078

Comment 4 Stefanie Forrester 2017-06-09 17:24:22 UTC

I talked with the upstream etcd guys and they have a fix for this. We'll need to do a build whenever it merges.

https://github.com/coreos/etcd/pull/8074

Comment 5 Stefanie Forrester 2017-06-09 19:46:13 UTC

I installed etcd-3.1.9-1.el7.x86_64.rpm and it appears to have resolved the issue! All 3 members are back online.

Comment 9 Joy Pu 2017-06-19 07:01:36 UTC

Test with following steps based on the pullreq:

1. Start a etcd cluster with 3 nodes. And set the flag: --snapshot-count "90"
2. Write 1M size value to etcd cluster until the snapshot is generated. With this step we can get a snapshot file around 600M which is more than 512M
3. Remove node2 from the cluster then add it back to the cluster with command 'etcdctl member'
4. Remove the data directory in node2 and start it with '--initial-cluster-state existing'

Expect results:
  after step 4 the node2 can start up successfully

test results:

with etcd-3.1.3-1.el7.x86_64:

Node2 can not start normally and always get following error msg:
2017-06-19 10:38:09.006667 I | rafthttp: peer 84dbbfd43e0516ba became active
2017-06-19 10:38:09.006723 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader)
2017-06-19 10:38:09.006787 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader)
2017-06-19 10:38:09.030954 E | rafthttp: failed to decode raft message (rafthttp: error limit exceeded)
2017-06-19 10:40:12.753053 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader)
2017-06-19 10:40:12.753100 E | rafthttp: failed to read 84dbbfd43e0516ba on stream MsgApp v2 (read tcp 10.73.3.144:50676->10.73.3.189:2380: i/o timeout)
2017-06-19 10:40:12.753110 I | rafthttp: peer 84dbbfd43e0516ba became inactive
2017-06-19 10:40:12.753965 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader)
2017-06-19 10:40:31.423684 I | rafthttp: peer 84dbbfd43e0516ba became active
2017-06-19 10:40:31.423710 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader)
2017-06-19 10:40:31.424194 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader)
2017-06-19 10:40:31.434851 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader)
2017-06-19 10:40:31.536072 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader)
2017-06-19 10:40:33.001830 E | rafthttp: failed to decode raft message (rafthttp: error limit exceeded)


From other nodes, always shows the node2 active and inactive in a loop:
2017-06-19 10:38:08.871419 I | rafthttp: peer 84dbbfd43e0516ba became active
2017-06-19 10:38:08.871564 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader)
2017-06-19 10:38:09.699964 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader)
2017-06-19 10:38:09.701580 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader)
2017-06-19 10:38:09.818906 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader)
2017-06-19 10:38:10.126762 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader)
2017-06-19 10:38:10.243878 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader)
2017-06-19 10:40:14.481636 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader)
2017-06-19 10:40:14.481668 E | rafthttp: failed to read 84dbbfd43e0516ba on stream MsgApp v2 (read tcp 10.73.3.143:47818->10.73.3.189:2380: i/o timeout)
2017-06-19 10:40:14.481696 I | rafthttp: peer 84dbbfd43e0516ba became inactive
2017-06-19 10:40:14.659095 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader)

with etcd-3.1.9-1.el7.x86_64:
Node2 can start normally, the health check cmd output is:
[root@node1 ~]# etcdctl cluster-health
member 84dbbfd43e0516ba is healthy: got healthy result from http://10.73.3.189:2379
member 860416ed68d01504 is healthy: got healthy result from http://10.73.3.144:2379
member abc0058a97f9b420 is healthy: got healthy result from http://10.73.3.143:2379



So the pullreq works as expect. And based on comments #c5 it also works in openshift env. So set it to verified.

Comment 11 errata-xmlrpc 2017-06-28 15:40:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1622