1458941 – etcd member is unable to start due to snapshot size

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1458941 - etcd member is unable to start due to snapshot size

Summary: etcd member is unable to start due to snapshot size

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	etcd
Sub Component:
Version:	7.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	rc
Target Release:	7.3
Assignee:	Jan Chaloupka
QA Contact:	atomic-bugs@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-05 21:45 UTC by Stefanie Forrester
Modified:	2021-06-10 12:24 UTC (History)
CC List:	8 users (show)
Fixed In Version:	etcd-3.1.9-1.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-06-28 15:40:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:1622	0	normal	SHIPPED_LIVE	etcd bug fix and enhancement update	2017-06-28 19:34:13 UTC

Description Stefanie Forrester 2017-06-05 21:45:49 UTC

Description of problem:

etcd is writing snapshots every 10 seconds. When one etcd host was rebooted, it was unable to rejoin the cluster due to this frequent snapshotting.

Any attempts at running etcd commands gave the following error on the bad member:

client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://ip-172-31-54-162.ec2.internal:2379 exceeded header timeout


Version-Release number of selected component (if applicable):

etcd-3.1.3-1.el7.x86_64
atomic-openshift-3.5.5.19-1.git.0.aef7e02.el7.x86_64

How reproducible:

Every time, but only on a large, busy cluster.

Steps to Reproduce:
1. reboot one of the etcd hosts
2.
3.

Actual results:

The host that was rebooted is unable to rejoin the cluster.

Expected results:

Etcd members should be able to rejoin the cluster after a reboot.

Additional info:

After a reboot, I was unable to get 172.31.54.162 back online (starter-us-east-1-master-25064). etcdctl commands failed to work from that bad member.

[root@starter-us-east-1-master-3a4e4 ~]# ops-etcdctl member list
2017-05-26 14:07:37.177458 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2621106c8b14e1c8: name=ip-172-31-55-199.ec2.internal peerURLs=https://172.31.55.199:2380 clientURLs=https://172.31.55.199:2379 isLeader=false
6423287aeccfcfd0: name=ip-172-31-54-162.ec2.internal peerURLs=https://172.31.54.162:2380 clientURLs=https://172.31.54.162:2379 isLeader=false
ace27f42a1052c35: name=ip-172-31-60-65.ec2.internal peerURLs=https://172.31.60.65:2380 clientURLs=https://172.31.60.65:2379 isLeader=true

[root@starter-us-east-1-master-25064 ~]# ops-etcdctl member list
2017-05-26 14:08:02.679785 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://ip-172-31-54-162.ec2.internal:2379 exceeded header timeout

[root@starter-us-east-1-master-51852 ~]# ops-etcdctl member list
2017-05-26 14:11:06.543141 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2621106c8b14e1c8: name=ip-172-31-55-199.ec2.internal peerURLs=https://172.31.55.199:2380 clientURLs=https://172.31.55.199:2379 isLeader=true
6423287aeccfcfd0: name=ip-172-31-54-162.ec2.internal peerURLs=https://172.31.54.162:2380 clientURLs=https://172.31.54.162:2379 isLeader=false
ace27f42a1052c35: name=ip-172-31-60-65.ec2.internal peerURLs=https://172.31.60.65:2380 clientURLs=https://172.31.60.65:2379 isLeader=false


Even after removing the bad member from the cluster and adding it as a new member, that member is unable to communicate with the other members of the cluster.

[root@starter-us-east-1-master-25064 ~]# journalctl -flu etcd
May 26 14:54:25 ip-172-31-54-162.ec2.internal etcd[118095]: failed to find member ace27f42a1052c35 in cluster c5f8e274769ca921
May 26 14:54:25 ip-172-31-54-162.ec2.internal etcd[118095]: failed to find member 2621106c8b14e1c8 in cluster c5f8e274769ca921

Testing connectivity via telnet showed that I was able to access the etcd ports of the other members from the OS just fine. But etcd was unable to.

# buggy member can talk to other members
[root@starter-us-east-1-master-25064 ~]# telnet 172.31.55.199 2380
Trying 172.31.55.199...
Connected to 172.31.55.199.
Escape character is '^]'.

[root@starter-us-east-1-master-25064 ~]# telnet 172.31.60.65 2380
Trying 172.31.60.65...
Connected to 172.31.60.65.
Escape character is '^]'.

[root@starter-us-east-1-master-25064 ~]# telnet 172.31.60.65 2379
Trying 172.31.60.65...
Connected to 172.31.60.65.
Escape character is '^]'.

[root@starter-us-east-1-master-25064 ~]# telnet 172.31.55.199 2379
Trying 172.31.55.199...
Connected to 172.31.55.199.


# On the leader, it shows the bad member connecting and disconnecting continually

 26 16:25:22 ip-172-31-55-199.ec2.internal etcd[1882]: peer ffc3e7d2b8c3d7f7 became inactive
May 26 16:25:22 ip-172-31-55-199.ec2.internal etcd[1882]: peer ffc3e7d2b8c3d7f7 became active
May 26 16:25:22 ip-172-31-55-199.ec2.internal etcd[1882]: failed to dial ffc3e7d2b8c3d7f7 on stream MsgApp v2 (peer ffc3e7d2b8c3d7f7 failed to find local node 2621106c8b14e1c8)
May 26 16:25:22 ip-172-31-55-199.ec2.internal etcd[1882]: peer ffc3e7d2b8c3d7f7 became inactive
May 26 16:25:22 ip-172-31-55-199.ec2.internal etcd[1882]: peer ffc3e7d2b8c3d7f7 became active
May 26 16:25:22 ip-172-31-55-199.ec2.internal etcd[1882]: failed to dial ffc3e7d2b8c3d7f7 on stream Message (peer ffc3e7d2b8c3d7f7 failed to find local node 2621106c8b14e1c8)


# so the above log means that that bad member (ffc3e7d2b8c3d7f7) is unable to find the Leader (2621106c8b14e1c8)

# Leader can see everyone
[root@starter-us-east-1-master-51852 ~]# ops-etcdctl member list
2017-05-26 16:25:26.363579 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2621106c8b14e1c8: name=ip-172-31-55-199.ec2.internal peerURLs=https://172.31.55.199:2380 clientURLs=https://172.31.55.199:2379 isLeader=true
ace27f42a1052c35: name=ip-172-31-60-65.ec2.internal peerURLs=https://172.31.60.65:2380 clientURLs=https://172.31.60.65:2379 isLeader=false
ffc3e7d2b8c3d7f7: name=ip-172-31-54-162.ec2.internal peerURLs=https://172.31.54.162:2380 clientURLs=https://172.31.54.162:2379 isLeader=false

# bad member can't see anyone but itself
May 26 16:24:50 ip-172-31-54-162.ec2.internal etcd[57788]: failed to find member ace27f42a1052c35 in cluster c5f8e274769ca921
May 26 16:24:50 ip-172-31-54-162.ec2.internal etcd[57788]: failed to find member 2621106c8b14e1c8 in cluster c5f8e274769ca921
May 26 16:24:50 ip-172-31-54-162.ec2.internal etcd[57788]: failed to find member 2621106c8b14e1c8 in cluster c5f8e274769ca921
May 26 16:24:50 ip-172-31-54-162.ec2.internal etcd[57788]: failed to find member ace27f42a1052c35 in cluster c5f8e274769ca921

Comment 3 Derek Carr 2017-06-07 14:16:45 UTC

From the metrics data gathered, we have observed some excessive writes from HPA controller doing status updates.

UPSTREAM PR:
https://github.com/kubernetes/kubernetes/pull/47078

Comment 4 Stefanie Forrester 2017-06-09 17:24:22 UTC

I talked with the upstream etcd guys and they have a fix for this. We'll need to do a build whenever it merges.

https://github.com/coreos/etcd/pull/8074

Comment 5 Stefanie Forrester 2017-06-09 19:46:13 UTC

I installed etcd-3.1.9-1.el7.x86_64.rpm and it appears to have resolved the issue! All 3 members are back online.

Comment 9 Joy Pu 2017-06-19 07:01:36 UTC

Test with following steps based on the pullreq:

1. Start a etcd cluster with 3 nodes. And set the flag: --snapshot-count "90"
2. Write 1M size value to etcd cluster until the snapshot is generated. With this step we can get a snapshot file around 600M which is more than 512M
3. Remove node2 from the cluster then add it back to the cluster with command 'etcdctl member'
4. Remove the data directory in node2 and start it with '--initial-cluster-state existing'

Expect results:
  after step 4 the node2 can start up successfully

test results:

with etcd-3.1.3-1.el7.x86_64:

Node2 can not start normally and always get following error msg:
2017-06-19 10:38:09.006667 I | rafthttp: peer 84dbbfd43e0516ba became active
2017-06-19 10:38:09.006723 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader)
2017-06-19 10:38:09.006787 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader)
2017-06-19 10:38:09.030954 E | rafthttp: failed to decode raft message (rafthttp: error limit exceeded)
2017-06-19 10:40:12.753053 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader)
2017-06-19 10:40:12.753100 E | rafthttp: failed to read 84dbbfd43e0516ba on stream MsgApp v2 (read tcp 10.73.3.144:50676->10.73.3.189:2380: i/o timeout)
2017-06-19 10:40:12.753110 I | rafthttp: peer 84dbbfd43e0516ba became inactive
2017-06-19 10:40:12.753965 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader)
2017-06-19 10:40:31.423684 I | rafthttp: peer 84dbbfd43e0516ba became active
2017-06-19 10:40:31.423710 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader)
2017-06-19 10:40:31.424194 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader)
2017-06-19 10:40:31.434851 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader)
2017-06-19 10:40:31.536072 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader)
2017-06-19 10:40:33.001830 E | rafthttp: failed to decode raft message (rafthttp: error limit exceeded)


From other nodes, always shows the node2 active and inactive in a loop:
2017-06-19 10:38:08.871419 I | rafthttp: peer 84dbbfd43e0516ba became active
2017-06-19 10:38:08.871564 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader)
2017-06-19 10:38:09.699964 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader)
2017-06-19 10:38:09.701580 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader)
2017-06-19 10:38:09.818906 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader)
2017-06-19 10:38:10.126762 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader)
2017-06-19 10:38:10.243878 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader)
2017-06-19 10:40:14.481636 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader)
2017-06-19 10:40:14.481668 E | rafthttp: failed to read 84dbbfd43e0516ba on stream MsgApp v2 (read tcp 10.73.3.143:47818->10.73.3.189:2380: i/o timeout)
2017-06-19 10:40:14.481696 I | rafthttp: peer 84dbbfd43e0516ba became inactive
2017-06-19 10:40:14.659095 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader)

with etcd-3.1.9-1.el7.x86_64:
Node2 can start normally, the health check cmd output is:
[root@node1 ~]# etcdctl cluster-health
member 84dbbfd43e0516ba is healthy: got healthy result from http://10.73.3.189:2379
member 860416ed68d01504 is healthy: got healthy result from http://10.73.3.144:2379
member abc0058a97f9b420 is healthy: got healthy result from http://10.73.3.143:2379



So the pullreq works as expect. And based on comments #c5 it also works in openshift env. So set it to verified.

Comment 11 errata-xmlrpc 2017-06-28 15:40:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1622

Note You need to log in before you can comment on or make changes to this bug.