Bug 1458941
Summary: | etcd member is unable to start due to snapshot size | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Stefanie Forrester <dakini> |
Component: | etcd | Assignee: | Jan Chaloupka <jchaloup> |
Status: | CLOSED ERRATA | QA Contact: | atomic-bugs <atomic-bugs> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.3 | CC: | aos-bugs, jchaloup, jeder, jokerman, mifiedle, mmccomas, nbhatt, ypu |
Target Milestone: | rc | Keywords: | Extras |
Target Release: | 7.3 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | etcd-3.1.9-1.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-06-28 15:40:26 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Stefanie Forrester
2017-06-05 21:45:49 UTC
From the metrics data gathered, we have observed some excessive writes from HPA controller doing status updates. UPSTREAM PR: https://github.com/kubernetes/kubernetes/pull/47078 I talked with the upstream etcd guys and they have a fix for this. We'll need to do a build whenever it merges. https://github.com/coreos/etcd/pull/8074 I installed etcd-3.1.9-1.el7.x86_64.rpm and it appears to have resolved the issue! All 3 members are back online. Test with following steps based on the pullreq: 1. Start a etcd cluster with 3 nodes. And set the flag: --snapshot-count "90" 2. Write 1M size value to etcd cluster until the snapshot is generated. With this step we can get a snapshot file around 600M which is more than 512M 3. Remove node2 from the cluster then add it back to the cluster with command 'etcdctl member' 4. Remove the data directory in node2 and start it with '--initial-cluster-state existing' Expect results: after step 4 the node2 can start up successfully test results: with etcd-3.1.3-1.el7.x86_64: Node2 can not start normally and always get following error msg: 2017-06-19 10:38:09.006667 I | rafthttp: peer 84dbbfd43e0516ba became active 2017-06-19 10:38:09.006723 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader) 2017-06-19 10:38:09.006787 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader) 2017-06-19 10:38:09.030954 E | rafthttp: failed to decode raft message (rafthttp: error limit exceeded) 2017-06-19 10:40:12.753053 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader) 2017-06-19 10:40:12.753100 E | rafthttp: failed to read 84dbbfd43e0516ba on stream MsgApp v2 (read tcp 10.73.3.144:50676->10.73.3.189:2380: i/o timeout) 2017-06-19 10:40:12.753110 I | rafthttp: peer 84dbbfd43e0516ba became inactive 2017-06-19 10:40:12.753965 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader) 2017-06-19 10:40:31.423684 I | rafthttp: peer 84dbbfd43e0516ba became active 2017-06-19 10:40:31.423710 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader) 2017-06-19 10:40:31.424194 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader) 2017-06-19 10:40:31.434851 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader) 2017-06-19 10:40:31.536072 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader) 2017-06-19 10:40:33.001830 E | rafthttp: failed to decode raft message (rafthttp: error limit exceeded) From other nodes, always shows the node2 active and inactive in a loop: 2017-06-19 10:38:08.871419 I | rafthttp: peer 84dbbfd43e0516ba became active 2017-06-19 10:38:08.871564 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader) 2017-06-19 10:38:09.699964 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader) 2017-06-19 10:38:09.701580 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader) 2017-06-19 10:38:09.818906 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader) 2017-06-19 10:38:10.126762 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader) 2017-06-19 10:38:10.243878 I | rafthttp: established a TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader) 2017-06-19 10:40:14.481636 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream MsgApp v2 reader) 2017-06-19 10:40:14.481668 E | rafthttp: failed to read 84dbbfd43e0516ba on stream MsgApp v2 (read tcp 10.73.3.143:47818->10.73.3.189:2380: i/o timeout) 2017-06-19 10:40:14.481696 I | rafthttp: peer 84dbbfd43e0516ba became inactive 2017-06-19 10:40:14.659095 W | rafthttp: lost the TCP streaming connection with peer 84dbbfd43e0516ba (stream Message reader) with etcd-3.1.9-1.el7.x86_64: Node2 can start normally, the health check cmd output is: [root@node1 ~]# etcdctl cluster-health member 84dbbfd43e0516ba is healthy: got healthy result from http://10.73.3.189:2379 member 860416ed68d01504 is healthy: got healthy result from http://10.73.3.144:2379 member abc0058a97f9b420 is healthy: got healthy result from http://10.73.3.143:2379 So the pullreq works as expect. And based on comments #c5 it also works in openshift env. So set it to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1622 |