Bug 1328482

Summary: etcd adding new member fails
Product: Red Hat Enterprise Linux 7 Reporter: Alexander Koksharov <akokshar>
Component: etcdAssignee: Jan Chaloupka <jchaloup>
Status: CLOSED ERRATA QA Contact: atomic-bugs <atomic-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 7.2CC: agoldste, akokshar, aos-bugs, asogukpi, erich, jamills, jcajka, jchaloup, jdetiber, jeder, jokerman, mmccomas, pep, rhowe, sghosh, sttts, tis, tstclair, ypu
Target Milestone: rcKeywords: Extras, UpcomingRelease
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: etcd-2.2.5-2.el7 Doc Type: Bug Fix
Doc Text:
The etcd packages have been rebuilt with golang 1.6 , which fix the problem when a member failed to be added to etcd cluster when database size grown over 700M and higher. Resulting in a loss of data.
Story Points: ---
Clone Of:
: 1350875 (view as bug list) Environment:
Last Closed: 2016-06-23 16:20:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1303130, 1186913, 1267746, 1350875    

Description Alexander Koksharov 2016-04-19 13:34:41 UTC
Description of problem:
etcd database is over 700M

it is possible to start one member cluster. when second is added the following messages appeared and no synchronization is performed.

initial member
2016-04-19 10:30:35.044306 D | raft: 5429af89680001 [firstindex: 57198439, commit: 57204158] sent snapshot[index: 57198438, term: 1389] to a7e9771a3b6aca97 [next = 1, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
2016-04-19 10:30:35.044337 D | raft: 5429af89680001 paused sending replication messages to a7e9771a3b6aca97 [next = 1, match = 0, state = ProgressStateSnapshot, waiting = true, pendingSnapshot = 57198438]
2016-04-19 10:30:40.455394 E | rafthttp: failed to write a7e9771a3b6aca97 on pipeline (read tcp 192.168.100.131:2380: i/o timeout)
2016-04-19 10:30:40.455435 D | raft: 5429af89680001 failed to send message to a7e9771a3b6aca97 because it is unreachable [next = 1, match = 0, state = ProgressStateSnapshot, waiting = true, pendingSnapshot = 57198438]
2016-04-19 10:30:40.455450 D | raft: 5429af89680001 snapshot failed, resumed sending replication messages to a7e9771a3b6aca97 [next = 1, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]

new member
2016-04-19 10:28:59.184800 E | etcdserver: publish error: etcdserver: request timed out
2016-04-19 10:29:01.519021 E | rafthttp: failed to read raft message (unexpected EOF)
2016-04-19 10:29:07.073515 E | rafthttp: failed to read raft message (unexpected EOF)

With an empty database or with backup made on testing environment everything run fine (on same servers).
So, there is no issues with network connections or with configuration of the servers. Adding member does not work only with a particular database backup which is more then 700M in size.

Version-Release number of selected component (if applicable):


How reproducible:
take backup and try to build 2 member cluster.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Timothy St. Clair 2016-04-19 14:42:56 UTC
The use case here is slightly concerning.  

It sounds like: they have an existing cluster and want to add a new member, but there are several steps involved on key creation etc that needs to be done.  

I'm not certain if etcd cluster addition(s) is a vetted path in the installer.

Comment 5 Eric Rich 2016-04-19 15:15:04 UTC
(In reply to Timothy St. Clair from comment #4)
> xref:
> https://docs.openshift.com/enterprise/latest/install_config/downgrade.
> html#downgrading-restoring-etcd

The issues being discussed here seems to be with the following procedure: https://docs.openshift.com/enterprise/latest/install_config/downgrade.html#downgrade-bringing-openshift-services-back-online

This process of re-importing the data. However the errors above don't make since unless https://docs.openshift.com/enterprise/latest/install_config/downgrade.html#downgrade-adding-addtl-etcd-members (e) is being followed after this step. 

Can you provide more details on the reproducer.

Comment 7 Eric Rich 2016-04-19 15:25:30 UTC
(In reply to Eric Rich from comment #5)
> (In reply to Timothy St. Clair from comment #4)
> > xref:
> > https://docs.openshift.com/enterprise/latest/install_config/downgrade.
> > html#downgrading-restoring-etcd
> 
> The issues being discussed here seems to be with the following procedure:
> https://docs.openshift.com/enterprise/latest/install_config/downgrade.
> html#downgrade-bringing-openshift-services-back-online
> 
> This process of re-importing the data. However the errors above don't make
> since unless
> https://docs.openshift.com/enterprise/latest/install_config/downgrade.
> html#downgrade-adding-addtl-etcd-members (e) is being followed after this
> step. 
> 

Can we confirm that the following has been run: 

>> Now start etcd on the new member:

>> # rm -rf /var/lib/etcd/member
>> # systemctl enable etcd
>> # systemctl start etcd

I ask as its not explicitly mentioned in any of the reproducer notes.

Comment 8 Andy Goldstein 2016-04-19 19:25:30 UTC
I can reproduce this. And when I built etcd v2.3.1 by hand (and fixed a small bug in that version as well), I was able to add a new cluster member whereas with v2.2.2 and v2.2.5 it fails.

Comment 9 Timothy St. Clair 2016-04-19 19:54:21 UTC
xref: https://github.com/coreos/etcd/releases/

Comment 10 Andy Goldstein 2016-04-19 20:05:21 UTC
A correction to my previous comment. It turns out that when I built both v2.2.5 and v2.3.1 with go 1.6, adding a new cluster member works. And when I built v2.2.5 with go 1.4.2 (which is what we use to build the etcd RPM), adding a new member failed.

Comment 11 Andy Goldstein 2016-04-19 20:22:02 UTC
Jakub, any idea what changed between go 1.4.2 and 1.6 (yes, I know, it's a huge delta presumably) that would cause i/o timeouts with https connections in 1.4.2 and no issues in 1.6?

Comment 14 Andy Goldstein 2016-04-19 21:23:11 UTC
I've narrowed down that the fix is somewhere after go1.5.4 and before go1.6rc1

Comment 23 Stefan Schimanski 2016-05-19 13:52:45 UTC
I am investigating how to restore a node from a backup (which probably comes for the founder node during disaster recovery).

In general this is described here for failed nodes (in contrast to a cluster which must be recovered):

  https://coreos.com/etcd/docs/2.3.4/admin_guide.html#member-migration

It is discouraged to use a backup for a disaster discovery though.

At https://github.com/coreos/etcd/blame/master/contrib/systemd/etcd2-backup-coreos/README.md#L249 it is sketched that it is possible to re-use a copy of the founder's data-dir. This though does not work out of the box because the WAL in the founder's copy has hard-coded its node-id. Each non-founder would re-use that, leading to conflict. 

It is not hard to add a feature to 'etcdctl backup' to set a specific node-id in the WAL. This allows nodes to come up with the snapshot of the founder, but with their own node-id.

This is implemented experimentally here:

  https://github.com/coreos/etcd/pull/5397

Next step: test this with the customer data.

Comment 43 errata-xmlrpc 2016-06-23 16:20:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1233