Bug 1328482
Summary: | etcd adding new member fails | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Alexander Koksharov <akokshar> | |
Component: | etcd | Assignee: | Jan Chaloupka <jchaloup> | |
Status: | CLOSED ERRATA | QA Contact: | atomic-bugs <atomic-bugs> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 7.2 | CC: | agoldste, akokshar, aos-bugs, asogukpi, erich, jamills, jcajka, jchaloup, jdetiber, jeder, jokerman, mmccomas, pep, rhowe, sghosh, sttts, tis, tstclair, ypu | |
Target Milestone: | rc | Keywords: | Extras, UpcomingRelease | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | etcd-2.2.5-2.el7 | Doc Type: | Bug Fix | |
Doc Text: |
The etcd packages have been rebuilt with golang 1.6 , which
fix the problem when a member failed to be added to etcd cluster when database
size grown over 700M and higher. Resulting in a loss of data.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1350875 (view as bug list) | Environment: | ||
Last Closed: | 2016-06-23 16:20:44 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1303130, 1186913, 1267746, 1350875 |
Description
Alexander Koksharov
2016-04-19 13:34:41 UTC
The use case here is slightly concerning. It sounds like: they have an existing cluster and want to add a new member, but there are several steps involved on key creation etc that needs to be done. I'm not certain if etcd cluster addition(s) is a vetted path in the installer. (In reply to Timothy St. Clair from comment #4) > xref: > https://docs.openshift.com/enterprise/latest/install_config/downgrade. > html#downgrading-restoring-etcd The issues being discussed here seems to be with the following procedure: https://docs.openshift.com/enterprise/latest/install_config/downgrade.html#downgrade-bringing-openshift-services-back-online This process of re-importing the data. However the errors above don't make since unless https://docs.openshift.com/enterprise/latest/install_config/downgrade.html#downgrade-adding-addtl-etcd-members (e) is being followed after this step. Can you provide more details on the reproducer. (In reply to Eric Rich from comment #5) > (In reply to Timothy St. Clair from comment #4) > > xref: > > https://docs.openshift.com/enterprise/latest/install_config/downgrade. > > html#downgrading-restoring-etcd > > The issues being discussed here seems to be with the following procedure: > https://docs.openshift.com/enterprise/latest/install_config/downgrade. > html#downgrade-bringing-openshift-services-back-online > > This process of re-importing the data. However the errors above don't make > since unless > https://docs.openshift.com/enterprise/latest/install_config/downgrade. > html#downgrade-adding-addtl-etcd-members (e) is being followed after this > step. > Can we confirm that the following has been run: >> Now start etcd on the new member: >> # rm -rf /var/lib/etcd/member >> # systemctl enable etcd >> # systemctl start etcd I ask as its not explicitly mentioned in any of the reproducer notes. I can reproduce this. And when I built etcd v2.3.1 by hand (and fixed a small bug in that version as well), I was able to add a new cluster member whereas with v2.2.2 and v2.2.5 it fails. A correction to my previous comment. It turns out that when I built both v2.2.5 and v2.3.1 with go 1.6, adding a new cluster member works. And when I built v2.2.5 with go 1.4.2 (which is what we use to build the etcd RPM), adding a new member failed. Jakub, any idea what changed between go 1.4.2 and 1.6 (yes, I know, it's a huge delta presumably) that would cause i/o timeouts with https connections in 1.4.2 and no issues in 1.6? I've narrowed down that the fix is somewhere after go1.5.4 and before go1.6rc1 I am investigating how to restore a node from a backup (which probably comes for the founder node during disaster recovery). In general this is described here for failed nodes (in contrast to a cluster which must be recovered): https://coreos.com/etcd/docs/2.3.4/admin_guide.html#member-migration It is discouraged to use a backup for a disaster discovery though. At https://github.com/coreos/etcd/blame/master/contrib/systemd/etcd2-backup-coreos/README.md#L249 it is sketched that it is possible to re-use a copy of the founder's data-dir. This though does not work out of the box because the WAL in the founder's copy has hard-coded its node-id. Each non-founder would re-use that, leading to conflict. It is not hard to add a feature to 'etcdctl backup' to set a specific node-id in the WAL. This allows nodes to come up with the snapshot of the founder, but with their own node-id. This is implemented experimentally here: https://github.com/coreos/etcd/pull/5397 Next step: test this with the customer data. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1233 |