Bug 1367035 - [DOCS] Document etcd cluster recovery after node failure + installing new etcd nodes
Summary: [DOCS] Document etcd cluster recovery after node failure + installing new etc...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Documentation
Version: 3.2.0
Hardware: All
OS: Linux
Target Milestone: ---
: ---
Assignee: brice
QA Contact: Anping Li
Vikram Goyal
Whiteboard: aos-scalability-34
: 1377487 (view as bug list)
Depends On: 1350875
TreeView+ depends on / blocked
Reported: 2016-08-15 10:56 UTC by Josep 'Pep' Turro Mauri
Modified: 2019-12-16 06:21 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2017-01-24 05:00:55 UTC
Target Upstream Version:
erich: needinfo-
erich: needinfo-

Attachments (Terms of Use)

System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1349712 high CLOSED [RFE] Support scale up playbook to add new etcd host 2020-10-14 00:28:05 UTC
Red Hat Knowledge Base (Article) 2650151 None None None 2016-09-22 19:28:27 UTC

Internal Links: 1349712

Description Josep 'Pep' Turro Mauri 2016-08-15 10:56:32 UTC
Document URL: 
This is a request for a new section. It might fit under https://docs.openshift.com/enterprise/3.2/admin_guide/high_availability.html or maybe another one, but maybe it's better to have a separate section under Cluster Administration.

Section Number and Name: 
The name could be "etcd HA cluster recovery" or similar

Describe the issue: 
There is no documentaion on how to recover an HA etcd cluster if one [or more] of the nodes fails.

Suggestions for improvement: 
Add a section to document etcd node recovery

Additional information: 

Bug 1350875 addresses a technical issue with etcd recovery. Comment 31 of that bz outlines the process of cluster/node recovery using the approach made possible after the code change in that bz is released.

The steps listed there take an approach of forcing a new cluster creation and involve downtime. It would be great if we can explore equivalent steps to replace a failed node that would not involve downtime.

Comment 3 Josep 'Pep' Turro Mauri 2016-09-20 08:53:53 UTC
*** Bug 1377487 has been marked as a duplicate of this bug. ***

Comment 6 Ryan Howe 2016-09-22 19:28:27 UTC
Steps to add a new etcd member using the CA configured during the OpenShift install. 


Comment 11 brice 2016-10-14 04:19:18 UTC
Pep, Eric, I've included a section on recovering etcd hosts:


But there's so much going on in the above comments, that I'm not actually sure if I'm going down the right path. Can I get your thoughts?

Also, there's a link to a KBase article about adding etcd hosts. How does that tie into this BZ? Should I be including it in the docs? Here perhaps:



Comment 14 brice 2016-10-27 06:17:15 UTC
Ryan, Eric,

Thanks for the comments. I was replicating Pep's suggestion above, but if the article Ryan pasted above is more accurate, then that might be a better source.

Can I ask for an ack that it's all there? Then I'll pass this to QE for a test.


Comment 18 brice 2016-11-15 01:26:00 UTC

Hmm I can't see any comments that you made. What did you comment about?

And really, all I'd be after would be that you think it fulfills this BZ. I think there's a lot that's being asked for here, and I want to know I'm aiming in the correct direction. Then, I can get it tested and checked out by devel, etc.

I don't expect you to test it out and tell me what I'm doing wrong, just that this is what you're after. If I'm headed in the wrong direction, it'd be great to get that under control before I involve anyone else.


Comment 19 brice 2016-11-17 01:23:53 UTC
Ryan has given the thumbs up in the PR. I'll put this onto QA.


The "Adding New etcd Hosts" section has been added with this BZ;


Do you have the capabilities to test the procedure? Please let me know if there's is anything wrong.

Thanks much, all.

Comment 20 Anping Li 2016-11-22 04:47:52 UTC
I'd like to have two changes.
1)  yum install etcd -> yum install etcd iptables-services
Reason: iptables-services is not installed by default, it is better to guide customer to install it

2)  create the ${PREFIX} directory before use it.
    'mkdir ${PREFIX}' before the subtitle "Create the server.csr and server.crt certificates:"

Reason: Without this directory, the openssl fail for "No such file or directory".

Comment 21 brice 2016-11-23 00:45:32 UTC
Thanks, Anping Li

I've made the changes you suggest, and I'll move this to peer review.

Comment 22 openshift-github-bot 2016-11-28 04:19:56 UTC
Commit pushed to master at https://github.com/openshift/openshift-docs

Merge pull request #3047 from bfallonf/etcd-1367035

Bug 1367035 added info on restoring etcd hosts

Comment 23 Kenjiro Nakayama 2016-12-02 08:36:08 UTC

I got an information from an engineer that when we add new etcd nodes with backup process due to 700mb data, we should stop etcd services on other hosts. (Please refer to [1])

We should add the instruction to the doc and article?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1398083#c12

Comment 25 brice 2016-12-05 04:11:33 UTC
Eric, Kenjiro,

I can see Eric's recommendation making it into the docs, and if that doesn't work for the reader, falling back to stopping the other etcd hosts. Something like:

"If the etcd backup is larger than 700mb, prune the resource (link to pruning docs). If the backup is still larger than 700mb, stop the other hosts before performing the steps in this topic."

My one question would be: Where in the section would this be? Where would be the best time to have the reader stop the etcd hosts?

Comment 27 brice 2016-12-08 03:47:33 UTC
Link to released docs:


However, I've sent an email to engineering lists about the conversation above, so I'll add something to it when I find something.

Comment 28 brice 2016-12-08 03:49:05 UTC
Actually, I'll put this back to modified in the meantime.

Comment 29 brice 2016-12-14 04:55:26 UTC
Submitted another PR for the extra info:


Comment 30 brice 2016-12-15 00:35:35 UTC
PR has merged.

Note You need to log in before you can comment on or make changes to this bug.