Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1367035

Summary:	[DOCS] Document etcd cluster recovery after node failure + installing new etcd nodes
Product:	OpenShift Container Platform	Reporter:	Josep 'Pep' Turro Mauri <pep>
Component:	Documentation	Assignee:	brice <bfallonf>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Anping Li <anli>
Severity:	high	Docs Contact:	Vikram Goyal <vigoyal>
Priority:	high
Version:	3.2.0	CC:	aos-bugs, erich, gpei, jeder, jialiu, jokerman, knakayam, mmccomas, nschuetz, pdwyer, pep, rhowe, tstclair
Target Milestone:	---	Keywords:	Performance
Target Release:	---	Flags:	erich: needinfo- erich: needinfo-
Hardware:	All
OS:	Linux
Whiteboard:	aos-scalability-34
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-01-24 05:00:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1350875
Bug Blocks:

Description Josep 'Pep' Turro Mauri 2016-08-15 10:56:32 UTC

Document URL: 
This is a request for a new section. It might fit under https://docs.openshift.com/enterprise/3.2/admin_guide/high_availability.html or maybe another one, but maybe it's better to have a separate section under Cluster Administration.

Section Number and Name: 
The name could be "etcd HA cluster recovery" or similar

Describe the issue: 
There is no documentaion on how to recover an HA etcd cluster if one [or more] of the nodes fails.

Suggestions for improvement: 
Add a section to document etcd node recovery

Additional information: 

Bug 1350875 addresses a technical issue with etcd recovery. Comment 31 of that bz outlines the process of cluster/node recovery using the approach made possible after the code change in that bz is released.

The steps listed there take an approach of forcing a new cluster creation and involve downtime. It would be great if we can explore equivalent steps to replace a failed node that would not involve downtime.

Comment 3 Josep 'Pep' Turro Mauri 2016-09-20 08:53:53 UTC

*** Bug 1377487 has been marked as a duplicate of this bug. ***

Comment 6 Ryan Howe 2016-09-22 19:28:27 UTC

Steps to add a new etcd member using the CA configured during the OpenShift install. 

https://access.redhat.com/articles/2650151

Comment 11 brice 2016-10-14 04:19:18 UTC

Pep, Eric, I've included a section on recovering etcd hosts:

https://github.com/openshift/openshift-docs/pull/3047

But there's so much going on in the above comments, that I'm not actually sure if I'm going down the right path. Can I get your thoughts?

Also, there's a link to a KBase article about adding etcd hosts. How does that tie into this BZ? Should I be including it in the docs? Here perhaps:

https://docs.openshift.com/enterprise/3.2/install_config/adding_hosts_to_existing_cluster.html

Thanks!

Comment 14 brice 2016-10-27 06:17:15 UTC

Ryan, Eric,

Thanks for the comments. I was replicating Pep's suggestion above, but if the article Ryan pasted above is more accurate, then that might be a better source.

Can I ask for an ack that it's all there? Then I'll pass this to QE for a test.

Thanks!

Comment 18 brice 2016-11-15 01:26:00 UTC

Pep,

Hmm I can't see any comments that you made. What did you comment about?

And really, all I'd be after would be that you think it fulfills this BZ. I think there's a lot that's being asked for here, and I want to know I'm aiming in the correct direction. Then, I can get it tested and checked out by devel, etc.

I don't expect you to test it out and tell me what I'm doing wrong, just that this is what you're after. If I'm headed in the wrong direction, it'd be great to get that under control before I involve anyone else.

Thanks.

Comment 19 brice 2016-11-17 01:23:53 UTC

Ryan has given the thumbs up in the PR. I'll put this onto QA.

Johnny,

The "Adding New etcd Hosts" section has been added with this BZ;

https://github.com/bfallonf/openshift-docs/blob/a0da5f21c0db6ba6a4a363198f8c15ff0844b7bf/admin_guide/backup_restore.adoc#backup-restore-adding-etcd-hosts

Do you have the capabilities to test the procedure? Please let me know if there's is anything wrong.

Thanks much, all.

Comment 20 Anping Li 2016-11-22 04:47:52 UTC

I'd like to have two changes.
1)  yum install etcd -> yum install etcd iptables-services
Reason: iptables-services is not installed by default, it is better to guide customer to install it

2)  create the ${PREFIX} directory before use it.
    'mkdir ${PREFIX}' before the subtitle "Create the server.csr and server.crt certificates:"

Reason: Without this directory, the openssl fail for "No such file or directory".

Comment 21 brice 2016-11-23 00:45:32 UTC

Thanks, Anping Li

I've made the changes you suggest, and I'll move this to peer review.

Comment 22 openshift-github-bot 2016-11-28 04:19:56 UTC

Commit pushed to master at https://github.com/openshift/openshift-docs

https://github.com/openshift/openshift-docs/commit/7d657a14ef5f5682d972b6b3576318ad644e5747
Merge pull request #3047 from bfallonf/etcd-1367035

Bug 1367035 added info on restoring etcd hosts

Comment 23 Kenjiro Nakayama 2016-12-02 08:36:08 UTC

Hi,

I got an information from an engineer that when we add new etcd nodes with backup process due to 700mb data, we should stop etcd services on other hosts. (Please refer to [1])

We should add the instruction to the doc and article?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1398083#c12

Comment 25 brice 2016-12-05 04:11:33 UTC

Eric, Kenjiro,

I can see Eric's recommendation making it into the docs, and if that doesn't work for the reader, falling back to stopping the other etcd hosts. Something like:

"If the etcd backup is larger than 700mb, prune the resource (link to pruning docs). If the backup is still larger than 700mb, stop the other hosts before performing the steps in this topic."

My one question would be: Where in the section would this be? Where would be the best time to have the reader stop the etcd hosts?

Comment 27 brice 2016-12-08 03:47:33 UTC

Link to released docs:

https://docs.openshift.com/container-platform/3.3/admin_guide/backup_restore.html#backup-restore-adding-etcd-hosts

However, I've sent an email to engineering lists about the conversation above, so I'll add something to it when I find something.

Comment 28 brice 2016-12-08 03:49:05 UTC

Actually, I'll put this back to modified in the meantime.

Comment 29 brice 2016-12-14 04:55:26 UTC

Submitted another PR for the extra info:

https://github.com/openshift/openshift-docs/pull/3385

Comment 30 brice 2016-12-15 00:35:35 UTC

PR has merged.

Comment 31 brice 2017-01-24 05:00:55 UTC

Link to released docs:

https://docs.openshift.com/container-platform/3.4/admin_guide/backup_restore.html#backup-restore-adding-etcd-hosts