Document URL: This is a request for a new section. It might fit under https://docs.openshift.com/enterprise/3.2/admin_guide/high_availability.html or maybe another one, but maybe it's better to have a separate section under Cluster Administration. Section Number and Name: The name could be "etcd HA cluster recovery" or similar Describe the issue: There is no documentaion on how to recover an HA etcd cluster if one [or more] of the nodes fails. Suggestions for improvement: Add a section to document etcd node recovery Additional information: Bug 1350875 addresses a technical issue with etcd recovery. Comment 31 of that bz outlines the process of cluster/node recovery using the approach made possible after the code change in that bz is released. The steps listed there take an approach of forcing a new cluster creation and involve downtime. It would be great if we can explore equivalent steps to replace a failed node that would not involve downtime.
*** Bug 1377487 has been marked as a duplicate of this bug. ***
Steps to add a new etcd member using the CA configured during the OpenShift install. https://access.redhat.com/articles/2650151
Pep, Eric, I've included a section on recovering etcd hosts: https://github.com/openshift/openshift-docs/pull/3047 But there's so much going on in the above comments, that I'm not actually sure if I'm going down the right path. Can I get your thoughts? Also, there's a link to a KBase article about adding etcd hosts. How does that tie into this BZ? Should I be including it in the docs? Here perhaps: https://docs.openshift.com/enterprise/3.2/install_config/adding_hosts_to_existing_cluster.html Thanks!
Ryan, Eric, Thanks for the comments. I was replicating Pep's suggestion above, but if the article Ryan pasted above is more accurate, then that might be a better source. Can I ask for an ack that it's all there? Then I'll pass this to QE for a test. Thanks!
Pep, Hmm I can't see any comments that you made. What did you comment about? And really, all I'd be after would be that you think it fulfills this BZ. I think there's a lot that's being asked for here, and I want to know I'm aiming in the correct direction. Then, I can get it tested and checked out by devel, etc. I don't expect you to test it out and tell me what I'm doing wrong, just that this is what you're after. If I'm headed in the wrong direction, it'd be great to get that under control before I involve anyone else. Thanks.
Ryan has given the thumbs up in the PR. I'll put this onto QA. Johnny, The "Adding New etcd Hosts" section has been added with this BZ; https://github.com/bfallonf/openshift-docs/blob/a0da5f21c0db6ba6a4a363198f8c15ff0844b7bf/admin_guide/backup_restore.adoc#backup-restore-adding-etcd-hosts Do you have the capabilities to test the procedure? Please let me know if there's is anything wrong. Thanks much, all.
I'd like to have two changes. 1) yum install etcd -> yum install etcd iptables-services Reason: iptables-services is not installed by default, it is better to guide customer to install it 2) create the ${PREFIX} directory before use it. 'mkdir ${PREFIX}' before the subtitle "Create the server.csr and server.crt certificates:" Reason: Without this directory, the openssl fail for "No such file or directory".
Thanks, Anping Li I've made the changes you suggest, and I'll move this to peer review.
Commit pushed to master at https://github.com/openshift/openshift-docs https://github.com/openshift/openshift-docs/commit/7d657a14ef5f5682d972b6b3576318ad644e5747 Merge pull request #3047 from bfallonf/etcd-1367035 Bug 1367035 added info on restoring etcd hosts
Hi, I got an information from an engineer that when we add new etcd nodes with backup process due to 700mb data, we should stop etcd services on other hosts. (Please refer to [1]) We should add the instruction to the doc and article? [1] https://bugzilla.redhat.com/show_bug.cgi?id=1398083#c12
Eric, Kenjiro, I can see Eric's recommendation making it into the docs, and if that doesn't work for the reader, falling back to stopping the other etcd hosts. Something like: "If the etcd backup is larger than 700mb, prune the resource (link to pruning docs). If the backup is still larger than 700mb, stop the other hosts before performing the steps in this topic." My one question would be: Where in the section would this be? Where would be the best time to have the reader stop the etcd hosts?
Link to released docs: https://docs.openshift.com/container-platform/3.3/admin_guide/backup_restore.html#backup-restore-adding-etcd-hosts However, I've sent an email to engineering lists about the conversation above, so I'll add something to it when I find something.
Actually, I'll put this back to modified in the meantime.
Submitted another PR for the extra info: https://github.com/openshift/openshift-docs/pull/3385
PR has merged.
Link to released docs: https://docs.openshift.com/container-platform/3.4/admin_guide/backup_restore.html#backup-restore-adding-etcd-hosts