1419670 – [DOCS] Incorrect etcd backup and restore procedure

Bug 1419670 - [DOCS] Incorrect etcd backup and restore procedure

Summary: [DOCS] Incorrect etcd backup and restore procedure

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Documentation
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ashley Hardin
QA Contact:	Anping Li
Docs Contact:	Vikram Goyal
URL:
Whiteboard:
Duplicates (1):	1421072 (view as bug list)
Depends On:
Blocks:	1405338
TreeView+	depends on / blocked

Reported:	2017-02-06 17:31 UTC by Jaspreet Kaur
Modified:	2020-05-14 15:37 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-03-27 15:52:53 UTC
Target Upstream Version:
Embargoed:
Flags:	sttts: needinfo+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1421072	0	high	CLOSED	[DOCS] Document etcd cluster recovery after node failure + installing new etcd nodes	2021-02-22 00:41:40 UTC

Description Jaspreet Kaur 2017-02-06 17:31:09 UTC

Description of problem:
Being in the situation of a disaster recovery the procedure that is provided in documentation at

 https://docs.openshift.com/container-platform/3.4/admin_guide/backup_restore.html#cluster-backup 

or at  

https://docs.openshift.com/container-platform/3.3/admin_guide/backup_restore.html#cluster-backup is not useful because the etcd process won't start again due to the fact that the db file is missing in backup (${ETCD_DATA_DIR}/member/snap/db).

We realize that etcd3 runs in a compatibility mode and the procedure for restoring the v2 keys it's the same but It seems that it also needs that file and backing up that file it's impossible because the "snapshot" argument of "etcdctl" command it's not available which should be according to the coreos docs: https://coreos.com/etcd/docs/3.0.15/op-guide/recovery.html.

Etcd fails to start

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: Etcd do not starts and the procedure is incorrect


Expected results: Etcd should have started and needs a updating in documentation. 

Additional info:

Comment 1 Timothy St. Clair 2017-02-06 22:20:49 UTC

This seems like two separate issues... 

1. Update the docs on disaster recovery 
2. Determine why your etcd instance is not starting.  

To #1, iirc before an upgrade a snapshot is saved.

Comment 26 Stefan Schimanski 2017-02-23 07:53:51 UTC

*** Bug 1421072 has been marked as a duplicate of this bug. ***

Comment 31 Ashley Hardin 2017-03-03 17:17:34 UTC

Updated pull request: https://github.com/openshift/openshift-docs/pull/3827

Comment 39 Anping Li 2017-03-16 01:27:41 UTC

(In reply to Anping Li from comment #36)
> For
> https://github.com/ahardin-rh/openshift-docs/blob/
> 5cfb6dc2ee7fcb1d15007ad85afb2998f81e6cdf/admin_guide/backup_restore.
> adoc#cluster-backup
> Should note 'cp "$ETCD_DATA_DIR"/member/snap/db "$HOTDIR"/member/snap/db'
> when use etcd 3.0.15.

It was talked at comment 2,10,15.

> For
> https://github.com/ahardin-rh/openshift-docs/blob/
> 5cfb6dc2ee7fcb1d15007ad85afb2998f81e6cdf/admin_guide/backup_restore.
> adoc#cluster-restore
> No step force-new-cluster and restart etcd service.  without them, for
> single-member etcd clusters, we  also need to see 
> https://docs.openshift.com/container-platform/3.4/install_config/downgrade.
> html#downgrading-restoring-embedded-etcd

There is a note 'This restore operation only works for single-member etcd clusters. For multiple-member etcd clusters, see Restoring etcd.'. In fact, the following restore operation aren't complete.  the step force-new-cluster and restart etcd service is missing.

Either change the note or copy force-new-cluster and restart etcd service step herein.


> 4.c)
> mkdir $PREFIX before run openssl

Without $PREFIX directory, the following command will fail.

> 4.e) cp ca.crt ${PREFIX} -> cp ca/ca.crt ${PREFIX}
  This step is not necessary； drop it.

Comment 41 Anping Li 2017-03-20 01:25:29 UTC

1. https://github.com/ahardin-rh/openshift-docs/blob/240abad8bc6109fc349c6f5b76521e144f08119a/admin_guide/backup_restore.adoc#cluster-backup
# tar cf /tmp/certs-and-keys-$(hostname).tar *.key *.crt' \
    master.proxy-client.crt \
    master.proxy-client.key \
    proxyca.crt \
    proxyca.key \
    master.server.crt \
    master.server.key \
    ca.crt \
    ca.key \
    master.etcd-client.crt \
    master.etcd-client.key \
    master.etcd-ca.crt

Should be 
# tar cf /tmp/certs-and-keys-$(hostname).tar *.key *.crt  

2. https://github.com/ahardin-rh/openshift-docs/blob/240abad8bc6109fc349c6f5b76521e144f08119a/admin_guide/backup_restore.adoc#cluster-restore-for-single-member-etcd-clusters
 
A similar step need to be added as https://github.com/ahardin-rh/openshift-docs/blob/240abad8bc6109fc349c6f5b76521e144f08119a/admin_guide/backup_restore.adoc#external-etcd: step 4 

For example:


Verify the etcd service started correctly, then re-edit the /usr/lib/systemd/system/etcd.service file and remove the --force-new-cluster option:

# sed -i '/ExecStart/s/ --force-new-cluster//' /usr/lib/systemd/system/etcd.service
# cat /usr/lib/systemd/system/etcd.service  | grep ExecStart

ExecStart=/bin/bash -c "GOMAXPROCS=$(nproc) /usr/bin/etcd"



Then restart the etcd service:

# systemctl daemon-reload
# systemctl start etcd



3. The other part looks good

Comment 43 Anping Li 2017-03-21 10:05:29 UTC

https://github.com/ahardin-rh/openshift-docs/blob/240abad8bc6109fc349c6f5b76521e144f08119a/admin_guide/backup_restore.adoc#cluster-backup

 tar cf /tmp/certs-and-keys-$(hostname).tar *.key *.crt \
>     master.proxy-client.crt \
>     master.proxy-client.key \
>     proxyca.crt \
>     proxyca.key \
>     master.server.crt \
>     master.server.key \
>     ca.crt \
>     ca.key \
>     master.etcd-client.crt \
>     master.etcd-client.key \
>     master.etcd-ca.crt
tar: proxyca.crt: Cannot stat: No such file or directory
tar: proxyca.key: Cannot stat: No such file or directory


1) The name be vary for crt and key files. For example: The Custom specify different names. That is why I suggested using command 'tar cf /tmp/certs-and-keys-$(hostname).tar *.key *.crt '.

Comment 45 Anping Li 2017-03-23 08:06:07 UTC

It look good to me.

Comment 46 openshift-github-bot 2017-03-23 13:29:23 UTC

Commits pushed to master at https://github.com/openshift/openshift-docs

https://github.com/openshift/openshift-docs/commit/b38042de02d9780842dce95cfa0ef45d53b58bc6
Bug 1419670, Update backup and restore procedure

https://github.com/openshift/openshift-docs/commit/be0f62d5b5e30b5a56a061382cec07cba1909f94
Merge pull request #3827 from ahardin-rh/etcd-backup-restore

Bug 1419670, Update backup and restore procedure

Comment 47 Ashley Hardin 2017-03-27 15:52:53 UTC

Content is now published:
https://access.redhat.com/documentation/en-us/openshift_container_platform/3.4/html-single/cluster_administration/#admin-guide-backup-and-restore

Note You need to log in before you can comment on or make changes to this bug.