Description of problem: When trying to run etcd-snapshot-restore.sh with more than one member listed in INITIAL_CLUSTER, it is possible for the script to fail: [core@etcd-1 ~]$ export INITIAL_CLUSTER="etcd-0.example.com=https://etcd-0.example.com:2380,etcd-1.example.com=https://etcd-1.example.com:2380,etcd-2.example.com=https://etcd-2.example.com:2380" [core@etcd-1 ~]$ sudo /usr/local/bin/etcd-snapshot-restore.sh /home/core/assets/backup/snapshot.db $INITIAL_CLUSTER ... Removing etcd data-dir /var/lib/etcd Restoring etcd member etcd-1.example.com etcd-1.example.com from snapshot.. Error: snapshot restore requires exactly one argument The workaround is to order INITIAL_CLUSTER and make sure the node you are executing commands on is listed *first* in INITIAL_ORDER. Additional info: I filed this under MCO component since that's where the etcd-snapshot-restore.sh script is shipped. Not sure if this is the right de- cision though. - The INITIAL_CLUSTER is parsed incorrectly in the bash function linked below. The regular expression is too greedy with the "*" and capture all node names at and before the ETCD_DNS_NAME. https://github.com/openshift/machine-config-operator/blob/release-4.3/templates/master/00-master/_base/files/usr-local-bin-openshift-recovery-tools-sh.yaml#L428-L436 For example: $ export INITIAL_CLUSTER="etcd-0.example.com=https://etcd-0.example.com:2380,etcd-1.example.com=https://etcd-1.example.com:2380,etcd-2.example.com=https://etcd-2.example.com:2380" $ ETCD_DNS_NAME=etcd-1.example.com $ validate_etcd_name etcd-0.example.com etcd-1.example.com I would only expect etcd-1.example.com to be listed and so does the etcd snapshot restore command this information is passed to.
``` export ETCD_DNS_NAME="etcd-0.example.com" export ETCD_INITIAL_CLUSTER="etcd-0.example.com=https://etcd-0.example.com:2380,etcd-1.example.com=https://etcd-1.example.com:2380,etcd-2.example.com=https://etcd-2.example.com:2380" echo "test 1" echo ${ETCD_INITIAL_CLUSTER} | grep -oP "(?<=)[^,,\s]*(?==[^=]*${ETCD_DNS_NAME}\b)" echo "test 2" echo ${ETCD_INITIAL_CLUSTER} | grep -oP "(?<=,)${ETCD_DNS_NAME}(?==)" test 1 etcd-0.example.com test 2 `` The function takes the ETCD_DNS_NAME then verifies that it is listed in ETCD_INITIAL_CLUSTER then returns the name in which matches that record. The format for INITIAL_CLUSTER is <name>=<peer-url>. Given that you can see that test 1 matches and test 2 does not. But as Suresh said the issue is not with that function. > Error: snapshot restore requires exactly one argument The following example would show how the error could happen. The command `etcdctl snapshot restore` takes a single argument with is $SNAPSHOT_FILE[1]. Notice space in path. SNAPSHOT_FILE="/home/core/assets/backup dir/snapshot.db" etcdctl snapshot restore $SNAPSHOT_FILE Error: snapshot restore requires exactly one argument Can you confirm your ocp version? [1]https://github.com/openshift/machine-config-operator/blob/release-4.3/templates/master/00-master/_base/files/usr-local-bin-openshift-recovery-tools-sh.yaml#L186
I will take the same test from c#7 and set ETCD_DNS_NAME=etcd-1.example.com. The command output is shared below and demonstrates the problem that customers have run into when following our docs [1]. The multiple output in "test 1" is passed to the snapshot restore command, resulting in the "snapshot restore requires exactly one argument" Perhaps my assumption that someone would be configuring ETCD_DNS_NAME and ETCD_INITIAL_CLUSTER in this way is not acceptable. Can you please verify and I will file a documentation bug if needed? ``` export ETCD_DNS_NAME="etcd-1.example.com" export ETCD_INITIAL_CLUSTER="etcd-0.example.com=https://etcd-0.example.com:2380,etcd-1.example.com=https://etcd-1.example.com:2380,etcd-2.example.com=https://etcd-2.example.com:2380" echo "test 1" echo ${ETCD_INITIAL_CLUSTER} | grep -oP "(?<=)[^,,\s]*(?==[^=]*${ETCD_DNS_NAME}\b)" echo "test 2" echo ${ETCD_INITIAL_CLUSTER} | grep -oP "(?<=,)${ETCD_DNS_NAME}(?==)" test 1 etcd-0.example.com etcd-1.example.com test 2 etcd-1.example.com ``` References: [1] https://docs.openshift.com/container-platform/4.3/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-restoring-cluster-state
Iām adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
This is a 4.3 bug, and the bug no longer exists in 4.4 and 4.4+. Please don't change the target release. The fix is available and being reviewed: https://github.com/openshift/machine-config-operator/pull/1913
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.3.31 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3180