Bug 1826021 - [4.3] etcd-snapshot-restore.sh fails due to "Error: snapshot restore requires exactly one argument"
Summary: [4.3] etcd-snapshot-restore.sh fails due to "Error: snapshot restore requires...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.3.z
Assignee: Suresh Kolichala
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks: 1826023
TreeView+ depends on / blocked
 
Reported: 2020-04-20 17:15 UTC by Robert Bost
Modified: 2020-08-05 10:54 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: A regular expression was incorrectly defined for obtaining etcd member name. Consequence: More than one etcd member names are returned. Work around is to define the member name first in the INITIAL_CLUSTER Fix: Fix the greedy regular expression. Result: Matches exactly one etcd member name.
Clone Of:
: 1826023 (view as bug list)
Environment:
Last Closed: 2020-08-05 10:54:06 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4999481 None None None 2020-04-20 17:48:18 UTC
Red Hat Product Errata RHBA-2020:3180 None None None 2020-08-05 10:54:34 UTC

Description Robert Bost 2020-04-20 17:15:37 UTC
Description of problem:

When trying to run etcd-snapshot-restore.sh with more than one member listed in INITIAL_CLUSTER, it is possible for the script to fail:

  [core@etcd-1 ~]$ export INITIAL_CLUSTER="etcd-0.example.com=https://etcd-0.example.com:2380,etcd-1.example.com=https://etcd-1.example.com:2380,etcd-2.example.com=https://etcd-2.example.com:2380"
  [core@etcd-1 ~]$ sudo /usr/local/bin/etcd-snapshot-restore.sh /home/core/assets/backup/snapshot.db $INITIAL_CLUSTER
  ...
  Removing etcd data-dir /var/lib/etcd
  Restoring etcd member etcd-1.example.com
  etcd-1.example.com from snapshot..
  Error: snapshot restore requires exactly one argument

The workaround is to order INITIAL_CLUSTER and make sure the node you are executing commands on is listed *first* in INITIAL_ORDER. 

Additional info:
I filed this under MCO component since that's where the etcd-snapshot-restore.sh script is shipped. Not sure if this is the right de- cision though. 

- The INITIAL_CLUSTER is parsed incorrectly in the bash function linked below. The regular expression is too greedy with the "*" and capture all node names at and before the ETCD_DNS_NAME. 

  https://github.com/openshift/machine-config-operator/blob/release-4.3/templates/master/00-master/_base/files/usr-local-bin-openshift-recovery-tools-sh.yaml#L428-L436

  For example:

    $ export INITIAL_CLUSTER="etcd-0.example.com=https://etcd-0.example.com:2380,etcd-1.example.com=https://etcd-1.example.com:2380,etcd-2.example.com=https://etcd-2.example.com:2380"
    $ ETCD_DNS_NAME=etcd-1.example.com
    $ validate_etcd_name 
    etcd-0.example.com
    etcd-1.example.com

  I would only expect etcd-1.example.com to be listed and so does the etcd snapshot restore command this information is passed to.

Comment 7 Sam Batschelet 2020-05-22 18:44:17 UTC
```
  export ETCD_DNS_NAME="etcd-0.example.com"
  export ETCD_INITIAL_CLUSTER="etcd-0.example.com=https://etcd-0.example.com:2380,etcd-1.example.com=https://etcd-1.example.com:2380,etcd-2.example.com=https://etcd-2.example.com:2380"
  
  echo "test 1"
  echo ${ETCD_INITIAL_CLUSTER} | grep -oP "(?<=)[^,,\s]*(?==[^=]*${ETCD_DNS_NAME}\b)"
  echo "test 2"
  echo ${ETCD_INITIAL_CLUSTER} | grep -oP "(?<=,)${ETCD_DNS_NAME}(?==)"

test 1
etcd-0.example.com
test 2
``

The function takes the ETCD_DNS_NAME then verifies that it is listed in ETCD_INITIAL_CLUSTER then returns the name in which matches that record. The format for INITIAL_CLUSTER is <name>=<peer-url>. Given that you can see that test 1 matches and test 2 does not. But as Suresh said the issue is not with that function.


> Error: snapshot restore requires exactly one argument

The following example would show how the error could happen. The command `etcdctl snapshot restore` takes a single argument with is $SNAPSHOT_FILE[1]. Notice space in path.

SNAPSHOT_FILE="/home/core/assets/backup dir/snapshot.db"

etcdctl snapshot restore $SNAPSHOT_FILE
Error: snapshot restore requires exactly one argument

Can you confirm your ocp version?


[1]https://github.com/openshift/machine-config-operator/blob/release-4.3/templates/master/00-master/_base/files/usr-local-bin-openshift-recovery-tools-sh.yaml#L186

Comment 9 Robert Bost 2020-05-26 16:08:56 UTC
I will take the same test from c#7 and set ETCD_DNS_NAME=etcd-1.example.com. The command output is shared below and demonstrates the problem that customers have run into when following our docs [1]. The multiple output in "test 1" is passed to the snapshot restore command, resulting in the "snapshot restore requires exactly one argument"

Perhaps my assumption that someone would be configuring ETCD_DNS_NAME and ETCD_INITIAL_CLUSTER in this way is not acceptable. Can you please verify and I will file a documentation bug if needed?


```
export ETCD_DNS_NAME="etcd-1.example.com"
export ETCD_INITIAL_CLUSTER="etcd-0.example.com=https://etcd-0.example.com:2380,etcd-1.example.com=https://etcd-1.example.com:2380,etcd-2.example.com=https://etcd-2.example.com:2380"
  
echo "test 1"
echo ${ETCD_INITIAL_CLUSTER} | grep -oP "(?<=)[^,,\s]*(?==[^=]*${ETCD_DNS_NAME}\b)"
echo "test 2"
echo ${ETCD_INITIAL_CLUSTER} | grep -oP "(?<=,)${ETCD_DNS_NAME}(?==)"


test 1
etcd-0.example.com
etcd-1.example.com
test 2
etcd-1.example.com
```

References:
[1] https://docs.openshift.com/container-platform/4.3/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-restoring-cluster-state

Comment 10 Sam Batschelet 2020-06-20 12:41:00 UTC
Iā€™m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 14 Suresh Kolichala 2020-07-09 13:48:21 UTC
This is a 4.3 bug, and the bug no longer exists in 4.4 and 4.4+. Please don't change the target release. The fix is available and being reviewed:

https://github.com/openshift/machine-config-operator/pull/1913

Comment 25 errata-xmlrpc 2020-08-05 10:54:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.3.31 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3180


Note You need to log in before you can comment on or make changes to this bug.