Bug 1576297 - etcd v3 migrate playbook does not have idempotency
Summary: etcd v3 migrate playbook does not have idempotency
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Scott Dodson
QA Contact: Gaoyun Pei
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-09 07:45 UTC by Kenjiro Nakayama
Modified: 2021-06-10 16:07 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-09 12:57:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1564098 0 unspecified CLOSED etcd3 migrate playbook fails with single master topology due to handler error 2021-02-22 00:41:40 UTC

Internal Links: 1564098

Description Kenjiro Nakayama 2018-05-09 07:45:46 UTC
Description of problem:
- Once etcd v3 migrate playbook failed to complete but some keys were migrated, playbook fails "TASK [etcd : Check if there are any v3 data]".


Version-Release number of the following components:
- openshift-ansible-3.7.44


How reproducible: 100%

Steps to Reproduce:
1. Run "ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-etcd/migrate.yml"
2. (Failed to complete. e.g My customer failed due to bz#1564098)
3. Re-run "ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-etcd/migrate.yml"

Actual results:
- Failed following task:

   TASK [etcd : Check if there are any v3 data] ******************************************************************************************************************************************************************
   task path: /usr/share/ansible/openshift-ansible/roles/etcd/tasks/migration/check.yml:19
   changed: [foo.example.com] => {"changed": true, "cmd": ["etcdctl", "--cert", "/etc/etcd/peer.crt", "--key", "/etc/etcd/peer.key", "--cacert", "/etc/etcd/ca.crt", "--endpoints", "https://xx.xx.xx.xx:2379", "get", "", "--from-key", "--keys-only", "-w", "json", "--limit", "1"], "delta": "0:00:00.035706", "end": "2018-05-09 12:16:57.670671", "failed": false, "rc": 0, "start": "2018-05-09 12:16:57.634965", "stderr": "", "stderr_lines": [], "stdout": "{\"header\"  ... snip ... ",\"create_revision\":11511764,\"mod_revision\":11511764,\"version\":1}],\"more\":true,\"count\":1637}"]}

Expected results:
- Even though re-running ansible, complete playbook and tasks remained. 


Additional info:
- It is easy to continue the playbook by removing following lines,

https://github.com/openshift/openshift-ansible/blob/release-3.7/roles/etcd/tasks/migration/check.yml#L30-L32
```
- fail:
    msg: "The etcd has at least one v3 key"
  when: "'count' in (l_etcdctl_output.stdout | from_json) and (l_etcdctl_output.stdout | from_json).count != 0"
```

However, docs mentions that v3 data could be overwritten if we migrate even though v3 keys exist.

https://coreos.com/etcd/docs/latest/op-guide/v2-migration.html
"Sometimes an etcd cluster will possibly have v3 data which should not be overwritten. In this case, the migration process may want to confirm no v3 data is committed before proceeding. One way to check the cluster has no v3 keys is to issue the following"

Comment 3 Scott Dodson 2018-05-09 12:57:01 UTC
This is a deliberate check to ensure that we never re-migrate a cluster. If at any point the migration fails you must restore from backup. It's not safe to re-migrate because the migration process does not properly reconcile changes made to either v2 or v3 keys.

If you're 100% certain that there's zero chance that any modification has taken place then you can disable that check by commenting it out and re-run the playbooks. But really the best thing to do is restore from backup and start over.


Note You need to log in before you can comment on or make changes to this bug.