Bug 1576297

Summary: etcd v3 migrate playbook does not have idempotency
Product: OpenShift Container Platform Reporter: Kenjiro Nakayama <knakayam>
Component: Cluster Version OperatorAssignee: Scott Dodson <sdodson>
Status: CLOSED WONTFIX QA Contact: Gaoyun Pei <gpei>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.7.0CC: aos-bugs, jokerman, mmccomas, vrutkovs
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-09 12:57:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Kenjiro Nakayama 2018-05-09 07:45:46 UTC
Description of problem:
- Once etcd v3 migrate playbook failed to complete but some keys were migrated, playbook fails "TASK [etcd : Check if there are any v3 data]".


Version-Release number of the following components:
- openshift-ansible-3.7.44


How reproducible: 100%

Steps to Reproduce:
1. Run "ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-etcd/migrate.yml"
2. (Failed to complete. e.g My customer failed due to bz#1564098)
3. Re-run "ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-etcd/migrate.yml"

Actual results:
- Failed following task:

   TASK [etcd : Check if there are any v3 data] ******************************************************************************************************************************************************************
   task path: /usr/share/ansible/openshift-ansible/roles/etcd/tasks/migration/check.yml:19
   changed: [foo.example.com] => {"changed": true, "cmd": ["etcdctl", "--cert", "/etc/etcd/peer.crt", "--key", "/etc/etcd/peer.key", "--cacert", "/etc/etcd/ca.crt", "--endpoints", "https://xx.xx.xx.xx:2379", "get", "", "--from-key", "--keys-only", "-w", "json", "--limit", "1"], "delta": "0:00:00.035706", "end": "2018-05-09 12:16:57.670671", "failed": false, "rc": 0, "start": "2018-05-09 12:16:57.634965", "stderr": "", "stderr_lines": [], "stdout": "{\"header\"  ... snip ... ",\"create_revision\":11511764,\"mod_revision\":11511764,\"version\":1}],\"more\":true,\"count\":1637}"]}

Expected results:
- Even though re-running ansible, complete playbook and tasks remained. 


Additional info:
- It is easy to continue the playbook by removing following lines,

https://github.com/openshift/openshift-ansible/blob/release-3.7/roles/etcd/tasks/migration/check.yml#L30-L32
```
- fail:
    msg: "The etcd has at least one v3 key"
  when: "'count' in (l_etcdctl_output.stdout | from_json) and (l_etcdctl_output.stdout | from_json).count != 0"
```

However, docs mentions that v3 data could be overwritten if we migrate even though v3 keys exist.

https://coreos.com/etcd/docs/latest/op-guide/v2-migration.html
"Sometimes an etcd cluster will possibly have v3 data which should not be overwritten. In this case, the migration process may want to confirm no v3 data is committed before proceeding. One way to check the cluster has no v3 keys is to issue the following"

Comment 3 Scott Dodson 2018-05-09 12:57:01 UTC
This is a deliberate check to ensure that we never re-migrate a cluster. If at any point the migration fails you must restore from backup. It's not safe to re-migrate because the migration process does not properly reconcile changes made to either v2 or v3 keys.

If you're 100% certain that there's zero chance that any modification has taken place then you can disable that check by commenting it out and re-run the playbooks. But really the best thing to do is restore from backup and start over.