Description of problem: As per https://bugzilla.redhat.com/show_bug.cgi?id=1419670. the etcd db file should be backed. Version-Release number of selected component (if applicable): openshift-ansible-3.5.35 etcd-3.x How reproducible: alkways Steps to Reproduce: 1. Install OCP 3.4 2. Upgrade to v3.5 ansible-playbook -i hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_5/upgrade.yml 3. check the etcd backup Actual results: the member/snap/db file wasn't backed up. Expected results: member/snap/db should be backed. Additional info:
Can you please provide your inventory? I have two etcd backups created during upgrade via the playbook you've referenced, one prior to performing the etcd upgrade and one after, both of which are taken before upgrading the control plane. It's in /var/lib/origin/etcd-backup-pre-*
Scott, The member had been backed prior and after upgrade. The file /var/lib/origin/openshift.local.etcd/member/snap/db couldn't backed by the command 'etcdctl backup'. I am not sure if this file must be backed. But without this file, the database can't be restored. For more detail, please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1419670. 1) # ls /var/lib/origin/openshift.local.etcd/member/snap 0000000000000003-000000000013617c.snap 0000000000000003-000000000013888d.snap 0000000000000008-000000000013af9e.snap 0000000000000008-000000000013d6af.snap 0000000000000008-000000000013fdc0.snap db 2) # ls /var/lib/origin/etcd-backup-pre-upgrade-20170405025113/member/snap/ 0000000000000003-000000000013888d.snap 3) # ls /var/lib/origin/etcd-backup-post-3.0-20170405025510/member/snap 0000000000000003-000000000013888d.snap
Thanks, after reviewing this and the reference BZ and the comments there I understand now what's up. Proposed fix here, I'd like Jan to verify the sanity before we merge it but feel free to test it, I walked through our documented restoration procedures and they seemed to work with this change. https://github.com/openshift/openshift-ansible/pull/3860
The etcd backup failed for containerized etcd, the root cause is the command [1] store the snapshot in container. [1] "docker exec etcd_container etcdctl backup --data-dir=/var/lib/etcd/ --backup-dir=/var/lib/origin/etcd-backup-pre-upgrade-20170407055724" TASK [Generate etcd backup] **************************************************** changed: [openshift-222.lab.eng.nay.redhat.com] TASK [Check for v3 data store] ************************************************* ok: [openshift-222.lab.eng.nay.redhat.com] TASK [Copy etcd v3 data store] ************************************************* fatal: [openshift-222.lab.eng.nay.redhat.com]: FAILED! => { "changed": true, "cmd": [ "cp", "-a", "/var/lib/etcd//member/snap", "/var/lib/origin/etcd-backup-pre-upgrade-20170407055413/member/" ], "delta": "0:00:00.003152", "end": "2017-04-07 01:54:17.584685", "failed": true, "rc": 1, "start": "2017-04-07 01:54:17.581533", "warnings": [] } STDERR: cp: cannot create directory ?/var/lib/origin/etcd-backup-pre-upgrade-20170407055413/member/?: No such file or directory to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/upgrade_etcd.retry PLAY RECAP ********************************************************************* localhost : ok=12 changed=0 unreachable=0 failed=0 openshift-210.lab.eng.nay.redhat.com : ok=1 changed=0 unreachable=0 failed=0 openshift-222.lab.eng.nay.redhat.com : ok=16 changed=2 unreachable=0 failed=1 openshift-223.lab.eng.nay.redhat.com : ok=1 changed=0 unreachable=0 failed=0
https://github.com/openshift/openshift-ansible/pull/3878 additional fix, testing now.
PR from comment 6 tested and merged.
Scott, all data are lost after I restored etcd from the backup files. I think only back the db file is not enough, We must backup the latest snap files. The snap file are generated by 'etcdctl backup' or service restart. I guess the command 'etcdctl backup' write the memory to disk file.
https://github.com/openshift/openshift-ansible/pull/3898 round 3 of proposed fixes
Anping, I'm sorry, I missed the key thing from comment 5 where you said it was storing the backup inside the container. I've refactored things considerably and I'll need to open up another PR to update the documentation as we're now storing the backup in /var/lib/etcd but the playbook outputs the path to the backup anyway. Hope this works now.
The fix work well for the external rpm etcd, the external containerized etcd and the embedded etcd. move bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0903
*** Bug 1402769 has been marked as a duplicate of this bug. ***