Bug 1433272
| Summary: | The etcd db file should be backed during upgrade | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Anping Li <anli> | |
| Component: | Cluster Version Operator | Assignee: | Scott Dodson <sdodson> | |
| Status: | CLOSED ERRATA | QA Contact: | Anping Li <anli> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | high | |||
| Version: | 3.5.0 | CC: | anli, aos-bugs, bleanhar, jokerman, mmccomas, sdodson | |
| Target Milestone: | --- | |||
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | openshift-ansible-3.5.53-1.git.0.8ade9f2.el7 | Doc Type: | Bug Fix | |
| Doc Text: |
If etcd 3.x or later were running on the host a v3 snapshot db must be backed up as part of the backup process. If this directory is not included in the backup then etcd will fail to restore the backup even though v3 data was not used. The etcd backup steps have been amended to ensure that the v3 snapshot database is included in our backups.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1440296 1440299 1440303 (view as bug list) | Environment: | ||
| Last Closed: | 2017-04-12 19:04:08 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1440296, 1440299, 1440303 | |||
|
Description
Anping Li
2017-03-17 09:25:32 UTC
Can you please provide your inventory? I have two etcd backups created during upgrade via the playbook you've referenced, one prior to performing the etcd upgrade and one after, both of which are taken before upgrading the control plane. It's in /var/lib/origin/etcd-backup-pre-* Scott, The member had been backed prior and after upgrade. The file /var/lib/origin/openshift.local.etcd/member/snap/db couldn't backed by the command 'etcdctl backup'. I am not sure if this file must be backed. But without this file, the database can't be restored. For more detail, please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1419670. 1) # ls /var/lib/origin/openshift.local.etcd/member/snap 0000000000000003-000000000013617c.snap 0000000000000003-000000000013888d.snap 0000000000000008-000000000013af9e.snap 0000000000000008-000000000013d6af.snap 0000000000000008-000000000013fdc0.snap db 2) # ls /var/lib/origin/etcd-backup-pre-upgrade-20170405025113/member/snap/ 0000000000000003-000000000013888d.snap 3) # ls /var/lib/origin/etcd-backup-post-3.0-20170405025510/member/snap 0000000000000003-000000000013888d.snap Thanks, after reviewing this and the reference BZ and the comments there I understand now what's up. Proposed fix here, I'd like Jan to verify the sanity before we merge it but feel free to test it, I walked through our documented restoration procedures and they seemed to work with this change. https://github.com/openshift/openshift-ansible/pull/3860 The etcd backup failed for containerized etcd, the root cause is the command [1] store the snapshot in container.
[1] "docker exec etcd_container etcdctl backup --data-dir=/var/lib/etcd/ --backup-dir=/var/lib/origin/etcd-backup-pre-upgrade-20170407055724"
TASK [Generate etcd backup] ****************************************************
changed: [openshift-222.lab.eng.nay.redhat.com]
TASK [Check for v3 data store] *************************************************
ok: [openshift-222.lab.eng.nay.redhat.com]
TASK [Copy etcd v3 data store] *************************************************
fatal: [openshift-222.lab.eng.nay.redhat.com]: FAILED! => {
"changed": true,
"cmd": [
"cp",
"-a",
"/var/lib/etcd//member/snap",
"/var/lib/origin/etcd-backup-pre-upgrade-20170407055413/member/"
],
"delta": "0:00:00.003152",
"end": "2017-04-07 01:54:17.584685",
"failed": true,
"rc": 1,
"start": "2017-04-07 01:54:17.581533",
"warnings": []
}
STDERR:
cp: cannot create directory ?/var/lib/origin/etcd-backup-pre-upgrade-20170407055413/member/?: No such file or directory
to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/upgrade_etcd.retry
PLAY RECAP *********************************************************************
localhost : ok=12 changed=0 unreachable=0 failed=0
openshift-210.lab.eng.nay.redhat.com : ok=1 changed=0 unreachable=0 failed=0
openshift-222.lab.eng.nay.redhat.com : ok=16 changed=2 unreachable=0 failed=1
openshift-223.lab.eng.nay.redhat.com : ok=1 changed=0 unreachable=0 failed=0
https://github.com/openshift/openshift-ansible/pull/3878 additional fix, testing now. Scott, all data are lost after I restored etcd from the backup files. I think only back the db file is not enough, We must backup the latest snap files. The snap file are generated by 'etcdctl backup' or service restart. I guess the command 'etcdctl backup' write the memory to disk file. https://github.com/openshift/openshift-ansible/pull/3898 round 3 of proposed fixes Anping, I'm sorry, I missed the key thing from comment 5 where you said it was storing the backup inside the container. I've refactored things considerably and I'll need to open up another PR to update the documentation as we're now storing the backup in /var/lib/etcd but the playbook outputs the path to the backup anyway. Hope this works now. The fix work well for the external rpm etcd, the external containerized etcd and the embedded etcd. move bug to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0903 *** Bug 1402769 has been marked as a duplicate of this bug. *** |