Bug 1433272 - The etcd db file should be backed during upgrade
Summary: The etcd db file should be backed during upgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: ---
Assignee: Scott Dodson
QA Contact: Anping Li
URL:
Whiteboard:
: 1402769 (view as bug list)
Depends On:
Blocks: 1440296 1440299 1440303
TreeView+ depends on / blocked
 
Reported: 2017-03-17 09:25 UTC by Anping Li
Modified: 2017-08-24 15:53 UTC (History)
6 users (show)

Fixed In Version: openshift-ansible-3.5.53-1.git.0.8ade9f2.el7
Doc Type: Bug Fix
Doc Text:
If etcd 3.x or later were running on the host a v3 snapshot db must be backed up as part of the backup process. If this directory is not included in the backup then etcd will fail to restore the backup even though v3 data was not used. The etcd backup steps have been amended to ensure that the v3 snapshot database is included in our backups.
Clone Of:
: 1440296 1440299 1440303 (view as bug list)
Environment:
Last Closed: 2017-04-12 19:04:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0903 0 normal SHIPPED_LIVE OpenShift Container Platform atomic-openshift-utils bug fix and enhancement 2017-04-12 22:45:42 UTC

Description Anping Li 2017-03-17 09:25:32 UTC
Description of problem:
As per https://bugzilla.redhat.com/show_bug.cgi?id=1419670. the etcd db file should be backed.  

Version-Release number of selected component (if applicable):
openshift-ansible-3.5.35
etcd-3.x

How reproducible:
alkways

Steps to Reproduce:
1. Install OCP 3.4
2. Upgrade to v3.5
   ansible-playbook -i hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_5/upgrade.yml
3. check the etcd backup
 

Actual results:
the member/snap/db file wasn't backed up.

Expected results:
member/snap/db should be backed.

Additional info:

Comment 1 Scott Dodson 2017-04-04 13:31:13 UTC
Can you please provide your inventory? I have two etcd backups created during upgrade via the playbook you've referenced, one prior to performing the etcd upgrade and one after, both of which are taken before upgrading the control plane.

It's in /var/lib/origin/etcd-backup-pre-*

Comment 2 Anping Li 2017-04-05 05:16:07 UTC
Scott,
The member had been backed prior and after upgrade. The file /var/lib/origin/openshift.local.etcd/member/snap/db couldn't backed by the command 'etcdctl backup'. I am not sure if this file must be backed.  But without this file, the database can't be restored.

For more detail, please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1419670.

1) # ls /var/lib/origin/openshift.local.etcd/member/snap
0000000000000003-000000000013617c.snap  0000000000000003-000000000013888d.snap  0000000000000008-000000000013af9e.snap  0000000000000008-000000000013d6af.snap  0000000000000008-000000000013fdc0.snap  db


2) # ls /var/lib/origin/etcd-backup-pre-upgrade-20170405025113/member/snap/
0000000000000003-000000000013888d.snap

3) # ls /var/lib/origin/etcd-backup-post-3.0-20170405025510/member/snap
0000000000000003-000000000013888d.snap

Comment 3 Scott Dodson 2017-04-05 19:46:54 UTC
Thanks, after reviewing this and the reference BZ and the comments there I understand now what's up.

Proposed fix here, I'd like Jan to verify the sanity before we merge it but feel free to test it, I walked through our documented restoration procedures and they seemed to work with this change.

https://github.com/openshift/openshift-ansible/pull/3860

Comment 5 Anping Li 2017-04-07 06:17:08 UTC
The etcd backup failed for containerized etcd, the root cause is the command [1] store the snapshot in container.

[1] "docker exec etcd_container etcdctl backup --data-dir=/var/lib/etcd/ --backup-dir=/var/lib/origin/etcd-backup-pre-upgrade-20170407055724"


TASK [Generate etcd backup] ****************************************************
changed: [openshift-222.lab.eng.nay.redhat.com]

TASK [Check for v3 data store] *************************************************
ok: [openshift-222.lab.eng.nay.redhat.com]

TASK [Copy etcd v3 data store] *************************************************
fatal: [openshift-222.lab.eng.nay.redhat.com]: FAILED! => {
    "changed": true, 
    "cmd": [
        "cp", 
        "-a", 
        "/var/lib/etcd//member/snap", 
        "/var/lib/origin/etcd-backup-pre-upgrade-20170407055413/member/"
    ], 
    "delta": "0:00:00.003152", 
    "end": "2017-04-07 01:54:17.584685", 
    "failed": true, 
    "rc": 1, 
    "start": "2017-04-07 01:54:17.581533", 
    "warnings": []
}

STDERR:

cp: cannot create directory ?/var/lib/origin/etcd-backup-pre-upgrade-20170407055413/member/?: No such file or directory
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/upgrade_etcd.retry

PLAY RECAP *********************************************************************
localhost                  : ok=12   changed=0    unreachable=0    failed=0   
openshift-210.lab.eng.nay.redhat.com : ok=1    changed=0    unreachable=0    failed=0   
openshift-222.lab.eng.nay.redhat.com : ok=16   changed=2    unreachable=0    failed=1   
openshift-223.lab.eng.nay.redhat.com : ok=1    changed=0    unreachable=0    failed=0

Comment 6 Scott Dodson 2017-04-07 13:26:07 UTC
https://github.com/openshift/openshift-ansible/pull/3878 additional fix, testing now.

Comment 7 Scott Dodson 2017-04-07 19:35:02 UTC
PR from comment 6 tested and merged.

Comment 9 Anping Li 2017-04-10 03:26:57 UTC
Scott, all data are lost after I restored etcd from the backup files. 
I  think only back the db file is not enough, We must backup the latest snap files. The snap file are generated by 'etcdctl backup' or service restart. I guess the command 'etcdctl backup' write the memory to disk file.

Comment 10 Scott Dodson 2017-04-10 17:42:53 UTC
https://github.com/openshift/openshift-ansible/pull/3898 round 3 of proposed fixes

Comment 11 Scott Dodson 2017-04-10 20:29:06 UTC
Anping,

I'm sorry, I missed the key thing from comment 5 where you said it was storing the backup inside the container. I've refactored things considerably and I'll need to open up another PR to update the documentation as we're now storing the backup in /var/lib/etcd but the playbook outputs the path to the backup anyway.

Hope this works now.

Comment 12 Anping Li 2017-04-11 05:20:37 UTC
The fix work well for the external rpm etcd, the external containerized etcd and the embedded etcd.  move bug to verified.

Comment 14 errata-xmlrpc 2017-04-12 19:04:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0903

Comment 15 Brenton Leanhardt 2017-08-24 15:53:40 UTC
*** Bug 1402769 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.