Bug 1462169 - Etcd backup fail when use system etcd container
Etcd backup fail when use system etcd container
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Upgrade (Show other bugs)
3.6.0
Unspecified Unspecified
unspecified Severity high
: ---
: ---
Assigned To: Jan Chaloupka
Anping Li
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-16 07:15 EDT by Anping Li
Modified: 2017-08-16 15 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-10 01:28:09 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
The system container upgrade logs (94.51 KB, application/x-gzip)
2017-06-26 06:52 EDT, Anping Li
no flags Details

  None (edit)
Description Anping Li 2017-06-16 07:15:05 EDT
Description of problem:

For system etcd container, It use new data_dir /var/lib/etcd/etcd.etcd/etcd.etcd/.  Etcd backup playbook failed. the backup directory is created,  but no snap dbs are backed. 



# ls /var/lib/etcd/etcd.etcd/
etc  etcd.etcd  openshift-backup-etcd_backup_tag20170616100907
# ls /var/lib/etcd/etcd.etcd/etcd.etcd/member/snap/
0000000000000004-0000000000002711.snap  0000000000000007-0000000000004e22.snap  0000000000000007-0000000000007533.snap  000000000000000a-0000000000009c44.snap  000000000000000d-000000000000c355.snap  db
# ls /var/lib/etcd/etcd.etcd/openshift-backup-etcd_backup_tag20170616100907/member/snap
#


Version-Release number of selected component (if applicable):
openshift-ansible:3.6.110

How reproducible:
always

Steps to Reproduce:
1. install OCP 3.6 with system etcd container
2. run upgrade playbook

Actual results:

TASK [etcd_upgrade : Install latest etcd for embedded] *************************
task path: /usr/share/ansible/openshift-ansible/roles/etcd_upgrade/tasks/backup.yml:40
skipping: [ec2-54-196-73-42.compute-1.amazonaws.com] => {
    "changed": false, 
    "skip_reason": "Conditional check failed", 
    "skipped": true
}
skipping: [ec2-52-91-209-128.compute-1.amazonaws.com] => {
    "changed": false, 
    "skip_reason": "Conditional check failed", 
    "skipped": true
}
skipping: [ec2-34-204-78-175.compute-1.amazonaws.com] => {
    "changed": false, 
    "skip_reason": "Conditional check failed", 
    "skipped": true
}

TASK [etcd_upgrade : Generate etcd backup] *************************************
task path: /usr/share/ansible/openshift-ansible/roles/etcd_upgrade/tasks/backup.yml:48

fatal: [ec2-52-91-209-128.compute-1.amazonaws.com]: FAILED! => {
    "changed": true, 
    "cmd": [
        "runc", 
        "exec", 
        "etcd", 
        "etcdctl", 
        "backup", 
        "--data-dir=/var/lib/etcd/", 
        "--backup-dir=/var/lib/etcd//openshift-backup-etcd_backup_tag20170616100907"
    ], 
    "delta": "0:00:00.109263", 
    "end": "2017-06-16 06:15:33.406480", 
    "failed": true, 
    "rc": 1, 
    "start": "2017-06-16 06:15:33.297217", 
    "warnings": []
}

STDERR:

2017-06-16 10:15:33.404008 I | open /var/lib/etcd/member/snap: no such file or directory
fatal: [ec2-34-204-78-175.compute-1.amazonaws.com]: FAILED! => {
    "changed": true, 
    "cmd": [
        "runc", 
        "exec", 
        "etcd", 
        "etcdctl", 
        "backup", 
        "--data-dir=/var/lib/etcd/", 
        "--backup-dir=/var/lib/etcd//openshift-backup-etcd_backup_tag20170616100907"
    ], 
    "delta": "0:00:00.106078", 
    "end": "2017-06-16 06:15:33.571094", 
    "failed": true, 
    "rc": 1, 
    "start": "2017-06-16 06:15:33.465016", 
    "warnings": []
}

STDERR:

2017-06-16 10:15:33.569018 I | open /var/lib/etcd/member/snap: no such file or directory
fatal: [ec2-54-196-73-42.compute-1.amazonaws.com]: FAILED! => {
    "changed": true, 
    "cmd": [
        "runc", 
        "exec", 
        "etcd", 
        "etcdctl", 
        "backup", 
        "--data-dir=/var/lib/etcd/", 
        "--backup-dir=/var/lib/etcd//openshift-backup-etcd_backup_tag20170616100907"
    ], 
    "delta": "0:00:00.245392", 
    "end": "2017-06-16 06:15:34.193560", 
    "failed": true, 
    "rc": 1, 
    "start": "2017-06-16 06:15:33.948168", 
    "warnings": []
}

STDERR:

2017-06-16 10:15:34.188201 I | open /var/lib/etcd/member/snap: no such file or directory
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_6/upgrade.retry

PLAY RECAP *********************************************************************
ec2-34-204-78-175.compute-1.amazonaws.com : ok=177  changed=10   unreachable=0    failed=1   
ec2-34-207-217-103.compute-1.amazonaws.com : ok=109  changed=8    unreachable=0    failed=0   
ec2-52-91-209-128.compute-1.amazonaws.com : ok=177  changed=10   unreachable=0    failed=1   
ec2-54-152-60-155.compute-1.amazonaws.com : ok=64   changed=2    unreachable=0    failed=0   
ec2-54-196-73-42.compute-1.amazonaws.com : ok=181  changed=10   unreachable=0    failed=1   
localhost                  : ok=13   changed=0    unreachable=0    failed=0   


Expected results:


Additional info:
Comment 1 Jan Chaloupka 2017-06-20 06:26:49 EDT
Upstream PR: https://github.com/openshift/openshift-ansible/pull/4505

Giuseppe, can you test it on AH?
Comment 2 Giuseppe Scrivano 2017-06-20 08:00:25 EDT
it solves the problem for me.
Comment 4 Anping Li 2017-06-26 04:13:41 EDT
The backup failed with following messages.

TASK [etcd_common : Generate etcd backup] **************************************
fatal: [openshift-124.lab.sjc.redhat.com]: FAILED! => {
    "changed": true, 
    "cmd": [
        "docker", 
        "exec", 
        "etcd_container", 
        "etcdctl", 
        "backup", 
        "--data-dir=/var/lib/etcd/", 
        "--backup-dir=/var/lib/etcd//openshift-backup-etcd_backup_tag20170626075154"
    ], 
    "delta": "0:00:00.025456", 
    "end": "2017-06-26 07:52:02.745411", 
    "failed": true, 
    "rc": 1, 
    "start": "2017-06-26 07:52:02.719955", 
    "warnings": []
}

STDERR:

Error response from daemon: No such container: etcd_container
Comment 5 Giuseppe Scrivano 2017-06-26 04:24:09 EDT
it looks like it is trying the backup of the docker container.  Have you specified `openshift_use_etcd_system_container=True` (it was recently renamed from `use_etcd_system_container`)?
Comment 6 Anping Li 2017-06-26 04:54:37 EDT
I used use_etcd_system_container=true.

#grep _system_container hosts
use_etcd_system_container=true
openshift_docker_use_system_container=true
Comment 7 Anping Li 2017-06-26 05:20:31 EDT
Please ignore Comment 4.  When I use openshift_use_etcd_system_container=True, the database can be backed. But there isn't db file. If we backup database with etcdctl2, this file should be backed.

[root@openshift-153 etcd.etcd]# ls /var/lib/etcd/etcd.etcd/etcd.etcd/openshift-backup-etcd_backup_tag20170626090158/member/snap/
0000000000000026-0000000000004e22.snap
Comment 8 Jan Chaloupka 2017-06-26 05:39:38 EDT
Are you saying the upgrade (including backup) succeeds. Just, the db file is missing in the backup?
Comment 9 Anping Li 2017-06-26 06:52 EDT
Created attachment 1291943 [details]
The system container upgrade logs

The entire upgrade fails for other issue.  The Etcd backup succeed
Comment 10 Giuseppe Scrivano 2017-06-26 08:01:29 EDT
the snap and the wal files are created.  Is this bug verified or is there anything else missing?
Comment 11 Jan Chaloupka 2017-06-26 08:06:09 EDT
TASK [etcd_common : Display location of etcd backup] ***************************
task path: /usr/share/ansible/openshift-ansible/roles/etcd_common/tasks/backup.yml:70
ok: [openshift-153.lab.sjc.redhat.com] => {}

MSG:

Etcd backup created in /var/lib/etcd/etcd.etcd//openshift-backup-etcd_backup_tag20170626103953
ok: [openshift-124.lab.sjc.redhat.com] => {}

MSG:

Etcd backup created in /var/lib/etcd/etcd.etcd//openshift-backup-etcd_backup_tag20170626103953
ok: [openshift-148.lab.sjc.redhat.com] => {}

MSG:

Etcd backup created in /var/lib/etcd/etcd.etcd//openshift-backup-etcd_backup_tag20170626103953

AFAIK, the backup ended successfully. I see another error in the logs. However, the error is not related to this bug. For that reasons, the fix is verified and the bug can be switched VERIFIED.

Anping, or is there something else that does not allow the bug to be switched to VERIFIED?
Comment 12 Jan Chaloupka 2017-06-26 08:07:40 EDT
Anping, what is your ansible version?
Comment 13 Giuseppe Scrivano 2017-06-26 08:14:37 EDT
FYI, the LooseVersion error is fixed by this PR:

https://github.com/openshift/openshift-ansible/pull/4583
Comment 14 Anping Li 2017-06-26 21:37:13 EDT
@Jan, ansible 2.2.3.0.  please notice that the db file is missing in the backup.
Comment 15 Jan Chaloupka 2017-06-27 04:05:46 EDT
The db file is present only when the backup is generated for etcd3. In this case etcd2 is used so the file is not present. This is ok.
Comment 16 Anping Li 2017-06-27 06:57:52 EDT
Jan, we support system container from v3.6. And etcd API 3 is enabled by default n v3.6.  when we use etcd API 2 with etcd3.x packages, we could not restore from snapshot without the db file.  Only when we use etcd API 2 with etcd2.x packages, the db file is useless.
Comment 17 Jan Chaloupka 2017-06-29 11:25:34 EDT
Upstream PR for that missing db file: https://github.com/openshift/openshift-ansible/pull/4640
Comment 18 Scott Dodson 2017-07-03 09:40:19 EDT
Additional changes in the latest build.
Comment 19 Anping Li 2017-07-04 02:48:25 EDT
The backup succeed with openshift-ansible-3.6.132

[root@openshift-131 ~]# ls -lah /var/lib/etcd/etcd.etcd/etcd.etcd/openshift-backup-etcd_backup_tag20170704142858/member/snap
total 17M
drwx------. 2 root root  62 Jul  4 06:29 .
drwx------. 4 root root  29 Jul  4 06:29 ..
-rw-r--r--. 1 root root 94K Jul  4 06:29 0000000000000072-0000000000004e22.snap
-rw-------. 1 root root 17M Jul  4 06:29 db
Comment 21 errata-xmlrpc 2017-08-10 01:28:09 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Note You need to log in before you can comment on or make changes to this bug.