Description of problem: For system etcd container, It use new data_dir /var/lib/etcd/etcd.etcd/etcd.etcd/. Etcd backup playbook failed. the backup directory is created, but no snap dbs are backed. # ls /var/lib/etcd/etcd.etcd/ etc etcd.etcd openshift-backup-etcd_backup_tag20170616100907 # ls /var/lib/etcd/etcd.etcd/etcd.etcd/member/snap/ 0000000000000004-0000000000002711.snap 0000000000000007-0000000000004e22.snap 0000000000000007-0000000000007533.snap 000000000000000a-0000000000009c44.snap 000000000000000d-000000000000c355.snap db # ls /var/lib/etcd/etcd.etcd/openshift-backup-etcd_backup_tag20170616100907/member/snap # Version-Release number of selected component (if applicable): openshift-ansible:3.6.110 How reproducible: always Steps to Reproduce: 1. install OCP 3.6 with system etcd container 2. run upgrade playbook Actual results: TASK [etcd_upgrade : Install latest etcd for embedded] ************************* task path: /usr/share/ansible/openshift-ansible/roles/etcd_upgrade/tasks/backup.yml:40 skipping: [ec2-54-196-73-42.compute-1.amazonaws.com] => { "changed": false, "skip_reason": "Conditional check failed", "skipped": true } skipping: [ec2-52-91-209-128.compute-1.amazonaws.com] => { "changed": false, "skip_reason": "Conditional check failed", "skipped": true } skipping: [ec2-34-204-78-175.compute-1.amazonaws.com] => { "changed": false, "skip_reason": "Conditional check failed", "skipped": true } TASK [etcd_upgrade : Generate etcd backup] ************************************* task path: /usr/share/ansible/openshift-ansible/roles/etcd_upgrade/tasks/backup.yml:48 fatal: [ec2-52-91-209-128.compute-1.amazonaws.com]: FAILED! => { "changed": true, "cmd": [ "runc", "exec", "etcd", "etcdctl", "backup", "--data-dir=/var/lib/etcd/", "--backup-dir=/var/lib/etcd//openshift-backup-etcd_backup_tag20170616100907" ], "delta": "0:00:00.109263", "end": "2017-06-16 06:15:33.406480", "failed": true, "rc": 1, "start": "2017-06-16 06:15:33.297217", "warnings": [] } STDERR: 2017-06-16 10:15:33.404008 I | open /var/lib/etcd/member/snap: no such file or directory fatal: [ec2-34-204-78-175.compute-1.amazonaws.com]: FAILED! => { "changed": true, "cmd": [ "runc", "exec", "etcd", "etcdctl", "backup", "--data-dir=/var/lib/etcd/", "--backup-dir=/var/lib/etcd//openshift-backup-etcd_backup_tag20170616100907" ], "delta": "0:00:00.106078", "end": "2017-06-16 06:15:33.571094", "failed": true, "rc": 1, "start": "2017-06-16 06:15:33.465016", "warnings": [] } STDERR: 2017-06-16 10:15:33.569018 I | open /var/lib/etcd/member/snap: no such file or directory fatal: [ec2-54-196-73-42.compute-1.amazonaws.com]: FAILED! => { "changed": true, "cmd": [ "runc", "exec", "etcd", "etcdctl", "backup", "--data-dir=/var/lib/etcd/", "--backup-dir=/var/lib/etcd//openshift-backup-etcd_backup_tag20170616100907" ], "delta": "0:00:00.245392", "end": "2017-06-16 06:15:34.193560", "failed": true, "rc": 1, "start": "2017-06-16 06:15:33.948168", "warnings": [] } STDERR: 2017-06-16 10:15:34.188201 I | open /var/lib/etcd/member/snap: no such file or directory to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_6/upgrade.retry PLAY RECAP ********************************************************************* ec2-34-204-78-175.compute-1.amazonaws.com : ok=177 changed=10 unreachable=0 failed=1 ec2-34-207-217-103.compute-1.amazonaws.com : ok=109 changed=8 unreachable=0 failed=0 ec2-52-91-209-128.compute-1.amazonaws.com : ok=177 changed=10 unreachable=0 failed=1 ec2-54-152-60-155.compute-1.amazonaws.com : ok=64 changed=2 unreachable=0 failed=0 ec2-54-196-73-42.compute-1.amazonaws.com : ok=181 changed=10 unreachable=0 failed=1 localhost : ok=13 changed=0 unreachable=0 failed=0 Expected results: Additional info:
Upstream PR: https://github.com/openshift/openshift-ansible/pull/4505 Giuseppe, can you test it on AH?
it solves the problem for me.
The backup failed with following messages. TASK [etcd_common : Generate etcd backup] ************************************** fatal: [openshift-124.lab.sjc.redhat.com]: FAILED! => { "changed": true, "cmd": [ "docker", "exec", "etcd_container", "etcdctl", "backup", "--data-dir=/var/lib/etcd/", "--backup-dir=/var/lib/etcd//openshift-backup-etcd_backup_tag20170626075154" ], "delta": "0:00:00.025456", "end": "2017-06-26 07:52:02.745411", "failed": true, "rc": 1, "start": "2017-06-26 07:52:02.719955", "warnings": [] } STDERR: Error response from daemon: No such container: etcd_container
it looks like it is trying the backup of the docker container. Have you specified `openshift_use_etcd_system_container=True` (it was recently renamed from `use_etcd_system_container`)?
I used use_etcd_system_container=true. #grep _system_container hosts use_etcd_system_container=true openshift_docker_use_system_container=true
Please ignore Comment 4. When I use openshift_use_etcd_system_container=True, the database can be backed. But there isn't db file. If we backup database with etcdctl2, this file should be backed. [root@openshift-153 etcd.etcd]# ls /var/lib/etcd/etcd.etcd/etcd.etcd/openshift-backup-etcd_backup_tag20170626090158/member/snap/ 0000000000000026-0000000000004e22.snap
Are you saying the upgrade (including backup) succeeds. Just, the db file is missing in the backup?
Created attachment 1291943 [details] The system container upgrade logs The entire upgrade fails for other issue. The Etcd backup succeed
the snap and the wal files are created. Is this bug verified or is there anything else missing?
TASK [etcd_common : Display location of etcd backup] *************************** task path: /usr/share/ansible/openshift-ansible/roles/etcd_common/tasks/backup.yml:70 ok: [openshift-153.lab.sjc.redhat.com] => {} MSG: Etcd backup created in /var/lib/etcd/etcd.etcd//openshift-backup-etcd_backup_tag20170626103953 ok: [openshift-124.lab.sjc.redhat.com] => {} MSG: Etcd backup created in /var/lib/etcd/etcd.etcd//openshift-backup-etcd_backup_tag20170626103953 ok: [openshift-148.lab.sjc.redhat.com] => {} MSG: Etcd backup created in /var/lib/etcd/etcd.etcd//openshift-backup-etcd_backup_tag20170626103953 AFAIK, the backup ended successfully. I see another error in the logs. However, the error is not related to this bug. For that reasons, the fix is verified and the bug can be switched VERIFIED. Anping, or is there something else that does not allow the bug to be switched to VERIFIED?
Anping, what is your ansible version?
FYI, the LooseVersion error is fixed by this PR: https://github.com/openshift/openshift-ansible/pull/4583
@Jan, ansible 2.2.3.0. please notice that the db file is missing in the backup.
The db file is present only when the backup is generated for etcd3. In this case etcd2 is used so the file is not present. This is ok.
Jan, we support system container from v3.6. And etcd API 3 is enabled by default n v3.6. when we use etcd API 2 with etcd3.x packages, we could not restore from snapshot without the db file. Only when we use etcd API 2 with etcd2.x packages, the db file is useless.
Upstream PR for that missing db file: https://github.com/openshift/openshift-ansible/pull/4640
Additional changes in the latest build.
The backup succeed with openshift-ansible-3.6.132 [root@openshift-131 ~]# ls -lah /var/lib/etcd/etcd.etcd/etcd.etcd/openshift-backup-etcd_backup_tag20170704142858/member/snap total 17M drwx------. 2 root root 62 Jul 4 06:29 . drwx------. 4 root root 29 Jul 4 06:29 .. -rw-r--r--. 1 root root 94K Jul 4 06:29 0000000000000072-0000000000004e22.snap -rw-------. 1 root root 17M Jul 4 06:29 db
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716