Description of problem: When upgrading from v3.3.0.32 to 3.3.0.34n fails with below error : TASK [Generate etcd backup] **************************************************** fatal: [master31.example.com]: FAILED! => {"changed": true, "cmd": ["etcdctl", "backup", "--data-dir=/var/lib/origin/openshift.local.etcd", "--backup-dir=/var/lib/origin/etcd-backup-20161007130217"], "delta": "0:00:00.478888", "end": "2016-10-07 13:02:18.483900", "failed": true, "rc": 1, "start": "2016-10-07 13:02:18.005012", "stderr": "2016-10-07 13:02:18.030053 W | snap: skipped unexpected non snapshot file db\n2016-10-07 13:02:18.121156 W | wal: ignored file 16.tmp in wal\n2016-10-07 13:02:18.481813 I | walpb: crc mismatch", "stdout": "", "stdout_lines": [], "warnings": []} Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. yum update atomic-openshift-utils 2. Upgrade using playbook : ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_3/upgrade.yml 2. 3. Actual results: Failed Expected results: Should have upgraded without issue. Additional info:
The etcd service can be started when the backup failed. I guess etcdctl didn't include the etcdrepair feature. By the way, it is recommend to place /var/lib/openshift on a separate disk partition.
Is this is embedded etcd? as in you don't have [etcd] hosts defined?
@scott, it can be any. the rate is 10% when I re-run the upgrade playbook. The good news is that I never hit this issue for the first upgrade.
Get some exception when upgrade to v3.4. TASK [Generate etcd backup] **************************************************** fatal: [openshift-199.lab.eng.nay.redhat.com]: FAILED! => { "changed": true, "cmd": [ "etcdctl", "backup", "--data-dir=/var/lib/origin/openshift.local.etcd", "--backup-dir=/var/lib/origin/etcd-backup-20161028050818" ], "delta": "0:00:01.229634", "end": "2016-10-28 05:08:18.920502", "failed": true, "rc": 2, "start": "2016-10-28 05:08:17.690868", "warnings": [] } STDERR: 2016-10-28 05:08:17.904324 W | snap: skipped unexpected non snapshot file db 2016-10-28 05:08:18.011340 W | wal: ignored file 0.tmp in wal panic: runtime error: makeslice: len out of range goroutine 1 [running]: panic(0xc52260, 0xc82033d310) /usr/lib/golang/src/runtime/panic.go:481 +0x3e6 github.com/coreos/etcd/wal.(*decoder).decode(0xc82011cea0, 0xc8201872c0, 0x0, 0x0) /builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/wal/decoder.go:55 +0x142 github.com/coreos/etcd/wal.(*WAL).ReadAll(0xc8200f61a0, 0xc820145400, 0x15, 0x20, 0x7, 0xeb6f55ae45478faf, 0x584b, 0x0, 0x0, 0x0, ...) /builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/wal/wal.go:237 +0x442 github.com/coreos/etcd/etcdctl/command.handleBackup(0xc82010f7a0) /builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/etcdctl/command/backup_command.go:93 +0x90f github.com/coreos/etcd/Godeps/_workspace/src/github.com/codegangsta/cli.Command.Run(0xd94728, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, 0xe45c00, 0x18, 0x0, ...) /builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/Godeps/_workspace/src/github.com/codegangsta/cli/command.go:137 +0x1081 github.com/coreos/etcd/Godeps/_workspace/src/github.com/codegangsta/cli.(*App).Run(0xc82010f560, 0xc82000a1c0, 0x4, 0x4, 0x0, 0x0) /builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/Godeps/_workspace/src/github.com/codegangsta/cli/app.go:175 +0xffa main.main() /builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/etcdctl/main.go:69 +0x1aae to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_4/upgrade.retry
Searching through etcd issues on github, it looks like this happens when the disk that etcd storage is on may be full? Can anyone confirm if that's the case?
Or, https://github.com/coreos/etcd/pull/4952 Seems to indicate that this happens when there's a file that's not named .snap in the wal directory. Can you stop atomic-openshift-master, create a backup of /var/lib/origin/openshift.local.etcd and then remove '0.tmp' file and re-try?
*** Bug 1391236 has been marked as a duplicate of this bug. ***
Workaround for now in my case after it fails since it seems to have created a backup: 1. vim /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/pre.yml 2. Modify the task: - name: Generate etcd backup command: > etcdctl backup --data-dir={{ openshift.etcd.etcd_data_dir }} --backup-dir={{ openshift.common.data_dir }}/etcd-backup-{{ timestamp }} tags: - backup_etcd Notice that the task now contain a tag. 3. ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_3/upgrade.yml --skip-tags=backup_etcd
(In reply to Alfredo Quiroga from comment #10) > Workaround for now in my case after it fails since it seems to have created > a backup: > > 1. vim > /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/ > upgrades/pre.yml > > 2. Modify the task: > > - name: Generate etcd backup > command: > > etcdctl backup --data-dir={{ openshift.etcd.etcd_data_dir }} > --backup-dir={{ openshift.common.data_dir }}/etcd-backup-{{ timestamp > }} > tags: > - backup_etcd > > Notice that the task now contain a tag. > > 3. ansible-playbook > /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/ > upgrades/v3_3/upgrade.yml --skip-tags=backup_etcd We need to cold backup prior upgrade.
(In reply to Alfredo Quiroga from comment #10) > Workaround for now in my case after it fails since it seems to have created > a backup: > > 1. vim > /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/ > upgrades/pre.yml > > 2. Modify the task: > > - name: Generate etcd backup > command: > > etcdctl backup --data-dir={{ openshift.etcd.etcd_data_dir }} > --backup-dir={{ openshift.common.data_dir }}/etcd-backup-{{ timestamp > }} > tags: > - backup_etcd > > Notice that the task now contain a tag. > > 3. ansible-playbook > /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/ > upgrades/v3_3/upgrade.yml --skip-tags=backup_etcd New filepath /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/upgrade_control_plane.yml
Stopping atomic-openshift-master first doesn't fix the issue. However upgrading to etcd3-3.0 does.
Proposed fix https://github.com/openshift/openshift-ansible/pull/2745
Will run upgrade for several days and then changes the status.
link this bug https://bugzilla.redhat.com/show_bug.cgi?id=1393187, which I suspect etcd3
I didn't hit etcd back error in two days, so move to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:2778
There's a bug in this fix because the logic depended on an etcd change that was reverted. Cloning this bug into a new bug to address that. The error will be TASK [Install etcd3 (for etcdctl)] ********************************************* fatal: [master.example.com]: FAILED! => { "changed": true, "failed": true, "rc": 1, "results": [ "Loaded plugins: search-disabled-repos\nResolving Dependencies\n--> Running transaction check\n---> Package etcd3.x86_64 0:3.0.3-1.el7 will be installed\n--> Finished Dependency Resolution\n\nDependencies Resolved\n\n================================================================================\n Package Arch Version Repository Size\n================================================================================\nInstalling:\n etcd3 x86_64 3.0.3-1.el7 rhel-7-server-extras-rpms 9.4 M\n\nTransaction Summary\n================================================================================\nInstall 1 Package\n\nTotal download size: 9.4 M\nInstalled size: 45 M\nDownloading packages:\nRunning transaction check\nRunning transaction test\n" ] } MSG: Transaction check error: file /usr/bin/etcd from install of etcd3-3.0.3-1.el7.x86_64 conflicts with file from package etcd-2.3.7-4.el7.x86_64 file /etc/etcd/etcd.conf from install of etcd3-3.0.3-1.el7.x86_64 conflicts with file from package etcd-2.3.7-4.el7.x86_64 file /usr/bin/etcdctl from install of etcd3-3.0.3-1.el7.x86_64 conflicts with file from package etcd-2.3.7-4.el7.x86_64 Error Summary
Running `yum swap etcd etcd3 -y` manually before running the updater works