Bug 1382634
| Summary: | Asynchronous errata upgrade to OSE 3.3.0.34 fails | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jaspreet Kaur <jkaur> | |
| Component: | Cluster Version Operator | Assignee: | Scott Dodson <sdodson> | |
| Status: | CLOSED ERRATA | QA Contact: | Anping Li <anli> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 3.3.0 | CC: | anli, aos-bugs, aquiroga, chernand, jkaur, jokerman, jwesterl, mmccomas, nschuetz, tobias.genannt | |
| Target Milestone: | --- | |||
| Target Release: | 3.3.1 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Previously, for embedded environments etcd 2.x was used to backup the etcd data before performing an upgrade. However etcd 2.x has a bug that prevents backups from working properly which prevents the upgrade playbooks from running to completion. For embedded etcd environments we now install etcd 3.0 which resolves the bug allowing upgrades to proceed normally. This bug only presents itself when using the embedded etcd service on single master environments.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1396547 (view as bug list) | Environment: | ||
| Last Closed: | 2016-11-15 19:09:42 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1396547 | |||
The etcd service can be started when the backup failed. I guess etcdctl didn't include the etcdrepair feature. By the way, it is recommend to place /var/lib/openshift on a separate disk partition. Is this is embedded etcd? as in you don't have [etcd] hosts defined? @scott, it can be any. the rate is 10% when I re-run the upgrade playbook. The good news is that I never hit this issue for the first upgrade. Get some exception when upgrade to v3.4.
TASK [Generate etcd backup] ****************************************************
fatal: [openshift-199.lab.eng.nay.redhat.com]: FAILED! => {
"changed": true,
"cmd": [
"etcdctl",
"backup",
"--data-dir=/var/lib/origin/openshift.local.etcd",
"--backup-dir=/var/lib/origin/etcd-backup-20161028050818"
],
"delta": "0:00:01.229634",
"end": "2016-10-28 05:08:18.920502",
"failed": true,
"rc": 2,
"start": "2016-10-28 05:08:17.690868",
"warnings": []
}
STDERR:
2016-10-28 05:08:17.904324 W | snap: skipped unexpected non snapshot file db
2016-10-28 05:08:18.011340 W | wal: ignored file 0.tmp in wal
panic: runtime error: makeslice: len out of range
goroutine 1 [running]:
panic(0xc52260, 0xc82033d310)
/usr/lib/golang/src/runtime/panic.go:481 +0x3e6
github.com/coreos/etcd/wal.(*decoder).decode(0xc82011cea0, 0xc8201872c0, 0x0, 0x0)
/builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/wal/decoder.go:55 +0x142
github.com/coreos/etcd/wal.(*WAL).ReadAll(0xc8200f61a0, 0xc820145400, 0x15, 0x20, 0x7, 0xeb6f55ae45478faf, 0x584b, 0x0, 0x0, 0x0, ...)
/builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/wal/wal.go:237 +0x442
github.com/coreos/etcd/etcdctl/command.handleBackup(0xc82010f7a0)
/builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/etcdctl/command/backup_command.go:93 +0x90f
github.com/coreos/etcd/Godeps/_workspace/src/github.com/codegangsta/cli.Command.Run(0xd94728, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, 0xe45c00, 0x18, 0x0, ...)
/builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/Godeps/_workspace/src/github.com/codegangsta/cli/command.go:137 +0x1081
github.com/coreos/etcd/Godeps/_workspace/src/github.com/codegangsta/cli.(*App).Run(0xc82010f560, 0xc82000a1c0, 0x4, 0x4, 0x0, 0x0)
/builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/Godeps/_workspace/src/github.com/codegangsta/cli/app.go:175 +0xffa
main.main()
/builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/etcdctl/main.go:69 +0x1aae
to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_4/upgrade.retry
Searching through etcd issues on github, it looks like this happens when the disk that etcd storage is on may be full? Can anyone confirm if that's the case? Or, https://github.com/coreos/etcd/pull/4952 Seems to indicate that this happens when there's a file that's not named .snap in the wal directory. Can you stop atomic-openshift-master, create a backup of /var/lib/origin/openshift.local.etcd and then remove '0.tmp' file and re-try? *** Bug 1391236 has been marked as a duplicate of this bug. *** Workaround for now in my case after it fails since it seems to have created a backup:
1. vim /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/pre.yml
2. Modify the task:
- name: Generate etcd backup
command: >
etcdctl backup --data-dir={{ openshift.etcd.etcd_data_dir }}
--backup-dir={{ openshift.common.data_dir }}/etcd-backup-{{ timestamp }}
tags:
- backup_etcd
Notice that the task now contain a tag.
3. ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_3/upgrade.yml --skip-tags=backup_etcd
(In reply to Alfredo Quiroga from comment #10) > Workaround for now in my case after it fails since it seems to have created > a backup: > > 1. vim > /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/ > upgrades/pre.yml > > 2. Modify the task: > > - name: Generate etcd backup > command: > > etcdctl backup --data-dir={{ openshift.etcd.etcd_data_dir }} > --backup-dir={{ openshift.common.data_dir }}/etcd-backup-{{ timestamp > }} > tags: > - backup_etcd > > Notice that the task now contain a tag. > > 3. ansible-playbook > /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/ > upgrades/v3_3/upgrade.yml --skip-tags=backup_etcd We need to cold backup prior upgrade. (In reply to Alfredo Quiroga from comment #10) > Workaround for now in my case after it fails since it seems to have created > a backup: > > 1. vim > /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/ > upgrades/pre.yml > > 2. Modify the task: > > - name: Generate etcd backup > command: > > etcdctl backup --data-dir={{ openshift.etcd.etcd_data_dir }} > --backup-dir={{ openshift.common.data_dir }}/etcd-backup-{{ timestamp > }} > tags: > - backup_etcd > > Notice that the task now contain a tag. > > 3. ansible-playbook > /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/ > upgrades/v3_3/upgrade.yml --skip-tags=backup_etcd New filepath /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/upgrade_control_plane.yml Stopping atomic-openshift-master first doesn't fix the issue. However upgrading to etcd3-3.0 does. Will run upgrade for several days and then changes the status. link this bug https://bugzilla.redhat.com/show_bug.cgi?id=1393187, which I suspect etcd3 I didn't hit etcd back error in two days, so move to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:2778 There's a bug in this fix because the logic depended on an etcd change that was reverted. Cloning this bug into a new bug to address that.
The error will be
TASK [Install etcd3 (for etcdctl)] *********************************************
fatal: [master.example.com]: FAILED! => {
"changed": true,
"failed": true,
"rc": 1,
"results": [
"Loaded plugins: search-disabled-repos\nResolving Dependencies\n--> Running transaction check\n---> Package etcd3.x86_64 0:3.0.3-1.el7 will be installed\n--> Finished Dependency Resolution\n\nDependencies Resolved\n\n================================================================================\n Package Arch Version Repository Size\n================================================================================\nInstalling:\n etcd3 x86_64 3.0.3-1.el7 rhel-7-server-extras-rpms 9.4 M\n\nTransaction Summary\n================================================================================\nInstall 1 Package\n\nTotal download size: 9.4 M\nInstalled size: 45 M\nDownloading packages:\nRunning transaction check\nRunning transaction test\n"
]
}
MSG:
Transaction check error:
file /usr/bin/etcd from install of etcd3-3.0.3-1.el7.x86_64 conflicts with file from package etcd-2.3.7-4.el7.x86_64
file /etc/etcd/etcd.conf from install of etcd3-3.0.3-1.el7.x86_64 conflicts with file from package etcd-2.3.7-4.el7.x86_64
file /usr/bin/etcdctl from install of etcd3-3.0.3-1.el7.x86_64 conflicts with file from package etcd-2.3.7-4.el7.x86_64
Error Summary
Running `yum swap etcd etcd3 -y` manually before running the updater works |
Description of problem: When upgrading from v3.3.0.32 to 3.3.0.34n fails with below error : TASK [Generate etcd backup] **************************************************** fatal: [master31.example.com]: FAILED! => {"changed": true, "cmd": ["etcdctl", "backup", "--data-dir=/var/lib/origin/openshift.local.etcd", "--backup-dir=/var/lib/origin/etcd-backup-20161007130217"], "delta": "0:00:00.478888", "end": "2016-10-07 13:02:18.483900", "failed": true, "rc": 1, "start": "2016-10-07 13:02:18.005012", "stderr": "2016-10-07 13:02:18.030053 W | snap: skipped unexpected non snapshot file db\n2016-10-07 13:02:18.121156 W | wal: ignored file 16.tmp in wal\n2016-10-07 13:02:18.481813 I | walpb: crc mismatch", "stdout": "", "stdout_lines": [], "warnings": []} Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. yum update atomic-openshift-utils 2. Upgrade using playbook : ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_3/upgrade.yml 2. 3. Actual results: Failed Expected results: Should have upgraded without issue. Additional info: