Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1382634 - Asynchronous errata upgrade to OSE 3.3.0.34 fails
Asynchronous errata upgrade to OSE 3.3.0.34 fails
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Upgrade (Show other bugs)
3.3.0
Unspecified Unspecified
medium Severity medium
: ---
: 3.3.1
Assigned To: Scott Dodson
Anping Li
:
: 1391236 (view as bug list)
Depends On:
Blocks: 1396547
  Show dependency treegraph
 
Reported: 2016-10-07 05:11 EDT by Jaspreet Kaur
Modified: 2016-11-21 15:47 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, for embedded environments etcd 2.x was used to backup the etcd data before performing an upgrade. However etcd 2.x has a bug that prevents backups from working properly which prevents the upgrade playbooks from running to completion. For embedded etcd environments we now install etcd 3.0 which resolves the bug allowing upgrades to proceed normally. This bug only presents itself when using the embedded etcd service on single master environments.
Story Points: ---
Clone Of:
: 1396547 (view as bug list)
Environment:
Last Closed: 2016-11-15 14:09:42 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:2778 normal SHIPPED_LIVE Moderate: atomic-openshift-utils security and bug fix update 2016-11-15 19:08:29 EST

  None (edit)
Description Jaspreet Kaur 2016-10-07 05:11:14 EDT
Description of problem: When upgrading from 
v3.3.0.32 to 3.3.0.34n fails with below error :

TASK [Generate etcd backup] ****************************************************
fatal: [master31.example.com]: FAILED! => {"changed": true, "cmd": ["etcdctl", "backup", "--data-dir=/var/lib/origin/openshift.local.etcd", "--backup-dir=/var/lib/origin/etcd-backup-20161007130217"], "delta": "0:00:00.478888", "end": "2016-10-07 13:02:18.483900", "failed": true, "rc": 1, "start": "2016-10-07 13:02:18.005012", "stderr": "2016-10-07 13:02:18.030053 W | snap: skipped unexpected non snapshot file db\n2016-10-07 13:02:18.121156 W | wal: ignored file 16.tmp in wal\n2016-10-07 13:02:18.481813 I | walpb: crc mismatch", "stdout": "", "stdout_lines": [], "warnings": []}



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.  yum update atomic-openshift-utils
2. Upgrade using playbook :

 ansible-playbook  /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_3/upgrade.yml

2.
3.

Actual results: Failed 


Expected results: Should have upgraded without issue.


Additional info:
Comment 2 Anping Li 2016-10-21 01:01:11 EDT
The etcd service can be started when the backup failed. I guess etcdctl didn't include the etcdrepair feature.
By the way, it is recommend to place /var/lib/openshift on a separate disk partition.
Comment 3 Scott Dodson 2016-10-27 10:25:27 EDT
Is this is embedded etcd? as in you don't have [etcd] hosts defined?
Comment 4 Anping Li 2016-10-27 22:41:03 EDT
@scott, it can be any. the rate is 10%  when I re-run the upgrade playbook. The good news is that I never hit this issue for the first upgrade.
Comment 5 Anping Li 2016-10-28 05:12:30 EDT
Get some exception when upgrade to v3.4. 

TASK [Generate etcd backup] ****************************************************
fatal: [openshift-199.lab.eng.nay.redhat.com]: FAILED! => {
    "changed": true, 
    "cmd": [
        "etcdctl", 
        "backup", 
        "--data-dir=/var/lib/origin/openshift.local.etcd", 
        "--backup-dir=/var/lib/origin/etcd-backup-20161028050818"
    ], 
    "delta": "0:00:01.229634", 
    "end": "2016-10-28 05:08:18.920502", 
    "failed": true, 
    "rc": 2, 
    "start": "2016-10-28 05:08:17.690868", 
    "warnings": []
}

STDERR:

2016-10-28 05:08:17.904324 W | snap: skipped unexpected non snapshot file db
2016-10-28 05:08:18.011340 W | wal: ignored file 0.tmp in wal
panic: runtime error: makeslice: len out of range

goroutine 1 [running]:
panic(0xc52260, 0xc82033d310)
	/usr/lib/golang/src/runtime/panic.go:481 +0x3e6
github.com/coreos/etcd/wal.(*decoder).decode(0xc82011cea0, 0xc8201872c0, 0x0, 0x0)
	/builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/wal/decoder.go:55 +0x142
github.com/coreos/etcd/wal.(*WAL).ReadAll(0xc8200f61a0, 0xc820145400, 0x15, 0x20, 0x7, 0xeb6f55ae45478faf, 0x584b, 0x0, 0x0, 0x0, ...)
	/builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/wal/wal.go:237 +0x442
github.com/coreos/etcd/etcdctl/command.handleBackup(0xc82010f7a0)
	/builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/etcdctl/command/backup_command.go:93 +0x90f
github.com/coreos/etcd/Godeps/_workspace/src/github.com/codegangsta/cli.Command.Run(0xd94728, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, 0xe45c00, 0x18, 0x0, ...)
	/builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/Godeps/_workspace/src/github.com/codegangsta/cli/command.go:137 +0x1081
github.com/coreos/etcd/Godeps/_workspace/src/github.com/codegangsta/cli.(*App).Run(0xc82010f560, 0xc82000a1c0, 0x4, 0x4, 0x0, 0x0)
	/builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/Godeps/_workspace/src/github.com/codegangsta/cli/app.go:175 +0xffa
main.main()
	/builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/etcdctl/main.go:69 +0x1aae
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_4/upgrade.retry
Comment 6 Scott Dodson 2016-11-01 10:08:07 EDT
Searching through etcd issues on github, it looks like this happens when the disk that etcd storage is on may be full? Can anyone confirm if that's the case?
Comment 7 Scott Dodson 2016-11-01 10:11:02 EDT
Or, https://github.com/coreos/etcd/pull/4952 Seems to indicate that this happens when there's a file that's not named .snap in the wal directory. Can you stop atomic-openshift-master, create a backup of /var/lib/origin/openshift.local.etcd and then remove '0.tmp' file and re-try?
Comment 9 Scott Dodson 2016-11-03 10:15:29 EDT
*** Bug 1391236 has been marked as a duplicate of this bug. ***
Comment 10 Alfredo Quiroga 2016-11-03 11:27:25 EDT
Workaround for now in my case after it fails since it seems to have created a backup:

1. vim /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/pre.yml

2. Modify the task:

  - name: Generate etcd backup
    command: >
      etcdctl backup --data-dir={{ openshift.etcd.etcd_data_dir }}
      --backup-dir={{ openshift.common.data_dir }}/etcd-backup-{{ timestamp }}
    tags:
      - backup_etcd

Notice that the task now contain a tag.

3. ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_3/upgrade.yml --skip-tags=backup_etcd
Comment 12 Anping Li 2016-11-03 21:24:38 EDT
(In reply to Alfredo Quiroga from comment #10)
> Workaround for now in my case after it fails since it seems to have created
> a backup:
> 
> 1. vim
> /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/
> upgrades/pre.yml
> 
> 2. Modify the task:
> 
>   - name: Generate etcd backup
>     command: >
>       etcdctl backup --data-dir={{ openshift.etcd.etcd_data_dir }}
>       --backup-dir={{ openshift.common.data_dir }}/etcd-backup-{{ timestamp
> }}
>     tags:
>       - backup_etcd
> 
> Notice that the task now contain a tag.
> 
> 3. ansible-playbook
> /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/
> upgrades/v3_3/upgrade.yml --skip-tags=backup_etcd

We need to cold backup prior upgrade.
Comment 13 Christian Hernandez 2016-11-04 17:35:16 EDT
(In reply to Alfredo Quiroga from comment #10)
> Workaround for now in my case after it fails since it seems to have created
> a backup:
> 
> 1. vim
> /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/
> upgrades/pre.yml
> 
> 2. Modify the task:
> 
>   - name: Generate etcd backup
>     command: >
>       etcdctl backup --data-dir={{ openshift.etcd.etcd_data_dir }}
>       --backup-dir={{ openshift.common.data_dir }}/etcd-backup-{{ timestamp
> }}
>     tags:
>       - backup_etcd
> 
> Notice that the task now contain a tag.
> 
> 3. ansible-playbook
> /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/
> upgrades/v3_3/upgrade.yml --skip-tags=backup_etcd

New filepath

/usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/upgrade_control_plane.yml
Comment 14 Scott Dodson 2016-11-07 08:01:36 EST
Stopping atomic-openshift-master first doesn't fix the issue. However upgrading to etcd3-3.0 does.
Comment 15 Scott Dodson 2016-11-07 10:47:16 EST
Proposed fix https://github.com/openshift/openshift-ansible/pull/2745
Comment 17 Anping Li 2016-11-08 23:24:25 EST
Will run upgrade for several days and then changes the status.
Comment 18 Anping Li 2016-11-09 00:27:26 EST
link this bug https://bugzilla.redhat.com/show_bug.cgi?id=1393187, which  I suspect etcd3
Comment 19 Anping Li 2016-11-10 05:46:25 EST
I didn't hit etcd back error in two days, so move to verified.
Comment 20 errata-xmlrpc 2016-11-15 14:09:42 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:2778
Comment 23 Scott Dodson 2016-11-18 10:35:08 EST
There's a bug in this fix because the logic depended on an etcd change that was reverted. Cloning this bug into a new bug to address that. 

The error will be 

TASK [Install etcd3 (for etcdctl)] *********************************************
fatal: [master.example.com]: FAILED! => {
    "changed": true, 
    "failed": true, 
    "rc": 1, 
    "results": [
        "Loaded plugins: search-disabled-repos\nResolving Dependencies\n--> Running transaction check\n---> Package etcd3.x86_64 0:3.0.3-1.el7 will be installed\n--> Finished Dependency Resolution\n\nDependencies Resolved\n\n================================================================================\n Package    Arch        Version            Repository                      Size\n================================================================================\nInstalling:\n etcd3      x86_64      3.0.3-1.el7        rhel-7-server-extras-rpms      9.4 M\n\nTransaction Summary\n================================================================================\nInstall  1 Package\n\nTotal download size: 9.4 M\nInstalled size: 45 M\nDownloading packages:\nRunning transaction check\nRunning transaction test\n"
    ]
}

MSG:



Transaction check error:
  file /usr/bin/etcd from install of etcd3-3.0.3-1.el7.x86_64 conflicts with file from package etcd-2.3.7-4.el7.x86_64
  file /etc/etcd/etcd.conf from install of etcd3-3.0.3-1.el7.x86_64 conflicts with file from package etcd-2.3.7-4.el7.x86_64
  file /usr/bin/etcdctl from install of etcd3-3.0.3-1.el7.x86_64 conflicts with file from package etcd-2.3.7-4.el7.x86_64

Error Summary
Comment 24 Christian Hernandez 2016-11-21 15:47:01 EST
Running `yum swap etcd etcd3 -y` manually before running the updater works

Note You need to log in before you can comment on or make changes to this bug.