1382634 – Asynchronous errata upgrade to OSE 3.3.0.34 fails

Bug 1382634 - Asynchronous errata upgrade to OSE 3.3.0.34 fails

Summary: Asynchronous errata upgrade to OSE 3.3.0.34 fails

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.3.1
Assignee:	Scott Dodson
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1391236 (view as bug list)
Depends On:
Blocks:	1396547
TreeView+	depends on / blocked

Reported:	2016-10-07 09:11 UTC by Jaspreet Kaur
Modified:	2019-12-16 07:01 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, for embedded environments etcd 2.x was used to backup the etcd data before performing an upgrade. However etcd 2.x has a bug that prevents backups from working properly which prevents the upgrade playbooks from running to completion. For embedded etcd environments we now install etcd 3.0 which resolves the bug allowing upgrades to proceed normally. This bug only presents itself when using the embedded etcd service on single master environments.
Clone Of:
Clones:	1396547 (view as bug list)
Environment:
Last Closed:	2016-11-15 19:09:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2016:2778	0	normal	SHIPPED_LIVE	Moderate: atomic-openshift-utils security and bug fix update	2016-11-16 00:08:29 UTC

Description Jaspreet Kaur 2016-10-07 09:11:14 UTC

Description of problem: When upgrading from 
v3.3.0.32 to 3.3.0.34n fails with below error :

TASK [Generate etcd backup] ****************************************************
fatal: [master31.example.com]: FAILED! => {"changed": true, "cmd": ["etcdctl", "backup", "--data-dir=/var/lib/origin/openshift.local.etcd", "--backup-dir=/var/lib/origin/etcd-backup-20161007130217"], "delta": "0:00:00.478888", "end": "2016-10-07 13:02:18.483900", "failed": true, "rc": 1, "start": "2016-10-07 13:02:18.005012", "stderr": "2016-10-07 13:02:18.030053 W | snap: skipped unexpected non snapshot file db\n2016-10-07 13:02:18.121156 W | wal: ignored file 16.tmp in wal\n2016-10-07 13:02:18.481813 I | walpb: crc mismatch", "stdout": "", "stdout_lines": [], "warnings": []}



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.  yum update atomic-openshift-utils
2. Upgrade using playbook :

 ansible-playbook  /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_3/upgrade.yml

2.
3.

Actual results: Failed 


Expected results: Should have upgraded without issue.


Additional info:

Comment 2 Anping Li 2016-10-21 05:01:11 UTC

The etcd service can be started when the backup failed. I guess etcdctl didn't include the etcdrepair feature.
By the way, it is recommend to place /var/lib/openshift on a separate disk partition.

Comment 3 Scott Dodson 2016-10-27 14:25:27 UTC

Is this is embedded etcd? as in you don't have [etcd] hosts defined?

Comment 4 Anping Li 2016-10-28 02:41:03 UTC

@scott, it can be any. the rate is 10%  when I re-run the upgrade playbook. The good news is that I never hit this issue for the first upgrade.

Comment 5 Anping Li 2016-10-28 09:12:30 UTC

Get some exception when upgrade to v3.4. 

TASK [Generate etcd backup] ****************************************************
fatal: [openshift-199.lab.eng.nay.redhat.com]: FAILED! => {
    "changed": true, 
    "cmd": [
        "etcdctl", 
        "backup", 
        "--data-dir=/var/lib/origin/openshift.local.etcd", 
        "--backup-dir=/var/lib/origin/etcd-backup-20161028050818"
    ], 
    "delta": "0:00:01.229634", 
    "end": "2016-10-28 05:08:18.920502", 
    "failed": true, 
    "rc": 2, 
    "start": "2016-10-28 05:08:17.690868", 
    "warnings": []
}

STDERR:

2016-10-28 05:08:17.904324 W | snap: skipped unexpected non snapshot file db
2016-10-28 05:08:18.011340 W | wal: ignored file 0.tmp in wal
panic: runtime error: makeslice: len out of range

goroutine 1 [running]:
panic(0xc52260, 0xc82033d310)
	/usr/lib/golang/src/runtime/panic.go:481 +0x3e6
github.com/coreos/etcd/wal.(*decoder).decode(0xc82011cea0, 0xc8201872c0, 0x0, 0x0)
	/builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/wal/decoder.go:55 +0x142
github.com/coreos/etcd/wal.(*WAL).ReadAll(0xc8200f61a0, 0xc820145400, 0x15, 0x20, 0x7, 0xeb6f55ae45478faf, 0x584b, 0x0, 0x0, 0x0, ...)
	/builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/wal/wal.go:237 +0x442
github.com/coreos/etcd/etcdctl/command.handleBackup(0xc82010f7a0)
	/builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/etcdctl/command/backup_command.go:93 +0x90f
github.com/coreos/etcd/Godeps/_workspace/src/github.com/codegangsta/cli.Command.Run(0xd94728, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, 0xe45c00, 0x18, 0x0, ...)
	/builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/Godeps/_workspace/src/github.com/codegangsta/cli/command.go:137 +0x1081
github.com/coreos/etcd/Godeps/_workspace/src/github.com/codegangsta/cli.(*App).Run(0xc82010f560, 0xc82000a1c0, 0x4, 0x4, 0x0, 0x0)
	/builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/Godeps/_workspace/src/github.com/codegangsta/cli/app.go:175 +0xffa
main.main()
	/builddir/build/BUILD/etcd-2.3.7/src/github.com/coreos/etcd/etcdctl/main.go:69 +0x1aae
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_4/upgrade.retry

Comment 6 Scott Dodson 2016-11-01 14:08:07 UTC

Searching through etcd issues on github, it looks like this happens when the disk that etcd storage is on may be full? Can anyone confirm if that's the case?

Comment 7 Scott Dodson 2016-11-01 14:11:02 UTC

Or, https://github.com/coreos/etcd/pull/4952 Seems to indicate that this happens when there's a file that's not named .snap in the wal directory. Can you stop atomic-openshift-master, create a backup of /var/lib/origin/openshift.local.etcd and then remove '0.tmp' file and re-try?

Comment 9 Scott Dodson 2016-11-03 14:15:29 UTC

*** Bug 1391236 has been marked as a duplicate of this bug. ***

Comment 10 Alfredo Quiroga 2016-11-03 15:27:25 UTC

Workaround for now in my case after it fails since it seems to have created a backup:

1. vim /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/pre.yml

2. Modify the task:

  - name: Generate etcd backup
    command: >
      etcdctl backup --data-dir={{ openshift.etcd.etcd_data_dir }}
      --backup-dir={{ openshift.common.data_dir }}/etcd-backup-{{ timestamp }}
    tags:
      - backup_etcd

Notice that the task now contain a tag.

3. ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_3/upgrade.yml --skip-tags=backup_etcd

Comment 12 Anping Li 2016-11-04 01:24:38 UTC

(In reply to Alfredo Quiroga from comment #10)
> Workaround for now in my case after it fails since it seems to have created
> a backup:
> 
> 1. vim
> /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/
> upgrades/pre.yml
> 
> 2. Modify the task:
> 
>   - name: Generate etcd backup
>     command: >
>       etcdctl backup --data-dir={{ openshift.etcd.etcd_data_dir }}
>       --backup-dir={{ openshift.common.data_dir }}/etcd-backup-{{ timestamp
> }}
>     tags:
>       - backup_etcd
> 
> Notice that the task now contain a tag.
> 
> 3. ansible-playbook
> /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/
> upgrades/v3_3/upgrade.yml --skip-tags=backup_etcd

We need to cold backup prior upgrade.

Comment 13 Christian Hernandez 2016-11-04 21:35:16 UTC

(In reply to Alfredo Quiroga from comment #10)
> Workaround for now in my case after it fails since it seems to have created
> a backup:
> 
> 1. vim
> /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/
> upgrades/pre.yml
> 
> 2. Modify the task:
> 
>   - name: Generate etcd backup
>     command: >
>       etcdctl backup --data-dir={{ openshift.etcd.etcd_data_dir }}
>       --backup-dir={{ openshift.common.data_dir }}/etcd-backup-{{ timestamp
> }}
>     tags:
>       - backup_etcd
> 
> Notice that the task now contain a tag.
> 
> 3. ansible-playbook
> /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/
> upgrades/v3_3/upgrade.yml --skip-tags=backup_etcd

New filepath

/usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/upgrade_control_plane.yml

Comment 14 Scott Dodson 2016-11-07 13:01:36 UTC

Stopping atomic-openshift-master first doesn't fix the issue. However upgrading to etcd3-3.0 does.

Comment 15 Scott Dodson 2016-11-07 15:47:16 UTC

Proposed fix https://github.com/openshift/openshift-ansible/pull/2745

Comment 17 Anping Li 2016-11-09 04:24:25 UTC

Will run upgrade for several days and then changes the status.

Comment 18 Anping Li 2016-11-09 05:27:26 UTC

link this bug https://bugzilla.redhat.com/show_bug.cgi?id=1393187, which  I suspect etcd3

Comment 19 Anping Li 2016-11-10 10:46:25 UTC

I didn't hit etcd back error in two days, so move to verified.

Comment 20 errata-xmlrpc 2016-11-15 19:09:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:2778

Comment 23 Scott Dodson 2016-11-18 15:35:08 UTC

There's a bug in this fix because the logic depended on an etcd change that was reverted. Cloning this bug into a new bug to address that. 

The error will be 

TASK [Install etcd3 (for etcdctl)] *********************************************
fatal: [master.example.com]: FAILED! => {
    "changed": true, 
    "failed": true, 
    "rc": 1, 
    "results": [
        "Loaded plugins: search-disabled-repos\nResolving Dependencies\n--> Running transaction check\n---> Package etcd3.x86_64 0:3.0.3-1.el7 will be installed\n--> Finished Dependency Resolution\n\nDependencies Resolved\n\n================================================================================\n Package    Arch        Version            Repository                      Size\n================================================================================\nInstalling:\n etcd3      x86_64      3.0.3-1.el7        rhel-7-server-extras-rpms      9.4 M\n\nTransaction Summary\n================================================================================\nInstall  1 Package\n\nTotal download size: 9.4 M\nInstalled size: 45 M\nDownloading packages:\nRunning transaction check\nRunning transaction test\n"
    ]
}

MSG:



Transaction check error:
  file /usr/bin/etcd from install of etcd3-3.0.3-1.el7.x86_64 conflicts with file from package etcd-2.3.7-4.el7.x86_64
  file /etc/etcd/etcd.conf from install of etcd3-3.0.3-1.el7.x86_64 conflicts with file from package etcd-2.3.7-4.el7.x86_64
  file /usr/bin/etcdctl from install of etcd3-3.0.3-1.el7.x86_64 conflicts with file from package etcd-2.3.7-4.el7.x86_64

Error Summary

Comment 24 Christian Hernandez 2016-11-21 20:47:01 UTC

Running `yum swap etcd etcd3 -y` manually before running the updater works

Note You need to log in before you can comment on or make changes to this bug.