1393187 – etcd cluster is unavailable or misconfigured during upgrade

Bug 1393187 - etcd cluster is unavailable or misconfigured during upgrade

Summary: etcd cluster is unavailable or misconfigured during upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Scott Dodson
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-09 04:18 UTC by Anping Li
Modified:	2017-05-04 08:29 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously the upgrade playbook would in inadvertently upgrade etcd when it should not have. If this upgrade triggered an upgrade to etcd3 then the upgrade would fail as etcd would become unavailable. We no longer upgrade etcd when it's not necessary ensuring upgrades proceed successfully.
Clone Of:
Environment:
Last Closed:	2017-01-18 12:51:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ansible logs (117.26 KB, text/plain) 2016-11-09 04:18 UTC, Anping Li	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0066	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.4 RPM Release Advisory	2017-01-18 17:23:26 UTC

Description Anping Li 2016-11-09 04:18:34 UTC

Created attachment 1218784 [details]
ansible logs

Description of problem:
The upgrade failed to evacute some pods.

Version-Release number of selected component (if applicable):



How reproducible:
onetime

Steps to Reproduce:
1. install OCP-3.2
2. ugprade to OCP-3.3
  ansible-playbook /root/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_3/upgrade.yml


Actual results:
NAME                            READY     STATUS    RESTARTS   AGE
cakephp-mysql-example-1-75dy2   1/1       Running   0          19d
mysql-1-fass1   1/1       Running   0         19d


STDERR:


Migrating these pods on node: openshift-181.lab.eng.nay.redhat.com

E1108 22:12:43.117701   18552 evacuate.go:135] Unable to delete a pod: {TypeMeta:{Kind: APIVersion:} ObjectMeta:{Name:cakephp-mysql-example-1-build GenerateName: Namespace:cakephp SelfLink:/api/v1/namespaces/cakephp/pods/cakephp-mysql-example-1-build UID:d946d280-969f-11e6-93a6-fa163e309439 ResourceVersion:1075 Generation:0 CreationTimestamp:{Time:2016-10-20 04:33:17 -0400 EDT} DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[openshift.io/build.name:cakephp-mysql-example-1] Annotations:map[openshift.io/build.name:cakephp-mysql-example-1 openshift.io/scc:privileged] OwnerReferences:[] Finalizers:[]} Spec:{Volumes:[{Name:docker-socket VolumeSource:{HostPath:0xc8202c4e70 EmptyDir:<nil> GCEPersistentDisk:<nil> AWSElasticBlockStore:<nil> GitRepo:<nil> Secret:<nil> NFS:<nil> ISCSI:<nil> Glusterfs:<nil> PersistentVolumeClaim:<nil> RBD:<nil> FlexVolume:<nil> Cinder:<nil> CephFS:<nil> Flocker:<nil> DownwardAPI:<nil> FC:<nil> AzureFile:<nil> ConfigMap:<nil> VsphereVolume:<nil>}} {Name:builder-dockercfg-1px3r-push VolumeSource:{HostPath:<nil> EmptyDir:<nil> GCEPersistentDisk:<nil> AWSElasticBlockStore:<nil> GitRepo:<nil> Secret:0xc8205c8210 NFS:<nil> ISCSI:<nil> Glusterfs:<nil> PersistentVolumeClaim:<nil> RBD:<nil> FlexVolume:<nil> Cinder:<nil> CephFS:<nil> Flocker:<nil> DownwardAPI:<nil> FC:<nil> AzureFile:<nil> ConfigMap:<nil> VsphereVolume:<nil>}} {Name:builder-token-fn0ss VolumeSource:{HostPath:<nil> EmptyDir:<nil> GCEPersistentDisk:<nil> AWSElasticBlockStore:<nil> GitRepo:<nil> Secret:0xc8205c8240 NFS:<nil> ISCSI:<nil> Glusterfs:<nil> PersistentVolumeClaim:<nil> RBD:<nil> FlexVolume:<nil> Cinder:<nil> CephFS:<nil> Flocker:<nil> DownwardAPI:<nil> FC:<nil> AzureFile:<nil> ConfigMap:<nil> VsphereVolume:<nil>}}] InitContainers:[] Containers:[{Name:sti-build Image:virt-openshift-05.lab.eng.nay.redhat.com:5000/openshift3/ose-sti-builder:v3.2.1.15 Command:[] Args:[--loglevel=2] WorkingDir: Ports:[] Env:[{Name:BUILD Value:{"kind":"Build","apiVersion":"v1","metadata":{"name":"cakephp-mysql-example-1","namespace":"cakephp","selfLink":"/oapi/v1/namespaces/cakephp/builds/cakephp-mysql-example-1","uid":"d91d0fbe-969f-11e6-93a6-fa163e309439","resourceVersion":"954","creationTimestamp":"2016-10-20T08:33:16Z","labels":{"app":"cakephp-mysql-example","buildconfig":"cakephp-mysql-example","openshift.io/build-config.name":"cakephp-mysql-example","template":"cakephp-mysql-example"},"annotations":{"openshift.io/build-config.name":"cakephp-mysql-example","openshift.io/build.number":"1"}},"spec":{"serviceAccount":"builder","source":{"type":"Git","git":{"uri":"https://github.com/openshift/cakephp-ex.git"},"secrets":null},"strategy":{"type":"Source","sourceStrategy":{"from":{"kind":"DockerImage","name":"registry.access.redhat.com/rhscl/php-56-rhel7@sha256:743108b04515500100a0b3d170f23474fadb7ed94497d5556e48691f931bb619"},"env":[{"name":"COMPOSER_MIRROR"}]}},"output":{"to":{"kind":"DockerImage","name":"172.30.60.12:5000/cakephp/cakephp-mysql-example:latest"},"pushSecret":{"name":"builder-dockercfg-1px3r"}},"resources":{},"postCommit":{}},"status":{"phase":"New","outputDockerImageReference":"172.30.60.12:5000/cakephp/cakephp-mysql-example:latest","config":{"kind":"BuildConfig","namespace":"cakephp","name":"cakephp-mysql-example"}}}
 ValueFrom:<nil>} {Name:BUILD_LOGLEVEL Value:2 ValueFrom:<nil>} {Name:SOURCE_REPOSITORY Value:https://github.com/openshift/cakephp-ex.git ValueFrom:<nil>} {Name:SOURCE_URI Value:https://github.com/openshift/cakephp-ex.git ValueFrom:<nil>} {Name:ORIGIN_VERSION Value:v3.2.1.15 ValueFrom:<nil>} {Name:ALLOWED_UIDS Value:1- ValueFrom:<nil>} {Name:DROP_CAPS Value:KILL,MKNOD,SETGID,SETUID,SYS_CHROOT ValueFrom:<nil>} {Name:PUSH_DOCKERCFG_PATH Value:/var/run/secrets/openshift.io/push ValueFrom:<nil>}] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:docker-socket ReadOnly:false MountPath:/var/run/docker.sock SubPath:} {Name:builder-dockercfg-1px3r-push ReadOnly:true MountPath:/var/run/secrets/openshift.io/push SubPath:} {Name:builder-token-fn0ss ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath:}] LivenessProbe:<nil> ReadinessProbe:<nil> Lifecycle:<nil> TerminationMessagePath:/dev/termination-log ImagePullPolicy:IfNotPresent SecurityContext:0xc8205c8270 Stdin:false StdinOnce:false TTY:false}] RestartPolicy:Never TerminationGracePeriodSeconds:0xc820353b80 ActiveDeadlineSeconds:<nil> DNSPolicy:ClusterFirst NodeSelector:map[] ServiceAccountName:builder NodeName:openshift-181.lab.eng.nay.redhat.com SecurityContext:0xc8207ae8c0 ImagePullSecrets:[{Name:builder-dockercfg-1px3r}] Hostname: Subdomain:} Status:{Phase:Succeeded Conditions:[{Type:Ready Status:False LastProbeTime:{Time:0001-01-01 00:00:00 +0000 UTC} LastTransitionTime:{Time:2016-10-20 04:36:55 -0400 EDT} Reason:PodCompleted Message:}] Message: Reason: HostIP:10.66.147.181 PodIP:10.1.0.3 StartTime:2016-10-20T04:33:17-04:00 InitContainerStatuses:[] ContainerStatuses:[{Name:sti-build State:{Waiting:<nil> Running:<nil> Terminated:0xc82042e620} LastTerminationState:{Waiting:<nil> Running:<nil> Terminated:<nil>} Ready:false RestartCount:0 Image:virt-openshift-05.lab.eng.nay.redhat.com:5000/openshift3/ose-sti-builder:v3.2.1.15 ImageID:docker://sha256:bedb99b947a662e46c35b18e798300bf714b71b92a839ed5e15cc129dd352300 ContainerID:docker://da7641ffc2599c60d3cdd7757532761590a6f40d65655b1500885e07c71f666e}]}}, error: client: etcd cluster is unavailable or misconfigured
Error from server: client: etcd cluster is unavailable or misconfigured

NO MORE HOSTS LEFT *************************************************************
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_3/upgrade.retry

PLAY RECAP *********************************************************************
localhost                  : ok=29   changed=14   unreachable=0    failed=0   
openshift-181.lab.eng.nay.redhat.com : ok=264  changed=48   unreachable=0    failed=1   
openshift-182.lab.eng.nay.redhat.com : ok=59   changed=4    unreachable=0    failed=0   


Expected results:


Additional info:

Comment 2 liujia 2016-11-09 05:09:32 UTC

The same error client: etcd cluster is unavailable or misconfigured

Description of problem:
When upgrade current 3.3 to latest 3.3 with 3.4 quick installer, the upgrade will fail on [restart master] for cacher.go:220] unexpected ListAndWatch error: pkg/storage/cacher.go:163: Failed to list *api.Group: client: etcd cluster is unavailable or misconfigured.

Master service status is activating but it does not work.
# oc get node
The connection to the server 192.168.2.184:8443 was refused - did you specify the right host or port?

Try to restart master service manually, still fail. 

Version-Release number of selected component (if applicable):
atomic-openshift-utils-3.4.17-1.git.0.4698b0c.el7.noarch
openshift-ansible-playbooks-3.4.17-1.git.0.4698b0c.el7.noarch
openshift-ansible-playbooks-3.4.17-1.git.0.4698b0c.el7.noarch

How reproducible:
always

Steps to Reproduce:
1.Install OCP3.3 with 3.3 quick installer
2.Run upgrade with 3.4 quick installer
# atomic-openshift-installer -d -c /tmp/installer.cfg.yml upgrade

This tool will help you upgrade your existing OpenShift installation.
        Currently running: openshift-enterprise 3.3

(1) Update to latest 3.3
(2) Upgrade to next release: 3.4

Choose an option from above:

3.Choose 1
4.It will continue to run 3.3 upgrade playbook.
installer - DEBUG - Going to subprocess out to ansible now with these args: ansible-playbook --inventory-file=/tmp/hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_3/upgrade.yml

Actual results:
The 3.3 minor upgrade fail.
RUNNING HANDLER [restart master] ***********************************************
fatal: [openshift-151.lab.eng.nay.redhat.com]: FAILED! => {
    "changed": false, 
    "failed": true
}

MSG:

Unable to restart service atomic-openshift-master: Job for atomic-openshift-master.service failed because a timeout was exceeded. See "systemctl status atomic-openshift-master.service" and "journalctl -xe" for details.

Comment 4 Anping Li 2016-11-09 05:23:00 UTC

just comment, The etcd had been upgraded to etcd3 before the error jump out.  Hit this issue two times until now, but not always!

Comment 5 Scott Dodson 2016-11-09 20:33:07 UTC

Yeah, I'm pretty sure this is happening because the backup step currently upgrades etcd when we don't really intend to do so. We're only installing it for backup purposes on embedded etcd environments where it wouldn't already be installed. The reason you're seeing it sometimes but not others is likely because you've got some hosts with RHEL 7.3 GA repos but some with 7.3.1 repos. In RHEL 7.3.1 etcd3 now obsoletes etcd so it would be seen as an upgrade.

So for not, stop upgrading etcd when backing up etcd. 

https://github.com/openshift/openshift-ansible/pull/2773

Comment 6 Scott Dodson 2016-11-09 20:38:13 UTC

Also, the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1382634 stopped upgrading etcd during backups for non embedded installs too. You can see those changes here and I believe they would've fixed the issue of upgrading from 3.2 to 3.3 as well. 

https://github.com/openshift/openshift-ansible/pull/2745

Given that issue I believe is already fixed in 3.3 I'm moving this to 3.4 and providing the fix in comment 5.

Comment 7 openshift-github-bot 2016-11-09 21:49:53 UTC

Commit pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/bd120d5cc460fa0c0d42c388dda00c6f15ee76cd
Don't upgrade etcd on backup operations

Fixes Bug 1393187
Fixes BZ1393187

Comment 9 Scott Dodson 2016-11-10 01:05:34 UTC

*** Bug 1391935 has been marked as a duplicate of this bug. ***

Comment 12 errata-xmlrpc 2017-01-18 12:51:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066

Note You need to log in before you can comment on or make changes to this bug.