Bug 1393187 - etcd cluster is unavailable or misconfigured during upgrade
Summary: etcd cluster is unavailable or misconfigured during upgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Scott Dodson
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-09 04:18 UTC by Anping Li
Modified: 2017-05-04 08:29 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously the upgrade playbook would in inadvertently upgrade etcd when it should not have. If this upgrade triggered an upgrade to etcd3 then the upgrade would fail as etcd would become unavailable. We no longer upgrade etcd when it's not necessary ensuring upgrades proceed successfully.
Clone Of:
Environment:
Last Closed: 2017-01-18 12:51:02 UTC
Target Upstream Version:


Attachments (Terms of Use)
ansible logs (117.26 KB, text/plain)
2016-11-09 04:18 UTC, Anping Li
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0066 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.4 RPM Release Advisory 2017-01-18 17:23:26 UTC

Description Anping Li 2016-11-09 04:18:34 UTC
Created attachment 1218784 [details]
ansible logs

Description of problem:
The upgrade failed to evacute some pods.

Version-Release number of selected component (if applicable):



How reproducible:
onetime

Steps to Reproduce:
1. install OCP-3.2
2. ugprade to OCP-3.3
  ansible-playbook /root/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_3/upgrade.yml


Actual results:
NAME                            READY     STATUS    RESTARTS   AGE
cakephp-mysql-example-1-75dy2   1/1       Running   0          19d
mysql-1-fass1   1/1       Running   0         19d


STDERR:


Migrating these pods on node: openshift-181.lab.eng.nay.redhat.com

E1108 22:12:43.117701   18552 evacuate.go:135] Unable to delete a pod: {TypeMeta:{Kind: APIVersion:} ObjectMeta:{Name:cakephp-mysql-example-1-build GenerateName: Namespace:cakephp SelfLink:/api/v1/namespaces/cakephp/pods/cakephp-mysql-example-1-build UID:d946d280-969f-11e6-93a6-fa163e309439 ResourceVersion:1075 Generation:0 CreationTimestamp:{Time:2016-10-20 04:33:17 -0400 EDT} DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[openshift.io/build.name:cakephp-mysql-example-1] Annotations:map[openshift.io/build.name:cakephp-mysql-example-1 openshift.io/scc:privileged] OwnerReferences:[] Finalizers:[]} Spec:{Volumes:[{Name:docker-socket VolumeSource:{HostPath:0xc8202c4e70 EmptyDir:<nil> GCEPersistentDisk:<nil> AWSElasticBlockStore:<nil> GitRepo:<nil> Secret:<nil> NFS:<nil> ISCSI:<nil> Glusterfs:<nil> PersistentVolumeClaim:<nil> RBD:<nil> FlexVolume:<nil> Cinder:<nil> CephFS:<nil> Flocker:<nil> DownwardAPI:<nil> FC:<nil> AzureFile:<nil> ConfigMap:<nil> VsphereVolume:<nil>}} {Name:builder-dockercfg-1px3r-push VolumeSource:{HostPath:<nil> EmptyDir:<nil> GCEPersistentDisk:<nil> AWSElasticBlockStore:<nil> GitRepo:<nil> Secret:0xc8205c8210 NFS:<nil> ISCSI:<nil> Glusterfs:<nil> PersistentVolumeClaim:<nil> RBD:<nil> FlexVolume:<nil> Cinder:<nil> CephFS:<nil> Flocker:<nil> DownwardAPI:<nil> FC:<nil> AzureFile:<nil> ConfigMap:<nil> VsphereVolume:<nil>}} {Name:builder-token-fn0ss VolumeSource:{HostPath:<nil> EmptyDir:<nil> GCEPersistentDisk:<nil> AWSElasticBlockStore:<nil> GitRepo:<nil> Secret:0xc8205c8240 NFS:<nil> ISCSI:<nil> Glusterfs:<nil> PersistentVolumeClaim:<nil> RBD:<nil> FlexVolume:<nil> Cinder:<nil> CephFS:<nil> Flocker:<nil> DownwardAPI:<nil> FC:<nil> AzureFile:<nil> ConfigMap:<nil> VsphereVolume:<nil>}}] InitContainers:[] Containers:[{Name:sti-build Image:virt-openshift-05.lab.eng.nay.redhat.com:5000/openshift3/ose-sti-builder:v3.2.1.15 Command:[] Args:[--loglevel=2] WorkingDir: Ports:[] Env:[{Name:BUILD Value:{"kind":"Build","apiVersion":"v1","metadata":{"name":"cakephp-mysql-example-1","namespace":"cakephp","selfLink":"/oapi/v1/namespaces/cakephp/builds/cakephp-mysql-example-1","uid":"d91d0fbe-969f-11e6-93a6-fa163e309439","resourceVersion":"954","creationTimestamp":"2016-10-20T08:33:16Z","labels":{"app":"cakephp-mysql-example","buildconfig":"cakephp-mysql-example","openshift.io/build-config.name":"cakephp-mysql-example","template":"cakephp-mysql-example"},"annotations":{"openshift.io/build-config.name":"cakephp-mysql-example","openshift.io/build.number":"1"}},"spec":{"serviceAccount":"builder","source":{"type":"Git","git":{"uri":"https://github.com/openshift/cakephp-ex.git"},"secrets":null},"strategy":{"type":"Source","sourceStrategy":{"from":{"kind":"DockerImage","name":"registry.access.redhat.com/rhscl/php-56-rhel7@sha256:743108b04515500100a0b3d170f23474fadb7ed94497d5556e48691f931bb619"},"env":[{"name":"COMPOSER_MIRROR"}]}},"output":{"to":{"kind":"DockerImage","name":"172.30.60.12:5000/cakephp/cakephp-mysql-example:latest"},"pushSecret":{"name":"builder-dockercfg-1px3r"}},"resources":{},"postCommit":{}},"status":{"phase":"New","outputDockerImageReference":"172.30.60.12:5000/cakephp/cakephp-mysql-example:latest","config":{"kind":"BuildConfig","namespace":"cakephp","name":"cakephp-mysql-example"}}}
 ValueFrom:<nil>} {Name:BUILD_LOGLEVEL Value:2 ValueFrom:<nil>} {Name:SOURCE_REPOSITORY Value:https://github.com/openshift/cakephp-ex.git ValueFrom:<nil>} {Name:SOURCE_URI Value:https://github.com/openshift/cakephp-ex.git ValueFrom:<nil>} {Name:ORIGIN_VERSION Value:v3.2.1.15 ValueFrom:<nil>} {Name:ALLOWED_UIDS Value:1- ValueFrom:<nil>} {Name:DROP_CAPS Value:KILL,MKNOD,SETGID,SETUID,SYS_CHROOT ValueFrom:<nil>} {Name:PUSH_DOCKERCFG_PATH Value:/var/run/secrets/openshift.io/push ValueFrom:<nil>}] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:docker-socket ReadOnly:false MountPath:/var/run/docker.sock SubPath:} {Name:builder-dockercfg-1px3r-push ReadOnly:true MountPath:/var/run/secrets/openshift.io/push SubPath:} {Name:builder-token-fn0ss ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath:}] LivenessProbe:<nil> ReadinessProbe:<nil> Lifecycle:<nil> TerminationMessagePath:/dev/termination-log ImagePullPolicy:IfNotPresent SecurityContext:0xc8205c8270 Stdin:false StdinOnce:false TTY:false}] RestartPolicy:Never TerminationGracePeriodSeconds:0xc820353b80 ActiveDeadlineSeconds:<nil> DNSPolicy:ClusterFirst NodeSelector:map[] ServiceAccountName:builder NodeName:openshift-181.lab.eng.nay.redhat.com SecurityContext:0xc8207ae8c0 ImagePullSecrets:[{Name:builder-dockercfg-1px3r}] Hostname: Subdomain:} Status:{Phase:Succeeded Conditions:[{Type:Ready Status:False LastProbeTime:{Time:0001-01-01 00:00:00 +0000 UTC} LastTransitionTime:{Time:2016-10-20 04:36:55 -0400 EDT} Reason:PodCompleted Message:}] Message: Reason: HostIP:10.66.147.181 PodIP:10.1.0.3 StartTime:2016-10-20T04:33:17-04:00 InitContainerStatuses:[] ContainerStatuses:[{Name:sti-build State:{Waiting:<nil> Running:<nil> Terminated:0xc82042e620} LastTerminationState:{Waiting:<nil> Running:<nil> Terminated:<nil>} Ready:false RestartCount:0 Image:virt-openshift-05.lab.eng.nay.redhat.com:5000/openshift3/ose-sti-builder:v3.2.1.15 ImageID:docker://sha256:bedb99b947a662e46c35b18e798300bf714b71b92a839ed5e15cc129dd352300 ContainerID:docker://da7641ffc2599c60d3cdd7757532761590a6f40d65655b1500885e07c71f666e}]}}, error: client: etcd cluster is unavailable or misconfigured
Error from server: client: etcd cluster is unavailable or misconfigured

NO MORE HOSTS LEFT *************************************************************
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_3/upgrade.retry

PLAY RECAP *********************************************************************
localhost                  : ok=29   changed=14   unreachable=0    failed=0   
openshift-181.lab.eng.nay.redhat.com : ok=264  changed=48   unreachable=0    failed=1   
openshift-182.lab.eng.nay.redhat.com : ok=59   changed=4    unreachable=0    failed=0   


Expected results:


Additional info:

Comment 2 liujia 2016-11-09 05:09:32 UTC
The same error client: etcd cluster is unavailable or misconfigured

Description of problem:
When upgrade current 3.3 to latest 3.3 with 3.4 quick installer, the upgrade will fail on [restart master] for cacher.go:220] unexpected ListAndWatch error: pkg/storage/cacher.go:163: Failed to list *api.Group: client: etcd cluster is unavailable or misconfigured.

Master service status is activating but it does not work.
# oc get node
The connection to the server 192.168.2.184:8443 was refused - did you specify the right host or port?

Try to restart master service manually, still fail. 

Version-Release number of selected component (if applicable):
atomic-openshift-utils-3.4.17-1.git.0.4698b0c.el7.noarch
openshift-ansible-playbooks-3.4.17-1.git.0.4698b0c.el7.noarch
openshift-ansible-playbooks-3.4.17-1.git.0.4698b0c.el7.noarch

How reproducible:
always

Steps to Reproduce:
1.Install OCP3.3 with 3.3 quick installer
2.Run upgrade with 3.4 quick installer
# atomic-openshift-installer -d -c /tmp/installer.cfg.yml upgrade

This tool will help you upgrade your existing OpenShift installation.
        Currently running: openshift-enterprise 3.3

(1) Update to latest 3.3
(2) Upgrade to next release: 3.4

Choose an option from above:

3.Choose 1
4.It will continue to run 3.3 upgrade playbook.
installer - DEBUG - Going to subprocess out to ansible now with these args: ansible-playbook --inventory-file=/tmp/hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_3/upgrade.yml

Actual results:
The 3.3 minor upgrade fail.
RUNNING HANDLER [restart master] ***********************************************
fatal: [openshift-151.lab.eng.nay.redhat.com]: FAILED! => {
    "changed": false, 
    "failed": true
}

MSG:

Unable to restart service atomic-openshift-master: Job for atomic-openshift-master.service failed because a timeout was exceeded. See "systemctl status atomic-openshift-master.service" and "journalctl -xe" for details.

Comment 4 Anping Li 2016-11-09 05:23:00 UTC
just comment, The etcd had been upgraded to etcd3 before the error jump out.  Hit this issue two times until now, but not always!

Comment 5 Scott Dodson 2016-11-09 20:33:07 UTC
Yeah, I'm pretty sure this is happening because the backup step currently upgrades etcd when we don't really intend to do so. We're only installing it for backup purposes on embedded etcd environments where it wouldn't already be installed. The reason you're seeing it sometimes but not others is likely because you've got some hosts with RHEL 7.3 GA repos but some with 7.3.1 repos. In RHEL 7.3.1 etcd3 now obsoletes etcd so it would be seen as an upgrade.

So for not, stop upgrading etcd when backing up etcd. 

https://github.com/openshift/openshift-ansible/pull/2773

Comment 6 Scott Dodson 2016-11-09 20:38:13 UTC
Also, the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1382634 stopped upgrading etcd during backups for non embedded installs too. You can see those changes here and I believe they would've fixed the issue of upgrading from 3.2 to 3.3 as well. 

https://github.com/openshift/openshift-ansible/pull/2745

Given that issue I believe is already fixed in 3.3 I'm moving this to 3.4 and providing the fix in comment 5.

Comment 7 openshift-github-bot 2016-11-09 21:49:53 UTC
Commit pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/bd120d5cc460fa0c0d42c388dda00c6f15ee76cd
Don't upgrade etcd on backup operations

Fixes Bug 1393187
Fixes BZ1393187

Comment 9 Scott Dodson 2016-11-10 01:05:34 UTC
*** Bug 1391935 has been marked as a duplicate of this bug. ***

Comment 12 errata-xmlrpc 2017-01-18 12:51:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066


Note You need to log in before you can comment on or make changes to this bug.