Created attachment 1218784 [details] ansible logs Description of problem: The upgrade failed to evacute some pods. Version-Release number of selected component (if applicable): How reproducible: onetime Steps to Reproduce: 1. install OCP-3.2 2. ugprade to OCP-3.3 ansible-playbook /root/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_3/upgrade.yml Actual results: NAME READY STATUS RESTARTS AGE cakephp-mysql-example-1-75dy2 1/1 Running 0 19d mysql-1-fass1 1/1 Running 0 19d STDERR: Migrating these pods on node: openshift-181.lab.eng.nay.redhat.com E1108 22:12:43.117701 18552 evacuate.go:135] Unable to delete a pod: {TypeMeta:{Kind: APIVersion:} ObjectMeta:{Name:cakephp-mysql-example-1-build GenerateName: Namespace:cakephp SelfLink:/api/v1/namespaces/cakephp/pods/cakephp-mysql-example-1-build UID:d946d280-969f-11e6-93a6-fa163e309439 ResourceVersion:1075 Generation:0 CreationTimestamp:{Time:2016-10-20 04:33:17 -0400 EDT} DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[openshift.io/build.name:cakephp-mysql-example-1] Annotations:map[openshift.io/build.name:cakephp-mysql-example-1 openshift.io/scc:privileged] OwnerReferences:[] Finalizers:[]} Spec:{Volumes:[{Name:docker-socket VolumeSource:{HostPath:0xc8202c4e70 EmptyDir:<nil> GCEPersistentDisk:<nil> AWSElasticBlockStore:<nil> GitRepo:<nil> Secret:<nil> NFS:<nil> ISCSI:<nil> Glusterfs:<nil> PersistentVolumeClaim:<nil> RBD:<nil> FlexVolume:<nil> Cinder:<nil> CephFS:<nil> Flocker:<nil> DownwardAPI:<nil> FC:<nil> AzureFile:<nil> ConfigMap:<nil> VsphereVolume:<nil>}} {Name:builder-dockercfg-1px3r-push VolumeSource:{HostPath:<nil> EmptyDir:<nil> GCEPersistentDisk:<nil> AWSElasticBlockStore:<nil> GitRepo:<nil> Secret:0xc8205c8210 NFS:<nil> ISCSI:<nil> Glusterfs:<nil> PersistentVolumeClaim:<nil> RBD:<nil> FlexVolume:<nil> Cinder:<nil> CephFS:<nil> Flocker:<nil> DownwardAPI:<nil> FC:<nil> AzureFile:<nil> ConfigMap:<nil> VsphereVolume:<nil>}} {Name:builder-token-fn0ss VolumeSource:{HostPath:<nil> EmptyDir:<nil> GCEPersistentDisk:<nil> AWSElasticBlockStore:<nil> GitRepo:<nil> Secret:0xc8205c8240 NFS:<nil> ISCSI:<nil> Glusterfs:<nil> PersistentVolumeClaim:<nil> RBD:<nil> FlexVolume:<nil> Cinder:<nil> CephFS:<nil> Flocker:<nil> DownwardAPI:<nil> FC:<nil> AzureFile:<nil> ConfigMap:<nil> VsphereVolume:<nil>}}] InitContainers:[] Containers:[{Name:sti-build Image:virt-openshift-05.lab.eng.nay.redhat.com:5000/openshift3/ose-sti-builder:v3.2.1.15 Command:[] Args:[--loglevel=2] WorkingDir: Ports:[] Env:[{Name:BUILD Value:{"kind":"Build","apiVersion":"v1","metadata":{"name":"cakephp-mysql-example-1","namespace":"cakephp","selfLink":"/oapi/v1/namespaces/cakephp/builds/cakephp-mysql-example-1","uid":"d91d0fbe-969f-11e6-93a6-fa163e309439","resourceVersion":"954","creationTimestamp":"2016-10-20T08:33:16Z","labels":{"app":"cakephp-mysql-example","buildconfig":"cakephp-mysql-example","openshift.io/build-config.name":"cakephp-mysql-example","template":"cakephp-mysql-example"},"annotations":{"openshift.io/build-config.name":"cakephp-mysql-example","openshift.io/build.number":"1"}},"spec":{"serviceAccount":"builder","source":{"type":"Git","git":{"uri":"https://github.com/openshift/cakephp-ex.git"},"secrets":null},"strategy":{"type":"Source","sourceStrategy":{"from":{"kind":"DockerImage","name":"registry.access.redhat.com/rhscl/php-56-rhel7@sha256:743108b04515500100a0b3d170f23474fadb7ed94497d5556e48691f931bb619"},"env":[{"name":"COMPOSER_MIRROR"}]}},"output":{"to":{"kind":"DockerImage","name":"172.30.60.12:5000/cakephp/cakephp-mysql-example:latest"},"pushSecret":{"name":"builder-dockercfg-1px3r"}},"resources":{},"postCommit":{}},"status":{"phase":"New","outputDockerImageReference":"172.30.60.12:5000/cakephp/cakephp-mysql-example:latest","config":{"kind":"BuildConfig","namespace":"cakephp","name":"cakephp-mysql-example"}}} ValueFrom:<nil>} {Name:BUILD_LOGLEVEL Value:2 ValueFrom:<nil>} {Name:SOURCE_REPOSITORY Value:https://github.com/openshift/cakephp-ex.git ValueFrom:<nil>} {Name:SOURCE_URI Value:https://github.com/openshift/cakephp-ex.git ValueFrom:<nil>} {Name:ORIGIN_VERSION Value:v3.2.1.15 ValueFrom:<nil>} {Name:ALLOWED_UIDS Value:1- ValueFrom:<nil>} {Name:DROP_CAPS Value:KILL,MKNOD,SETGID,SETUID,SYS_CHROOT ValueFrom:<nil>} {Name:PUSH_DOCKERCFG_PATH Value:/var/run/secrets/openshift.io/push ValueFrom:<nil>}] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:docker-socket ReadOnly:false MountPath:/var/run/docker.sock SubPath:} {Name:builder-dockercfg-1px3r-push ReadOnly:true MountPath:/var/run/secrets/openshift.io/push SubPath:} {Name:builder-token-fn0ss ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath:}] LivenessProbe:<nil> ReadinessProbe:<nil> Lifecycle:<nil> TerminationMessagePath:/dev/termination-log ImagePullPolicy:IfNotPresent SecurityContext:0xc8205c8270 Stdin:false StdinOnce:false TTY:false}] RestartPolicy:Never TerminationGracePeriodSeconds:0xc820353b80 ActiveDeadlineSeconds:<nil> DNSPolicy:ClusterFirst NodeSelector:map[] ServiceAccountName:builder NodeName:openshift-181.lab.eng.nay.redhat.com SecurityContext:0xc8207ae8c0 ImagePullSecrets:[{Name:builder-dockercfg-1px3r}] Hostname: Subdomain:} Status:{Phase:Succeeded Conditions:[{Type:Ready Status:False LastProbeTime:{Time:0001-01-01 00:00:00 +0000 UTC} LastTransitionTime:{Time:2016-10-20 04:36:55 -0400 EDT} Reason:PodCompleted Message:}] Message: Reason: HostIP:10.66.147.181 PodIP:10.1.0.3 StartTime:2016-10-20T04:33:17-04:00 InitContainerStatuses:[] ContainerStatuses:[{Name:sti-build State:{Waiting:<nil> Running:<nil> Terminated:0xc82042e620} LastTerminationState:{Waiting:<nil> Running:<nil> Terminated:<nil>} Ready:false RestartCount:0 Image:virt-openshift-05.lab.eng.nay.redhat.com:5000/openshift3/ose-sti-builder:v3.2.1.15 ImageID:docker://sha256:bedb99b947a662e46c35b18e798300bf714b71b92a839ed5e15cc129dd352300 ContainerID:docker://da7641ffc2599c60d3cdd7757532761590a6f40d65655b1500885e07c71f666e}]}}, error: client: etcd cluster is unavailable or misconfigured Error from server: client: etcd cluster is unavailable or misconfigured NO MORE HOSTS LEFT ************************************************************* to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_3/upgrade.retry PLAY RECAP ********************************************************************* localhost : ok=29 changed=14 unreachable=0 failed=0 openshift-181.lab.eng.nay.redhat.com : ok=264 changed=48 unreachable=0 failed=1 openshift-182.lab.eng.nay.redhat.com : ok=59 changed=4 unreachable=0 failed=0 Expected results: Additional info:
The same error client: etcd cluster is unavailable or misconfigured Description of problem: When upgrade current 3.3 to latest 3.3 with 3.4 quick installer, the upgrade will fail on [restart master] for cacher.go:220] unexpected ListAndWatch error: pkg/storage/cacher.go:163: Failed to list *api.Group: client: etcd cluster is unavailable or misconfigured. Master service status is activating but it does not work. # oc get node The connection to the server 192.168.2.184:8443 was refused - did you specify the right host or port? Try to restart master service manually, still fail. Version-Release number of selected component (if applicable): atomic-openshift-utils-3.4.17-1.git.0.4698b0c.el7.noarch openshift-ansible-playbooks-3.4.17-1.git.0.4698b0c.el7.noarch openshift-ansible-playbooks-3.4.17-1.git.0.4698b0c.el7.noarch How reproducible: always Steps to Reproduce: 1.Install OCP3.3 with 3.3 quick installer 2.Run upgrade with 3.4 quick installer # atomic-openshift-installer -d -c /tmp/installer.cfg.yml upgrade This tool will help you upgrade your existing OpenShift installation. Currently running: openshift-enterprise 3.3 (1) Update to latest 3.3 (2) Upgrade to next release: 3.4 Choose an option from above: 3.Choose 1 4.It will continue to run 3.3 upgrade playbook. installer - DEBUG - Going to subprocess out to ansible now with these args: ansible-playbook --inventory-file=/tmp/hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_3/upgrade.yml Actual results: The 3.3 minor upgrade fail. RUNNING HANDLER [restart master] *********************************************** fatal: [openshift-151.lab.eng.nay.redhat.com]: FAILED! => { "changed": false, "failed": true } MSG: Unable to restart service atomic-openshift-master: Job for atomic-openshift-master.service failed because a timeout was exceeded. See "systemctl status atomic-openshift-master.service" and "journalctl -xe" for details.
just comment, The etcd had been upgraded to etcd3 before the error jump out. Hit this issue two times until now, but not always!
Yeah, I'm pretty sure this is happening because the backup step currently upgrades etcd when we don't really intend to do so. We're only installing it for backup purposes on embedded etcd environments where it wouldn't already be installed. The reason you're seeing it sometimes but not others is likely because you've got some hosts with RHEL 7.3 GA repos but some with 7.3.1 repos. In RHEL 7.3.1 etcd3 now obsoletes etcd so it would be seen as an upgrade. So for not, stop upgrading etcd when backing up etcd. https://github.com/openshift/openshift-ansible/pull/2773
Also, the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1382634 stopped upgrading etcd during backups for non embedded installs too. You can see those changes here and I believe they would've fixed the issue of upgrading from 3.2 to 3.3 as well. https://github.com/openshift/openshift-ansible/pull/2745 Given that issue I believe is already fixed in 3.3 I'm moving this to 3.4 and providing the fix in comment 5.
Commit pushed to master at https://github.com/openshift/openshift-ansible https://github.com/openshift/openshift-ansible/commit/bd120d5cc460fa0c0d42c388dda00c6f15ee76cd Don't upgrade etcd on backup operations Fixes Bug 1393187 Fixes BZ1393187
*** Bug 1391935 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0066