Description of problem: upgrade from 4.1->4.2->4.3->4.4 failed with controller version mismatch for rendered-master and masters are stuck on kubernetes 1.16 Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-03-18-102708 How reproducible: always Steps to Reproduce: 1. Upgrade a cluster from 4.1->4.2->4.3->4.4 2. Added resources to etcd on 4.1 and after each upgrade 3. Actual results: Upgrade from 4.3 -> 4.4 fails Expected results: Upgrade should be successful Additional info:
Your masters are degraded: 'Node ip-10-0-158-46.us-west-2.compute.internal is reporting: "unexpected on-disk state validating against rendered-master-c6d0551802212ac9ccd21c1aa37cecf3", Node ip-10-0-170-63.us-west-2.compute.internal is reporting: "unexpected on-disk state validating against rendered-master-c6d0551802212ac9ccd21c1aa37cecf3", Node ip-10-0-131-102.us-west-2.compute.internal is reporting: "unexpected on-disk state validating against rendered-master-c6d0551802212ac9ccd21c1aa37cecf3"' It looks like whatever changes you made for your etcd resources in #2, caused a problem: 2020-03-18T21:37:07.172587369Z E0318 21:37:07.172513 127376 daemon.go:1354] content mismatch for file /usr/local/bin/etcd-snapshot-restore.sh: #!/usr/bin/env bash What exactly did you do? And how did you do it? Because it seems that /usr/local/bin/etcd-snapshot-restore.sh is changed and I dont know why that file would have been at all?
There seem to be some errors in cluster etcd operator: message: | onn.go:577] ClientConn switching balancer to "pick_first" I0318 21:24:13.487355 1 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc002f47450, CONNECTING I0318 21:24:13.494538 1 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc002f47450, READY I0318 21:24:13.495305 1 client.go:361] parsed scheme: "passthrough" I0318 21:24:13.495329 1 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://10.0.170.63:2379 0 <nil>}] <nil>} I0318 21:24:13.495335 1 clientconn.go:577] ClientConn switching balancer to "pick_first" I0318 21:24:13.495364 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0318 21:24:13.495373 1 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc003382370, CONNECTING I0318 21:24:13.501195 1 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc003382370, READY I0318 21:24:13.501905 1 client.go:361] parsed scheme: "passthrough" I0318 21:24:13.501931 1 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://10.0.131.102:2379 0 <nil>}] <nil>} I0318 21:24:13.501937 1 clientconn.go:577] ClientConn switching balancer to "pick_first" I0318 21:24:13.501969 1 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc003293cd0, CONNECTING I0318 21:24:13.501985 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0318 21:24:13.511589 1 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc003293cd0, READY I0318 21:24:13.513119 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0318 21:24:13.622046 1 etcdcli.go:95] service/host-etcd-2 is missing annotation alpha.installer.openshift.io/etcd-bootstrap F0318 21:24:13.623791 1 leaderelection.go:67] leaderelection lost reason: Error Also see: - lastTransitionTime: "2020-03-19T09:46:15Z" message: ip-10-0-158-46.us-west-2.compute.internal,ip-10-0-170-63.us-west-2.compute.internal,ip-10-0-131-102.us-west-2.compute.internal members are available, have not started, are unhealthy, are unknown reason: EtcdQuorate status: "True" type: EtcdMembersAvailable Whatever you did in step 2 "adding resources to etcd" seems to have caused problems, will pass this to etcd team to take a look once you provide further information so they can troubleshoot.
I ran the following yaml with cluster-loader.py from https://github.com/openshift/svt/tree/master/openshift_scalability. I ran the same after each upgrade so there are 150 such projects on that cluster now. # cat upgrade.yaml projects: - num: 50 basename: svt-43- templates: - num: 6 file: ./content/build-template.json - num: 10 file: ./content/image-stream-template.json - num: 2 file: ./content/deployment-config-0rep-pause-template.json parameters: - ENV_VALUE: "asodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e 8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij12" - num: 10 file: ./content/ssh-secret-template.json - num: 5 file: ./content/route-template.json - num: 10 file: ./content/configmap-template.json
*** Bug 1815468 has been marked as a duplicate of this bug. ***
See bug 1815468 for more details on the same issue
I also hit such issue when upgrading from 4.3.0-0.nightly-2020-03-18-011727 to 4.4.0-0.nightly-2020-03-18-102708 in disconnected upi on baremetal install.
Verified with baremetal, upgrade from 4.3.5 to 4.4.0-0.nightly-2020-03-23-010639 then to 4.5.0-0.nightly-2020-03-23-124202 succ, added some resource to etcd before upgrade, and verified all components works well after upgrade.
I get the following errors after upgrading on Azure and AWS from 4.3.21 to 4.4.5 2020-05-29 14:02:01.781098 I | embed: rejected connection from "10.0.147.100:39738" (error "EOF", ServerName "") 2020-05-29 14:02:01.781168 I | embed: rejected connection from "10.0.147.100:39740" (error "EOF", ServerName "") 2020-05-29 14:03:01.745965 I | embed: rejected connection from "10.0.147.100:40426" (error "EOF", ServerName "") 2020-05-29 14:04:01.614448 I | embed: rejected connection from "10.0.147.100:41044" (error "EOF", ServerName "") 2020-05-29 14:04:01.614600 I | embed: rejected connection from "10.0.147.100:1058" (error "EOF", ServerName "") 2020-05-29 14:05:01.756987 I | embed: rejected connection from "10.0.147.100:41762" (error "EOF", ServerName "") Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: 2 of 3 members are available, ip-10-0-167-77.eu-north-1.compute.internal is unhealthy" I0529 14:14:07.605023 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://10.0.147.100:2379 0 <nil>} {https://10.0.167.77:2379 0 <nil>} {https://10.0.142.180:2379 0 <nil>}] I0529 14:14:07.621794 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0529 14:14:07.621986 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0529 14:14:07.622105 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0529 14:14:10.608506 1 etcdcli.go:96] service/host-etcd-2 is missing annotation alpha.installer.openshift.io/etcd-bootstrap I0529 14:14:13.616833 1 etcdcli.go:96] service/host-etcd-2 is missing annotation alpha.installer.openshift.io/etcd-bootstrap I0529 14:14:13.617328 1 client.go:361] parsed scheme: "endpoint" I0529 14:14:13.617384 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://10.0.142.180:2379 0 <nil>} {https://10.0.147.100:2379 0 <nil>} {https://10.0.167.77:2379 0 <nil>}] I0529 14:14:13.631203 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0529 14:14:13.631221 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0529 14:14:13.631211 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0529 14:14:16.628449 1 etcdcli.go:96] service/host-etcd-2 is missing annotation alpha.installer.openshift.io/etcd-bootstrap I0529 14:14:16.629401 1 client.go:361] parsed scheme: "endpoint" I0529 14:14:16.629456 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://10.0.167.77:2379 0 <nil>} {https://10.0.142.180:2379 0 <nil>} {https://10.0.147.100:2379 0 <nil>}] I0529 14:14:16.651758 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0529 14:14:16.651854 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0529 14:14:16.651924 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
> 2020-05-29 14:02:01.781098 I | embed: rejected connection from "10.0.147.100:39738" (error "EOF", ServerName "") This is info level logging "I |" vs error, the result of teardown of the etcd client balancer. We are working on cleaning up these logs. > Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: 2 of 3 members are available, ip-10-0-167-77.eu-north-1.compute.internal is unhealthy" This was a bug[1] that we patched and should be part of 4.4.7, the bug results in invalid health checks against etcd members. The cluster has an etcdMemberDown alert that will fire if a member is actually down. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1838781
Think I hit this for an OSD cluster upgraded from 4.3.25 to 4.4.11 today. No alerts, no degraded clusteroperators. 3 workers not updated, one with the following machineconfiguration annotations: $ oc get node ip-10-0-132-179.ca-central-1.compute.internal -o json | jq -r '.metadata.annotations' | grep machineconfig "machineconfiguration.openshift.io/currentConfig": "rendered-worker-e510f353841a3261398328ba745b7969", "machineconfiguration.openshift.io/desiredConfig": "rendered-worker-0a077ea3b17f66f31929fe0e2608b77c", "machineconfiguration.openshift.io/reason": "unexpected on-disk state validating against rendered-worker-0a077ea3b17f66f31929fe0e2608b77c", "machineconfiguration.openshift.io/state": "Degraded", All other nodes are in state="Done". I'll post the must-gather once it's uploaded. I will be replacing this working by deleting the related Machine and letting MCC do its thing.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475