Bug 1815203

Summary: upgrade from 4.1->4.2->4.3->4.4 fails: unexpected on-disk state validating against rendered-master
Product: OpenShift Container Platform Reporter: Roshni <rpattath>
Component: Etcd OperatorAssignee: Sam Batschelet <sbatsche>
Status: CLOSED ERRATA QA Contact: Roshni <rpattath>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.4CC: ccoleman, dmace, geliu, jialiu, kborup, kgarriso, mifiedle, minmli, nmalik
Target Milestone: ---Keywords: ServiceDeliveryImpact
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1815539 (view as bug list) Environment:
Last Closed: 2020-08-04 18:06:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1815539    

Description Roshni 2020-03-19 18:43:27 UTC
Description of problem:
upgrade from 4.1->4.2->4.3->4.4 failed with controller version mismatch for rendered-master and masters are stuck on kubernetes 1.16

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-03-18-102708

How reproducible:
always

Steps to Reproduce:
1. Upgrade a cluster from 4.1->4.2->4.3->4.4 
2. Added resources to etcd on 4.1 and after each upgrade
3.

Actual results:
Upgrade from 4.3 -> 4.4 fails

Expected results:
Upgrade should be successful

Additional info:

Comment 2 Kirsten Garrison 2020-03-19 19:20:31 UTC
Your masters are degraded:

'Node ip-10-0-158-46.us-west-2.compute.internal is reporting: "unexpected
      on-disk state validating against rendered-master-c6d0551802212ac9ccd21c1aa37cecf3",
      Node ip-10-0-170-63.us-west-2.compute.internal is reporting: "unexpected on-disk
      state validating against rendered-master-c6d0551802212ac9ccd21c1aa37cecf3",
      Node ip-10-0-131-102.us-west-2.compute.internal is reporting: "unexpected on-disk
      state validating against rendered-master-c6d0551802212ac9ccd21c1aa37cecf3"'


It looks like whatever changes you made for your etcd resources in #2, caused a problem:
2020-03-18T21:37:07.172587369Z E0318 21:37:07.172513  127376 daemon.go:1354] content mismatch for file /usr/local/bin/etcd-snapshot-restore.sh: #!/usr/bin/env bash

What exactly did you do? And how did you do it? Because it seems that /usr/local/bin/etcd-snapshot-restore.sh is changed and I dont know why that file would have been at all?

Comment 3 Kirsten Garrison 2020-03-19 19:27:38 UTC
There seem to be some errors in cluster etcd operator:

        message: |
          onn.go:577] ClientConn switching balancer to "pick_first"
          I0318 21:24:13.487355       1 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc002f47450, CONNECTING
          I0318 21:24:13.494538       1 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc002f47450, READY
          I0318 21:24:13.495305       1 client.go:361] parsed scheme: "passthrough"
          I0318 21:24:13.495329       1 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://10.0.170.63:2379 0  <nil>}] <nil>}
          I0318 21:24:13.495335       1 clientconn.go:577] ClientConn switching balancer to "pick_first"
          I0318 21:24:13.495364       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
          I0318 21:24:13.495373       1 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc003382370, CONNECTING
          I0318 21:24:13.501195       1 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc003382370, READY
          I0318 21:24:13.501905       1 client.go:361] parsed scheme: "passthrough"
          I0318 21:24:13.501931       1 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://10.0.131.102:2379 0  <nil>}] <nil>}
          I0318 21:24:13.501937       1 clientconn.go:577] ClientConn switching balancer to "pick_first"
          I0318 21:24:13.501969       1 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc003293cd0, CONNECTING
          I0318 21:24:13.501985       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
          I0318 21:24:13.511589       1 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc003293cd0, READY
          I0318 21:24:13.513119       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
          I0318 21:24:13.622046       1 etcdcli.go:95] service/host-etcd-2 is missing annotation alpha.installer.openshift.io/etcd-bootstrap
          F0318 21:24:13.623791       1 leaderelection.go:67] leaderelection lost
        reason: Error

Also see:
- lastTransitionTime: "2020-03-19T09:46:15Z"
    message: ip-10-0-158-46.us-west-2.compute.internal,ip-10-0-170-63.us-west-2.compute.internal,ip-10-0-131-102.us-west-2.compute.internal
      members are available,  have not started,  are unhealthy,  are unknown
    reason: EtcdQuorate
    status: "True"
    type: EtcdMembersAvailable

Whatever you did in step 2 "adding resources to etcd" seems to have caused problems, will pass this to etcd team to take a look once you provide further information so they can troubleshoot.

Comment 4 Roshni 2020-03-19 19:49:30 UTC
I ran the following yaml with cluster-loader.py from https://github.com/openshift/svt/tree/master/openshift_scalability. I ran the same after each upgrade so there are 150 such projects on that cluster now.

# cat upgrade.yaml 
projects:
  - num: 50
    basename: svt-43-
    templates:
      -
        num: 6
        file: ./content/build-template.json
      -
        num: 10
        file: ./content/image-stream-template.json
      -
        num: 2
        file: ./content/deployment-config-0rep-pause-template.json
        parameters:
          -
            ENV_VALUE: "asodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e
8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij12"
      -
        num: 10
        file: ./content/ssh-secret-template.json
      -
        num: 5
        file: ./content/route-template.json
      -
        num: 10
        file: ./content/configmap-template.json

Comment 5 MinLi 2020-03-20 11:20:01 UTC
*** Bug 1815468 has been marked as a duplicate of this bug. ***

Comment 6 Mike Fiedler 2020-03-20 11:21:36 UTC
See bug 1815468 for more details on the same issue

Comment 7 Johnny Liu 2020-03-20 12:03:13 UTC
I also hit such issue when upgrading from 4.3.0-0.nightly-2020-03-18-011727 to 4.4.0-0.nightly-2020-03-18-102708 in disconnected upi on baremetal install.

Comment 10 ge liu 2020-03-23 16:37:47 UTC
Verified with baremetal, upgrade from 4.3.5 to 4.4.0-0.nightly-2020-03-23-010639 then to 4.5.0-0.nightly-2020-03-23-124202 succ, added some resource to etcd before upgrade, and verified all components works well after upgrade.

Comment 11 Kim Borup 2020-05-29 14:14:52 UTC
I get the following errors after upgrading on Azure and AWS from 4.3.21 to 4.4.5

2020-05-29 14:02:01.781098 I | embed: rejected connection from "10.0.147.100:39738" (error "EOF", ServerName "")
2020-05-29 14:02:01.781168 I | embed: rejected connection from "10.0.147.100:39740" (error "EOF", ServerName "")
2020-05-29 14:03:01.745965 I | embed: rejected connection from "10.0.147.100:40426" (error "EOF", ServerName "")
2020-05-29 14:04:01.614448 I | embed: rejected connection from "10.0.147.100:41044" (error "EOF", ServerName "")
2020-05-29 14:04:01.614600 I | embed: rejected connection from "10.0.147.100:1058" (error "EOF", ServerName "")
2020-05-29 14:05:01.756987 I | embed: rejected connection from "10.0.147.100:41762" (error "EOF", ServerName "")

Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: 2 of 3 members are available, ip-10-0-167-77.eu-north-1.compute.internal is unhealthy"


I0529 14:14:07.605023       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://10.0.147.100:2379 0  <nil>} {https://10.0.167.77:2379 0  <nil>} {https://10.0.142.180:2379 0  <nil>}]
I0529 14:14:07.621794       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0529 14:14:07.621986       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0529 14:14:07.622105       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0529 14:14:10.608506       1 etcdcli.go:96] service/host-etcd-2 is missing annotation alpha.installer.openshift.io/etcd-bootstrap
I0529 14:14:13.616833       1 etcdcli.go:96] service/host-etcd-2 is missing annotation alpha.installer.openshift.io/etcd-bootstrap
I0529 14:14:13.617328       1 client.go:361] parsed scheme: "endpoint"
I0529 14:14:13.617384       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://10.0.142.180:2379 0  <nil>} {https://10.0.147.100:2379 0  <nil>} {https://10.0.167.77:2379 0  <nil>}]
I0529 14:14:13.631203       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0529 14:14:13.631221       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0529 14:14:13.631211       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0529 14:14:16.628449       1 etcdcli.go:96] service/host-etcd-2 is missing annotation alpha.installer.openshift.io/etcd-bootstrap
I0529 14:14:16.629401       1 client.go:361] parsed scheme: "endpoint"
I0529 14:14:16.629456       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://10.0.167.77:2379 0  <nil>} {https://10.0.142.180:2379 0  <nil>} {https://10.0.147.100:2379 0  <nil>}]
I0529 14:14:16.651758       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0529 14:14:16.651854       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0529 14:14:16.651924       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"

Comment 12 Sam Batschelet 2020-05-29 17:04:12 UTC
> 2020-05-29 14:02:01.781098 I | embed: rejected connection from "10.0.147.100:39738" (error "EOF", ServerName "")

This is info level logging "I |" vs error, the result of teardown of the etcd client balancer. We are working on cleaning up these logs.


> Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: 2 of 3 members are available, ip-10-0-167-77.eu-north-1.compute.internal is unhealthy"

This was a bug[1] that we patched and should be part of 4.4.7, the bug results in invalid health checks against etcd members. The cluster has an etcdMemberDown alert that will fire if a member is actually down.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1838781

Comment 14 Naveen Malik 2020-07-28 18:23:36 UTC
Think I hit this for an OSD cluster upgraded from 4.3.25 to 4.4.11 today.  No alerts, no degraded clusteroperators.  3 workers not updated, one with the following machineconfiguration annotations:

$ oc get node ip-10-0-132-179.ca-central-1.compute.internal -o json | jq -r '.metadata.annotations' | grep machineconfig
  "machineconfiguration.openshift.io/currentConfig": "rendered-worker-e510f353841a3261398328ba745b7969",
  "machineconfiguration.openshift.io/desiredConfig": "rendered-worker-0a077ea3b17f66f31929fe0e2608b77c",
  "machineconfiguration.openshift.io/reason": "unexpected on-disk state validating against rendered-worker-0a077ea3b17f66f31929fe0e2608b77c",
  "machineconfiguration.openshift.io/state": "Degraded",

All other nodes are in state="Done".

I'll post the must-gather once it's uploaded.  I will be replacing this working by deleting the related Machine and letting MCC do its thing.

Comment 16 errata-xmlrpc 2020-08-04 18:06:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 17 W. Trevor King 2021-04-05 17:46:22 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475