Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1861017

Summary:	Upgrade from 4.4.6 to 4.5.3 using stable-4.5 channel is failing (250 nodes)
Product:	OpenShift Container Platform	Reporter:	Naga Ravi Chaitanya Elluri <nelluri>
Component:	Etcd	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED DUPLICATE	QA Contact:	ge liu <geliu>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.4	CC:	amurdaca, bleanhar, jeder, kgarriso, lmohanty, nelluri, sbatsche, scuppett, sdodson, skolicha, wking, wlewis, yprokule
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Linux
Whiteboard:	aos-scalability-45
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-20 23:54:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Naga Ravi Chaitanya Elluri 2020-07-27 16:18:05 UTC

Description of problem:
we tried upgrading a large scale cluster ( 250 nodes with 4000 projects/60k pods ) built with 4.4.6 bits to 4.5.3 using the stable-4.5 channel and it has been stuck for about 48 hours. All the cluster operators got upgraded but just 3 of the nodes got upgraded to kube v1.18. Operators including apiserver, dns, network e.t.c are in degraded state as the couple of nodes are stuck in NotReady state. One of the master node has been upgraded but the scheduling has been disabled causing some of the operators (apiserver) to degrade ( current apiserver replica count is not matching the expected count ). MCO stopped progressing after updating couple of nodes. We observed bunch of etcd request time outs during the process as well as the requests to the apiserver getting rejected/throttled ( from the node journal logs ).

Logs including must-gather, failed operator events and journal from the nodes which were NotReady are at http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/upgrades/.

Version-Release number of selected component (if applicable):
4.4.6

How reproducible:

Steps to Reproduce:
1. Install a large scale cluster using 4.4.6 build and scale up to higher node counts ( 250 in this case ).
2. Load the cluster with large number of objects.
3. Update the upgrade channel to stable-4.5.
4. Try to upgrade the cluster to 4.5.3.

Actual results:
Upgrade is stuck with just 3 nodes getting upgraded in ~48 hours. Scheduling is disabled on one of the master nodes after upgrade. Couple of cluster operators are in degraded state ( mostly related to nodes being in NotReady state and due to disabled scheduling on one of the master nodes ).

Expected results:
Cluster upgraded successfully to 4.5.3.

Comment 1 Kirsten Garrison 2020-07-27 16:25:31 UTC

Looking at must gather seeing lots of issues throughout critical operators, this isn't striking me as an mco issue mainly:

```
2020-07-25T20:15:37.601156014Z E0725 20:15:37.601147       1 controller.go:129] {AuthenticationOperator2 AuthenticationOperator2} failed with: Unauthorized
2020-07-25T20:15:37.601186114Z E0725 20:15:37.601169       1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Unauthorized]
2020-07-25T20:15:37.601209525Z E0725 20:15:37.601172       1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Unauthorized]
```

```
2020-07-26T03:13:42.007781716Z I0726 03:13:42.007738       1 etcdcli.go:96] configmaps/etcd-endpoints is missing annotation alpha.installer.openshift.io/etcd-bootstrap
2020-07-26T03:13:45.015900115Z I0726 03:13:45.015856       1 etcdcli.go:96] configmaps/etcd-endpoints is missing annotation alpha.installer.openshift.io/etcd-bootstrap
```

```
`APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver`
```

```
2020-07-25T20:16:52.651204404Z time="2020-07-25T20:16:52Z" level=error msg="failed to reconcile request /default: failed to ensure dns namespace: failed to ensure dns cluster role for openshift-dns: rpc error: code = Unavailable desc = transport is closing"
```

```
2020-07-26T03:51:06.95512586Z 2020/07/26 03:51:06 configmap 'openshift-config/initial-kube-apiserver-server-ca' name differs from trustedCA of proxy 'cluster' or trustedCA not set; reconciliation will be skipped
```

Comment 4 Scott Dodson 2020-07-31 12:51:30 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 8 Sam Batschelet 2020-08-20 23:54:18 UTC


*** This bug has been marked as a duplicate of bug 1852047 ***

Comment 9 Red Hat Bugzilla 2023-09-14 06:04:24 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days