Bug 1861017
| Summary: | Upgrade from 4.4.6 to 4.5.3 using stable-4.5 channel is failing (250 nodes) | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Naga Ravi Chaitanya Elluri <nelluri> |
| Component: | Etcd | Assignee: | Sam Batschelet <sbatsche> |
| Status: | CLOSED DUPLICATE | QA Contact: | ge liu <geliu> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.4 | CC: | amurdaca, bleanhar, jeder, kgarriso, lmohanty, nelluri, sbatsche, scuppett, sdodson, skolicha, wking, wlewis, yprokule |
| Target Milestone: | --- | Keywords: | Upgrades |
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Linux | ||
| Whiteboard: | aos-scalability-45 | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-08-20 23:54:18 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Naga Ravi Chaitanya Elluri
2020-07-27 16:18:05 UTC
Looking at must gather seeing lots of issues throughout critical operators, this isn't striking me as an mco issue mainly:
```
2020-07-25T20:15:37.601156014Z E0725 20:15:37.601147 1 controller.go:129] {AuthenticationOperator2 AuthenticationOperator2} failed with: Unauthorized
2020-07-25T20:15:37.601186114Z E0725 20:15:37.601169 1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Unauthorized]
2020-07-25T20:15:37.601209525Z E0725 20:15:37.601172 1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Unauthorized]
```
```
2020-07-26T03:13:42.007781716Z I0726 03:13:42.007738 1 etcdcli.go:96] configmaps/etcd-endpoints is missing annotation alpha.installer.openshift.io/etcd-bootstrap
2020-07-26T03:13:45.015900115Z I0726 03:13:45.015856 1 etcdcli.go:96] configmaps/etcd-endpoints is missing annotation alpha.installer.openshift.io/etcd-bootstrap
```
```
`APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver`
```
```
2020-07-25T20:16:52.651204404Z time="2020-07-25T20:16:52Z" level=error msg="failed to reconcile request /default: failed to ensure dns namespace: failed to ensure dns cluster role for openshift-dns: rpc error: code = Unavailable desc = transport is closing"
```
```
2020-07-26T03:51:06.95512586Z 2020/07/26 03:51:06 configmap 'openshift-config/initial-kube-apiserver-server-ca' name differs from trustedCA of proxy 'cluster' or trustedCA not set; reconciliation will be skipped
```
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it’s always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1 *** This bug has been marked as a duplicate of bug 1852047 *** The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |