Bug 1952618
| Summary: | 4.7.4->4.7.8 Upgrade Caused OpenShift-Apiserver Outage | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Steve Kuznetsov <skuznets> |
| Component: | openshift-apiserver | Assignee: | Stefan Schimanski <sttts> |
| Status: | CLOSED ERRATA | QA Contact: | Xingxing Xia <xxia> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.7 | CC: | aos-bugs, jerzhang, mfojtik, wking |
| Target Milestone: | --- | Keywords: | Upgrades |
| Target Release: | 4.8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-27 23:02:56 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Steve Kuznetsov
2021-04-22 16:52:43 UTC
During an upgrade from 4.7.4 to 4.7.8, clients failed to hit the OpenShift API: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ci-tools/1902/pull-ci-openshift-ci-tools-master-images/1385240490412085248#1:build-log.txt%3A187 Must-gather is here: https://coreos.slack.com/archives/C01UQNJA31D/p1619108789012400 The client failures as linked were: WARN[2021-04-22T15:13:15Z] Failed to get build e2e-bin. error=the server is currently unable to handle the request (get builds.build.openshift.io e2e-bin) WARN[2021-04-22T15:13:45Z] Failed to get build e2e-bin. error=the server is currently unable to handle the request (get builds.build.openshift.io e2e-bin) WARN[2021-04-22T15:13:51Z] Failed to get build e2e-bin. error=the server is currently unable to handle the request (get builds.build.openshift.io e2e-bin) So around that time is when we must have had connectivity/uptime issues. Adding some MCO timeline: 1st master: starts: 15:06:13 drain complete: 15:07:30 update successful: 15:10:34 2nd master start: 15:10:40 drain start: 15:10:42 pdb issue: 15:10:49 - 15:11:09 (or 15:11:24) update successful 15:15:15 3rd master: starts 15:15:21 draining: 15:15:23 pdb error: 15:15:33-15:15:58 (25s) successfully finish: 15:19:43 pdb issues above are due to etcd-quorum-guard not draining (maybe the replacement for the updated master hasn't started yet) Otherwise the MCO master upgrade was successful in ~15 minutes with no errors Relevant events:
1. draining kills pods at 15:45Z
2. pod stopped listening at 15:46Z (super early, SDN certainly had no chance to react)
3. operator notices APIService is (still?) down at 15:12:45Z
- apiVersion: v1
count: 1
eventTime: null
firstTimestamp: "2021-04-22T15:10:45Z"
involvedObject:
apiVersion: v1
fieldPath: spec.containers{openshift-apiserver}
kind: Pod
name: apiserver-c64dc5678-ss7p4
namespace: openshift-apiserver
resourceVersion: "648042934"
uid: c237112e-e9f1-45e9-91df-9d9a6965e1ea
kind: Event
lastTimestamp: "2021-04-22T15:10:45Z"
message: Stopping container openshift-apiserver
metadata:
creationTimestamp: "2021-04-22T15:10:45Z"
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:count: {}
f:firstTimestamp: {}
f:involvedObject:
f:apiVersion: {}
f:fieldPath: {}
f:kind: {}
f:name: {}
f:namespace: {}
f:resourceVersion: {}
f:uid: {}
f:lastTimestamp: {}
f:message: {}
f:reason: {}
f:source:
f:component: {}
f:host: {}
f:type: {}
manager: kubelet
operation: Update
time: "2021-04-22T15:10:45Z"
name: apiserver-c64dc5678-ss7p4.167836bba15e41ab
namespace: openshift-apiserver
resourceVersion: "649113517"
selfLink: /api/v1/namespaces/openshift-apiserver/events/apiserver-c64dc5678-ss7p4.167836bba15e41ab
uid: 6a0f865f-7061-4a0b-a3f6-a77c96e206d5
reason: Killing
reportingComponent: ""
reportingInstance: ""
source:
component: kubelet
host: ip-10-0-140-81.ec2.internal
type: Normal
- apiVersion: v1
count: 3
eventTime: null
firstTimestamp: "2021-04-22T15:10:46Z"
involvedObject:
apiVersion: v1
fieldPath: spec.containers{openshift-apiserver}
kind: Pod
name: apiserver-c64dc5678-ss7p4
namespace: openshift-apiserver
resourceVersion: "648042934"
uid: c237112e-e9f1-45e9-91df-9d9a6965e1ea
kind: Event
lastTimestamp: "2021-04-22T15:11:06Z"
message: 'Liveness probe failed: Get "https://10.130.64.73:8443/healthz": dial tcp 10.130.64.73:8443: connect: connection refused'
metadata:
creationTimestamp: "2021-04-22T15:10:46Z"
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:count: {}
f:firstTimestamp: {}
f:involvedObject:
f:apiVersion: {}
f:fieldPath: {}
f:kind: {}
f:name: {}
f:namespace: {}
f:resourceVersion: {}
f:uid: {}
f:lastTimestamp: {}
f:message: {}
f:reason: {}
f:source:
f:component: {}
f:host: {}
f:type: {}
manager: kubelet
operation: Update
time: "2021-04-22T15:10:46Z"
name: apiserver-c64dc5678-ss7p4.167836bbcefcf2a1
namespace: openshift-apiserver
resourceVersion: "649115334"
selfLink: /api/v1/namespaces/openshift-apiserver/events/apiserver-c64dc5678-ss7p4.167836bbcefcf2a1
uid: 0d22e470-c018-4897-beab-afaa0c9e94ac
reason: Unhealthy
reportingComponent: ""
reportingInstance: ""
source:
component: kubelet
host: ip-10-0-140-81.ec2.internal
type: Warning
- apiVersion: v1
count: 3
eventTime: null
firstTimestamp: "2021-04-22T15:12:45Z"
involvedObject:
apiVersion: apps/v1
kind: Deployment
name: openshift-apiserver-operator
namespace: openshift-apiserver-operator
uid: f5777cc0-0b49-4c34-a2ca-420081e3ef08
kind: Event
lastTimestamp: "2021-04-22T15:13:51Z"
message: '"apps.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request)'
metadata:
creationTimestamp: "2021-04-22T15:12:45Z"
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:count: {}
f:firstTimestamp: {}
f:involvedObject:
f:apiVersion: {}
f:kind: {}
f:name: {}
f:namespace: {}
f:uid: {}
f:lastTimestamp: {}
f:message: {}
f:reason: {}
f:source:
f:component: {}
f:type: {}
manager: cluster-openshift-apiserver-operator
operation: Update
time: "2021-04-22T15:12:45Z"
name: openshift-apiserver-operator.167836d78d46a7a8
namespace: openshift-apiserver-operator
resourceVersion: "649125103"
selfLink: /api/v1/namespaces/openshift-apiserver-operator/events/openshift-apiserver-operator.167836d78d46a7a8
uid: 23339b7b-f955-43cf-907f-5eb652fb7bb6
reason: OpenShiftAPICheckFailed
reportingComponent: ""
reportingInstance: ""
source:
component: openshift-apiserver-operator-apiservice-openshift-apiserver-controller-apiservicecontroller_openshift-apiserver
type: Warning
Missed the minutes. This is correct: 1. draining kills pods at 15:10:45Z 2. pod stopped listening at 15:10:46Z (super early, SDN certainly had no chance to react) 3. operator notices APIService is (still?) down at 15:12:45Z This should be fixed through https://github.com/openshift/openshift-apiserver/pull/198 in 4.8. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |