Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
During an upgrade from 4.7.4 to 4.7.8, clients failed to hit the OpenShift API: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ci-tools/1902/pull-ci-openshift-ci-tools-master-images/1385240490412085248#1:build-log.txt%3A187 Must-gather is here: https://coreos.slack.com/archives/C01UQNJA31D/p1619108789012400
The client failures as linked were: WARN[2021-04-22T15:13:15Z] Failed to get build e2e-bin. error=the server is currently unable to handle the request (get builds.build.openshift.io e2e-bin) WARN[2021-04-22T15:13:45Z] Failed to get build e2e-bin. error=the server is currently unable to handle the request (get builds.build.openshift.io e2e-bin) WARN[2021-04-22T15:13:51Z] Failed to get build e2e-bin. error=the server is currently unable to handle the request (get builds.build.openshift.io e2e-bin) So around that time is when we must have had connectivity/uptime issues.
Adding some MCO timeline: 1st master: starts: 15:06:13 drain complete: 15:07:30 update successful: 15:10:34 2nd master start: 15:10:40 drain start: 15:10:42 pdb issue: 15:10:49 - 15:11:09 (or 15:11:24) update successful 15:15:15 3rd master: starts 15:15:21 draining: 15:15:23 pdb error: 15:15:33-15:15:58 (25s) successfully finish: 15:19:43 pdb issues above are due to etcd-quorum-guard not draining (maybe the replacement for the updated master hasn't started yet) Otherwise the MCO master upgrade was successful in ~15 minutes with no errors
Relevant events: 1. draining kills pods at 15:45Z 2. pod stopped listening at 15:46Z (super early, SDN certainly had no chance to react) 3. operator notices APIService is (still?) down at 15:12:45Z - apiVersion: v1 count: 1 eventTime: null firstTimestamp: "2021-04-22T15:10:45Z" involvedObject: apiVersion: v1 fieldPath: spec.containers{openshift-apiserver} kind: Pod name: apiserver-c64dc5678-ss7p4 namespace: openshift-apiserver resourceVersion: "648042934" uid: c237112e-e9f1-45e9-91df-9d9a6965e1ea kind: Event lastTimestamp: "2021-04-22T15:10:45Z" message: Stopping container openshift-apiserver metadata: creationTimestamp: "2021-04-22T15:10:45Z" managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:count: {} f:firstTimestamp: {} f:involvedObject: f:apiVersion: {} f:fieldPath: {} f:kind: {} f:name: {} f:namespace: {} f:resourceVersion: {} f:uid: {} f:lastTimestamp: {} f:message: {} f:reason: {} f:source: f:component: {} f:host: {} f:type: {} manager: kubelet operation: Update time: "2021-04-22T15:10:45Z" name: apiserver-c64dc5678-ss7p4.167836bba15e41ab namespace: openshift-apiserver resourceVersion: "649113517" selfLink: /api/v1/namespaces/openshift-apiserver/events/apiserver-c64dc5678-ss7p4.167836bba15e41ab uid: 6a0f865f-7061-4a0b-a3f6-a77c96e206d5 reason: Killing reportingComponent: "" reportingInstance: "" source: component: kubelet host: ip-10-0-140-81.ec2.internal type: Normal - apiVersion: v1 count: 3 eventTime: null firstTimestamp: "2021-04-22T15:10:46Z" involvedObject: apiVersion: v1 fieldPath: spec.containers{openshift-apiserver} kind: Pod name: apiserver-c64dc5678-ss7p4 namespace: openshift-apiserver resourceVersion: "648042934" uid: c237112e-e9f1-45e9-91df-9d9a6965e1ea kind: Event lastTimestamp: "2021-04-22T15:11:06Z" message: 'Liveness probe failed: Get "https://10.130.64.73:8443/healthz": dial tcp 10.130.64.73:8443: connect: connection refused' metadata: creationTimestamp: "2021-04-22T15:10:46Z" managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:count: {} f:firstTimestamp: {} f:involvedObject: f:apiVersion: {} f:fieldPath: {} f:kind: {} f:name: {} f:namespace: {} f:resourceVersion: {} f:uid: {} f:lastTimestamp: {} f:message: {} f:reason: {} f:source: f:component: {} f:host: {} f:type: {} manager: kubelet operation: Update time: "2021-04-22T15:10:46Z" name: apiserver-c64dc5678-ss7p4.167836bbcefcf2a1 namespace: openshift-apiserver resourceVersion: "649115334" selfLink: /api/v1/namespaces/openshift-apiserver/events/apiserver-c64dc5678-ss7p4.167836bbcefcf2a1 uid: 0d22e470-c018-4897-beab-afaa0c9e94ac reason: Unhealthy reportingComponent: "" reportingInstance: "" source: component: kubelet host: ip-10-0-140-81.ec2.internal type: Warning - apiVersion: v1 count: 3 eventTime: null firstTimestamp: "2021-04-22T15:12:45Z" involvedObject: apiVersion: apps/v1 kind: Deployment name: openshift-apiserver-operator namespace: openshift-apiserver-operator uid: f5777cc0-0b49-4c34-a2ca-420081e3ef08 kind: Event lastTimestamp: "2021-04-22T15:13:51Z" message: '"apps.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request)' metadata: creationTimestamp: "2021-04-22T15:12:45Z" managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:count: {} f:firstTimestamp: {} f:involvedObject: f:apiVersion: {} f:kind: {} f:name: {} f:namespace: {} f:uid: {} f:lastTimestamp: {} f:message: {} f:reason: {} f:source: f:component: {} f:type: {} manager: cluster-openshift-apiserver-operator operation: Update time: "2021-04-22T15:12:45Z" name: openshift-apiserver-operator.167836d78d46a7a8 namespace: openshift-apiserver-operator resourceVersion: "649125103" selfLink: /api/v1/namespaces/openshift-apiserver-operator/events/openshift-apiserver-operator.167836d78d46a7a8 uid: 23339b7b-f955-43cf-907f-5eb652fb7bb6 reason: OpenShiftAPICheckFailed reportingComponent: "" reportingInstance: "" source: component: openshift-apiserver-operator-apiservice-openshift-apiserver-controller-apiservicecontroller_openshift-apiserver type: Warning
Missed the minutes. This is correct: 1. draining kills pods at 15:10:45Z 2. pod stopped listening at 15:10:46Z (super early, SDN certainly had no chance to react) 3. operator notices APIService is (still?) down at 15:12:45Z
This should be fixed through https://github.com/openshift/openshift-apiserver/pull/198 in 4.8.
Verified in https://bugzilla.redhat.com/show_bug.cgi?id=1912820#c14
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438