1952618 – 4.7.4->4.7.8 Upgrade Caused OpenShift-Apiserver Outage

Bug 1952618 - 4.7.4->4.7.8 Upgrade Caused OpenShift-Apiserver Outage

Summary: 4.7.4->4.7.8 Upgrade Caused OpenShift-Apiserver Outage

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	openshift-apiserver
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Stefan Schimanski
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-22 16:52 UTC by Steve Kuznetsov
Modified:	2021-07-27 23:03 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 23:02:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:03:12 UTC

Description Steve Kuznetsov 2021-04-22 16:52:43 UTC

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Steve Kuznetsov 2021-04-22 16:53:51 UTC

During an upgrade from 4.7.4 to 4.7.8, clients failed to hit the OpenShift API:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ci-tools/1902/pull-ci-openshift-ci-tools-master-images/1385240490412085248#1:build-log.txt%3A187

Must-gather is here:

https://coreos.slack.com/archives/C01UQNJA31D/p1619108789012400

Comment 2 Steve Kuznetsov 2021-04-22 18:30:33 UTC

The client failures as linked were:

 WARN[2021-04-22T15:13:15Z] Failed to get build e2e-bin.                  error=the server is currently unable to handle the request (get builds.build.openshift.io e2e-bin)
WARN[2021-04-22T15:13:45Z] Failed to get build e2e-bin.                  error=the server is currently unable to handle the request (get builds.build.openshift.io e2e-bin)
WARN[2021-04-22T15:13:51Z] Failed to get build e2e-bin.                  error=the server is currently unable to handle the request (get builds.build.openshift.io e2e-bin) 

So around that time is when we must have had connectivity/uptime issues.

Comment 3 Yu Qi Zhang 2021-04-22 19:49:11 UTC

Adding some MCO timeline:

1st master:
starts: 15:06:13
drain complete: 15:07:30
update successful: 15:10:34

2nd master
start: 15:10:40
drain start: 15:10:42
pdb issue: 15:10:49 - 15:11:09 (or 15:11:24)
update successful 15:15:15

3rd master:
starts 15:15:21
draining: 15:15:23
pdb error: 15:15:33-15:15:58 (25s)
successfully finish: 15:19:43

pdb issues above are due to etcd-quorum-guard not draining (maybe the replacement for the updated master hasn't started yet) Otherwise the MCO master upgrade was successful in ~15 minutes with no errors

Comment 5 Stefan Schimanski 2021-04-23 07:36:15 UTC

Relevant events:

1. draining kills pods at 15:45Z
2. pod stopped listening at 15:46Z (super early, SDN certainly had no chance to react)
3. operator notices APIService is (still?) down at 15:12:45Z

- apiVersion: v1
  count: 1
  eventTime: null
  firstTimestamp: "2021-04-22T15:10:45Z"
  involvedObject:
    apiVersion: v1
    fieldPath: spec.containers{openshift-apiserver}
    kind: Pod
    name: apiserver-c64dc5678-ss7p4
    namespace: openshift-apiserver
    resourceVersion: "648042934"
    uid: c237112e-e9f1-45e9-91df-9d9a6965e1ea
  kind: Event
  lastTimestamp: "2021-04-22T15:10:45Z"
  message: Stopping container openshift-apiserver
  metadata:
    creationTimestamp: "2021-04-22T15:10:45Z"
    managedFields:
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:count: {}
        f:firstTimestamp: {}
        f:involvedObject:
          f:apiVersion: {}
          f:fieldPath: {}
          f:kind: {}
          f:name: {}
          f:namespace: {}
          f:resourceVersion: {}
          f:uid: {}
        f:lastTimestamp: {}
        f:message: {}
        f:reason: {}
        f:source:
          f:component: {}
          f:host: {}
        f:type: {}
      manager: kubelet
      operation: Update
      time: "2021-04-22T15:10:45Z"
    name: apiserver-c64dc5678-ss7p4.167836bba15e41ab
    namespace: openshift-apiserver
    resourceVersion: "649113517"
    selfLink: /api/v1/namespaces/openshift-apiserver/events/apiserver-c64dc5678-ss7p4.167836bba15e41ab
    uid: 6a0f865f-7061-4a0b-a3f6-a77c96e206d5
  reason: Killing
  reportingComponent: ""
  reportingInstance: ""
  source:
    component: kubelet
    host: ip-10-0-140-81.ec2.internal
  type: Normal
- apiVersion: v1
  count: 3
  eventTime: null
  firstTimestamp: "2021-04-22T15:10:46Z"
  involvedObject:
    apiVersion: v1
    fieldPath: spec.containers{openshift-apiserver}
    kind: Pod
    name: apiserver-c64dc5678-ss7p4
    namespace: openshift-apiserver
    resourceVersion: "648042934"
    uid: c237112e-e9f1-45e9-91df-9d9a6965e1ea
  kind: Event
  lastTimestamp: "2021-04-22T15:11:06Z"
  message: 'Liveness probe failed: Get "https://10.130.64.73:8443/healthz": dial tcp 10.130.64.73:8443: connect: connection refused'
  metadata:
    creationTimestamp: "2021-04-22T15:10:46Z"
    managedFields:
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:count: {}
        f:firstTimestamp: {}
        f:involvedObject:
          f:apiVersion: {}
          f:fieldPath: {}
          f:kind: {}
          f:name: {}
          f:namespace: {}
          f:resourceVersion: {}
          f:uid: {}
        f:lastTimestamp: {}
        f:message: {}
        f:reason: {}
        f:source:
          f:component: {}
          f:host: {}
        f:type: {}
      manager: kubelet
      operation: Update
      time: "2021-04-22T15:10:46Z"
    name: apiserver-c64dc5678-ss7p4.167836bbcefcf2a1
    namespace: openshift-apiserver
    resourceVersion: "649115334"
    selfLink: /api/v1/namespaces/openshift-apiserver/events/apiserver-c64dc5678-ss7p4.167836bbcefcf2a1
    uid: 0d22e470-c018-4897-beab-afaa0c9e94ac
  reason: Unhealthy
  reportingComponent: ""
  reportingInstance: ""
  source:
    component: kubelet
    host: ip-10-0-140-81.ec2.internal
  type: Warning
- apiVersion: v1
  count: 3
  eventTime: null
  firstTimestamp: "2021-04-22T15:12:45Z"
  involvedObject:
    apiVersion: apps/v1
    kind: Deployment
    name: openshift-apiserver-operator
    namespace: openshift-apiserver-operator
    uid: f5777cc0-0b49-4c34-a2ca-420081e3ef08
  kind: Event
  lastTimestamp: "2021-04-22T15:13:51Z"
  message: '"apps.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request)'
  metadata:
    creationTimestamp: "2021-04-22T15:12:45Z"
    managedFields:
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:count: {}
        f:firstTimestamp: {}
        f:involvedObject:
          f:apiVersion: {}
          f:kind: {}
          f:name: {}
          f:namespace: {}
          f:uid: {}
        f:lastTimestamp: {}
        f:message: {}
        f:reason: {}
        f:source:
          f:component: {}
        f:type: {}
      manager: cluster-openshift-apiserver-operator
      operation: Update
      time: "2021-04-22T15:12:45Z"
    name: openshift-apiserver-operator.167836d78d46a7a8
    namespace: openshift-apiserver-operator
    resourceVersion: "649125103"
    selfLink: /api/v1/namespaces/openshift-apiserver-operator/events/openshift-apiserver-operator.167836d78d46a7a8
    uid: 23339b7b-f955-43cf-907f-5eb652fb7bb6
  reason: OpenShiftAPICheckFailed
  reportingComponent: ""
  reportingInstance: ""
  source:
    component: openshift-apiserver-operator-apiservice-openshift-apiserver-controller-apiservicecontroller_openshift-apiserver
  type: Warning

Comment 6 Stefan Schimanski 2021-04-23 07:39:45 UTC

Missed the minutes. This is correct:

1. draining kills pods at 15:10:45Z
2. pod stopped listening at 15:10:46Z (super early, SDN certainly had no chance to react)
3. operator notices APIService is (still?) down at 15:12:45Z

Comment 7 Stefan Schimanski 2021-05-18 08:35:06 UTC

This should be fixed through https://github.com/openshift/openshift-apiserver/pull/198 in 4.8.

Comment 9 Xingxing Xia 2021-05-25 12:04:50 UTC

Verified in https://bugzilla.redhat.com/show_bug.cgi?id=1912820#c14

Comment 12 errata-xmlrpc 2021-07-27 23:02:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.