Bug 1952618 - 4.7.4->4.7.8 Upgrade Caused OpenShift-Apiserver Outage
Summary: 4.7.4->4.7.8 Upgrade Caused OpenShift-Apiserver Outage
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.8.0
Assignee: Stefan Schimanski
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-22 16:52 UTC by Steve Kuznetsov
Modified: 2021-07-27 23:03 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:02:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:03:12 UTC

Description Steve Kuznetsov 2021-04-22 16:52:43 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Steve Kuznetsov 2021-04-22 18:30:33 UTC
The client failures as linked were:

 WARN[2021-04-22T15:13:15Z] Failed to get build e2e-bin.                  error=the server is currently unable to handle the request (get builds.build.openshift.io e2e-bin)
WARN[2021-04-22T15:13:45Z] Failed to get build e2e-bin.                  error=the server is currently unable to handle the request (get builds.build.openshift.io e2e-bin)
WARN[2021-04-22T15:13:51Z] Failed to get build e2e-bin.                  error=the server is currently unable to handle the request (get builds.build.openshift.io e2e-bin) 

So around that time is when we must have had connectivity/uptime issues.

Comment 3 Yu Qi Zhang 2021-04-22 19:49:11 UTC
Adding some MCO timeline:

1st master:
starts: 15:06:13
drain complete: 15:07:30
update successful: 15:10:34

2nd master
start: 15:10:40
drain start: 15:10:42
pdb issue: 15:10:49 - 15:11:09 (or 15:11:24)
update successful 15:15:15

3rd master:
starts 15:15:21
draining: 15:15:23
pdb error: 15:15:33-15:15:58 (25s)
successfully finish: 15:19:43

pdb issues above are due to etcd-quorum-guard not draining (maybe the replacement for the updated master hasn't started yet) Otherwise the MCO master upgrade was successful in ~15 minutes with no errors

Comment 5 Stefan Schimanski 2021-04-23 07:36:15 UTC
Relevant events:

1. draining kills pods at 15:45Z
2. pod stopped listening at 15:46Z (super early, SDN certainly had no chance to react)
3. operator notices APIService is (still?) down at 15:12:45Z

- apiVersion: v1
  count: 1
  eventTime: null
  firstTimestamp: "2021-04-22T15:10:45Z"
  involvedObject:
    apiVersion: v1
    fieldPath: spec.containers{openshift-apiserver}
    kind: Pod
    name: apiserver-c64dc5678-ss7p4
    namespace: openshift-apiserver
    resourceVersion: "648042934"
    uid: c237112e-e9f1-45e9-91df-9d9a6965e1ea
  kind: Event
  lastTimestamp: "2021-04-22T15:10:45Z"
  message: Stopping container openshift-apiserver
  metadata:
    creationTimestamp: "2021-04-22T15:10:45Z"
    managedFields:
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:count: {}
        f:firstTimestamp: {}
        f:involvedObject:
          f:apiVersion: {}
          f:fieldPath: {}
          f:kind: {}
          f:name: {}
          f:namespace: {}
          f:resourceVersion: {}
          f:uid: {}
        f:lastTimestamp: {}
        f:message: {}
        f:reason: {}
        f:source:
          f:component: {}
          f:host: {}
        f:type: {}
      manager: kubelet
      operation: Update
      time: "2021-04-22T15:10:45Z"
    name: apiserver-c64dc5678-ss7p4.167836bba15e41ab
    namespace: openshift-apiserver
    resourceVersion: "649113517"
    selfLink: /api/v1/namespaces/openshift-apiserver/events/apiserver-c64dc5678-ss7p4.167836bba15e41ab
    uid: 6a0f865f-7061-4a0b-a3f6-a77c96e206d5
  reason: Killing
  reportingComponent: ""
  reportingInstance: ""
  source:
    component: kubelet
    host: ip-10-0-140-81.ec2.internal
  type: Normal
- apiVersion: v1
  count: 3
  eventTime: null
  firstTimestamp: "2021-04-22T15:10:46Z"
  involvedObject:
    apiVersion: v1
    fieldPath: spec.containers{openshift-apiserver}
    kind: Pod
    name: apiserver-c64dc5678-ss7p4
    namespace: openshift-apiserver
    resourceVersion: "648042934"
    uid: c237112e-e9f1-45e9-91df-9d9a6965e1ea
  kind: Event
  lastTimestamp: "2021-04-22T15:11:06Z"
  message: 'Liveness probe failed: Get "https://10.130.64.73:8443/healthz": dial tcp 10.130.64.73:8443: connect: connection refused'
  metadata:
    creationTimestamp: "2021-04-22T15:10:46Z"
    managedFields:
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:count: {}
        f:firstTimestamp: {}
        f:involvedObject:
          f:apiVersion: {}
          f:fieldPath: {}
          f:kind: {}
          f:name: {}
          f:namespace: {}
          f:resourceVersion: {}
          f:uid: {}
        f:lastTimestamp: {}
        f:message: {}
        f:reason: {}
        f:source:
          f:component: {}
          f:host: {}
        f:type: {}
      manager: kubelet
      operation: Update
      time: "2021-04-22T15:10:46Z"
    name: apiserver-c64dc5678-ss7p4.167836bbcefcf2a1
    namespace: openshift-apiserver
    resourceVersion: "649115334"
    selfLink: /api/v1/namespaces/openshift-apiserver/events/apiserver-c64dc5678-ss7p4.167836bbcefcf2a1
    uid: 0d22e470-c018-4897-beab-afaa0c9e94ac
  reason: Unhealthy
  reportingComponent: ""
  reportingInstance: ""
  source:
    component: kubelet
    host: ip-10-0-140-81.ec2.internal
  type: Warning
- apiVersion: v1
  count: 3
  eventTime: null
  firstTimestamp: "2021-04-22T15:12:45Z"
  involvedObject:
    apiVersion: apps/v1
    kind: Deployment
    name: openshift-apiserver-operator
    namespace: openshift-apiserver-operator
    uid: f5777cc0-0b49-4c34-a2ca-420081e3ef08
  kind: Event
  lastTimestamp: "2021-04-22T15:13:51Z"
  message: '"apps.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request)'
  metadata:
    creationTimestamp: "2021-04-22T15:12:45Z"
    managedFields:
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:count: {}
        f:firstTimestamp: {}
        f:involvedObject:
          f:apiVersion: {}
          f:kind: {}
          f:name: {}
          f:namespace: {}
          f:uid: {}
        f:lastTimestamp: {}
        f:message: {}
        f:reason: {}
        f:source:
          f:component: {}
        f:type: {}
      manager: cluster-openshift-apiserver-operator
      operation: Update
      time: "2021-04-22T15:12:45Z"
    name: openshift-apiserver-operator.167836d78d46a7a8
    namespace: openshift-apiserver-operator
    resourceVersion: "649125103"
    selfLink: /api/v1/namespaces/openshift-apiserver-operator/events/openshift-apiserver-operator.167836d78d46a7a8
    uid: 23339b7b-f955-43cf-907f-5eb652fb7bb6
  reason: OpenShiftAPICheckFailed
  reportingComponent: ""
  reportingInstance: ""
  source:
    component: openshift-apiserver-operator-apiservice-openshift-apiserver-controller-apiservicecontroller_openshift-apiserver
  type: Warning

Comment 6 Stefan Schimanski 2021-04-23 07:39:45 UTC
Missed the minutes. This is correct:

1. draining kills pods at 15:10:45Z
2. pod stopped listening at 15:10:46Z (super early, SDN certainly had no chance to react)
3. operator notices APIService is (still?) down at 15:12:45Z

Comment 7 Stefan Schimanski 2021-05-18 08:35:06 UTC
This should be fixed through https://github.com/openshift/openshift-apiserver/pull/198 in 4.8.

Comment 9 Xingxing Xia 2021-05-25 12:04:50 UTC
Verified in https://bugzilla.redhat.com/show_bug.cgi?id=1912820#c14

Comment 12 errata-xmlrpc 2021-07-27 23:02:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.