Bug 1838716 - Improve APIServerError condition name many degraded clusters report
Summary: Improve APIServerError condition name many degraded clusters report
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Samples
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.5.0
Assignee: Gabe Montero
QA Contact: XiuJuan Wang
URL:
Whiteboard:
Depends On:
Blocks: 1842560 1842561
TreeView+ depends on / blocked
 
Reported: 2020-05-21 16:10 UTC by Michal Fojtik
Modified: 2020-07-13 17:41 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: sample operator file system errors were incorrectly reported as api server errors in the clusteroperator reason field, and details on actual api server errors while manipulating api server objects did not provide detail on the exact type of failure Consequence: analysis of degraded samples operator reported via OTA/insights was unnecessarily hindered Fix: file system errors are reported as file system errors in the degraded reason field, and api server errors reported in the degraded reason field include the specific error type Result: degraded samples operator conditions around api server errors and file system errors are more easily triaged
Clone Of:
Environment:
Last Closed: 2020-07-13 17:40:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-samples-operator pull 275 0 None closed Bug 1838716: improve reason text on degraded condition 2020-06-24 03:13:10 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:41:14 UTC

Description Michal Fojtik 2020-05-21 16:10:34 UTC
Description of problem:

Many clusters that upgraded from 4.3->4.4 reporting degraded samples-operator with "APIServerError" as condition name.

I believe we should make this condition name more verbose, I propose:

APIServerTimeoutError
APIServerConnectionRefusedError
APIServerNoRouteToHostError

UnknownError (?) if we don't know what error?

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Gabe Montero 2020-05-21 16:17:25 UTC
Some highlights from the discussion in https://coreos.slack.com/archives/CB48XQ4KZ/p1590074085229800

1) Michal encountered instances of https://bugzilla.redhat.com/show_bug.cgi?id=1835995 when looking at upgrades to 4.4.4 
2) https://bugzilla.redhat.com/show_bug.cgi?id=1835995 is not out yet in 4.4.x  (still in verified as of this posting)
3) the fact that we even cited APIServer error was part of the problem with https://bugzilla.redhat.com/show_bug.cgi?id=1835995
4) the problem stemmed around accessing content in the payload

All that said, for "real" issues on API server related calls, I'll look into changes to augment the samples operator reason code 
along the lines Michal has articulated.

Comment 2 Gabe Montero 2020-05-21 16:25:13 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1832344 is the other upgrade bug that was leading to what Michal saw

Comment 4 Gabe Montero 2020-05-21 17:28:45 UTC
set priority vs. severity in my haste

Comment 8 Gabe Montero 2020-05-28 14:57:30 UTC
Verification for this will be tricky @XiuJuan

Causing a disruption to the api server while samples tries to install would be needed.

Not 100% confident this will fly, but:
1) mark samples operator removed
2) scale down / kill the 3 openshift api server pods
3) mark samples operator managed ASAP after 2) and see if errors occur specific to trying to create imagestreams / templates
4) then catch the openshift-samples clusteroperator being in degraded status and see what the reason is

My thinking is take a pass or two at that and see what results.

Otherwise, claim due diligence and mark as verified.

Comment 9 XiuJuan Wang 2020-06-01 10:05:08 UTC
I got the APIServerServiceUnavailableError in openshift-samples clusteroperator after delete three nopenshift-apiserver pods

Gabe, we could mark this as verified against 4.5.0-0.nightly-2020-05-31-230932 version.

Following comment #8
A.
1) mark samples operator removed
2) scale down / kill the 3 openshift api server pods
3) then catch the openshift-samples clusteroperator being in degraded status and see what the reason is

The openshift-samples clusteroperator report APIServerServiceUnavailableError error when delete template or imagestream when interact with apiserver

 conditions:
  - lastTransitionTime: "2020-06-01T09:28:58Z"
    message: The error the server is currently unable to handle the request (delete
      templates.template.openshift.io jws31-tomcat7-postgresql-s2i) during openshift
      namespace cleanup has left the samples in an unknown state;
    reason: APIServerServiceUnavailableError
    status: "True"
    type: Degraded
  - lastTransitionTime: "2020-06-01T09:28:58Z"
    status: "False"
    type: Available
  - lastTransitionTime: "2020-06-01T09:28:58Z"
    message: 'Samples installation in error at 4.5.0-0.nightly-2020-05-31-230932:
      APIServerServiceUnavailableError'
    status: "True"
    type: Progressing

B.
1) mark samples operator removed
2)Wait samples are removed, mark samples to Managed
3) kill the 3 openshift api server pods
4) then catch the openshift-samples clusteroperator being in degraded status and see what the reason is

The openshift-samples clusteroperator report APIServerServiceUnavailableError error when create templates or imagestream when interact with apiserver


status:
  conditions:
  - lastTransitionTime: "2020-06-01T09:57:05Z"
    message: 'error creating samples: the server is currently unable to handle the
      request (put imagestreams.image.openshift.io fis-karaf-openshift);imagestream
      update error: the server is currently unable to handle the request (put imagestreams.image.openshift.io
      fis-karaf-openshift);'
    reason: APIServerServiceUnavailableError
    status: "True"
    type: Degraded
  - lastTransitionTime: "2020-06-01T09:56:55Z"
    status: "False"
    type: Available
  - lastTransitionTime: "2020-06-01T09:56:55Z"
    message: 'Samples installation in error at 4.5.0-0.nightly-2020-05-31-230932:
      APIServerServiceUnavailableError'
    status: "True"
    type: Progressing

Comment 10 errata-xmlrpc 2020-07-13 17:40:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.