1698155 – Failing status should require multiple consecutive failures

Bug 1698155 - Failing status should require multiple consecutive failures

Summary: Failing status should require multiple consecutive failures

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Luis Sanchez
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-09 17:35 UTC by David Eads
Modified:	2019-06-04 10:47 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:47:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:47:25 UTC

Description David Eads 2019-04-09 17:35:20 UTC

To avoid listing our operator as failed early, we can add a `count` to our failing status conditions on our low-level status for the number of consecutive Trues.  Since we update status via a helper, we can set that count without having to touch every control loop.

Our union of conditions can then require a count of at least X.  I think we do this for just failing to start.

Comment 1 David Eads 2019-04-09 18:01:07 UTC

Michal pointed out that we can simply do this based on time we've been failing at the lower level.  That makes just as much sense and is easier and more consistent.

Comment 2 Luis Sanchez 2019-04-11 19:44:26 UTC

PR: https://github.com/openshift/library-go/pull/338

Comment 3 Michal Fojtik 2019-04-16 08:59:51 UTC

This merged, moving to QA.

Comment 4 Xingxing Xia 2019-04-22 06:45:08 UTC

Take the co/openshift-apiserver example, tried in 4.1.0-0.nightly-2019-04-20-080532 env:
In a terminal A, run `watch -n 1 oc get co openshift-apiserver` to monitor columns "AVAILABLE   PROGRESSING   FAILING".
In another terminal B, run `while true; do     oc delete ds/apiserver -n openshift-apiserver; done`.
Then observe terminal A, found "AVAILABLE" _immediately_ (within 1 second) changed from True to False:
NAME                  VERSION                             AVAILABLE   PROGRESSING   FAILING   SINCE
openshift-apiserver   4.1.0-0.nightly-2019-04-20-080532   False       True                    2s

Per https://github.com/openshift/library-go/pull/338 , seems it should only change from True to False after 1 minute instead of _immediately_ (within 1 second)?

BTW, some COs' FAILING is empty as below (they were not empty in prior payloads), is it expected?
oc get co
NAME                                 VERSION                             AVAILABLE   PROGRESSING   FAILING   SINCE
authentication                       4.1.0-0.nightly-2019-04-20-080532   True        False         False     21h
cloud-credential                     4.1.0-0.nightly-2019-04-20-080532   True        False         False     22h
cluster-autoscaler                   4.1.0-0.nightly-2019-04-20-080532   True        False         False     22h
console                              4.1.0-0.nightly-2019-04-20-080532   True        True          True      21h
dns                                  4.1.0-0.nightly-2019-04-20-080532   True        False         False     22h
image-registry                       4.1.0-0.nightly-2019-04-20-080532   True        False         False     21h
ingress                              4.1.0-0.nightly-2019-04-20-080532   True        False         False     21h
kube-apiserver                       4.1.0-0.nightly-2019-04-20-080532   True        False                   22h
kube-controller-manager              4.1.0-0.nightly-2019-04-20-080532   True        False                   21h
kube-scheduler                       4.1.0-0.nightly-2019-04-20-080532   True        False                   21h
machine-api                          4.1.0-0.nightly-2019-04-20-080532   True        False         False     22h
machine-config                       4.1.0-0.nightly-2019-04-20-080532   True        False         False     21h
marketplace                          4.1.0-0.nightly-2019-04-20-080532   True        False         False     21h
monitoring                           4.1.0-0.nightly-2019-04-20-080532   False       True          True      2m50s
network                              4.1.0-0.nightly-2019-04-20-080532   True        False                   22h
node-tuning                          4.1.0-0.nightly-2019-04-20-080532   True        False         False     21h
openshift-apiserver                  4.1.0-0.nightly-2019-04-20-080532   False       False                   4m11s
openshift-controller-manager         4.1.0-0.nightly-2019-04-20-080532   True        False                   22h
openshift-samples                    4.1.0-0.nightly-2019-04-20-080532   True        False         False     21h
operator-lifecycle-manager           4.1.0-0.nightly-2019-04-20-080532   True        False         False     21h
operator-lifecycle-manager-catalog   4.1.0-0.nightly-2019-04-20-080532   True        False         False     21h
service-ca                           4.1.0-0.nightly-2019-04-20-080532   True        False         False     22h
service-catalog-apiserver            4.1.0-0.nightly-2019-04-20-080532   True        False         False     21h
service-catalog-controller-manager   4.1.0-0.nightly-2019-04-20-080532   True        False         False     21h
storage                              4.1.0-0.nightly-2019-04-20-080532   True        False         False     21h

Comment 5 Luis Sanchez 2019-04-22 16:43:44 UTC

@xxia, only "Failing" status is affected by the PR.

Comment 6 Xingxing Xia 2019-04-23 01:59:42 UTC

(In reply to Luis Sanchez from comment #5)
> @xxia, only "Failing" status is affected by the PR.

I intended to check Failing per the PR code, but as above result showed, the 6 clusteroperators kube-apiserver, kube-controller-manager, kube-scheduler, network, openshift-apiserver and openshift-controller-manager don't have "Failing" status in YAML, they show empty "Failing" in `oc get`. Should this be expected, or be fixed before verifying this bug?

(In reply to Xingxing Xia from comment #4)
> BTW, some COs' FAILING is empty as below (they were not empty in prior
> payloads), is it expected?

Comment 7 Luis Sanchez 2019-04-23 18:08:30 UTC

@xxia The blanks are not related to this bug. 
"FAILING" status is being changed to "DEGRADED" status. Use a newer version of oc to see a "DEGRADED" column instead of "FAILING" (and the blanks will be "flipped"). 
The ClusterOperators which have a blank "FAILING" status have already migrated to reporting "DEGRADED" status instead.

Comment 8 Xingxing Xia 2019-04-24 09:32:11 UTC

Yes, tested it in 4.1.0-0.nightly-2019-04-24-014305 env, now DEGRADED is shown:
oc get co openshift-apiserver kube-apiserver kube-controller-manager
NAME                      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
openshift-apiserver       4.1.0-0.nightly-2019-04-24-014305   True        False         False      2m
NAME                      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
kube-apiserver            4.1.0-0.nightly-2019-04-24-014305   True        False         False      3h26m
NAME                      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
kube-controller-manager   4.1.0-0.nightly-2019-04-24-014305   True        False         False      3h25m

But tried the verification steps as in comment 4:
In a terminal A, run `watch -n 1 oc get co openshift-apiserver` to monitor output columns.
In another terminal B, run `while true; do     oc delete ds/apiserver -n openshift-apiserver; done`.
Then observe terminal A, found DEGRADED never changed from False to True, even after 2 mins elapsed (BTW, "AVAILABLE" still immediately within 1 second changed from True to False):
oc get co openshift-apiserver
NAME                  VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
openshift-apiserver   4.1.0-0.nightly-2019-04-24-014305   False       True          False      2m

Are the verification steps right? Or the fix still has problem?

Comment 9 Luis Sanchez 2019-04-24 20:11:26 UTC

@xxia deleting ds/apiserver will not result in DEGRADED=true if the resource is simply re-created (by the operator) successfully.

Edit APIServer/cluster to have more than 10 named certificates (the referenced secrets don't need to exist), for example:

spec:
  servingCerts:
    namedCertificates:
      - servingCertificate:
          name: s01
      - servingCertificate:
          name: s02
      - servingCertificate:
          name: s03
      - servingCertificate:
          name: s04
      - servingCertificate:
          name: s05
      - servingCertificate:
          name: s06
      - servingCertificate:
          name: s07
      - servingCertificate:
          name: s08
      - servingCertificate:
          name: s09
      - servingCertificate:
          name: s10
      - servingCertificate:
          name: s11


You can watch for events: oc -n openshift-kube-apiserver-operator get events -w
And you should see the kube-apiserver clusteroperator report Degraded=true after one minute.

Comment 10 Xingxing Xia 2019-04-25 05:41:39 UTC

Luis, thank you for kindly helping. Yes, now got the result per your suggestion. Verified in 4.1.0-0.nightly-2019-04-25-002910.

Comment 12 errata-xmlrpc 2019-06-04 10:47:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.