Hide Forgot
To avoid listing our operator as failed early, we can add a `count` to our failing status conditions on our low-level status for the number of consecutive Trues. Since we update status via a helper, we can set that count without having to touch every control loop. Our union of conditions can then require a count of at least X. I think we do this for just failing to start.
Michal pointed out that we can simply do this based on time we've been failing at the lower level. That makes just as much sense and is easier and more consistent.
PR: https://github.com/openshift/library-go/pull/338
This merged, moving to QA.
Take the co/openshift-apiserver example, tried in 4.1.0-0.nightly-2019-04-20-080532 env: In a terminal A, run `watch -n 1 oc get co openshift-apiserver` to monitor columns "AVAILABLE PROGRESSING FAILING". In another terminal B, run `while true; do oc delete ds/apiserver -n openshift-apiserver; done`. Then observe terminal A, found "AVAILABLE" _immediately_ (within 1 second) changed from True to False: NAME VERSION AVAILABLE PROGRESSING FAILING SINCE openshift-apiserver 4.1.0-0.nightly-2019-04-20-080532 False True 2s Per https://github.com/openshift/library-go/pull/338 , seems it should only change from True to False after 1 minute instead of _immediately_ (within 1 second)? BTW, some COs' FAILING is empty as below (they were not empty in prior payloads), is it expected? oc get co NAME VERSION AVAILABLE PROGRESSING FAILING SINCE authentication 4.1.0-0.nightly-2019-04-20-080532 True False False 21h cloud-credential 4.1.0-0.nightly-2019-04-20-080532 True False False 22h cluster-autoscaler 4.1.0-0.nightly-2019-04-20-080532 True False False 22h console 4.1.0-0.nightly-2019-04-20-080532 True True True 21h dns 4.1.0-0.nightly-2019-04-20-080532 True False False 22h image-registry 4.1.0-0.nightly-2019-04-20-080532 True False False 21h ingress 4.1.0-0.nightly-2019-04-20-080532 True False False 21h kube-apiserver 4.1.0-0.nightly-2019-04-20-080532 True False 22h kube-controller-manager 4.1.0-0.nightly-2019-04-20-080532 True False 21h kube-scheduler 4.1.0-0.nightly-2019-04-20-080532 True False 21h machine-api 4.1.0-0.nightly-2019-04-20-080532 True False False 22h machine-config 4.1.0-0.nightly-2019-04-20-080532 True False False 21h marketplace 4.1.0-0.nightly-2019-04-20-080532 True False False 21h monitoring 4.1.0-0.nightly-2019-04-20-080532 False True True 2m50s network 4.1.0-0.nightly-2019-04-20-080532 True False 22h node-tuning 4.1.0-0.nightly-2019-04-20-080532 True False False 21h openshift-apiserver 4.1.0-0.nightly-2019-04-20-080532 False False 4m11s openshift-controller-manager 4.1.0-0.nightly-2019-04-20-080532 True False 22h openshift-samples 4.1.0-0.nightly-2019-04-20-080532 True False False 21h operator-lifecycle-manager 4.1.0-0.nightly-2019-04-20-080532 True False False 21h operator-lifecycle-manager-catalog 4.1.0-0.nightly-2019-04-20-080532 True False False 21h service-ca 4.1.0-0.nightly-2019-04-20-080532 True False False 22h service-catalog-apiserver 4.1.0-0.nightly-2019-04-20-080532 True False False 21h service-catalog-controller-manager 4.1.0-0.nightly-2019-04-20-080532 True False False 21h storage 4.1.0-0.nightly-2019-04-20-080532 True False False 21h
@xxia, only "Failing" status is affected by the PR.
(In reply to Luis Sanchez from comment #5) > @xxia, only "Failing" status is affected by the PR. I intended to check Failing per the PR code, but as above result showed, the 6 clusteroperators kube-apiserver, kube-controller-manager, kube-scheduler, network, openshift-apiserver and openshift-controller-manager don't have "Failing" status in YAML, they show empty "Failing" in `oc get`. Should this be expected, or be fixed before verifying this bug? (In reply to Xingxing Xia from comment #4) > BTW, some COs' FAILING is empty as below (they were not empty in prior > payloads), is it expected?
@xxia The blanks are not related to this bug. "FAILING" status is being changed to "DEGRADED" status. Use a newer version of oc to see a "DEGRADED" column instead of "FAILING" (and the blanks will be "flipped"). The ClusterOperators which have a blank "FAILING" status have already migrated to reporting "DEGRADED" status instead.
Yes, tested it in 4.1.0-0.nightly-2019-04-24-014305 env, now DEGRADED is shown: oc get co openshift-apiserver kube-apiserver kube-controller-manager NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE openshift-apiserver 4.1.0-0.nightly-2019-04-24-014305 True False False 2m NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE kube-apiserver 4.1.0-0.nightly-2019-04-24-014305 True False False 3h26m NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE kube-controller-manager 4.1.0-0.nightly-2019-04-24-014305 True False False 3h25m But tried the verification steps as in comment 4: In a terminal A, run `watch -n 1 oc get co openshift-apiserver` to monitor output columns. In another terminal B, run `while true; do oc delete ds/apiserver -n openshift-apiserver; done`. Then observe terminal A, found DEGRADED never changed from False to True, even after 2 mins elapsed (BTW, "AVAILABLE" still immediately within 1 second changed from True to False): oc get co openshift-apiserver NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE openshift-apiserver 4.1.0-0.nightly-2019-04-24-014305 False True False 2m Are the verification steps right? Or the fix still has problem?
@xxia deleting ds/apiserver will not result in DEGRADED=true if the resource is simply re-created (by the operator) successfully. Edit APIServer/cluster to have more than 10 named certificates (the referenced secrets don't need to exist), for example: spec: servingCerts: namedCertificates: - servingCertificate: name: s01 - servingCertificate: name: s02 - servingCertificate: name: s03 - servingCertificate: name: s04 - servingCertificate: name: s05 - servingCertificate: name: s06 - servingCertificate: name: s07 - servingCertificate: name: s08 - servingCertificate: name: s09 - servingCertificate: name: s10 - servingCertificate: name: s11 You can watch for events: oc -n openshift-kube-apiserver-operator get events -w And you should see the kube-apiserver clusteroperator report Degraded=true after one minute.
Luis, thank you for kindly helping. Yes, now got the result per your suggestion. Verified in 4.1.0-0.nightly-2019-04-25-002910.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758