Bug 1993840
Summary: | openshift-samples should not change condition Degraded/Available (upgrades) | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jan Chaloupka <jchaloup> |
Component: | Samples | Assignee: | Gabe Montero <gmontero> |
Status: | CLOSED ERRATA | QA Contact: | Jitendar Singh <jitsingh> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.9 | CC: | adam.kaplan, aos-bugs, dperaza, gmontero, shbose, wking, xiuwang |
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
* Before this update, if the Cluster Samples Operator encountered an `APIServerConflictError` error, it reported `sample-operator` as having `Degraded status` until it recovered. Momentary errors of this type aren't unusual during upgrades and were causing undue concern for administrators monitoring the Operator status. With this update, if the Operator encounters a momentary error, it no longer indicates `openshift-samples` as having `Degraded status` and retries to connect to the API server. Momentary shifts to `Degraded status` should no longer occur. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1993840[BZ#1993840])
|
Story Points: | --- |
Clone Of: | Environment: |
job=periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade=all
job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade=all
|
|
Last Closed: | 2021-10-18 17:46:20 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jan Chaloupka
2021-08-16 09:00:09 UTC
Thanks for the detailed triage on this Jan. Saved me some cycles. That said, I'm not 100% sure I agree with the course of action of essentially batching a bit on encountering errors like this, in case we recover from them, when dealing with marking degraded. Before I make that stance "official", I'll spend some cycles revisiting the API doc and CVO repo doc for guidance, and to see if there is some precedence for such things. If you have some quick pointers around official guidance for doing such things, please post some links if you have the chance. I've also cc:ed some folks who have collaborated with me on samples in the past, to see if they are aligned with my initial take on this, or are leaning toward what you are proscribing. > If that's the case, it is not practical to have the operator switch to Degraded right away when there's APIServerConflictError. Instead, the operator can wait for few seconds/iterations and check again before it switches to Degraded=True.
I did not follow any official documentation. My goal is to minimize the occurrence of Degraded state in the operator. If what I suggest is not the right way to go, let's keep discussing this to find the proper way to resolve this. I am debugging other operators which switch into Degraded state as well to categorize all the instances to find something in common which we might either use to extend the logic to allow "short" changes into Degraded state to be ignored or as an argument to tune our synthetic tests to ignore these cases.
Relevant API docs [1]: Degraded indicates that the operator's current state does not match its desired state over a period of time resulting in a lower quality of service. The period of time may vary by component, but a Degraded state represents persistent observation of a condition. As a result, a component should not oscillate in and out of Degraded state. "25s of modification conflicts" doesn't seems like a QoS degradation big enough to be worth complaining about to me. For Available=False, API docs are [2]. I suspect we don't want to go Available=False on short modification conflicts either, but I'm just guessing that's the reason, and I definitely consider it a bug to go Available=False without setting a reason and message (per the build-log excerpts, it appears the operator is setting neither reason nor message for this case today). [1]: https://github.com/openshift/api/blob/a6156965faae5ce117e3cd3735981a3fc0e27e27/config/v1/types_cluster_operator.go#L161 [2]: https://github.com/openshift/api/blob/a6156965faae5ce117e3cd3735981a3fc0e27e27/config/v1/types_cluster_operator.go#L146-L150 Thanks for the various refs Trevor So minimally shore up as needed wrt Available=False and reason/message field. And I see my opinion shifting on the "batch/Degraded" thread. Unless they post here specifically beforehand, I'll reach out to the samples stakeholders I've cc:ed here for consensus, and we'll go from there. I am +1 on being less aggressive when going degraded (e.g. by retrying over some period of time before marking ourselves degraded). Especially for something like this(updating a sample imagestream) where the user impact is minimal. As Jan noted, this is something we're pursuing across all our operators as right now we are generating too much "noise", especially during cluster upgrades, that make admins think the product is not stable when in reality everything is fine. (In reply to Jan Chaloupka from comment #2) > > If that's the case, it is not practical to have the operator switch to Degraded right away when there's APIServerConflictError. Instead, the operator can wait for few seconds/iterations and check again before it switches to Degraded=True. > > I did not follow any official documentation. My goal is to minimize the > occurrence of Degraded state in the operator. If what I suggest is not the > right way to go, let's keep discussing this to find the proper way to > resolve this. I am debugging other operators which switch into Degraded > state as well to categorize all the instances to find something in common > which we might either use to extend the logic to allow "short" changes into > Degraded state to be ignored or as an argument to tune our synthetic tests > to ignore these cases. Likewise Jan thanks for the additional context. @Jitendar - for verification, CI searches of the 4.9/master branch e2e upgrade jobs, like noted in this BZs description, for the string clusteroperator/openshift-samples condition/Degraded status/True reason/APIServerConflictError for a few days to a week after the this change shows up in the associated payloads, is sufficient verification. If you need help performing such CI searches, let me know. Thanks. https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-samples+condition%2FDegraded+status%2FTrue+reason%2FAPIServerConflictError&maxAge=48h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job There are lots of fail such as "clusteroperator/openshift-samples condition/Degraded status/True reason/APIServerConflictError changed: Operation cannot be fulfilled on imagestreams.image.openshift.io **" In https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade/1434237913092067328 [bz-Samples] clusteroperator/openshift-samples should not change condition/Degraded passed Hi Gabe, as above info, we could mark it verified, right? Not sure if you have a typo with " we could mark it verified, right?" XiuJuan, but your CI search results show I missed a spot with my fix for this, and we should *NOT* mark this verified. https://github.com/openshift/cluster-samples-operator/blob/9d2c950cc66d7edd4a7aa0ea186b6fed33a8c7ba/pkg/stub/handler.go#L913-L921 Moving this back to assigned state, will have a new PR up shortly. Note, the final 4.9 freeze just passed. Given this is a medium severity bug, I believe this won't meet the stop ship level of scrutiny 4.9 is now under. So I'm changing the target to 4.10. https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-samples+condition%2FDegraded+status%2FTrue+reason%2FAPIServerConflictError&maxAge=48h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job There are still some samples degarade on reason APIServerConflictError. Actually XiuJuan, when I look at that query in more details, the conflicts only show up in when the upgrades are dealing with 4.9, 4.8, 4.7 There was one job that had a final landing spot of 4.10, after upgrades from 4.7, 4.8, 4.9, but those messages only occurred when we were at the older levels. https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-from-stable-4.8-e2e-aws-upgrade/1437137590653292544/build-log.txt was such an instance When I look at the final 4.10 logs / status, it is clean. Marking this verified, we are OK here for 4.10. Given severity, I'm not going to backport this at this time, but of course, we can always reassess that as we move forward. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |