From https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade/1425688122829574144: ``` Aug 12 06:20:57.082 - 25s E clusteroperator/openshift-samples condition/Available status/False reason/ 1 tests failed during this blip (2021-08-12 06:20:57.082878042 +0000 UTC to 2021-08-12 06:20:57.082878042 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] ``` From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade/1425688122829574144/build-log.txt: ``` openshift-samples <new> Samples installation successful at 4.9.0-0.ci-2021-08-12-051754 Aug 12 06:20:56.994 E clusteroperator/openshift-samples condition/Degraded status/True reason/APIServerConflictError changed: Operation cannot be fulfilled on imagestreams.image.openshift.io "jboss-webserver54-openjdk11-tomcat9-openshift-ubi8": the object has been modified; please apply your changes to the latest version and try again error replacing imagestream [jboss-webserver54-openjdk11-tomcat9-openshift-ubi8]; Aug 12 06:20:56.994 - 25s E clusteroperator/openshift-samples condition/Degraded status/True reason/Operation cannot be fulfilled on imagestreams.image.openshift.io "jboss-webserver54-openjdk11-tomcat9-openshift-ubi8": the object has been modified; please apply your changes to the latest version and try again error replacing imagestream [jboss-webserver54-openjdk11-tomcat9-openshift-ubi8]; Aug 12 06:20:57.082 E clusteroperator/openshift-samples condition/Available status/False changed: Aug 12 06:20:57.082 - 25s E clusteroperator/openshift-samples condition/Available status/False reason/ Aug 12 06:20:57.129 W clusteroperator/openshift-samples condition/Progressing status/True changed: Samples installation in error at 4.9.0-0.ci-2021-08-11-220110: APIServerConflictError Aug 12 06:20:57.129 - 25s W clusteroperator/openshift-samples condition/Progressing status/True reason/Samples installation in error at 4.9.0-0.ci-2021-08-11-220110: APIServerConflictError Aug 12 06:21:22.319 W clusteroperator/openshift-samples condition/Degraded status/False changed: Aug 12 06:21:22.345 W clusteroperator/openshift-samples condition/Available status/True changed: Samples installation successful at 4.9.0-0.ci-2021-08-12-051754 Aug 12 06:21:22.345 I clusteroperator/openshift-samples versions: operator 4.9.0-0.ci-2021-08-11-220110 -> 4.9.0-0.ci-2021-08-12-051754 Aug 12 06:21:22.384 W clusteroperator/openshift-samples condition/Progressing status/False changed: Samples installation successful at 4.9.0-0.ci-2021-08-12-051754 [bz-Samples] clusteroperator/openshift-samples should not change condition/Available [bz-Samples] clusteroperator/openshift-samples should not change condition/Degraded [bz-Samples] clusteroperator/openshift-samples should not change condition/Available [bz-Samples] clusteroperator/openshift-samples should not change condition/Degraded ``` The samples operator changed the state to Degraded because of APIServerConflictError. The object gets eventually updated and the conditions go back to Degraded=False and Available=True. Based on https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221628632800000%22,%221628805599000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade%2F1425688122829574144%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%3D%5C%22cluster-samples-operator-57fc848467-gbsnk%5C%22%22%7D%5D: ``` time="2021-08-12T06:20:56Z" level=info msg="updated imagestream jboss-webserver54-openjdk11-tomcat9-openshift-ubi8" time="2021-08-12T06:20:56Z" level=info msg="Updating imagestreamtag-to-image configmap to version 4.9.0-0.ci-2021-08-11-220110" ... time="2021-08-12T06:20:57Z" level=info msg="ENTERING UPSERT / STEADY STATE PATH ExistTrue false ImageInProgressFalse true VersionOK true ConfigChanged false ManagementStateChanged true" ... (no unique labels) time="2021-08-12T06:20:58Z" level=info msg="There are no more errors or image imports in flight for imagestream jboss-webserver54-openjdk11-tomcat9-openshift-ubi8" ``` Which signifies the imagestream stream got updated ~1s after the APIServerConflictError. If that's the case, it is not practical to have the operator switch to Degraded right away when there's APIServerConflictError. Instead, the operator can wait for few seconds/iterations and check again before it switches to Degraded=True. Checking other jobs: # https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade/1427113993536802816 ``` Aug 16 04:45:43.612 E clusteroperator/openshift-samples condition/Degraded status/True reason/APIServerConflictError changed: Operation cannot be fulfilled on imagestreams.image.openshift.io "jboss-eap72-openshift": the object has been modified; please apply your changes to the latest version and try again error replacing imagestream [jboss-eap72-openshift]; Aug 16 04:45:43.612 - 25s E clusteroperator/openshift-samples condition/Degraded status/True reason/Operation cannot be fulfilled on imagestreams.image.openshift.io "jboss-eap72-openshift": the object has been modified; please apply your changes to the latest version and try again error replacing imagestream [jboss-eap72-openshift]; Aug 16 04:45:43.764 E clusteroperator/openshift-samples condition/Available status/False changed: Aug 16 04:45:43.764 - 25s E clusteroperator/openshift-samples condition/Available status/False reason/ Aug 16 04:45:43.943 W clusteroperator/openshift-samples condition/Progressing status/True changed: Samples installation in error at 4.9.0-0.ci-2021-08-15-234811: APIServerConflictError Aug 16 04:45:43.943 - 25s W clusteroperator/openshift-samples condition/Progressing status/True reason/Samples installation in error at 4.9.0-0.ci-2021-08-15-234811: APIServerConflictError Aug 16 04:46:09.539 W clusteroperator/openshift-samples condition/Degraded status/False changed: Aug 16 04:46:09.567 W clusteroperator/openshift-samples condition/Available status/True changed: Samples installation successful at 4.9.0-0.ci-2021-08-16-033516 Aug 16 04:46:09.567 I clusteroperator/openshift-samples versions: operator 4.9.0-0.ci-2021-08-15-234811 -> 4.9.0-0.ci-2021-08-16-033516 Aug 16 04:46:09.588 W clusteroperator/openshift-samples condition/Progressing status/False changed: Samples installation successful at 4.9.0-0.ci-2021-08-16-033516 ``` # https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade/1426947950688342016 ``` Aug 15 18:22:14.640 E clusteroperator/openshift-samples condition/Degraded status/True reason/APIServerConflictError changed: error creating samples: Operation cannot be fulfilled on imagestreams.image.openshift.io "jboss-eap73-runtime-openshift": the object has been modified; please apply your changes to the latest version and try again; Aug 15 18:22:14.640 - 13s E clusteroperator/openshift-samples condition/Degraded status/True reason/error creating samples: Operation cannot be fulfilled on imagestreams.image.openshift.io "jboss-eap73-runtime-openshift": the object has been modified; please apply your changes to the latest version and try again; Aug 15 18:22:14.900 E clusteroperator/openshift-samples condition/Available status/False changed: Aug 15 18:22:14.900 - 12s E clusteroperator/openshift-samples condition/Available status/False reason/ Aug 15 18:22:15.042 W clusteroperator/openshift-samples condition/Progressing status/True changed: Samples installation in error at 4.9.0-0.ci-2021-08-15-163830: APIServerConflictError Aug 15 18:22:15.042 - 12s W clusteroperator/openshift-samples condition/Progressing status/True reason/Samples installation in error at 4.9.0-0.ci-2021-08-15-163830: APIServerConflictError Aug 15 18:22:27.731 W clusteroperator/openshift-samples condition/Degraded status/False changed: Aug 15 18:22:27.837 W clusteroperator/openshift-samples condition/Available status/True changed: Samples installation successful at 4.9.0-0.ci-2021-08-15-163830 Aug 15 18:22:27.837 I clusteroperator/openshift-samples versions: operator 4.8.5 -> 4.9.0-0.ci-2021-08-15-163830 Aug 15 18:22:27.872 W clusteroperator/openshift-samples condition/Progressing status/False changed: Samples installation successful at 4.9.0-0.ci-2021-08-15-163830 ``` Other occurrences in https://search.ci.openshift.org/?search=APIServerConflictError&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Thanks for the detailed triage on this Jan. Saved me some cycles. That said, I'm not 100% sure I agree with the course of action of essentially batching a bit on encountering errors like this, in case we recover from them, when dealing with marking degraded. Before I make that stance "official", I'll spend some cycles revisiting the API doc and CVO repo doc for guidance, and to see if there is some precedence for such things. If you have some quick pointers around official guidance for doing such things, please post some links if you have the chance. I've also cc:ed some folks who have collaborated with me on samples in the past, to see if they are aligned with my initial take on this, or are leaning toward what you are proscribing.
> If that's the case, it is not practical to have the operator switch to Degraded right away when there's APIServerConflictError. Instead, the operator can wait for few seconds/iterations and check again before it switches to Degraded=True. I did not follow any official documentation. My goal is to minimize the occurrence of Degraded state in the operator. If what I suggest is not the right way to go, let's keep discussing this to find the proper way to resolve this. I am debugging other operators which switch into Degraded state as well to categorize all the instances to find something in common which we might either use to extend the logic to allow "short" changes into Degraded state to be ignored or as an argument to tune our synthetic tests to ignore these cases.
Relevant API docs [1]: Degraded indicates that the operator's current state does not match its desired state over a period of time resulting in a lower quality of service. The period of time may vary by component, but a Degraded state represents persistent observation of a condition. As a result, a component should not oscillate in and out of Degraded state. "25s of modification conflicts" doesn't seems like a QoS degradation big enough to be worth complaining about to me. For Available=False, API docs are [2]. I suspect we don't want to go Available=False on short modification conflicts either, but I'm just guessing that's the reason, and I definitely consider it a bug to go Available=False without setting a reason and message (per the build-log excerpts, it appears the operator is setting neither reason nor message for this case today). [1]: https://github.com/openshift/api/blob/a6156965faae5ce117e3cd3735981a3fc0e27e27/config/v1/types_cluster_operator.go#L161 [2]: https://github.com/openshift/api/blob/a6156965faae5ce117e3cd3735981a3fc0e27e27/config/v1/types_cluster_operator.go#L146-L150
Thanks for the various refs Trevor So minimally shore up as needed wrt Available=False and reason/message field. And I see my opinion shifting on the "batch/Degraded" thread. Unless they post here specifically beforehand, I'll reach out to the samples stakeholders I've cc:ed here for consensus, and we'll go from there.
I am +1 on being less aggressive when going degraded (e.g. by retrying over some period of time before marking ourselves degraded). Especially for something like this(updating a sample imagestream) where the user impact is minimal. As Jan noted, this is something we're pursuing across all our operators as right now we are generating too much "noise", especially during cluster upgrades, that make admins think the product is not stable when in reality everything is fine.
(In reply to Jan Chaloupka from comment #2) > > If that's the case, it is not practical to have the operator switch to Degraded right away when there's APIServerConflictError. Instead, the operator can wait for few seconds/iterations and check again before it switches to Degraded=True. > > I did not follow any official documentation. My goal is to minimize the > occurrence of Degraded state in the operator. If what I suggest is not the > right way to go, let's keep discussing this to find the proper way to > resolve this. I am debugging other operators which switch into Degraded > state as well to categorize all the instances to find something in common > which we might either use to extend the logic to allow "short" changes into > Degraded state to be ignored or as an argument to tune our synthetic tests > to ignore these cases. Likewise Jan thanks for the additional context.
@Jitendar - for verification, CI searches of the 4.9/master branch e2e upgrade jobs, like noted in this BZs description, for the string clusteroperator/openshift-samples condition/Degraded status/True reason/APIServerConflictError for a few days to a week after the this change shows up in the associated payloads, is sufficient verification. If you need help performing such CI searches, let me know. Thanks.
https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-samples+condition%2FDegraded+status%2FTrue+reason%2FAPIServerConflictError&maxAge=48h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job There are lots of fail such as "clusteroperator/openshift-samples condition/Degraded status/True reason/APIServerConflictError changed: Operation cannot be fulfilled on imagestreams.image.openshift.io **" In https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade/1434237913092067328 [bz-Samples] clusteroperator/openshift-samples should not change condition/Degraded passed Hi Gabe, as above info, we could mark it verified, right?
Not sure if you have a typo with " we could mark it verified, right?" XiuJuan, but your CI search results show I missed a spot with my fix for this, and we should *NOT* mark this verified. https://github.com/openshift/cluster-samples-operator/blob/9d2c950cc66d7edd4a7aa0ea186b6fed33a8c7ba/pkg/stub/handler.go#L913-L921 Moving this back to assigned state, will have a new PR up shortly.
Note, the final 4.9 freeze just passed. Given this is a medium severity bug, I believe this won't meet the stop ship level of scrutiny 4.9 is now under. So I'm changing the target to 4.10.
https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-samples+condition%2FDegraded+status%2FTrue+reason%2FAPIServerConflictError&maxAge=48h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job There are still some samples degarade on reason APIServerConflictError.
Actually XiuJuan, when I look at that query in more details, the conflicts only show up in when the upgrades are dealing with 4.9, 4.8, 4.7 There was one job that had a final landing spot of 4.10, after upgrades from 4.7, 4.8, 4.9, but those messages only occurred when we were at the older levels. https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-from-stable-4.8-e2e-aws-upgrade/1437137590653292544/build-log.txt was such an instance When I look at the final 4.10 logs / status, it is clean. Marking this verified, we are OK here for 4.10. Given severity, I'm not going to backport this at this time, but of course, we can always reassess that as we move forward.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759