Bug 1993840 - openshift-samples should not change condition Degraded/Available (upgrades)
Summary: openshift-samples should not change condition Degraded/Available (upgrades)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Samples
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.10.0
Assignee: Gabe Montero
QA Contact: Jitendar Singh
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-16 09:00 UTC by Jan Chaloupka
Modified: 2022-02-15 22:31 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
* Before this update, if the Cluster Samples Operator encountered an `APIServerConflictError` error, it reported `sample-operator` as having `Degraded status` until it recovered. Momentary errors of this type aren't unusual during upgrades and were causing undue concern for administrators monitoring the Operator status. With this update, if the Operator encounters a momentary error, it no longer indicates `openshift-samples` as having `Degraded status` and retries to connect to the API server. Momentary shifts to `Degraded status` should no longer occur. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1993840[BZ#1993840])
Clone Of:
Environment:
job=periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade=all job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade=all
Last Closed: 2021-10-18 17:46:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-samples-operator pull 387 0 None None None 2021-08-23 15:46:46 UTC
Github openshift cluster-samples-operator pull 391 0 None None None 2021-09-07 13:54:48 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:46:34 UTC

Description Jan Chaloupka 2021-08-16 09:00:09 UTC
From https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade/1425688122829574144:
```
Aug 12 06:20:57.082 - 25s   E clusteroperator/openshift-samples condition/Available status/False reason/
1 tests failed during this blip (2021-08-12 06:20:57.082878042 +0000 UTC to 2021-08-12 06:20:57.082878042 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
```

From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade/1425688122829574144/build-log.txt:
```
openshift-samples                              <new>   Samples installation successful at 4.9.0-0.ci-2021-08-12-051754
Aug 12 06:20:56.994 E clusteroperator/openshift-samples condition/Degraded status/True reason/APIServerConflictError changed: Operation cannot be fulfilled on imagestreams.image.openshift.io "jboss-webserver54-openjdk11-tomcat9-openshift-ubi8": the object has been modified; please apply your changes to the latest version and try again error replacing imagestream [jboss-webserver54-openjdk11-tomcat9-openshift-ubi8];
Aug 12 06:20:56.994 - 25s   E clusteroperator/openshift-samples condition/Degraded status/True reason/Operation cannot be fulfilled on imagestreams.image.openshift.io "jboss-webserver54-openjdk11-tomcat9-openshift-ubi8": the object has been modified; please apply your changes to the latest version and try again error replacing imagestream [jboss-webserver54-openjdk11-tomcat9-openshift-ubi8];
Aug 12 06:20:57.082 E clusteroperator/openshift-samples condition/Available status/False changed: 
Aug 12 06:20:57.082 - 25s   E clusteroperator/openshift-samples condition/Available status/False reason/
Aug 12 06:20:57.129 W clusteroperator/openshift-samples condition/Progressing status/True changed: Samples installation in error at 4.9.0-0.ci-2021-08-11-220110: APIServerConflictError
Aug 12 06:20:57.129 - 25s   W clusteroperator/openshift-samples condition/Progressing status/True reason/Samples installation in error at 4.9.0-0.ci-2021-08-11-220110: APIServerConflictError
Aug 12 06:21:22.319 W clusteroperator/openshift-samples condition/Degraded status/False changed: 
Aug 12 06:21:22.345 W clusteroperator/openshift-samples condition/Available status/True changed: Samples installation successful at 4.9.0-0.ci-2021-08-12-051754
Aug 12 06:21:22.345 I clusteroperator/openshift-samples versions: operator 4.9.0-0.ci-2021-08-11-220110 -> 4.9.0-0.ci-2021-08-12-051754
Aug 12 06:21:22.384 W clusteroperator/openshift-samples condition/Progressing status/False changed: Samples installation successful at 4.9.0-0.ci-2021-08-12-051754
[bz-Samples] clusteroperator/openshift-samples should not change condition/Available
[bz-Samples] clusteroperator/openshift-samples should not change condition/Degraded
[bz-Samples] clusteroperator/openshift-samples should not change condition/Available
[bz-Samples] clusteroperator/openshift-samples should not change condition/Degraded
```

The samples operator changed the state to Degraded because of APIServerConflictError. The object gets eventually updated and the conditions go back to Degraded=False and Available=True.

Based on https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221628632800000%22,%221628805599000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade%2F1425688122829574144%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%3D%5C%22cluster-samples-operator-57fc848467-gbsnk%5C%22%22%7D%5D:

```
time="2021-08-12T06:20:56Z" level=info msg="updated imagestream jboss-webserver54-openjdk11-tomcat9-openshift-ubi8"
time="2021-08-12T06:20:56Z" level=info msg="Updating imagestreamtag-to-image configmap to version 4.9.0-0.ci-2021-08-11-220110"
...
time="2021-08-12T06:20:57Z" level=info msg="ENTERING UPSERT / STEADY STATE PATH ExistTrue false ImageInProgressFalse true VersionOK true ConfigChanged false ManagementStateChanged true"
...
(no unique labels)
time="2021-08-12T06:20:58Z" level=info msg="There are no more errors or image imports in flight for imagestream jboss-webserver54-openjdk11-tomcat9-openshift-ubi8"
```

Which signifies the imagestream stream got updated ~1s after the APIServerConflictError.

If that's the case, it is not practical to have the operator switch to Degraded right away when there's APIServerConflictError. Instead, the operator can wait for few seconds/iterations and check again before it switches to Degraded=True.

Checking other jobs:
# https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade/1427113993536802816
```
Aug 16 04:45:43.612 E clusteroperator/openshift-samples condition/Degraded status/True reason/APIServerConflictError changed: Operation cannot be fulfilled on imagestreams.image.openshift.io "jboss-eap72-openshift": the object has been modified; please apply your changes to the latest version and try again error replacing imagestream [jboss-eap72-openshift];
Aug 16 04:45:43.612 - 25s   E clusteroperator/openshift-samples condition/Degraded status/True reason/Operation cannot be fulfilled on imagestreams.image.openshift.io "jboss-eap72-openshift": the object has been modified; please apply your changes to the latest version and try again error replacing imagestream [jboss-eap72-openshift];
Aug 16 04:45:43.764 E clusteroperator/openshift-samples condition/Available status/False changed: 
Aug 16 04:45:43.764 - 25s   E clusteroperator/openshift-samples condition/Available status/False reason/
Aug 16 04:45:43.943 W clusteroperator/openshift-samples condition/Progressing status/True changed: Samples installation in error at 4.9.0-0.ci-2021-08-15-234811: APIServerConflictError
Aug 16 04:45:43.943 - 25s   W clusteroperator/openshift-samples condition/Progressing status/True reason/Samples installation in error at 4.9.0-0.ci-2021-08-15-234811: APIServerConflictError
Aug 16 04:46:09.539 W clusteroperator/openshift-samples condition/Degraded status/False changed: 
Aug 16 04:46:09.567 W clusteroperator/openshift-samples condition/Available status/True changed: Samples installation successful at 4.9.0-0.ci-2021-08-16-033516
Aug 16 04:46:09.567 I clusteroperator/openshift-samples versions: operator 4.9.0-0.ci-2021-08-15-234811 -> 4.9.0-0.ci-2021-08-16-033516
Aug 16 04:46:09.588 W clusteroperator/openshift-samples condition/Progressing status/False changed: Samples installation successful at 4.9.0-0.ci-2021-08-16-033516
```

# https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade/1426947950688342016
```
Aug 15 18:22:14.640 E clusteroperator/openshift-samples condition/Degraded status/True reason/APIServerConflictError changed: error creating samples: Operation cannot be fulfilled on imagestreams.image.openshift.io "jboss-eap73-runtime-openshift": the object has been modified; please apply your changes to the latest version and try again;
Aug 15 18:22:14.640 - 13s   E clusteroperator/openshift-samples condition/Degraded status/True reason/error creating samples: Operation cannot be fulfilled on imagestreams.image.openshift.io "jboss-eap73-runtime-openshift": the object has been modified; please apply your changes to the latest version and try again;
Aug 15 18:22:14.900 E clusteroperator/openshift-samples condition/Available status/False changed: 
Aug 15 18:22:14.900 - 12s   E clusteroperator/openshift-samples condition/Available status/False reason/
Aug 15 18:22:15.042 W clusteroperator/openshift-samples condition/Progressing status/True changed: Samples installation in error at 4.9.0-0.ci-2021-08-15-163830: APIServerConflictError
Aug 15 18:22:15.042 - 12s   W clusteroperator/openshift-samples condition/Progressing status/True reason/Samples installation in error at 4.9.0-0.ci-2021-08-15-163830: APIServerConflictError
Aug 15 18:22:27.731 W clusteroperator/openshift-samples condition/Degraded status/False changed: 
Aug 15 18:22:27.837 W clusteroperator/openshift-samples condition/Available status/True changed: Samples installation successful at 4.9.0-0.ci-2021-08-15-163830
Aug 15 18:22:27.837 I clusteroperator/openshift-samples versions: operator 4.8.5 -> 4.9.0-0.ci-2021-08-15-163830
Aug 15 18:22:27.872 W clusteroperator/openshift-samples condition/Progressing status/False changed: Samples installation successful at 4.9.0-0.ci-2021-08-15-163830
```

Other occurrences in https://search.ci.openshift.org/?search=APIServerConflictError&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 Gabe Montero 2021-08-16 13:14:23 UTC
Thanks for the detailed triage on this Jan.  Saved me some cycles.

That said, I'm not 100% sure I agree with the course of action of essentially batching a bit on encountering errors like this, in case we recover from them, when dealing with marking degraded.

Before I make that stance "official", I'll spend some cycles revisiting the API doc and CVO repo doc for guidance, and to see if there is some precedence for such things.

If you have some quick pointers around official guidance for doing such things, please post some links if you have the chance.

I've also cc:ed some folks who have collaborated with me on samples in the past, to see if they are aligned with my initial take on this, or are leaning toward what you are proscribing.

Comment 2 Jan Chaloupka 2021-08-16 13:36:51 UTC
> If that's the case, it is not practical to have the operator switch to Degraded right away when there's APIServerConflictError. Instead, the operator can wait for few seconds/iterations and check again before it switches to Degraded=True.

I did not follow any official documentation. My goal is to minimize the occurrence of Degraded state in the operator. If what I suggest is not the right way to go, let's keep discussing this to find the proper way to resolve this. I am debugging other operators which switch into Degraded state as well to categorize all the instances to find something in common which we might either use to extend the logic to allow "short" changes into Degraded state to be ignored or as an argument to tune our synthetic tests to ignore these cases.

Comment 3 W. Trevor King 2021-08-16 16:49:25 UTC
Relevant API docs [1]:

  Degraded indicates that the operator's current state does not match its desired state over a period of time resulting in a lower quality of service.  The period of time may vary by component, but a Degraded state represents persistent observation of a condition.  As a result, a component should not oscillate in and out of Degraded state.

"25s of modification conflicts" doesn't seems like a QoS degradation big enough to be worth complaining about to me.

For Available=False, API docs are [2].  I suspect we don't want to go Available=False on short modification conflicts either, but I'm just guessing that's the reason, and I definitely consider it a bug to go Available=False without setting a reason and message (per the build-log excerpts, it appears the operator is setting neither reason nor message for this case today).

[1]: https://github.com/openshift/api/blob/a6156965faae5ce117e3cd3735981a3fc0e27e27/config/v1/types_cluster_operator.go#L161
[2]: https://github.com/openshift/api/blob/a6156965faae5ce117e3cd3735981a3fc0e27e27/config/v1/types_cluster_operator.go#L146-L150

Comment 4 Gabe Montero 2021-08-16 17:58:52 UTC
Thanks for the various refs Trevor

So minimally shore up as needed wrt Available=False and reason/message field.

And I see my opinion shifting on the "batch/Degraded" thread.

Unless they post here specifically beforehand, I'll reach out to the samples stakeholders I've cc:ed here for consensus, and we'll go from there.

Comment 5 Ben Parees 2021-08-16 18:10:07 UTC
I am +1 on being less aggressive when going degraded (e.g. by retrying over some period of time before marking ourselves degraded).  Especially for something like this(updating a sample imagestream) where the user impact is minimal.  As Jan noted, this is something we're pursuing across all our operators as right now we are generating too much "noise", especially during cluster upgrades, that make admins think the product is not stable when in reality everything is fine.

Comment 6 Gabe Montero 2021-08-16 18:32:31 UTC
(In reply to Jan Chaloupka from comment #2)
> > If that's the case, it is not practical to have the operator switch to Degraded right away when there's APIServerConflictError. Instead, the operator can wait for few seconds/iterations and check again before it switches to Degraded=True.
> 
> I did not follow any official documentation. My goal is to minimize the
> occurrence of Degraded state in the operator. If what I suggest is not the
> right way to go, let's keep discussing this to find the proper way to
> resolve this. I am debugging other operators which switch into Degraded
> state as well to categorize all the instances to find something in common
> which we might either use to extend the logic to allow "short" changes into
> Degraded state to be ignored or as an argument to tune our synthetic tests
> to ignore these cases.

Likewise Jan thanks for the additional context.

Comment 7 Gabe Montero 2021-08-24 23:19:29 UTC
@Jitendar - for verification, CI searches of the 4.9/master branch e2e upgrade jobs, like noted in this BZs description, for the string

clusteroperator/openshift-samples condition/Degraded status/True reason/APIServerConflictError

for a few days to a week after the this change shows up in the associated payloads, is sufficient verification.

If you need help performing such CI searches, let me know.  Thanks.

Comment 9 XiuJuan Wang 2021-09-06 10:33:25 UTC
https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-samples+condition%2FDegraded+status%2FTrue+reason%2FAPIServerConflictError&maxAge=48h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

There are lots of fail such as "clusteroperator/openshift-samples condition/Degraded status/True reason/APIServerConflictError changed: Operation cannot be fulfilled on imagestreams.image.openshift.io **"

In https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade/1434237913092067328

[bz-Samples] clusteroperator/openshift-samples should not change condition/Degraded passed

Hi Gabe, as above info, we could mark it verified, right?

Comment 11 Gabe Montero 2021-09-07 13:30:53 UTC
Not sure if you have a typo with " we could mark it verified, right?" XiuJuan, but your CI search results 
show I missed a spot with my fix for this, and we should *NOT* mark this verified.

https://github.com/openshift/cluster-samples-operator/blob/9d2c950cc66d7edd4a7aa0ea186b6fed33a8c7ba/pkg/stub/handler.go#L913-L921

Moving this back to assigned state, will have a new PR up shortly.

Comment 12 Gabe Montero 2021-09-07 13:41:01 UTC
Note, the final 4.9 freeze just passed.

Given this is a medium severity bug, I believe this won't meet the stop ship level of scrutiny
4.9 is now under.

So I'm changing the target to 4.10.

Comment 44 Gabe Montero 2021-09-13 11:37:01 UTC
Actually XiuJuan, when I look at that query in more details, the conflicts only show up in when the upgrades are dealing with 4.9, 4.8, 4.7

There was one job that had a final landing spot of 4.10, after upgrades from 4.7, 4.8, 4.9, but those messages only occurred when we were at the older levels.

https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-from-stable-4.8-e2e-aws-upgrade/1437137590653292544/build-log.txt was such an instance

When I look at the final 4.10 logs / status, it is clean.

Marking this verified, we are OK here for 4.10.

Given severity, I'm not going to backport this at this time, but of course, we can always reassess that as we move forward.

Comment 46 errata-xmlrpc 2021-10-18 17:46:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.