Bug 1993840

Summary:	openshift-samples should not change condition Degraded/Available (upgrades)
Product:	OpenShift Container Platform	Reporter:	Jan Chaloupka <jchaloup>
Component:	Samples	Assignee:	Gabe Montero <gmontero>
Status:	CLOSED ERRATA	QA Contact:	Jitendar Singh <jitsingh>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.9	CC:	adam.kaplan, aos-bugs, dperaza, gmontero, shbose, wking, xiuwang
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	* Before this update, if the Cluster Samples Operator encountered an `APIServerConflictError` error, it reported `sample-operator` as having `Degraded status` until it recovered. Momentary errors of this type aren't unusual during upgrades and were causing undue concern for administrators monitoring the Operator status. With this update, if the Operator encounters a momentary error, it no longer indicates `openshift-samples` as having `Degraded status` and retries to connect to the API server. Momentary shifts to `Degraded status` should no longer occur. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1993840[BZ#1993840])	Story Points:	---
Clone Of:		Environment:	job=periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade=all job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade=all
Last Closed:	2021-10-18 17:46:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jan Chaloupka 2021-08-16 09:00:09 UTC

From https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade/1425688122829574144:
```
Aug 12 06:20:57.082 - 25s   E clusteroperator/openshift-samples condition/Available status/False reason/
1 tests failed during this blip (2021-08-12 06:20:57.082878042 +0000 UTC to 2021-08-12 06:20:57.082878042 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
```

From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade/1425688122829574144/build-log.txt:
```
openshift-samples                              <new>   Samples installation successful at 4.9.0-0.ci-2021-08-12-051754
Aug 12 06:20:56.994 E clusteroperator/openshift-samples condition/Degraded status/True reason/APIServerConflictError changed: Operation cannot be fulfilled on imagestreams.image.openshift.io "jboss-webserver54-openjdk11-tomcat9-openshift-ubi8": the object has been modified; please apply your changes to the latest version and try again error replacing imagestream [jboss-webserver54-openjdk11-tomcat9-openshift-ubi8];
Aug 12 06:20:56.994 - 25s   E clusteroperator/openshift-samples condition/Degraded status/True reason/Operation cannot be fulfilled on imagestreams.image.openshift.io "jboss-webserver54-openjdk11-tomcat9-openshift-ubi8": the object has been modified; please apply your changes to the latest version and try again error replacing imagestream [jboss-webserver54-openjdk11-tomcat9-openshift-ubi8];
Aug 12 06:20:57.082 E clusteroperator/openshift-samples condition/Available status/False changed: 
Aug 12 06:20:57.082 - 25s   E clusteroperator/openshift-samples condition/Available status/False reason/
Aug 12 06:20:57.129 W clusteroperator/openshift-samples condition/Progressing status/True changed: Samples installation in error at 4.9.0-0.ci-2021-08-11-220110: APIServerConflictError
Aug 12 06:20:57.129 - 25s   W clusteroperator/openshift-samples condition/Progressing status/True reason/Samples installation in error at 4.9.0-0.ci-2021-08-11-220110: APIServerConflictError
Aug 12 06:21:22.319 W clusteroperator/openshift-samples condition/Degraded status/False changed: 
Aug 12 06:21:22.345 W clusteroperator/openshift-samples condition/Available status/True changed: Samples installation successful at 4.9.0-0.ci-2021-08-12-051754
Aug 12 06:21:22.345 I clusteroperator/openshift-samples versions: operator 4.9.0-0.ci-2021-08-11-220110 -> 4.9.0-0.ci-2021-08-12-051754
Aug 12 06:21:22.384 W clusteroperator/openshift-samples condition/Progressing status/False changed: Samples installation successful at 4.9.0-0.ci-2021-08-12-051754
[bz-Samples] clusteroperator/openshift-samples should not change condition/Available
[bz-Samples] clusteroperator/openshift-samples should not change condition/Degraded
[bz-Samples] clusteroperator/openshift-samples should not change condition/Available
[bz-Samples] clusteroperator/openshift-samples should not change condition/Degraded
```

The samples operator changed the state to Degraded because of APIServerConflictError. The object gets eventually updated and the conditions go back to Degraded=False and Available=True.

Based on https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221628632800000%22,%221628805599000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade%2F1425688122829574144%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%3D%5C%22cluster-samples-operator-57fc848467-gbsnk%5C%22%22%7D%5D:

```
time="2021-08-12T06:20:56Z" level=info msg="updated imagestream jboss-webserver54-openjdk11-tomcat9-openshift-ubi8"
time="2021-08-12T06:20:56Z" level=info msg="Updating imagestreamtag-to-image configmap to version 4.9.0-0.ci-2021-08-11-220110"
...
time="2021-08-12T06:20:57Z" level=info msg="ENTERING UPSERT / STEADY STATE PATH ExistTrue false ImageInProgressFalse true VersionOK true ConfigChanged false ManagementStateChanged true"
...
(no unique labels)
time="2021-08-12T06:20:58Z" level=info msg="There are no more errors or image imports in flight for imagestream jboss-webserver54-openjdk11-tomcat9-openshift-ubi8"
```

Which signifies the imagestream stream got updated ~1s after the APIServerConflictError.

If that's the case, it is not practical to have the operator switch to Degraded right away when there's APIServerConflictError. Instead, the operator can wait for few seconds/iterations and check again before it switches to Degraded=True.

Checking other jobs:
# https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade/1427113993536802816
```
Aug 16 04:45:43.612 E clusteroperator/openshift-samples condition/Degraded status/True reason/APIServerConflictError changed: Operation cannot be fulfilled on imagestreams.image.openshift.io "jboss-eap72-openshift": the object has been modified; please apply your changes to the latest version and try again error replacing imagestream [jboss-eap72-openshift];
Aug 16 04:45:43.612 - 25s   E clusteroperator/openshift-samples condition/Degraded status/True reason/Operation cannot be fulfilled on imagestreams.image.openshift.io "jboss-eap72-openshift": the object has been modified; please apply your changes to the latest version and try again error replacing imagestream [jboss-eap72-openshift];
Aug 16 04:45:43.764 E clusteroperator/openshift-samples condition/Available status/False changed: 
Aug 16 04:45:43.764 - 25s   E clusteroperator/openshift-samples condition/Available status/False reason/
Aug 16 04:45:43.943 W clusteroperator/openshift-samples condition/Progressing status/True changed: Samples installation in error at 4.9.0-0.ci-2021-08-15-234811: APIServerConflictError
Aug 16 04:45:43.943 - 25s   W clusteroperator/openshift-samples condition/Progressing status/True reason/Samples installation in error at 4.9.0-0.ci-2021-08-15-234811: APIServerConflictError
Aug 16 04:46:09.539 W clusteroperator/openshift-samples condition/Degraded status/False changed: 
Aug 16 04:46:09.567 W clusteroperator/openshift-samples condition/Available status/True changed: Samples installation successful at 4.9.0-0.ci-2021-08-16-033516
Aug 16 04:46:09.567 I clusteroperator/openshift-samples versions: operator 4.9.0-0.ci-2021-08-15-234811 -> 4.9.0-0.ci-2021-08-16-033516
Aug 16 04:46:09.588 W clusteroperator/openshift-samples condition/Progressing status/False changed: Samples installation successful at 4.9.0-0.ci-2021-08-16-033516
```

# https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade/1426947950688342016
```
Aug 15 18:22:14.640 E clusteroperator/openshift-samples condition/Degraded status/True reason/APIServerConflictError changed: error creating samples: Operation cannot be fulfilled on imagestreams.image.openshift.io "jboss-eap73-runtime-openshift": the object has been modified; please apply your changes to the latest version and try again;
Aug 15 18:22:14.640 - 13s   E clusteroperator/openshift-samples condition/Degraded status/True reason/error creating samples: Operation cannot be fulfilled on imagestreams.image.openshift.io "jboss-eap73-runtime-openshift": the object has been modified; please apply your changes to the latest version and try again;
Aug 15 18:22:14.900 E clusteroperator/openshift-samples condition/Available status/False changed: 
Aug 15 18:22:14.900 - 12s   E clusteroperator/openshift-samples condition/Available status/False reason/
Aug 15 18:22:15.042 W clusteroperator/openshift-samples condition/Progressing status/True changed: Samples installation in error at 4.9.0-0.ci-2021-08-15-163830: APIServerConflictError
Aug 15 18:22:15.042 - 12s   W clusteroperator/openshift-samples condition/Progressing status/True reason/Samples installation in error at 4.9.0-0.ci-2021-08-15-163830: APIServerConflictError
Aug 15 18:22:27.731 W clusteroperator/openshift-samples condition/Degraded status/False changed: 
Aug 15 18:22:27.837 W clusteroperator/openshift-samples condition/Available status/True changed: Samples installation successful at 4.9.0-0.ci-2021-08-15-163830
Aug 15 18:22:27.837 I clusteroperator/openshift-samples versions: operator 4.8.5 -> 4.9.0-0.ci-2021-08-15-163830
Aug 15 18:22:27.872 W clusteroperator/openshift-samples condition/Progressing status/False changed: Samples installation successful at 4.9.0-0.ci-2021-08-15-163830
```

Other occurrences in https://search.ci.openshift.org/?search=APIServerConflictError&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 Gabe Montero 2021-08-16 13:14:23 UTC

Thanks for the detailed triage on this Jan.  Saved me some cycles.

That said, I'm not 100% sure I agree with the course of action of essentially batching a bit on encountering errors like this, in case we recover from them, when dealing with marking degraded.

Before I make that stance "official", I'll spend some cycles revisiting the API doc and CVO repo doc for guidance, and to see if there is some precedence for such things.

If you have some quick pointers around official guidance for doing such things, please post some links if you have the chance.

I've also cc:ed some folks who have collaborated with me on samples in the past, to see if they are aligned with my initial take on this, or are leaning toward what you are proscribing.

Comment 2 Jan Chaloupka 2021-08-16 13:36:51 UTC

> If that's the case, it is not practical to have the operator switch to Degraded right away when there's APIServerConflictError. Instead, the operator can wait for few seconds/iterations and check again before it switches to Degraded=True.

I did not follow any official documentation. My goal is to minimize the occurrence of Degraded state in the operator. If what I suggest is not the right way to go, let's keep discussing this to find the proper way to resolve this. I am debugging other operators which switch into Degraded state as well to categorize all the instances to find something in common which we might either use to extend the logic to allow "short" changes into Degraded state to be ignored or as an argument to tune our synthetic tests to ignore these cases.

Comment 3 W. Trevor King 2021-08-16 16:49:25 UTC

Relevant API docs [1]:

  Degraded indicates that the operator's current state does not match its desired state over a period of time resulting in a lower quality of service.  The period of time may vary by component, but a Degraded state represents persistent observation of a condition.  As a result, a component should not oscillate in and out of Degraded state.

"25s of modification conflicts" doesn't seems like a QoS degradation big enough to be worth complaining about to me.

For Available=False, API docs are [2].  I suspect we don't want to go Available=False on short modification conflicts either, but I'm just guessing that's the reason, and I definitely consider it a bug to go Available=False without setting a reason and message (per the build-log excerpts, it appears the operator is setting neither reason nor message for this case today).

[1]: https://github.com/openshift/api/blob/a6156965faae5ce117e3cd3735981a3fc0e27e27/config/v1/types_cluster_operator.go#L161
[2]: https://github.com/openshift/api/blob/a6156965faae5ce117e3cd3735981a3fc0e27e27/config/v1/types_cluster_operator.go#L146-L150

Comment 4 Gabe Montero 2021-08-16 17:58:52 UTC

Thanks for the various refs Trevor

So minimally shore up as needed wrt Available=False and reason/message field.

And I see my opinion shifting on the "batch/Degraded" thread.

Unless they post here specifically beforehand, I'll reach out to the samples stakeholders I've cc:ed here for consensus, and we'll go from there.

Comment 5 Ben Parees 2021-08-16 18:10:07 UTC

I am +1 on being less aggressive when going degraded (e.g. by retrying over some period of time before marking ourselves degraded).  Especially for something like this(updating a sample imagestream) where the user impact is minimal.  As Jan noted, this is something we're pursuing across all our operators as right now we are generating too much "noise", especially during cluster upgrades, that make admins think the product is not stable when in reality everything is fine.

Comment 6 Gabe Montero 2021-08-16 18:32:31 UTC

(In reply to Jan Chaloupka from comment #2)
> > If that's the case, it is not practical to have the operator switch to Degraded right away when there's APIServerConflictError. Instead, the operator can wait for few seconds/iterations and check again before it switches to Degraded=True.
> 
> I did not follow any official documentation. My goal is to minimize the
> occurrence of Degraded state in the operator. If what I suggest is not the
> right way to go, let's keep discussing this to find the proper way to
> resolve this. I am debugging other operators which switch into Degraded
> state as well to categorize all the instances to find something in common
> which we might either use to extend the logic to allow "short" changes into
> Degraded state to be ignored or as an argument to tune our synthetic tests
> to ignore these cases.

Likewise Jan thanks for the additional context.

Comment 7 Gabe Montero 2021-08-24 23:19:29 UTC

@Jitendar - for verification, CI searches of the 4.9/master branch e2e upgrade jobs, like noted in this BZs description, for the string

clusteroperator/openshift-samples condition/Degraded status/True reason/APIServerConflictError

for a few days to a week after the this change shows up in the associated payloads, is sufficient verification.

If you need help performing such CI searches, let me know.  Thanks.

Comment 9 XiuJuan Wang 2021-09-06 10:33:25 UTC

https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-samples+condition%2FDegraded+status%2FTrue+reason%2FAPIServerConflictError&maxAge=48h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

There are lots of fail such as "clusteroperator/openshift-samples condition/Degraded status/True reason/APIServerConflictError changed: Operation cannot be fulfilled on imagestreams.image.openshift.io **"

In https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade/1434237913092067328

[bz-Samples] clusteroperator/openshift-samples should not change condition/Degraded passed

Hi Gabe, as above info, we could mark it verified, right?

Comment 11 Gabe Montero 2021-09-07 13:30:53 UTC

Not sure if you have a typo with " we could mark it verified, right?" XiuJuan, but your CI search results 
show I missed a spot with my fix for this, and we should *NOT* mark this verified.

https://github.com/openshift/cluster-samples-operator/blob/9d2c950cc66d7edd4a7aa0ea186b6fed33a8c7ba/pkg/stub/handler.go#L913-L921

Moving this back to assigned state, will have a new PR up shortly.

Comment 12 Gabe Montero 2021-09-07 13:41:01 UTC

Note, the final 4.9 freeze just passed.

Given this is a medium severity bug, I believe this won't meet the stop ship level of scrutiny
4.9 is now under.

So I'm changing the target to 4.10.

Comment 43 XiuJuan Wang 2021-09-13 02:26:43 UTC

https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-samples+condition%2FDegraded+status%2FTrue+reason%2FAPIServerConflictError&maxAge=48h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
There are still some samples degarade on reason APIServerConflictError.

Comment 44 Gabe Montero 2021-09-13 11:37:01 UTC

Actually XiuJuan, when I look at that query in more details, the conflicts only show up in when the upgrades are dealing with 4.9, 4.8, 4.7

There was one job that had a final landing spot of 4.10, after upgrades from 4.7, 4.8, 4.9, but those messages only occurred when we were at the older levels.

https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-from-stable-4.8-e2e-aws-upgrade/1437137590653292544/build-log.txt was such an instance

When I look at the final 4.10 logs / status, it is clean.

Marking this verified, we are OK here for 4.10.

Given severity, I'm not going to backport this at this time, but of course, we can always reassess that as we move forward.

Comment 46 errata-xmlrpc 2021-10-18 17:46:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759