Bug 1759612 - Operator upgrade gating
Summary: Operator upgrade gating
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Installation
Version: 2.1.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 2.4.0
Assignee: Nahshon Unna-Tsameret
QA Contact: Irina Gulina
Pan Ousley
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-08 16:50 UTC by Ryan Hallisey
Modified: 2020-05-06 12:47 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Cause: We need to be able to track state in order to guarantee inform OLM when it's safe to upgrade. We can't 100% guarantee state using a declarative API. Consequence: There's a chance OLM can interrupt an upgrade. Workaround (if any): None Result: OLM interrupts CNV during an upgrade and it causes the upgrade to fail. However, it's not likley for this to happen with 'Automatic upgrades' because most of the interruption risk is at the beginning of an upgrade, when operators haven't had a chance to complete a reconcile loop. Therefore, the majority of the risk is negated as long as multiple releases aren't published at the same time on the same upgrade graph.
Clone Of:
Environment:
Last Closed: 2020-05-06 12:47:02 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Ryan Hallisey 2019-10-08 16:50:27 UTC
Background Info:
The only method OLM has for operators to inform it that an upgrade is occurring is the Readiness probe. Operators are to set their Readiness to false whenever they were doing work that shouldn't be interrupted by OLM. However, using a Readiness probe has negative side affects like logs and metrics not being gathered (https://github.com/operator-framework/operator-lifecycle-manager/issues/922).  Therefore, it made sense to push these problems onto CNV's user-operator, HCO, in order to avoid the side effects on component operators.

Problem:
With the conditions now on the HCO and components operators, we can't say with 100% certainty that they always will be able to gate upgrades from OLM.  For example, if all the operators are Running 1/1 and one of the operators hasn't started reporting conditions in it's CR, the HCO will see the operator as Ready and will report upgradable to OLM when it's not. 

The root of this problem has to do with tracking state in Kubernetes.  We need a way to communicate state with 100% accuracy while being "kubernetes-like", declarative and distributed.

TLDR: We can't 100% guarantee state using a declarative API, so there's a chance OLM can interrupt an upgrade. However it's important to note that it's not likley for this to happen with 'Automatic upgrades' because most of the interruption risk is at the beginning of an upgrade, when operators haven't had a chance to complete a reconcile loop.  Therefore, the majority of the risk is negated as long as multiple releases aren't published at the same time on the same upgrade graph.

Long term solutions:
 1) OLM provides a better method for upgrade gating (https://github.com/openshift/enhancements/pull/28)
 2) Component operators support multiple upgrades (one-to-many) across operator versions.

Comment 2 Fabian Deutsch 2019-10-10 07:58:34 UTC
Michael was mentioning on an issue that instead of gating, operators should tolerate to get replaced by another operator.
To some degree I understand this pespective, as containers in general can always get interrupted. This is slighlty touching the point of eventually consisiting systems.

Comment 3 Ryan Hallisey 2019-10-10 11:05:01 UTC
Point 2) would be the implementation of that concept.
> 2) Component operators support multiple upgrades (one-to-many) across operator versions.

Comment 7 Nelly Credi 2019-11-11 12:30:17 UTC
please add fixed in version

Comment 8 Ryan Hallisey 2019-11-11 18:13:57 UTC
This bug tracks a documented release note.  The fix will be in 2.3.

Comment 9 Fabian Deutsch 2020-02-20 08:30:52 UTC
The fix will not be in 2.3, we might actually not fix it in this way. Deferring for now.

Comment 10 Fabian Deutsch 2020-05-06 12:47:02 UTC
I'm proposing to close this bug.

I tis not a bug per-se. It's an enhancement.

https://issues.redhat.com/browse/CNV-474 is the right place to work on this.


Note You need to log in before you can comment on or make changes to this bug.