Bug 1972075 - Race condition in OperatorCondition reconcilation
Summary: Race condition in OperatorCondition reconcilation
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.z
Assignee: Vu Dinh
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On: 1927340
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-15 08:22 UTC by Vu Dinh
Modified: 2021-07-26 17:35 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-26 17:35:21 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github operator-framework operator-lifecycle-manager pull 2206 0 None open [release-4.7] Bug 1972075: Add OperatorCondition status sync and update operator upgradeable check 2021-06-28 22:35:13 UTC
Red Hat Product Errata RHBA-2021:2762 0 None None None 2021-07-26 17:35:37 UTC

Description Vu Dinh 2021-06-15 08:22:19 UTC
This bug was initially created as a copy of Bug #1927340

I am copying this bug because: 



Description of problem:

There is a potential race when using the `OperatorCondition` Upgradeable supported condition that can do the opposite of what the user expects.

What should happen on the happy path:

 - operator writes operatorcondition with upgradeable=false
 - olm sees operatorcondition and sets pending = true
 - operator sets pod ready = true
 - olm waits for the operator to set operatorcondition upgradeable=true

But OLM is not gauranteed to see the updated `OperatorCondition` value that the operator has set before it sees that the operator has set `ready=True` on its pod. 

In that case, this is what actually happens:

- operator writes operatorcondition with upgradeable=false
- operator sets pod ready = true
- olm sees pod status first and sets success = true
- olm upgrades operator to the next one
- olm gets a bug report from an user or operator author that `upgradeable=false` didn't prevent the upgrade


Version-Release number of selected component (if applicable): 4.7


There are a couple of ways to investigate fixing this:

1. When OLM queries for OperatorCondition status, hit the API directly instead of using the cache. This would fix the issue but will increase API load, especially during rollout of an upgrade, so it's not the ideal.
2. Use a generation / some other mechanism so that OLM can ack that it has seen the latest updates on OperatorCondition. The operator would wait to see the ack (i.e. ObservedGeneration = Generation) before setting pod readiness = true.

Comment 2 Jian Zhang 2021-07-01 08:13:10 UTC
[cloud-user@preserve-olm-env jian]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-07-01-052427   True        False         96m     Cluster version is 4.7.0-0.nightly-2021-07-01-052427

[cloud-user@preserve-olm-env jian]$ oc -n openshift-operator-lifecycle-manager exec catalog-operator-5644cf4978-jmgdq -- olm --version
OLM version: 0.17.0
git commit: a5f086a0cec32be5577b55c89d58adc40b677eb1

1, Install etcd operator v0.9.2
[cloud-user@preserve-olm-env jian]$ oc project default
Now using project "default" on server "https://api.ci-ln-bdpflpb-d5d6b.origin-ci-int-aws.dev.rhcloud.com:6443".
[cloud-user@preserve-olm-env jian]$ oc get og
NAME         AGE
default-og   4s
[cloud-user@preserve-olm-env jian]$ cat sub-etcd.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: etcd
  namespace: default
spec:
  channel: singlenamespace-alpha
  installPlanApproval: Manual
  name: etcd
  source: community-operators
  sourceNamespace: openshift-marketplace
  startingCSV: etcdoperator.v0.9.2
[cloud-user@preserve-olm-env jian]$ oc create -f sub-etcd.yaml 
subscription.operators.coreos.com/etcd created

[cloud-user@preserve-olm-env jian]$ oc get sub
NAME   PACKAGE   SOURCE                CHANNEL
etcd   etcd      community-operators   singlenamespace-alpha
[cloud-user@preserve-olm-env jian]$ oc get ip
NAME            CSV                   APPROVAL   APPROVED
install-pghlp   etcdoperator.v0.9.2   Manual     false
[cloud-user@preserve-olm-env jian]$ oc get csv
No resources found in default namespace.
[cloud-user@preserve-olm-env jian]$ oc edit ip install-pghlp
installplan.operators.coreos.com/install-pghlp edited
[cloud-user@preserve-olm-env jian]$ 
[cloud-user@preserve-olm-env jian]$ oc get ip
NAME            CSV                   APPROVAL   APPROVED
install-kpn9l   etcdoperator.v0.9.4   Manual     false
install-pghlp   etcdoperator.v0.9.2   Manual     true
[cloud-user@preserve-olm-env jian]$ oc get csv
NAME                  DISPLAY   VERSION   REPLACES              PHASE
etcdoperator.v0.9.2   etcd      0.9.2     etcdoperator.v0.9.0   Succeeded
[cloud-user@preserve-olm-env jian]$ oc get operatorcondition
NAME                  AGE
etcdoperator.v0.9.2   58s

2, Patch the `spec.conditions[0].Upgradeable` of operatorcondition to False. Looks good.

[cloud-user@preserve-olm-env jian]$ oc patch operatorcondition etcdoperator.v0.9.2 -p '{"spec":{"conditions":[{"type":"Upgradeable", "observedGeneration":1,"status":"False","reason":"bug","message":"not ready","lastUpdateTime":"2021-06-16T16:56:44Z","lastTransitionTime":"2021-06-16T16:56:44Z"}]}}' --type=merge
operatorcondition.operators.coreos.com/etcdoperator.v0.9.2 patched
[cloud-user@preserve-olm-env jian]$ oc get operatorcondition etcdoperator.v0.9.2 -o yaml
apiVersion: operators.coreos.com/v2
kind: OperatorCondition
metadata:
  creationTimestamp: "2021-07-01T08:07:22Z"
  generation: 2
  labels:
    operators.coreos.com/etcd.default: ""
  name: etcdoperator.v0.9.2
  namespace: default
  ownerReferences:
  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: true
    kind: ClusterServiceVersion
    name: etcdoperator.v0.9.2
    uid: 36254607-312f-4b32-931c-20021747ed48
  resourceVersion: "59966"
  selfLink: /apis/operators.coreos.com/v2/namespaces/default/operatorconditions/etcdoperator.v0.9.2
  uid: 2ddc5d28-ee38-466f-93f0-f2a1748b964e
spec:
  conditions:
  - lastTransitionTime: "2021-06-16T16:56:44Z"
    message: not ready
    observedGeneration: 1
    reason: bug
    status: "False"
    type: Upgradeable
  deployments:
  - etcd-operator
  serviceAccounts:
  - etcd-operator
status:
  conditions:
  - lastTransitionTime: "2021-06-16T16:56:44Z"
    message: not ready
    observedGeneration: 2
    reason: bug
    status: "False"
    type: Upgradeable

3, Approve the etcdoperator.v0.9.4.

[cloud-user@preserve-olm-env jian]$ oc get ip
NAME            CSV                   APPROVAL   APPROVED
install-kpn9l   etcdoperator.v0.9.4   Manual     false
install-pghlp   etcdoperator.v0.9.2   Manual     true
[cloud-user@preserve-olm-env jian]$ oc edit ip install-kpn9l
installplan.operators.coreos.com/install-kpn9l edited
[cloud-user@preserve-olm-env jian]$ 
[cloud-user@preserve-olm-env jian]$ oc get csv
NAME                  DISPLAY   VERSION   REPLACES              PHASE
etcdoperator.v0.9.2   etcd      0.9.2     etcdoperator.v0.9.0   Replacing
etcdoperator.v0.9.4   etcd      0.9.4     etcdoperator.v0.9.2   Pending

[cloud-user@preserve-olm-env jian]$ oc get csv etcdoperator.v0.9.4 -o yaml
...
status:
  conditions:
  - lastTransitionTime: "2021-07-01T08:10:28Z"
    lastUpdateTime: "2021-07-01T08:10:28Z"
    message: requirements not yet checked
    phase: Pending
    reason: RequirementsUnknown
  - lastTransitionTime: "2021-07-01T08:10:28Z"
    lastUpdateTime: "2021-07-01T08:10:28Z"
    message: 'operator is not upgradeable: The operator is not upgradeable: not ready'
    phase: Pending
    reason: OperatorConditionNotUpgradeable
  lastTransitionTime: "2021-07-01T08:10:28Z"
  lastUpdateTime: "2021-07-01T08:10:28Z"
  message: 'operator is not upgradeable: The operator is not upgradeable: not ready'
  phase: Pending
  reason: OperatorConditionNotUpgradeable

4, Patch the `spec.conditions[0].Upgradeable` of operatorcondition to True. etcdoperator.v0.9.2 can be upgraded to etcdoperator.v0.9.4, looks good. Verify it.
[cloud-user@preserve-olm-env jian]$ oc patch operatorcondition etcdoperator.v0.9.2 -p '{"spec":{"conditions":[{"type":"Upgradeable", "observedGeneration":1,"status":"True","reason":"bug","message":"now ready","lastUpdateTime":"2021-06-16T16:56:44Z","lastTransitionTime":"2021-06-16T16:56:44Z"}]}}' --type=merge
operatorcondition.operators.coreos.com/etcdoperator.v0.9.2 patched
[cloud-user@preserve-olm-env jian]$ oc get csv
NAME                  DISPLAY   VERSION   REPLACES              PHASE
etcdoperator.v0.9.4   etcd      0.9.4     etcdoperator.v0.9.2   Succeeded
[cloud-user@preserve-olm-env jian]$ oc get operatorcondition
NAME                  AGE
etcdoperator.v0.9.4   2m3s

Comment 5 Siddharth Sharma 2021-07-09 21:07:46 UTC
OCP engineering has decided to not ship 4.7.20 due to a blocker. This bug will be shipped as part of next z-stream release 4.7.21 planned on July 27th

Comment 8 errata-xmlrpc 2021-07-26 17:35:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.21 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2762


Note You need to log in before you can comment on or make changes to this bug.