Bug 1908596 - Race condition on operator installation
Summary: Race condition on operator installation
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.7
Hardware: All
OS: All
urgent
urgent
Target Milestone: ---
: 4.7.0
Assignee: Evan Cordell
QA Contact: Bruno Andrade
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-17 06:38 UTC by Nahshon Unna-Tsameret
Modified: 2021-02-24 15:46 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:45:40 UTC
Target Upstream Version:
Embargoed:
nunnatsa: needinfo-


Attachments (Terms of Use)
test and OLM logs (445.75 KB, application/octet-stream)
2020-12-21 10:16 UTC, Nahshon Unna-Tsameret
no flags Details
OLM debug logs (514.32 KB, application/x-xz)
2020-12-22 17:48 UTC, Simone Tiraboschi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github operator-framework operator-lifecycle-manager pull 1933 0 None closed Bug 1908596: Use correct caBundle for 'olmcahash' annotation 2021-01-12 09:19:08 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:46:15 UTC

Description Nahshon Unna-Tsameret 2020-12-17 06:38:45 UTC
Description of problem:
During HCO openshift-ci tests, there are many failures because the installation never becomes ready. But not every time the tests are running.

Looks like a race condition, because the whole log is full with these lines over and over again:
> {"level":"error","ts":1608141247.4510381,"logger":"controllers.operator","msg":"Could not update Operator status","request":"/kubevirt-hyperconverged.kubevirt-hyperconverged","error":"Operation cannot be fulfilled on operators.operators.coreos.com \"kubevirt-hyperconverged.kubevirt-hyperconverged\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/build/vendor/github.com/go-logr/zapr/zapr.go:132\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators.
> (*OperatorReconciler).Reconcile\n\t/build/pkg/controller/operators/operator_controller.go:163\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"}

How reproducible:
Run HCO openshift-ci tests

Actual results:
Test not pass

Expected results:
Test run after successful installation of HCO.

Additional info:
Example: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/1002/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-prev-aws/1339253678091014144/artifacts/hco-e2e-upgrade-prev-aws/gather-extra/pods/openshift-operator-lifecycle-manager_olm-operator-5f57f4899c-r2qbw_olm-operator.log

Comment 1 Kevin Rizza 2020-12-18 15:43:28 UTC
Those logs are a red herring. That just mean's OLM's internal cache was older than what was currently available in the API server, and OLM then retried. You'll find those messages in the OLM operator and Catalog operator logs all over the place, they are not indicative of a real error condition.

It looks like the errors near the bottom of your log are more interesting:

E1216 18:30:03.362584       1 queueinformer_operator.go:290] sync {"update" "kubevirt-hyperconverged/kubevirt-hyperconverged-operator.v1.2.0"} failed: error transitioning ClusterServiceVersion: requirements were not met and error updating CSV status: error updating ClusterServiceVersion status: Operation cannot be fulfilled on clusterserviceversions.operators.coreos.com "kubevirt-hyperconverged-operator.v1.2.0": the object has been modified; please apply your changes to the latest version and try again

It seems like that is a valid error. You will need to look at the status of the CSV object of the operator you're attempting to install to understand why the operator is not installing, but OLM is implying that it cannot finish the install.

I'm closing this as NOTABUG. Please take a look at your CSV, InstallPlan, and Subscription to determine what is preventing the install from continuing. If there's something there that appears to be a valid bug, feel free to reopen with those details.

Comment 2 Nahshon Unna-Tsameret 2020-12-21 10:16:30 UTC
Created attachment 1740922 [details]
test and OLM logs

Comment 3 Nahshon Unna-Tsameret 2020-12-21 10:29:55 UTC
@krizza - I reopen this bug because this is not solve and it does seems realted to OLM.

As result of your comment, we modified the test to check the CSV status before the upgrade. So the test now:

1. Subscribes the *old* CSV
2. Wait to the deployment to finished by waiting tho the CSV status:
> Msg "Read the CSV to make sure the deployment is done"
> ./hack/retry.sh 30 10 "${CMD} get ClusterServiceVersion  -n ${HCO_NAMESPACE} kubevirt-hyperconverged-operator.v${INITIAL_CHANNEL} -o jsonpath='{ .status.phase }' | grep 'Succeeded'"

3. After the deployment successfully done, the test patches the subscription with the *new* channel.
4. The test is waiting to the upgrade to be done; but the upgrade *never even starts*.

The test print the CSVs after it failed. The *old* CSV was modify after it was already succeeded. Is that what prevent the upgrade to start? And it seems, just a guess, that the old CSV is modified *as result* of the upgrade request (patching the subscription); here are the CSV conditions after the test failed:
>    - lastTransitionTime: "2020-12-21T09:05:03Z"
>      lastUpdateTime: "2020-12-21T09:05:03Z"
>      message: install strategy completed with no errors
>      phase: Succeeded
>      reason: InstallSucceeded
>    - lastTransitionTime: "2020-12-21T09:05:21Z"
>      lastUpdateTime: "2020-12-21T09:05:21Z"
>      message: |
>        installing: waiting for deployment hco-operator to become ready: Waiting for rollout to finish: 1 old replicas are pending termination...
>      phase: Failed
>      reason: ComponentUnhealthy
>    - lastTransitionTime: "2020-12-21T09:05:22Z"
>      lastUpdateTime: "2020-12-21T09:05:22Z"
>      message: requirements not met
>      phase: Pending
>      reason: RequirementsNotMet


The test files are here: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/1010/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-aws/1340933487674265600

Since this is not happening each time, but sometimes the upgrade does succeed (but failes later for the certificate issue), then something is not stable. it smells like a race condition, but maybe it's another issue.

The test log and the olm logs are attached here in olm_issue.tar.gz

Comment 4 Simone Tiraboschi 2020-12-21 10:46:04 UTC
Quick hint (not sure if relevant):

one of the operators in the CSV has:

      - name: cluster-network-addons-operator
        spec:
          replicas: 1
          selector:
            matchLabels:
              name: cluster-network-addons-operator
          strategy:
            type: Recreate

I'm just wondering if the issue if there.

I mean, strategy=recreate causes all existing pods to be killed before new ones are created,
I'm just wondering if this can cause a race conditions where the OLM detects that the initial CSV is missing one of its operators flagging it as in pending and blocking the upgrade.

I also want to add that the issue is not 100% systematic, that's why I suspect something racy there.

In our case we are trying this upgrade:
kubevirt-hyperconverged-operator.v1.3.0 -> kubevirt-hyperconverged-operator.v100.0.0

but sometimes (pretty often but not always) we are getting stuck in:

oc get clusterserviceversions -n kubevirt-hyperconverged
NAME                                        DISPLAY                                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v1.3.0     KubeVirt HyperConverged Cluster Operator   1.3.0     kubevirt-hyperconverged-operator.v1.2.0   Pending
kubevirt-hyperconverged-operator.v100.0.0   KubeVirt HyperConverged Cluster Operator   100.0.0   kubevirt-hyperconverged-operator.v1.3.0   Pending

where the initial CSV is reverted back to Pending in the middle of hte upgrade.

Comment 5 Zvi Cahana 2020-12-21 16:10:24 UTC
(In reply to Kevin Rizza from comment #1)
> Those logs are a red herring. That just mean's OLM's internal cache was
> older than what was currently available in the API server, and OLM then
> retried. You'll find those messages in the OLM operator and Catalog operator
> logs all over the place, they are not indicative of a real error condition.

In our case, the "Operation cannot be fulfilled" errors seem to repeat consistently and "indefinitely" (for about 40min of test run).
E.g., see the logs below and search for "Operation cannot be fulfilled":
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/1010/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-aws/1340933487674265600/artifacts/hco-e2e-upgrade-aws/gather-extra/pods/openshift-operator-lifecycle-manager_olm-operator-7bdcb8db56-m9t8s_olm-operator.log

Can this indicate something else rather than stale cache on the OLM side?

Comment 9 Simone Tiraboschi 2020-12-22 17:48:02 UTC
Created attachment 1741407 [details]
OLM debug logs

Comment 12 Zvi Cahana 2020-12-28 09:31:59 UTC
Attempted to fix in https://github.com/operator-framework/operator-lifecycle-manager/pull/1933

Comment 25 errata-xmlrpc 2021-02-24 15:45:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.