1908596 – Race condition on operator installation

Bug 1908596 - Race condition on operator installation

Summary: Race condition on operator installation

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.7
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Evan Cordell
QA Contact:	Bruno Andrade
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-17 06:38 UTC by Nahshon Unna-Tsameret
Modified:	2021-02-24 15:46 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:45:40 UTC
Target Upstream Version:
Embargoed:
Flags:	nunnatsa: needinfo-

Attachments	(Terms of Use)
test and OLM logs (445.75 KB, application/octet-stream) 2020-12-21 10:16 UTC, Nahshon Unna-Tsameret	no flags	Details
OLM debug logs (514.32 KB, application/x-xz) 2020-12-22 17:48 UTC, Simone Tiraboschi	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	operator-framework operator-lifecycle-manager pull 1933	0	None	closed	Bug 1908596: Use correct caBundle for 'olmcahash' annotation	2021-01-12 09:19:08 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:46:15 UTC

Description Nahshon Unna-Tsameret 2020-12-17 06:38:45 UTC

Description of problem:
During HCO openshift-ci tests, there are many failures because the installation never becomes ready. But not every time the tests are running.

Looks like a race condition, because the whole log is full with these lines over and over again:
> {"level":"error","ts":1608141247.4510381,"logger":"controllers.operator","msg":"Could not update Operator status","request":"/kubevirt-hyperconverged.kubevirt-hyperconverged","error":"Operation cannot be fulfilled on operators.operators.coreos.com \"kubevirt-hyperconverged.kubevirt-hyperconverged\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/build/vendor/github.com/go-logr/zapr/zapr.go:132\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators.
> (*OperatorReconciler).Reconcile\n\t/build/pkg/controller/operators/operator_controller.go:163\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"}

How reproducible:
Run HCO openshift-ci tests

Actual results:
Test not pass

Expected results:
Test run after successful installation of HCO.

Additional info:
Example: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/1002/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-prev-aws/1339253678091014144/artifacts/hco-e2e-upgrade-prev-aws/gather-extra/pods/openshift-operator-lifecycle-manager_olm-operator-5f57f4899c-r2qbw_olm-operator.log

Comment 1 Kevin Rizza 2020-12-18 15:43:28 UTC

Those logs are a red herring. That just mean's OLM's internal cache was older than what was currently available in the API server, and OLM then retried. You'll find those messages in the OLM operator and Catalog operator logs all over the place, they are not indicative of a real error condition.

It looks like the errors near the bottom of your log are more interesting:

E1216 18:30:03.362584       1 queueinformer_operator.go:290] sync {"update" "kubevirt-hyperconverged/kubevirt-hyperconverged-operator.v1.2.0"} failed: error transitioning ClusterServiceVersion: requirements were not met and error updating CSV status: error updating ClusterServiceVersion status: Operation cannot be fulfilled on clusterserviceversions.operators.coreos.com "kubevirt-hyperconverged-operator.v1.2.0": the object has been modified; please apply your changes to the latest version and try again

It seems like that is a valid error. You will need to look at the status of the CSV object of the operator you're attempting to install to understand why the operator is not installing, but OLM is implying that it cannot finish the install.

I'm closing this as NOTABUG. Please take a look at your CSV, InstallPlan, and Subscription to determine what is preventing the install from continuing. If there's something there that appears to be a valid bug, feel free to reopen with those details.

Comment 2 Nahshon Unna-Tsameret 2020-12-21 10:16:30 UTC

Created attachment 1740922 [details]
test and OLM logs

Comment 3 Nahshon Unna-Tsameret 2020-12-21 10:29:55 UTC

@krizza - I reopen this bug because this is not solve and it does seems realted to OLM.

As result of your comment, we modified the test to check the CSV status before the upgrade. So the test now:

1. Subscribes the *old* CSV
2. Wait to the deployment to finished by waiting tho the CSV status:
> Msg "Read the CSV to make sure the deployment is done"
> ./hack/retry.sh 30 10 "${CMD} get ClusterServiceVersion  -n ${HCO_NAMESPACE} kubevirt-hyperconverged-operator.v${INITIAL_CHANNEL} -o jsonpath='{ .status.phase }' | grep 'Succeeded'"

3. After the deployment successfully done, the test patches the subscription with the *new* channel.
4. The test is waiting to the upgrade to be done; but the upgrade *never even starts*.

The test print the CSVs after it failed. The *old* CSV was modify after it was already succeeded. Is that what prevent the upgrade to start? And it seems, just a guess, that the old CSV is modified *as result* of the upgrade request (patching the subscription); here are the CSV conditions after the test failed:
>    - lastTransitionTime: "2020-12-21T09:05:03Z"
>      lastUpdateTime: "2020-12-21T09:05:03Z"
>      message: install strategy completed with no errors
>      phase: Succeeded
>      reason: InstallSucceeded
>    - lastTransitionTime: "2020-12-21T09:05:21Z"
>      lastUpdateTime: "2020-12-21T09:05:21Z"
>      message: |
>        installing: waiting for deployment hco-operator to become ready: Waiting for rollout to finish: 1 old replicas are pending termination...
>      phase: Failed
>      reason: ComponentUnhealthy
>    - lastTransitionTime: "2020-12-21T09:05:22Z"
>      lastUpdateTime: "2020-12-21T09:05:22Z"
>      message: requirements not met
>      phase: Pending
>      reason: RequirementsNotMet


The test files are here: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/1010/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-aws/1340933487674265600

Since this is not happening each time, but sometimes the upgrade does succeed (but failes later for the certificate issue), then something is not stable. it smells like a race condition, but maybe it's another issue.

The test log and the olm logs are attached here in olm_issue.tar.gz

Comment 4 Simone Tiraboschi 2020-12-21 10:46:04 UTC

Quick hint (not sure if relevant):

one of the operators in the CSV has:

      - name: cluster-network-addons-operator
        spec:
          replicas: 1
          selector:
            matchLabels:
              name: cluster-network-addons-operator
          strategy:
            type: Recreate

I'm just wondering if the issue if there.

I mean, strategy=recreate causes all existing pods to be killed before new ones are created,
I'm just wondering if this can cause a race conditions where the OLM detects that the initial CSV is missing one of its operators flagging it as in pending and blocking the upgrade.

I also want to add that the issue is not 100% systematic, that's why I suspect something racy there.

In our case we are trying this upgrade:
kubevirt-hyperconverged-operator.v1.3.0 -> kubevirt-hyperconverged-operator.v100.0.0

but sometimes (pretty often but not always) we are getting stuck in:

oc get clusterserviceversions -n kubevirt-hyperconverged
NAME                                        DISPLAY                                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v1.3.0     KubeVirt HyperConverged Cluster Operator   1.3.0     kubevirt-hyperconverged-operator.v1.2.0   Pending
kubevirt-hyperconverged-operator.v100.0.0   KubeVirt HyperConverged Cluster Operator   100.0.0   kubevirt-hyperconverged-operator.v1.3.0   Pending

where the initial CSV is reverted back to Pending in the middle of hte upgrade.

Comment 5 Zvi Cahana 2020-12-21 16:10:24 UTC

(In reply to Kevin Rizza from comment #1)
> Those logs are a red herring. That just mean's OLM's internal cache was
> older than what was currently available in the API server, and OLM then
> retried. You'll find those messages in the OLM operator and Catalog operator
> logs all over the place, they are not indicative of a real error condition.

In our case, the "Operation cannot be fulfilled" errors seem to repeat consistently and "indefinitely" (for about 40min of test run).
E.g., see the logs below and search for "Operation cannot be fulfilled":
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/1010/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-aws/1340933487674265600/artifacts/hco-e2e-upgrade-aws/gather-extra/pods/openshift-operator-lifecycle-manager_olm-operator-7bdcb8db56-m9t8s_olm-operator.log

Can this indicate something else rather than stale cache on the OLM side?

Comment 9 Simone Tiraboschi 2020-12-22 17:48:02 UTC

Created attachment 1741407 [details]
OLM debug logs

Comment 12 Zvi Cahana 2020-12-28 09:31:59 UTC

Attempted to fix in https://github.com/operator-framework/operator-lifecycle-manager/pull/1933

Comment 25 errata-xmlrpc 2021-02-24 15:45:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.