Hide Forgot
Description of problem: During HCO openshift-ci tests, there are many failures because the installation never becomes ready. But not every time the tests are running. Looks like a race condition, because the whole log is full with these lines over and over again: > {"level":"error","ts":1608141247.4510381,"logger":"controllers.operator","msg":"Could not update Operator status","request":"/kubevirt-hyperconverged.kubevirt-hyperconverged","error":"Operation cannot be fulfilled on operators.operators.coreos.com \"kubevirt-hyperconverged.kubevirt-hyperconverged\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/build/vendor/github.com/go-logr/zapr/zapr.go:132\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators. > (*OperatorReconciler).Reconcile\n\t/build/pkg/controller/operators/operator_controller.go:163\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"} How reproducible: Run HCO openshift-ci tests Actual results: Test not pass Expected results: Test run after successful installation of HCO. Additional info: Example: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/1002/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-prev-aws/1339253678091014144/artifacts/hco-e2e-upgrade-prev-aws/gather-extra/pods/openshift-operator-lifecycle-manager_olm-operator-5f57f4899c-r2qbw_olm-operator.log
Those logs are a red herring. That just mean's OLM's internal cache was older than what was currently available in the API server, and OLM then retried. You'll find those messages in the OLM operator and Catalog operator logs all over the place, they are not indicative of a real error condition. It looks like the errors near the bottom of your log are more interesting: E1216 18:30:03.362584 1 queueinformer_operator.go:290] sync {"update" "kubevirt-hyperconverged/kubevirt-hyperconverged-operator.v1.2.0"} failed: error transitioning ClusterServiceVersion: requirements were not met and error updating CSV status: error updating ClusterServiceVersion status: Operation cannot be fulfilled on clusterserviceversions.operators.coreos.com "kubevirt-hyperconverged-operator.v1.2.0": the object has been modified; please apply your changes to the latest version and try again It seems like that is a valid error. You will need to look at the status of the CSV object of the operator you're attempting to install to understand why the operator is not installing, but OLM is implying that it cannot finish the install. I'm closing this as NOTABUG. Please take a look at your CSV, InstallPlan, and Subscription to determine what is preventing the install from continuing. If there's something there that appears to be a valid bug, feel free to reopen with those details.
Created attachment 1740922 [details] test and OLM logs
@krizza - I reopen this bug because this is not solve and it does seems realted to OLM. As result of your comment, we modified the test to check the CSV status before the upgrade. So the test now: 1. Subscribes the *old* CSV 2. Wait to the deployment to finished by waiting tho the CSV status: > Msg "Read the CSV to make sure the deployment is done" > ./hack/retry.sh 30 10 "${CMD} get ClusterServiceVersion -n ${HCO_NAMESPACE} kubevirt-hyperconverged-operator.v${INITIAL_CHANNEL} -o jsonpath='{ .status.phase }' | grep 'Succeeded'" 3. After the deployment successfully done, the test patches the subscription with the *new* channel. 4. The test is waiting to the upgrade to be done; but the upgrade *never even starts*. The test print the CSVs after it failed. The *old* CSV was modify after it was already succeeded. Is that what prevent the upgrade to start? And it seems, just a guess, that the old CSV is modified *as result* of the upgrade request (patching the subscription); here are the CSV conditions after the test failed: > - lastTransitionTime: "2020-12-21T09:05:03Z" > lastUpdateTime: "2020-12-21T09:05:03Z" > message: install strategy completed with no errors > phase: Succeeded > reason: InstallSucceeded > - lastTransitionTime: "2020-12-21T09:05:21Z" > lastUpdateTime: "2020-12-21T09:05:21Z" > message: | > installing: waiting for deployment hco-operator to become ready: Waiting for rollout to finish: 1 old replicas are pending termination... > phase: Failed > reason: ComponentUnhealthy > - lastTransitionTime: "2020-12-21T09:05:22Z" > lastUpdateTime: "2020-12-21T09:05:22Z" > message: requirements not met > phase: Pending > reason: RequirementsNotMet The test files are here: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/1010/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-aws/1340933487674265600 Since this is not happening each time, but sometimes the upgrade does succeed (but failes later for the certificate issue), then something is not stable. it smells like a race condition, but maybe it's another issue. The test log and the olm logs are attached here in olm_issue.tar.gz
Quick hint (not sure if relevant): one of the operators in the CSV has: - name: cluster-network-addons-operator spec: replicas: 1 selector: matchLabels: name: cluster-network-addons-operator strategy: type: Recreate I'm just wondering if the issue if there. I mean, strategy=recreate causes all existing pods to be killed before new ones are created, I'm just wondering if this can cause a race conditions where the OLM detects that the initial CSV is missing one of its operators flagging it as in pending and blocking the upgrade. I also want to add that the issue is not 100% systematic, that's why I suspect something racy there. In our case we are trying this upgrade: kubevirt-hyperconverged-operator.v1.3.0 -> kubevirt-hyperconverged-operator.v100.0.0 but sometimes (pretty often but not always) we are getting stuck in: oc get clusterserviceversions -n kubevirt-hyperconverged NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v1.3.0 KubeVirt HyperConverged Cluster Operator 1.3.0 kubevirt-hyperconverged-operator.v1.2.0 Pending kubevirt-hyperconverged-operator.v100.0.0 KubeVirt HyperConverged Cluster Operator 100.0.0 kubevirt-hyperconverged-operator.v1.3.0 Pending where the initial CSV is reverted back to Pending in the middle of hte upgrade.
(In reply to Kevin Rizza from comment #1) > Those logs are a red herring. That just mean's OLM's internal cache was > older than what was currently available in the API server, and OLM then > retried. You'll find those messages in the OLM operator and Catalog operator > logs all over the place, they are not indicative of a real error condition. In our case, the "Operation cannot be fulfilled" errors seem to repeat consistently and "indefinitely" (for about 40min of test run). E.g., see the logs below and search for "Operation cannot be fulfilled": https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/1010/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-aws/1340933487674265600/artifacts/hco-e2e-upgrade-aws/gather-extra/pods/openshift-operator-lifecycle-manager_olm-operator-7bdcb8db56-m9t8s_olm-operator.log Can this indicate something else rather than stale cache on the OLM side?
Created attachment 1741407 [details] OLM debug logs
Attempted to fix in https://github.com/operator-framework/operator-lifecycle-manager/pull/1933
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633