1723818 – OLM upgrade failure from 4.1 to 4.2 due to packageserver csv OwnerConflict

Bug 1723818 - OLM upgrade failure from 4.1 to 4.2 due to packageserver csv OwnerConflict

Summary: OLM upgrade failure from 4.1 to 4.2 due to packageserver csv OwnerConflict

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Evan Cordell
QA Contact:	Cuiping HUO
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1724801 1731123 (view as bug list)
Depends On:	1733015
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-25 13:34 UTC by Russell Bryant
Modified:	2019-10-16 06:32 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:32:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:32:36 UTC

Description Russell Bryant 2019-06-25 13:34:43 UTC

Description of problem:

This problem was observed in the e2e-aws-upgrade-4.1-to-4.2 CI job.

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/119

The top level failure message was:

Jun 25 03:09:26.477: INFO: Unexpected error occurred: Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator operator-lifecycle-manager-packageserver is still updating

I looked in the OLM operator log:

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/119/artifacts/e2e-aws-upgrade/pods/openshift-operator-lifecycle-manager_olm-operator-6fb8dc66f8-p6ssk_olm-operator.log

and found the following error message repeated several times:

time="2019-06-25T03:10:44Z" level=info msg="error updating ClusterServiceVersion status: Operation cannot be fulfilled on clusterserviceversions.operators.coreos.com \"packageserver.v0.9.0\": the object has been modified; please apply your changes to the latest version and try again" csv=packageserver.v0.9.0 id=a9I6B namespace=openshift-operator-lifecycle-manager phase=Installing
E0625 03:10:44.262560       1 queueinformer_operator.go:274] sync {"update" "openshift-operator-lifecycle-manager/packageserver.v0.9.0"} failed: error updating ClusterServiceVersion status: Operation cannot be fulfilled on clusterserviceversions.operators.coreos.com "packageserver.v0.9.0": the object has been modified; please apply your changes to the latest version and try again


Version-Release number of selected component (if applicable):

operator-lifecycle-manager commit ID: 586ffaf57b5da9cc2301b01e2ea10ce6117928c9

operator-lifecycle-manager commit ID for the upgrade: 6bf64d01349f8ca67749cb8849edfeebe39b475f

Comment 1 Mark McLoughlin 2019-06-25 14:30:14 UTC

Likely related to a PR merged yesterday - https://github.com/operator-framework/operator-lifecycle-manager/pull/863 where the operator-lifecycle-manager-packageserver ClusterOperator was added

Comment 2 Abu Kashem 2019-06-25 20:40:31 UTC

Investigated the issue further, here are the findings.

packageserver fails to deploy since it can't adopt ownership of the `APIService` object. The status of the new version of the csv (packageserver.v0.10.1) reflects this.

apiVersion: operators.coreos.com/v1alpha1
kind: ClusterServiceVersion
metadata:
  annotations:
    olm.operatorGroup: olm-operators
    olm.operatorNamespace: openshift-operator-lifecycle-manager
    olm.targetNamespaces: openshift-operator-lifecycle-manager
  creationTimestamp: 2019-06-25T02:05:19Z
  generation: 16
  labels:
    olm.api.4bca9f23e412d79d: provided
    olm.clusteroperator.name: operator-lifecycle-manager-packageserver
  name: packageserver.v0.10.1
  namespace: openshift-operator-lifecycle-manager
status:
  certsLastUpdated: null
  certsRotateAt: null
  message: unable to adopt APIService
  phase: Failed
  conditions:
  - lastTransitionTime: 2019-06-25T02:05:19Z
    lastUpdateTime: 2019-06-25T02:05:19Z
    message: requirements not yet checked
    phase: Pending
    reason: RequirementsUnknown
  - lastTransitionTime: 2019-06-25T02:05:19Z
    lastUpdateTime: 2019-06-25T02:05:19Z
    message: unable to adopt APIService
    phase: Failed
    reason: OwnerConflict
  lastTransitionTime: 2019-06-25T02:05:19Z
  lastUpdateTime: 2019-06-25T02:05:19Z
  message: unable to adopt APIService
  phase: Failed
  reason: OwnerConflict


The APIService object has the following labels
"labels": {
   "olm.owner": "packageserver.v0.9.0",
   "olm.owner.kind": "ClusterServiceVersion",
   "olm.owner.namespace": "openshift-operator-lifecycle-manager"                
},

Because of a name mismatch ( csv name 4.1 -> 4.2 has changed from 'packageserver.v0.9.0' to 'packageserver.v0.10.1' ) and this is causing olm to throw an 'OwnerConflict' error. https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/olm/operator.go#L1411-L1413

Comment 3 Abu Kashem 2019-06-25 20:49:47 UTC

Next steps:
- Reproduce by adding an e2e test that simulates this upgrade scenario where the name of the csv changes.
- Make changes to 'apiServiceOwnerConflicts' to make sure we can adopt an APIService when the current owner csv has been removed or is being replaced.

Comment 4 Abu Kashem 2019-06-27 20:52:32 UTC

*** Bug 1724801 has been marked as a duplicate of this bug. ***

Comment 5 Clayton Coleman 2019-06-28 02:10:14 UTC

Setting priority appropriately, all upgrade jobs are blocked by this.

If there’s not easy fix (a few hours), let’s revert the breaking change and make sure we test upgrades before we re-introduce it.

Did the upgrade job pass when this PR merged?  I would have expected it to fail, unless we delivered setting the value and then using it in two separate PRs.

Comment 6 Clayton Coleman 2019-06-28 02:13:21 UTC

All 4.1 to 4.2 upgrade jobs, that is

Comment 7 Clayton Coleman 2019-06-29 05:24:21 UTC

Still failing after https://github.com/operator-framework/operator-lifecycle-manager/pull/925#issuecomment-506928237 merged it looks like:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/131

Comment 8 Abu Kashem 2019-06-30 23:42:17 UTC

This PR ( https://github.com/operator-framework/operator-lifecycle-manager/pull/937 ) should fix this issue. It's merged and included in this release - https://openshift-release.svc.ci.openshift.org/releasestream/4.2.0-0.ci/release/4.2.0-0.ci-2019-06-30-145631.

Comment 9 Matthew Staebler 2019-07-18 21:30:01 UTC

This failed again: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/189

Comment 10 Matthew Staebler 2019-07-18 21:45:39 UTC

And https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/190

Comment 11 Ben Parees 2019-07-18 22:15:10 UTC

Moving this back to New to ensure it gets attention.  If you feel it's a new issue(I imagine it is), please open a new BZ and move this back to modified.

Comment 12 Abu Kashem 2019-07-18 22:38:48 UTC

CSV status:
 
  lastTransitionTime: 2019-07-18T11:17:18Z
  lastUpdateTime: 2019-07-18T11:17:18Z
  message: unable to adopt APIService
  phase: Failed
  reason: OwnerConflict


I think it's a regression, packageserver fails to install since it can't adopt ownership of the APIService object.

Comment 13 Jian Zhang 2019-07-19 02:20:29 UTC

We met this issue again, see bug 1731123 for more details. I'd like to change the status back to ASSIGNED since there is a fixed PR before. But, it didn't work. 
I also change the version to 4.1.z since it's 4.1.z upgrading issue. Correct me if I'm wrong.

Comment 14 Jian Zhang 2019-07-19 02:22:37 UTC

*** Bug 1731123 has been marked as a duplicate of this bug. ***

Comment 15 Evan Cordell 2019-07-19 02:39:38 UTC

https://github.com/operator-framework/operator-lifecycle-manager/pull/957 should fix this issue.

The code to ensure this issue doesn't occur wasn't looking at the right namespace because of the wrong syntax. I will follow up with an e2e test to verify this in the future.

Comment 17 Cuiping HUO 2019-07-25 07:38:37 UTC

Verification blocked due to 4.1 upgrade to 4.2 failed as https://bugzilla.redhat.com/show_bug.cgi?id=1733015 shows.

Comment 18 Zhang Cheng 2019-07-25 08:34:21 UTC

Changing Target Release to 4.2 since this issue occured in upgrade from 4.1 to 4.2.
QE will double check while block issue is fixed.

Comment 20 Cuiping HUO 2019-07-31 09:43:45 UTC

Verified.
OLM version: 0.11.0
git commit: d2209c409b35f1db4669c474044decc6995f624d

$ oc get clusteroperator | grep package
operator-lifecycle-manager-packageserver   4.2.0-0.nightly-2019-07-30-073644   True        False         False      13h

$ oc get csv
NAME                                        DISPLAY                   VERSION              REPLACES   PHASE
packageserver                               Package Server            0.11.0                          Succeeded

$ oc get csv packageserver -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: ClusterServiceVersion
metadata:
  annotations:
    olm.operatorGroup: olm-operators
    olm.operatorNamespace: openshift-operator-lifecycle-manager
    olm.targetNamespaces: openshift-operator-lifecycle-manager
  creationTimestamp: "2019-07-30T09:21:35Z"
  generation: 328
  labels:
    olm.api.4bca9f23e412d79d: provided
    olm.clusteroperator.name: operator-lifecycle-manager-packageserver
    olm.version: 0.11.0
  name: packageserver
  namespace: openshift-operator-lifecycle-manager
  resourceVersion: "1326273"
  selfLink: /apis/operators.coreos.com/v1alpha1/namespaces/openshift-operator-lifecycle-manager/clusterserviceversions/packageserver
  uid: 6d3684c3-b2ab-11e9-838b-0050568b69c6
status:
  certsLastUpdated: "2019-07-30T20:30:46Z"
  certsRotateAt: "2021-07-28T20:30:43Z"
  conditions:
  - lastTransitionTime: "2019-07-30T09:22:33Z"
    lastUpdateTime: "2019-07-30T09:22:33Z"
    message: requirements not yet checked
    phase: Pending
    reason: RequirementsUnknown
  - lastTransitionTime: "2019-07-30T09:22:38Z"
    lastUpdateTime: "2019-07-30T09:22:38Z"
    message: all requirements found, attempting install
    phase: InstallReady
    reason: AllRequirementsMet
  - lastTransitionTime: "2019-07-30T09:22:59Z"
    lastUpdateTime: "2019-07-30T09:22:59Z"
    message: waiting for install components to report healthy
    phase: Installing
    reason: InstallSucceeded
  - lastTransitionTime: "2019-07-30T09:22:59Z"
    lastUpdateTime: "2019-07-30T09:23:07Z"
    message: APIServices not installed
    phase: Installing
    reason: InstallWaiting
  - lastTransitionTime: "2019-07-30T09:23:32Z"
    lastUpdateTime: "2019-07-30T09:23:32Z"
    message: install strategy completed with no errors
    phase: Succeeded
    reason: InstallSucceeded
  - lastTransitionTime: "2019-07-30T20:29:48Z"
    lastUpdateTime: "2019-07-30T20:29:48Z"
    message: APIServices not installed
    phase: Failed
    reason: ComponentUnhealthy
  - lastTransitionTime: "2019-07-30T20:30:43Z"
    lastUpdateTime: "2019-07-30T20:30:43Z"
    message: APIServices not installed
    phase: Pending
    reason: NeedsReinstall
  - lastTransitionTime: "2019-07-30T20:30:43Z"
    lastUpdateTime: "2019-07-30T20:30:43Z"
    message: all requirements found, attempting install
    phase: InstallReady
    reason: AllRequirementsMet
  - lastTransitionTime: "2019-07-30T20:30:43Z"
    lastUpdateTime: "2019-07-30T20:30:43Z"
    message: waiting for install components to report healthy
    phase: Installing
    reason: InstallSucceeded
  - lastTransitionTime: "2019-07-30T20:30:43Z"
    lastUpdateTime: "2019-07-30T20:30:48Z"
    message: APIServices not installed
    phase: Installing
    reason: InstallWaiting
  - lastTransitionTime: "2019-07-30T20:31:11Z"
    lastUpdateTime: "2019-07-30T20:31:11Z"
    message: install strategy completed with no errors
    phase: Succeeded
    reason: InstallSucceeded
  lastTransitionTime: "2019-07-30T20:31:11Z"
  lastUpdateTime: "2019-07-30T20:31:11Z"
  message: install strategy completed with no errors
  phase: Succeeded
  reason: InstallSucceeded
  requirementStatus:
  - group: operators.coreos.com
    kind: ClusterServiceVersion
    message: CSV minKubeVersion (1.11.0) less than server version (v1.14.0+1682e38)
    name: packageserver
    status: Present
    version: v1alpha1
  - group: apiregistration.k8s.io
    kind: APIService
    message: ""
    name: v1.packages.operators.coreos.com
    status: DeploymentFound
    version: v1
  - dependents:
    - group: rbac.authorization.k8s.io
      kind: PolicyRule
      message: cluster rule:{"verbs":["create","get"],"apiGroups":["authorization.k8s.io"],"resources":["subjectaccessreviews"]}
      status: Satisfied
      version: v1beta1
    - group: rbac.authorization.k8s.io
      kind: PolicyRule
      message: cluster rule:{"verbs":["get","list","watch"],"apiGroups":[""],"resources":["configmaps"]}
      status: Satisfied
      version: v1beta1
    - group: rbac.authorization.k8s.io
      kind: PolicyRule
      message: cluster rule:{"verbs":["get","list","watch"],"apiGroups":["operators.coreos.com"],"resources":["catalogsources"]}
      status: Satisfied
      version: v1beta1
    - group: rbac.authorization.k8s.io
      kind: PolicyRule
      message: cluster rule:{"verbs":["get","list"],"apiGroups":["packages.operators.coreos.com"],"resources":["packagemanifests"]}
      status: Satisfied
      version: v1beta1
    group: ""
    kind: ServiceAccount
    message: ""
    name: olm-operator-serviceaccount
    status: Present
    version: v1

$ oc get clusterrole packageserver.v0.9.0-nl2jh -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  creationTimestamp: "2019-07-30T01:29:19Z"
  labels:
    olm.owner: packageserver.v0.9.0
    olm.owner.kind: ClusterServiceVersion
    olm.owner.namespace: openshift-operator-lifecycle-manager
  name: packageserver.v0.9.0-nl2jh
  resourceVersion: "6509"
  selfLink: /apis/rbac.authorization.k8s.io/v1/clusterroles/packageserver.v0.9.0-nl2jh
  uid: 73f91eaf-b269-11e9-8c50-0050568b2d02
rules:
- apiGroups:
  - authorization.k8s.io
  resources:
  - subjectaccessreviews
  verbs:
  - create
  - get
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - operators.coreos.com
  resources:
  - catalogsources
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - packages.operators.coreos.com
  resources:
  - packagemanifests
  verbs:
  - get
  - list

Comment 21 errata-xmlrpc 2019-10-16 06:32:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.