Bug 1741475

Summary:

OLM doesn't create new role for CLO when upgrading CLO from 4.1 to 4.2.

Product:

OpenShift Container Platform

Reporter:

Qiaoling Tang <qitang>

Component:

OLM

Assignee:

Evan Cordell <ecordell>

OLM sub component:

OLM

QA Contact:

Qiaoling Tang <qitang>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

high

CC:

bandrade, chuo, jfan, scolange

Version:

4.2.0

Target Milestone:

---

Target Release:

4.2.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-10-16 06:36:02 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
install plan	none

Description Qiaoling Tang 2019-08-15 09:17:00 UTC

Description of problem:
Failed to upgrade cluster-logging-operator from 4.1 to 4.2, the OLM doesn't create new role for CLO:

$ oc get csv
NAME                                 DISPLAY                  VERSION               REPLACES                                     PHASE
clusterlogging.4.1.12-201908130938   Cluster Logging          4.1.12-201908130938                                                Replacing
clusterlogging.v4.2.0                Cluster Logging          4.2.0                 clusterlogging.4.1.12-201908130938           Pending
elasticsearch-operator.v4.2.0        Elasticsearch Operator   4.2.0                 elasticsearch-operator.4.1.12-201908130938   Succeeded


status in clusterlogging.v4.2.0:
Status:
  Certs Last Updated:  <nil>
  Certs Rotate At:     <nil>
  Conditions:
    Last Transition Time:  2019-08-15T06:40:33Z
    Last Update Time:      2019-08-15T06:40:33Z
    Message:               requirements not yet checked
    Phase:                 Pending
    Reason:                RequirementsUnknown
    Last Transition Time:  2019-08-15T06:40:33Z
    Last Update Time:      2019-08-15T06:40:33Z
    Message:               one or more requirements couldn't be found
    Phase:                 Pending
    Reason:                RequirementsNotMet
  Last Transition Time:    2019-08-15T06:40:33Z
  Last Update Time:        2019-08-15T06:40:33Z
  Message:                 one or more requirements couldn't be found
  Phase:                   Pending
  Reason:                  RequirementsNotMet
  Requirement Status:
      Message:  namespaced rule:{"verbs":["*"],"apiGroups":["monitoring.coreos.com"],"resources":["servicemonitors","prometheusrules"]}
      Status:   NotSatisfied
      Version:  v1beta1
      Group:    rbac.authorization.k8s.io

    Group:      
    Kind:       ServiceAccount
    Message:    Policy rule not satisfied for service account
    Name:       cluster-logging-operator
    Status:     PresentNotSatisfied
    Version:    v1
Events:
  Type    Reason               Age   From                        Message
  ----    ------               ----  ----                        -------
  Normal  RequirementsUnknown  26m   operator-lifecycle-manager  requirements not yet checked
  Normal  RequirementsNotMet   26m   operator-lifecycle-manager  one or more requirements couldn't be found

$ oc get role
NAME                                       AGE
clusterlogging.4.1.12-201908130938-gp5h9   38m
log-collector-privileged                   37m
sharing-config-reader                      38m
$ oc get role clusterlogging.4.1.12-201908130938-gp5h9 -oyaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  creationTimestamp: "2019-08-15T06:34:01Z"
  labels:
    olm.owner: clusterlogging.4.1.12-201908130938
    olm.owner.kind: ClusterServiceVersion
    olm.owner.namespace: openshift-logging
  name: clusterlogging.4.1.12-201908130938-gp5h9
  namespace: openshift-logging
  ownerReferences:
  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: false
    kind: ClusterServiceVersion
    name: clusterlogging.4.1.12-201908130938
    uid: aa96d9d5-bf26-11e9-ba3f-0a1f60e86372
  resourceVersion: "101322"
  selfLink: /apis/rbac.authorization.k8s.io/v1/namespaces/openshift-logging/roles/clusterlogging.4.1.12-201908130938-gp5h9
  uid: ab303e8e-bf26-11e9-ba3f-0a1f60e86372
rules:
- apiGroups:
  - logging.openshift.io
  resources:
  - '*'
  verbs:
  - '*'
- apiGroups:
  - ""
  resources:
  - pods
  - services
  - endpoints
  - persistentvolumeclaims
  - events
  - configmaps
  - secrets
  - serviceaccounts
  verbs:
  - '*'
- apiGroups:
  - apps
  resources:
  - deployments
  - daemonsets
  - replicasets
  - statefulsets
  verbs:
  - '*'
- apiGroups:
  - route.openshift.io
  resources:
  - routes
  - routes/custom-host
  verbs:
  - '*'
- apiGroups:
  - batch
  resources:
  - cronjobs
  verbs:
  - '*'
- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - roles
  - rolebindings
  verbs:
  - '*'
- apiGroups:
  - security.openshift.io
  resourceNames:
  - privileged
  resources:
  - securitycontextconstraints
  verbs:
  - use
- apiGroups:
  - monitoring.coreos.com
  resources:
  - servicemonitors
  verbs:
  - '*'


Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-08-14-235244   True        False         4h23m   Cluster version is 4.2.0-0.nightly-2019-08-14-235244


How reproducible:
Always

Steps to Reproduce:
1.Create a new opsrc, set registryNamespace to `aosqe42`, there have bundle files of 4.1 and 4.2
2.subscribe CLO 4.1, the channel is `preview`, the operator could be deployed successfully
3.manually change the channel to `4.2` to subscribe 4.2 CLO
4.check the status of csv, the clusterlogging.v4.2.0 is in pending status

Actual results:
upgrade CLO from 4.1 to 4.2 failed

Expected results:
Could upgrade successfully


Additional info:
The CSV file for CLO: https://raw.githubusercontent.com/openshift/cluster-logging-operator/release-4.2/manifests/4.2/cluster-logging.v4.2.0.clusterserviceversion.yaml  and https://raw.githubusercontent.com/openshift/cluster-logging-operator/release-4.1/manifests/4.1/cluster-logging.v4.1.0.clusterserviceversion.yaml

Comment 1 Jian Zhang 2019-08-15 10:18:01 UTC

OLM failed to create "prometheusrules" in the existing apiGroup: "monitoring.coreos.com".

  Requirement Status:
      Message:  namespaced rule:{"verbs":["*"],"apiGroups":["monitoring.coreos.com"],"resources":["servicemonitors","prometheusrules"]}
      Status:   NotSatisfied
      Version:  v1beta1
      Group:    rbac.authorization.k8s.io

Based on my understanding, the root cause is that OLM failed to create new resources into an already existing API group.

Comment 2 Qiaoling Tang 2019-08-16 02:26:44 UTC

From my observation, the resources SA, role, rolebindings and secrets created by OLM are all have some labels indicate the resources are owned by which CSV, e.g.:
  labels:
    olm.owner: clusterlogging.4.1.12-201908130938
    olm.owner.kind: ClusterServiceVersion
    olm.owner.namespace: openshift-logging
  name: clusterlogging.4.1.12-201908130938-gp5h9
  namespace: openshift-logging
  ownerReferences:
  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: false
    kind: ClusterServiceVersion
    name: clusterlogging.4.1.12-201908130938
    uid: aa96d9d5-bf26-11e9-ba3f-0a1f60e86372

when upgrading to another version, the OLM would create new resources or update resources as needed.

So I guess the problem here is: when upgrading CLO from 4.1 to 4.2, the permissions in the role are changed, but the OLM doesn't create a new role, then the upgrade failed. 

I've tried to add the correct permissions to the role, then the upgrade could go on. Since I didn't not update the labels and ownerReferences, the resources related to csv clusterlogging.4.1.12-201908130938 were all deleted after upgrading to 4.2 successfully.

Comment 3 Evan Cordell 2019-08-21 17:42:25 UTC

Could you share the InstallPlans that are generated in the namespace? That will help debug this.

I wrote an additional e2e test to verify this here: https://github.com/operator-framework/operator-lifecycle-manager/pull/998 (our CI is currently blocked an another issues that will be resolved soon, and these should pass)

Comment 4 Qiaoling Tang 2019-08-22 01:00:53 UTC

Created attachment 1606813 [details]
install plan

Comment 5 Evan Cordell 2019-08-22 12:17:47 UTC

The new installplan is in the Failed state:

```
  conditions:
  - lastTransitionTime: "2019-08-22T00:57:11Z"
    lastUpdateTime: "2019-08-22T00:57:11Z"
    message: 'error missing existing CRD version(s) in new CRD: clusterloggings.logging.openshift.io:
      not allowing CRD (clusterloggings.logging.openshift.io) update with unincluded
      version {v1 true true nil nil []}'
    reason: InstallComponentFailed
    status: "False"
    type: Installed
  phase: Failed
```

From the error, it looks like CLO removed a CRD apiversion in an update. Please see our WIP docs on CRD versioning rules: https://github.com/operator-framework/operator-lifecycle-manager/blob/61d66d74ca8e76cef7692d1cc4cbac7da7b3a87a/Documentation/design/dependency-resolution.md (these will be merged shortly).

The e2e test that I linked above passes with no issues, and tests both amplification and attenuation of permissions between operator upgrades.

Comment 7 Qiaoling Tang 2019-08-23 06:11:47 UTC

Verified with lasted nightly build, the CLO could upgrade successfully.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-08-22-201424   True        False         2m33s   Cluster version is 4.2.0-0.nightly-2019-08-22-201424

$ oc exec -n openshift-operator-lifecycle-manager olm-operator-69bc98c6ff-kz9bg -- olm --version
OLM version: 0.11.0
git commit: 55d504a1de95e8820d0dcc02b14f6c8d15edff4f

$ oc get csv
NAME                            DISPLAY                  VERSION   REPLACES                                     PHASE
clusterlogging.v4.2.0           Cluster Logging          4.2.0     clusterlogging.4.1.12-201908130938           Succeeded
elasticsearch-operator.v4.2.0   Elasticsearch Operator   4.2.0     elasticsearch-operator.4.1.12-201908130938   Succeeded

$ oc get role
NAME                          AGE
clusterlogging.v4.2.0-wjnw8   36s
log-collector-privileged      4m56s
sharing-config-reader         5m2s
$ oc get rolebindings
NAME                                                         AGE
clusterlogging.v4.2.0-wjnw8-cluster-logging-operator-k4j47   45s
log-collector-privileged-binding                             5m5s
openshift-logging-sharing-config-reader-binding              5m10s

Comment 8 errata-xmlrpc 2019-10-16 06:36:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922