Bug 1839754 - [sriov] Failed to upgrade from 4.4 to 4.5 for sriov operator
Summary: [sriov] Failed to upgrade from 4.4 to 4.5 for sriov operator
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: All
OS: All
high
urgent
Target Milestone: ---
: 4.6.0
Assignee: Peng Liu
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks: 1846239
TreeView+ depends on / blocked
 
Reported: 2020-05-25 12:54 UTC by zhaozhanqi
Modified: 2020-10-27 16:01 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1846239 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:01:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
oc describe csv (87.56 KB, text/plain)
2020-05-25 12:54 UTC, zhaozhanqi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift sriov-network-operator pull 246 0 None closed Bug 1839754: Remove required feilds from SriovNetworkNodeState CRD 2020-11-19 14:36:29 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:01:27 UTC

Description zhaozhanqi 2020-05-25 12:54:10 UTC
Created attachment 1691930 [details]
oc describe csv

Description of problem:
when upgrade from 4.4(4.4.0-202005221118) to 4.5(4.5.0-202005220507) for sriov operator. it was pending there
oc get csv
NAME                                        DISPLAY                   VERSION              REPLACES                                    PHASE
sriov-network-operator.4.4.0-202005221118   SR-IOV Network Operator   4.4.0-202005221118                                               Replacing
sriov-network-operator.4.5.0-202005220507   SR-IOV Network Operator   4.5.0-202005220507   sriov-network-operator.4.4.0-202005221118   Pending

see the attachment for `oc describe csv`


Version-Release number of selected component (if applicable):
4.4 to 4.5

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 zhaozhanqi 2020-05-26 08:06:41 UTC
more details:

there is some difference for 4.4 and 4.5 sriov namespaces scc see the change for 4.5 https://github.com/openshift/sriov-network-operator/commit/b2b549210cf242a884eb33cb6876bbcb9c4fc106

1. for 4.4 sriov operator namespace created with following yaml
 
echo 'apiVersion: v1
kind: Namespace
metadata:
  name: openshift-sriov-network-operator
  labels:
    openshift.io/run-level: "1"
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: sriov-network-operators
  namespace: openshift-sriov-network-operator
spec:
  targetNamespaces:
  - openshift-sriov-network-operator
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: sriov-network-operator-subsription
  namespace: openshift-sriov-network-operator
spec:
  channel: "4.4"
  name: sriov-network-operator
  source: qe-app-registry
  sourceNamespace: openshift-marketplace'  | oc create -f  -

2. when creating above, the 4.4 sriov operator can work well
3. update the `channel to 4.5` for upgrade
4. then check the csv

#oc get csv
NAME                                        DISPLAY                   VERSION              REPLACES                                    PHASE
sriov-network-operator.4.4.0-202005221118   SR-IOV Network Operator   4.4.0-202005221118                                               Replacing
sriov-network-operator.4.5.0-202005220507   SR-IOV Network Operator   4.5.0-202005220507   sriov-network-operator.4.4.0-202005221118   Pending

5. please see the attachment for `oc describe csv sriov-network-operator.4.5.0-202005220507`

Comment 3 Jian Zhang 2020-05-26 08:14:13 UTC
It failed at the SCC creating:
...
      Kind:     PolicyRule
      Message:  namespaced rule:{"verbs":["use"],"apiGroups":["security.openshift.io"],"resources":["securitycontextconstraints"],"resourceNames":["privileged"]}
      Status:   NotSatisfied
      Version:  v1
...

But, I can create it manually.
mac:~ jianzhang$ oc create role sriov-plugin --verb=use --resource=securitycontextconstraints --resource-name=privileged -n openshift-sriov-network-operator
role.rbac.authorization.k8s.io/sriov-plugin created
mac:~ jianzhang$ oc get role
NAME                                              CREATED AT
sriov-network-operator.4.4.0-202005221118-6k2mq   2020-05-25T09:53:19Z
sriov-network-operator.4.4.0-202005221118-pkv5v   2020-05-25T09:53:18Z
sriov-plugin                                      2020-05-26T07:46:27Z
mac:~ jianzhang$ oc get role sriov-plugin -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  creationTimestamp: "2020-05-26T07:46:27Z"
  managedFields:
  - apiVersion: rbac.authorization.k8s.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:rules: {}
    manager: oc
    operation: Update
    time: "2020-05-26T07:46:27Z"
  name: sriov-plugin
  namespace: openshift-sriov-network-operator
  resourceVersion: "692545"
  selfLink: /apis/rbac.authorization.k8s.io/v1/namespaces/openshift-sriov-network-operator/roles/sriov-plugin
  uid: 5faf3b24-7ba1-4235-9675-34c9af792895
rules:
- apiGroups:
  - security.openshift.io
  resourceNames:
  - privileged
  resources:
  - securitycontextconstraints
  verbs:
  - use

Comment 17 Alexander Greene 2020-06-10 17:13:31 UTC
@jian thank you for providing me with a cluster where the upgrade issue is present. I found the following condition in the installPlan Status associated with the 4.5 SRIOV Operator:
```
...
status:
  catalogSources:
  - qe-app-registry
  conditions:
  - lastTransitionTime: "2020-06-10T09:01:42Z"
    lastUpdateTime: "2020-06-10T09:01:42Z"
    message: 'error validating existing CRs agains new CRD''s schema: sriovnetworknodestates.sriovnetwork.openshift.io:
      error validating custom resource against new schema &apiextensions.CustomResourceValidation{OpenAPIV3Schema:(*apiextensions.JSONSchemaProps)(0xc0010f9e00)}:
      [].spec.interfaces.vfGroups.policyName: Required value'
    reason: InstallComponentFailed
    status: "False"
    type: Installed
  phase: Failed
...
```

Based on the presence of this condition, OLM is working as intended.

This condition signals that the SRIOV opreator has added a required field to their CRD and CRs that exist on cluster that do not have the required field set. It is best practice to:
* Introduce the field as an optional field and update the operator to set the field to some value that implements existing behavior.
* In a future release of the operator (likely a different channel based on your release strategy), mark the field as required and make sure to update the API Version.

In this case, the CR you created does not have the .spec.interfaces.vfGroups.PolicyName field which is required in the CRD shipped with SRIOV Operator 4.5. This happened because the SRIOV team rolled out a change to their API in a method that OLM does not support. OLM follows the guidelines suggested by the sig-architecture group [1].

As someone that installed the operator, possible workarounds include any one of the following steps:
* Update the existing CRs to include the required field.
* Delete the existing CRs that do not include the required field.

OLM would then be able to perform the upgrade.

I am going to mark this as `Not A Bug`, @Jian I suggest creating a new bug against the SRIOV operator.

Ref:
[1] https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md#on-compatibility

Comment 18 Alexander Greene 2020-06-10 17:53:49 UTC
To close the loop on this, I created a Doc PR [1] aginst OLM-Book which suggests reviewing [2] if changing the CRD Schema.

Ref:
[1] https://github.com/operator-framework/olm-book/pull/42
[2] https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md#on-compatibility

Comment 19 Jian Zhang 2020-06-11 05:58:48 UTC
Hi Alex,

Many thanks for your information! Move this bug to the Networking team.

Comment 23 zhaozhanqi 2020-06-15 06:06:46 UTC
make this to 'verified' in order to this can be backport to 4.5.

Comment 25 errata-xmlrpc 2020-10-27 16:01:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.