Bug 1919581 - OCP 4.7: Special Resource Operator (SRO) fails to deploy from OperatorHub
Summary: OCP 4.7: Special Resource Operator (SRO) fails to deploy from OperatorHub
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Special Resource Operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.7.0
Assignee: Brett Thurber
QA Contact: Walid A.
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-23 20:08 UTC by Walid A.
Modified: 2021-10-12 19:52 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
N/A
Clone Of:
Environment:
Last Closed: 2021-10-12 19:51:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Installed Operators Screen (1.63 MB, image/png)
2021-01-23 20:10 UTC, Walid A.
no flags Details
SRO operator subscription tab screen shot (1.61 MB, image/png)
2021-01-23 20:11 UTC, Walid A.
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2021:3686 0 None None None 2021-10-12 19:52:10 UTC

Description Walid A. 2021-01-23 20:08:16 UTC
Description of problem:
The Special Resource Operator (SRO) fails to deploy from OperatorHub on an ipi install OCP 4.7 cluster with cluster-wide entitlement configured.  

Both the Node Feature Discovery Operator (NFD) and Special Resource Operator (SRO) it tries to install fail to deploy:
- NFD gets stuck in "Pending - Upgrade Available"
- SRO gets stuck in "Pending - Up to date"
They both remain in this state for several minutes even after 20 minutes.

Note:  When Deploying SRO from operator hub, it also tries to install the earlier version 4.5 of NFD operator

# oc get events -n test-sro
LAST SEEN   TYPE     REASON                OBJECT                                                   MESSAGE
15m         Normal   RequirementsUnknown   clusterserviceversion/nfd.v4.5.0                         requirements not yet checked
15m         Normal   RequirementsNotMet    clusterserviceversion/nfd.v4.5.0                         one or more requirements couldn't be found
15m         Normal   RequirementsUnknown   clusterserviceversion/special-resource-operator.v0.0.1   requirements not yet checked
15m         Normal   RequirementsNotMet    clusterserviceversion/special-resource-operator.v0.0.1   one or more requirements couldn't be found
17m         Normal   CreatedSCCRanges      namespace/test-sro                        

# oc get csv -A | grep sro
test-sro                                           nfd.v4.5.0                         Node Feature Discovery             4.5.0                                                     Pending
test-sro                                           special-resource-operator.v0.0.1   Special Resource Operator          0.0.1      

The NFD Operator version 4.7 was deployed earlier successfully from Operator Hub and the g4dn.xlarge instance was added as worker node by adding a new machineset, with the nvidia gpu resource, and correctly labeled by NFD:

feature.node.kubernetes.io/pci-10de.present=true



Version-Release number of selected component (if applicable):
Server Version: 4.7.0-0.nightly-2021-01-22-104107
Kubernetes Version: v1.20.0+f0a2ec9

How reproducible:
at least once

Steps to Reproduce:
1. IPI install on AWS of 4.7 nightly build
2. Add new machineset to add a new g4dn.xlarge instance with a GPU resource
3. From OpenShift Console:  Deploy Node Feature Discovery (NFD) Operator from OperatorHub on OpenShift Console and create an instance of the operand.  All in a new project called "test-nfd"
4. Enable cluster-wide entitlement on the cluster following steps in https://www.openshift.com/blog/how-to-use-entitled-image-builds-to-build-drivercontainers-with-ubi-on-openshift
5. Create a new project "test-sro"
6. From OpenShift Console, deploy Special Resource Operator in namespace you just created "test-sro" (only option)

Actual results:
Both NFD (version 4.5) and SRO operators fail to deploy, stuck in Pending state

Expected results:
Under Installed Operators on OpenShift console, both operators should have Status Succeeded, Up to Date

Additional info:
Screen shots form OpenShift console will be uploaded as attachments

Comment 1 Walid A. 2021-01-23 20:10:10 UTC
Created attachment 1750145 [details]
Installed Operators Screen

Comment 2 Walid A. 2021-01-23 20:11:15 UTC
Created attachment 1750146 [details]
SRO operator subscription tab screen shot

Comment 3 dagray 2021-01-25 19:39:05 UTC
Thanks Walid. Once the following PR is merged, I think we will be able to deploy SRO from OperatorHub again.

https://github.com/operator-framework/community-operators/pull/3037

Comment 4 dagray 2021-01-29 16:52:05 UTC
FYI Walid and Zvonko, the above PR has merged however the SRO bundle has not yet been updated in the catalog for some unknown reason. I am in touch with the community-operators team and they will investigate next week.

Comment 5 dagray 2021-01-29 20:11:39 UTC
@wabouham this should now be fixed. One note: install NFD before SRO. Currently NFD 4.5 is being installed as the dependency for SRO and this is not working. However, if you install NFD 4.6 (community or openshift operator version) first it will work. You can even create the simple-kmod specialresource from OperatorHub as I added it as an example.

Comment 6 Walid A. 2021-04-15 16:07:07 UTC
@dagray I am still seeing the same issues on OCP 4.7.0-0.nightly-2021-04-10-082109.  I started with installing NFD operator v4.7.0-202104030128.p0 from OperatorHub, and then later installed SRO from OperatorHub.  It still tries to deploy NFD Operator v4.5.0 and both SRO and NFD operator v4.5.0 stay in Pending state (Up to Date) indefinitely on that cluster.

Comment 7 dagray 2021-04-19 15:40:10 UTC
@wabouham, I see that the SRO community operator has been updated. The change seemed to work for me, so moving to ON_QA.

Comment 8 Walid A. 2021-04-26 21:21:39 UTC
Verified on OCP 4.7.8 that we can deploy SRO from OperatorHub on Openshift Console and create the simple-kmod specialresource successfully.
NFD operator was deployed before deploying SRO.
The kernel module simple-kmod was present on all 3 worker nodes.

# oc debug node/<worker_node>
.
.
sh-4.4# lsmod | grep simple
simple_procfs_kmod     16384  0
simple_kmod            16384  0


# oc get pods -n driver-container-base
NAME                                     READY   STATUS      RESTARTS   AGE
driver-container-base-bf9b01f15741109d   0/1     Completed   0          17m

# oc get pods -n simple-kmod
NAME                                                  READY   STATUS      RESTARTS   AGE
simple-kmod-driver-build-bf9b01f15741109d-1-build     0/1     Completed   0          12m
simple-kmod-driver-container-bf9b01f15741109d-5qrtz   1/1     Running     0          12m
simple-kmod-driver-container-bf9b01f15741109d-n8rfk   1/1     Running     0          12m
simple-kmod-driver-container-bf9b01f15741109d-wktng   1/1     Running     0          12m

# oc get pods -n openshift-operators
NAME                                                   READY   STATUS    RESTARTS   AGE
nfd-master-7zqbj                                       1/1     Running   0          35m
nfd-master-bb9gh                                       1/1     Running   0          35m
nfd-master-p8gr7                                       1/1     Running   0          35m
nfd-operator-d8fcb8746-d6tt4                           1/1     Running   0          36m
nfd-worker-4klnn                                       1/1     Running   0          35m
nfd-worker-7968m                                       1/1     Running   0          35m
nfd-worker-czq8s                                       1/1     Running   0          35m
special-resource-controller-manager-587d8fdb9b-tzzdc   2/2     Running   0          25m

Comment 13 errata-xmlrpc 2021-10-12 19:51:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.33 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3686


Note You need to log in before you can comment on or make changes to this bug.