Description of problem: The Special Resource Operator (SRO) fails to deploy from OperatorHub on an ipi install OCP 4.7 cluster with cluster-wide entitlement configured. Both the Node Feature Discovery Operator (NFD) and Special Resource Operator (SRO) it tries to install fail to deploy: - NFD gets stuck in "Pending - Upgrade Available" - SRO gets stuck in "Pending - Up to date" They both remain in this state for several minutes even after 20 minutes. Note: When Deploying SRO from operator hub, it also tries to install the earlier version 4.5 of NFD operator # oc get events -n test-sro LAST SEEN TYPE REASON OBJECT MESSAGE 15m Normal RequirementsUnknown clusterserviceversion/nfd.v4.5.0 requirements not yet checked 15m Normal RequirementsNotMet clusterserviceversion/nfd.v4.5.0 one or more requirements couldn't be found 15m Normal RequirementsUnknown clusterserviceversion/special-resource-operator.v0.0.1 requirements not yet checked 15m Normal RequirementsNotMet clusterserviceversion/special-resource-operator.v0.0.1 one or more requirements couldn't be found 17m Normal CreatedSCCRanges namespace/test-sro # oc get csv -A | grep sro test-sro nfd.v4.5.0 Node Feature Discovery 4.5.0 Pending test-sro special-resource-operator.v0.0.1 Special Resource Operator 0.0.1 The NFD Operator version 4.7 was deployed earlier successfully from Operator Hub and the g4dn.xlarge instance was added as worker node by adding a new machineset, with the nvidia gpu resource, and correctly labeled by NFD: feature.node.kubernetes.io/pci-10de.present=true Version-Release number of selected component (if applicable): Server Version: 4.7.0-0.nightly-2021-01-22-104107 Kubernetes Version: v1.20.0+f0a2ec9 How reproducible: at least once Steps to Reproduce: 1. IPI install on AWS of 4.7 nightly build 2. Add new machineset to add a new g4dn.xlarge instance with a GPU resource 3. From OpenShift Console: Deploy Node Feature Discovery (NFD) Operator from OperatorHub on OpenShift Console and create an instance of the operand. All in a new project called "test-nfd" 4. Enable cluster-wide entitlement on the cluster following steps in https://www.openshift.com/blog/how-to-use-entitled-image-builds-to-build-drivercontainers-with-ubi-on-openshift 5. Create a new project "test-sro" 6. From OpenShift Console, deploy Special Resource Operator in namespace you just created "test-sro" (only option) Actual results: Both NFD (version 4.5) and SRO operators fail to deploy, stuck in Pending state Expected results: Under Installed Operators on OpenShift console, both operators should have Status Succeeded, Up to Date Additional info: Screen shots form OpenShift console will be uploaded as attachments
Created attachment 1750145 [details] Installed Operators Screen
Created attachment 1750146 [details] SRO operator subscription tab screen shot
Thanks Walid. Once the following PR is merged, I think we will be able to deploy SRO from OperatorHub again. https://github.com/operator-framework/community-operators/pull/3037
FYI Walid and Zvonko, the above PR has merged however the SRO bundle has not yet been updated in the catalog for some unknown reason. I am in touch with the community-operators team and they will investigate next week.
@wabouham this should now be fixed. One note: install NFD before SRO. Currently NFD 4.5 is being installed as the dependency for SRO and this is not working. However, if you install NFD 4.6 (community or openshift operator version) first it will work. You can even create the simple-kmod specialresource from OperatorHub as I added it as an example.
@dagray I am still seeing the same issues on OCP 4.7.0-0.nightly-2021-04-10-082109. I started with installing NFD operator v4.7.0-202104030128.p0 from OperatorHub, and then later installed SRO from OperatorHub. It still tries to deploy NFD Operator v4.5.0 and both SRO and NFD operator v4.5.0 stay in Pending state (Up to Date) indefinitely on that cluster.
@wabouham, I see that the SRO community operator has been updated. The change seemed to work for me, so moving to ON_QA.
Verified on OCP 4.7.8 that we can deploy SRO from OperatorHub on Openshift Console and create the simple-kmod specialresource successfully. NFD operator was deployed before deploying SRO. The kernel module simple-kmod was present on all 3 worker nodes. # oc debug node/<worker_node> . . sh-4.4# lsmod | grep simple simple_procfs_kmod 16384 0 simple_kmod 16384 0 # oc get pods -n driver-container-base NAME READY STATUS RESTARTS AGE driver-container-base-bf9b01f15741109d 0/1 Completed 0 17m # oc get pods -n simple-kmod NAME READY STATUS RESTARTS AGE simple-kmod-driver-build-bf9b01f15741109d-1-build 0/1 Completed 0 12m simple-kmod-driver-container-bf9b01f15741109d-5qrtz 1/1 Running 0 12m simple-kmod-driver-container-bf9b01f15741109d-n8rfk 1/1 Running 0 12m simple-kmod-driver-container-bf9b01f15741109d-wktng 1/1 Running 0 12m # oc get pods -n openshift-operators NAME READY STATUS RESTARTS AGE nfd-master-7zqbj 1/1 Running 0 35m nfd-master-bb9gh 1/1 Running 0 35m nfd-master-p8gr7 1/1 Running 0 35m nfd-operator-d8fcb8746-d6tt4 1/1 Running 0 36m nfd-worker-4klnn 1/1 Running 0 35m nfd-worker-7968m 1/1 Running 0 35m nfd-worker-czq8s 1/1 Running 0 35m special-resource-controller-manager-587d8fdb9b-tzzdc 2/2 Running 0 25m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.33 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3686