Description of problem: I am creating this BZ as place holder for when Special Resource Operator will be able to support deploying the nvidia GPU driver stack from OperatorHub. This is on an OCP 4.6 nightly IPI install on AWS, 3 masters, 3 worker nodes and one GPU instance g3.4xlarge, added as a new machinset. Node Feature Discovery (NFD) operator was deployed successfully from OperatorHub and it labeled the nodes correctly. Cluster-wide entitlement was configured according to procedure in: https://www.openshift.com/blog/how-to-use-entitled-image-builds-to-build-drivercontainers-with-ubi-on-openshift Special Resource Operator (SRO) when deployed from OperatorHub fails to deploy the NVIDIA GPU driver stack needed to run GPU workloads. Only the special-resource-operator pod was created in the nvidia-gpu namespace, along with the SpecialResource CR. When deploying SRO in a custom namespace "nvidia-gpu, only the # oc get pods -n nvidia-gpu NAME READY STATUS RESTARTS AGE special-resource-operator-546cf98d94-zwcg2 1/1 Running 0 10m # oc logs -n nvidia-gpu special-resource-operator-546cf98d94-zwcg2 {"level":"info","ts":1601332114.7923079,"logger":"cmd","msg":"Go Version: go1.13.8"} {"level":"info","ts":1601332114.7923307,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"} {"level":"info","ts":1601332114.7923355,"logger":"cmd","msg":"Version of operator-sdk: v0.10.0"} {"level":"info","ts":1601332114.7927125,"logger":"leader","msg":"Trying to become the leader."} {"level":"info","ts":1601332115.088913,"logger":"leader","msg":"No pre-existing lock was found."} {"level":"info","ts":1601332115.101232,"logger":"leader","msg":"Became the leader."} {"level":"info","ts":1601332115.4072561,"logger":"cmd","msg":"Registering Components."} {"level":"info","ts":1601332115.4082434,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1601332115.408354,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1601332115.4084454,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1601332115.4085052,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1601332115.4085648,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1601332115.4086235,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1601332115.4086754,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1601332115.40873,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1601332115.4087865,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1601332115.4088757,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1601332115.4089956,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1601332115.4090657,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1601332115.8612587,"logger":"cmd","msg":"Could not create metrics Service","error":"failed to create or get service for metrics: services \"special-resource-operator-metrics\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>"} {"level":"info","ts":1601332115.8810937,"logger":"cmd","msg":"Starting the Cmd."} {"level":"info","ts":1601332115.9813724,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"specialresource"} {"level":"info","ts":1601332116.0815537,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"specialresource","worker count":1} {"level":"info","ts":1601332128.2989557,"logger":"specialresource","msg":"Reconciling SpecialResource","Namespace":"nvidia-gpu","Name":"nvidia-gpu"} {"level":"info","ts":1601332128.298994,"logger":"specialresource","msg":"Looking for Hardware Configuration ConfigMaps with label specialresource.openshift.io/config: true"} # oc get specialresource -n nvidia-gpu -o yaml apiVersion: v1 items: - apiVersion: sro.openshift.io/v1alpha1 kind: SpecialResource metadata: creationTimestamp: "2020-09-28T22:28:48Z" generation: 1 managedFields: - apiVersion: sro.openshift.io/v1alpha1 fieldsType: FieldsV1 fieldsV1: f:spec: .: {} f:size: {} f:status: .: {} f:state: {} manager: Mozilla operation: Update time: "2020-09-28T22:28:48Z" name: nvidia-gpu namespace: nvidia-gpu resourceVersion: "1687968" selfLink: /apis/sro.openshift.io/v1alpha1/namespaces/nvidia-gpu/specialresources/nvidia-gpu uid: 4c5e8367-3ef1-4632-8f4d-bd54df2eb82b spec: size: 3 kind: List metadata: resourceVersion: "" selfLink: "" Version-Release number of selected component (if applicable): Server Version: 4.6.0-0.nightly-2020-09-27-075304 Kubernetes Version: v1.19.0+e465e66 How reproducible: Always Steps to Reproduce: 1. create IPI install of OCP 4.6 nightly build, 3 masters and 3 worker nodes on AWS 2. create a new machineset for a g3.4xlarge instance with on GPU on AWS 3. Deploy NFD from OperatorHub in a custom namespace 4. Setup cluster-wide entitlement according to documentation link above 5. Create namespace "nvidia-gpu". 6. Deploy SRO from OperatorHub in "nvidia-gpu" namespace and create a SpecialResource instance Actual results: Only special-resource-operator pod is created in nvidia-gpu namespace # oc get pods -n nvidia-gpu NAME READY STATUS RESTARTS AGE special-resource-operator-546cf98d94-zwcg2 1/1 Running 0 10m Expected results: # oc get pods -n nvidia-gpu NAME READY STATUS RESTARTS AGE nvidia-gpu-device-feature-discovery-2pssp 1/1 Running 0 23m nvidia-gpu-device-grafana-7d96c95c88-rmjbz 1/1 Running 0 23m nvidia-gpu-device-monitoring-2tz4p 2/2 Running 0 25m nvidia-gpu-device-plugin-z48tk 1/1 Running 0 26m nvidia-gpu-driver-build-1-build 0/1 Completed 0 31m nvidia-gpu-driver-container-rhel8-mjsjs 1/1 Running 0 32m nvidia-gpu-runtime-enablement-m7p95 1/1 Running 0 28m special-resource-operator-668875dcf9-s5fgx 1/1 Running 0 32m Additional info:
Current workaround is to deploy from source using the branch `sro-list`
This is still not creating the nvidia-gpu-device plugin and driver containers when deployed form OperatorHub and creating the nvidia-gpu special resource from console. This was tested on OCP 4.6.5: # oc get pods -n nvidia-gpu NAME READY STATUS RESTARTS AGE nfd-operator-67876f6f5-6msc8 1/1 Running 0 11m special-resource-operator-5fd68b4586-hhg2w 1/1 Running 0 11m [root@ip-172-31-17-233 ~]# [root@ip-172-31-17-233 ~]# [root@ip-172-31-17-233 ~]# oc logs -n nvidia-gpu special-resource-operator-5fd68b4586-hhg2w {"level":"info","ts":1606341000.667781,"logger":"cmd","msg":"Go Version: go1.13.8"} {"level":"info","ts":1606341000.6678047,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"} {"level":"info","ts":1606341000.6678097,"logger":"cmd","msg":"Version of operator-sdk: v0.10.0"} {"level":"info","ts":1606341000.6682265,"logger":"leader","msg":"Trying to become the leader."} {"level":"info","ts":1606341000.7991717,"logger":"leader","msg":"No pre-existing lock was found."} {"level":"info","ts":1606341000.8066115,"logger":"leader","msg":"Became the leader."} {"level":"info","ts":1606341000.9134064,"logger":"cmd","msg":"Registering Components."} {"level":"info","ts":1606341000.9153435,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1606341000.915477,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1606341000.9155464,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1606341000.9156294,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1606341000.9157097,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1606341000.915772,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1606341000.915836,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1606341000.9159052,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1606341000.9159617,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1606341000.9160302,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1606341000.916095,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1606341000.9161553,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource","source":"kind source: /, Kind="} {"level":"info","ts":1606341001.0426922,"logger":"cmd","msg":"Could not create metrics Service","error":"failed to create or get service for metrics: services \"special-resource-operator-metrics\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>"} {"level":"info","ts":1606341001.0624459,"logger":"cmd","msg":"Starting the Cmd."} {"level":"info","ts":1606341001.1626635,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"specialresource"} {"level":"info","ts":1606341001.262836,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"specialresource","worker count":1} {"level":"info","ts":1606341106.9086497,"logger":"specialresource","msg":"Reconciling SpecialResource","Namespace":"nvidia-gpu","Name":"nvidia-gpu"} {"level":"info","ts":1606341106.9086807,"logger":"specialresource","msg":"Looking for Hardware Configuration ConfigMaps with label specialresource.openshift.io/config: true"}
@wabouham the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1919581 should also work on 4.6. Please verify.
Verified on 4.6.0-0.nightly-2021-04-17-182039 that we can deploy SRO from OperatorHub and create a the sinple-kmod special resource successfully. NFD operator was deployed before deploying SRO. # oc debug node/<worker_node> . . sh-4.4# chroot /host sh-4.4# lsmod | grep kmod simple_procfs_kmod 16384 0 simple_kmod 16384 0 sh-4.4# exit exit sh-4.4# exit exit Removing debug pod ... # oc get pods -n driver-container-base NAME READY STATUS RESTARTS AGE driver-container-base-ee060ca2c5056b7 0/1 Completed 0 20m # oc get pods -n simple-kmod NAME READY STATUS RESTARTS AGE simple-kmod-driver-build-ee060ca2c5056b7-1-build 0/1 Completed 0 16m simple-kmod-driver-container-ee060ca2c5056b7-74fhf 1/1 Running 0 16m simple-kmod-driver-container-ee060ca2c5056b7-g4rvp 1/1 Running 0 16m simple-kmod-driver-container-ee060ca2c5056b7-k54w2 1/1 Running 0 16m [root@ip-172-31-45-145 special-resource-operator]# [root@ip-172-31-45-145 special-resource-operator]# oc get pods -n openshift-operators NAME READY STATUS RESTARTS AGE nfd-master-9nc78 1/1 Running 0 101m nfd-master-fvndg 1/1 Running 0 101m nfd-master-ws94z 1/1 Running 0 101m nfd-operator-576d77d47f-rkhjd 1/1 Running 0 103m nfd-worker-bjcgw 1/1 Running 0 101m nfd-worker-hqfmn 1/1 Running 0 101m nfd-worker-j4thh 1/1 Running 0 101m special-resource-controller-manager-765fbc7f54-mbhsw 2/2 Running 0 21m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.47 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3737