Description of problem: When Deploy SRO4.9 without NFD, SRO operator will automatically deploy NFD4.10 in the same namespace, the operator pod will threw OSVersion mismatch NFD: 4.10 vs. DTK: 8.4, it cause sro operator keep restarting Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Deploy SRO4.9 from operatorhub 2. NFD4.10 will automatically deploy in the same namespace. Actual results: SRO operator will keep restarting with error OSVersion mismatch NFD: 4.10 vs. DTK: 8.4 2021-12-20T07:19:35.363Z INFO cache Nodes {"num": 3} 2021-12-20T07:19:35.460Z INFO upgrade History {"entry": "registry.ci.openshift.org/ocp/release@sha256:8207b4e6371144d8a715617ddf1f5958b87e26a015da23cfec7ccbefab9cd49f"} 2021-12-20T07:19:37.833Z INFO registry DTK {"kernel-version": "4.18.0-305.28.1.el8_4.x86_64"} 2021-12-20T07:19:37.833Z INFO registry DTK {"rt-kernel-version": "4.18.0-305.28.1.rt7.100.el8_4.x86_64"} 2021-12-20T07:19:37.833Z INFO registry DTK {"rhel-version": "8.4"} 2021-12-20T07:19:37.833Z INFO exit OnError: upgrade.UpdateInfo[upgrade.go:116] OSVersion mismatch NFD: 4.10 vs. DTK: 8.4 Expected results: SRO operator should be able to deploy successfully. Additional info: Or 1. Deploy NFD4.10 2. Deploy SRO4.10 using make deploy from source code 3. Create simple-kmod 2022-01-04T03:47:01.447Z ERROR controller.specialresource Reconciler error {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "name": "special-resource-preamble", "namespace": "", "error": "RECONCILE ERROR: Cannot upgrade special resource: OSVersion mismatch NFD: 4.10 vs. DTK: 8.4", "errorVerbose": "OSVersion mismatch NFD: 4.10 vs. DTK: 8.4\nRECONCILE ERROR: Cannot upgrade special resource\ngithub.com/openshift-psap/special-resource-operator/controllers.(*SpecialResourceReconciler).Reconcile\n\t/workspace/controllers/specialresource_controller.go:83\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1581"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 ^C
After some testing I was able to see a difference in NFD pods behavior. Setup: - OCP 4.10 through cluster bot. - SRO 4.10 deployed from source. - NFD 4.10 deployed from source. SRO matches the contents of the label "feature.node.kubernetes.io/system-os_release.RHEL_VERSION" to the OS release it finds in driver toolkit. If these dont match, we get the error that is shown in the BZ. This label is set by NFD pods after creating a CR, which is taken straight from the samples directory in https://github.com/openshift/cluster-nfd-operator/blob/master/config/samples/nfd.openshift.io_v1_nodefeaturediscovery.yaml. When using version 4.9 for node-feature-discovery image we get the following labels: feature.node.kubernetes.io/system-os_release.ID: rhcos feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION: "4.10" feature.node.kubernetes.io/system-os_release.OSTREE_VERSION: 410.84.202112230202-0 feature.node.kubernetes.io/system-os_release.RHEL_VERSION: "8.4" feature.node.kubernetes.io/system-os_release.VERSION_ID: "4.10" feature.node.kubernetes.io/system-os_release.VERSION_ID.major: "4" feature.node.kubernetes.io/system-os_release.VERSION_ID.minor: "10" When using 4.10 we get these: feature.node.kubernetes.io/system-os_release.ID: rhcos feature.node.kubernetes.io/system-os_release.OSTREE_VERSION: 410.84.202112230202-0 feature.node.kubernetes.io/system-os_release.VERSION_ID: "4.10" feature.node.kubernetes.io/system-os_release.VERSION_ID.major: "4" feature.node.kubernetes.io/system-os_release.VERSION_ID.minor: "10" And all these are taken from the same file in the worker: $ oc debug node/ip-10-0-147-108.us-east-2.compute.internal Starting pod/ip-10-0-147-108us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.147.108 If you don't see a command prompt, try pressing enter. sh-4.4# cat /host/etc/os-release NAME="Red Hat Enterprise Linux CoreOS" VERSION="410.84.202112230202-0" ID="rhcos" ID_LIKE="rhel fedora" VERSION_ID="4.10" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 410.84.202112230202-0 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.10/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.10" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.10" OPENSHIFT_VERSION="4.10" RHEL_VERSION="8.4" OSTREE_VERSION='410.84.202112230202-0' We have lost RHEL_VERSION label, which is the one SRO is using to compare with the DTK. If not available, SRO takes the VERSION_ID labels which match the OCP version and not the OS, thus rendering the error message.
Routing to Eduardo Arango, as it looks like NFD.
Verified Result: [mirroradmin@ec2-18-217-45-133 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-11-065245 True False 29m Cluster version is 4.10.0-0.nightly-2022-01-11-065245 [mirroradmin@ec2-18-217-45-133 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION liqcui-oc4101-k2dmj-master-0.c.openshift-qe.internal Ready master 48m v1.22.1+6859754 liqcui-oc4101-k2dmj-master-1.c.openshift-qe.internal Ready master 48m v1.22.1+6859754 liqcui-oc4101-k2dmj-master-2.c.openshift-qe.internal Ready master 48m v1.22.1+6859754 liqcui-oc4101-k2dmj-worker-a-kxbff.c.openshift-qe.internal Ready worker 38m v1.22.1+6859754 liqcui-oc4101-k2dmj-worker-b-72x79.c.openshift-qe.internal Ready worker 36m v1.22.1+6859754 liqcui-oc4101-k2dmj-worker-c-cv99r.c.openshift-qe.internal Ready worker 38m v1.22.1+6859754 [mirroradmin@ec2-18-217-45-133 ~]$ oc describe node liqcui-oc4101-k2dmj-worker-a-kxbff.c.openshift-qe.internal |grep featur feature.node.kubernetes.io/cpu-cpuid.AESNI=true feature.node.kubernetes.io/cpu-cpuid.AVX=true feature.node.kubernetes.io/cpu-cpuid.AVX2=true feature.node.kubernetes.io/cpu-cpuid.FMA3=true feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true feature.node.kubernetes.io/cpu-cpuid.IBPB=true feature.node.kubernetes.io/cpu-cpuid.STIBP=true feature.node.kubernetes.io/cpu-cpuid.VMX=true feature.node.kubernetes.io/cpu-hardware_multithreading=true feature.node.kubernetes.io/kernel-config.NO_HZ=true feature.node.kubernetes.io/kernel-config.NO_HZ_FULL=true feature.node.kubernetes.io/kernel-selinux.enabled=true feature.node.kubernetes.io/kernel-version.full=4.18.0-305.30.1.el8_4.x86_64 feature.node.kubernetes.io/kernel-version.major=4 feature.node.kubernetes.io/kernel-version.minor=18 feature.node.kubernetes.io/kernel-version.revision=0 feature.node.kubernetes.io/pci-1af4.present=true feature.node.kubernetes.io/storage-nonrotationaldisk=true feature.node.kubernetes.io/system-os_release.ID=rhcos feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=410.84.202201101959-0 feature.node.kubernetes.io/system-os_release.RHEL_VERSION=8.4 feature.node.kubernetes.io/system-os_release.VERSION_ID=4.10 feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4 feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=10 nfd.node.kubernetes.io/feature-labels: oc get pods -n simple-kmod NAME READY STATUS RESTARTS AGE simple-kmod-driver-build-7a2fc1535ea1b11f-1-build 0/1 Completed 0 6m36s simple-kmod-driver-container-7a2fc1535ea1b11f-6dsrf 1/1 Running 0 7m32s simple-kmod-driver-container-7a2fc1535ea1b11f-f79lv 1/1 Running 0 7m32s simple-kmod-driver-container-7a2fc1535ea1b11f-qt8wd 1/1 Running 0 7m32s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.10.3 extras update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0057