Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2036809

Summary: Special Resource Operator(SRO) - OSVersion mismatch NFD error
Product: OpenShift Container Platform Reporter: liqcui
Component: Node Feature Discovery OperatorAssignee: Carlos Eduardo Arango Gutierrez <carangog>
Status: CLOSED ERRATA QA Contact: liqcui
Severity: high Docs Contact:
Priority: high    
Version: 4.10CC: aos-bugs, pacevedo, sejug
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 15:56:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description liqcui 2022-01-04 03:57:58 UTC
Description of problem:
When Deploy SRO4.9 without NFD, SRO operator will automatically deploy NFD4.10 in the same namespace, the operator pod will threw OSVersion mismatch NFD: 4.10 vs. DTK: 8.4, it cause sro operator keep restarting

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Deploy SRO4.9 from operatorhub
2. NFD4.10 will automatically deploy in the same namespace.


Actual results:

SRO operator will keep restarting with error OSVersion mismatch NFD: 4.10 vs. DTK: 8.4
2021-12-20T07:19:35.363Z        INFO    cache   Nodes   {"num": 3}
2021-12-20T07:19:35.460Z        INFO    upgrade         History {"entry": "registry.ci.openshift.org/ocp/release@sha256:8207b4e6371144d8a715617ddf1f5958b87e26a015da23cfec7ccbefab9cd49f"}
2021-12-20T07:19:37.833Z        INFO    registry        DTK     {"kernel-version": "4.18.0-305.28.1.el8_4.x86_64"}
2021-12-20T07:19:37.833Z        INFO    registry        DTK     {"rt-kernel-version": "4.18.0-305.28.1.rt7.100.el8_4.x86_64"}
2021-12-20T07:19:37.833Z        INFO    registry        DTK     {"rhel-version": "8.4"}
2021-12-20T07:19:37.833Z        INFO    exit    OnError: upgrade.UpdateInfo[upgrade.go:116] OSVersion mismatch NFD: 4.10 vs. DTK: 8.4  

Expected results:
SRO operator should be able to deploy successfully.

Additional info:

Or
1. Deploy NFD4.10
2. Deploy SRO4.10 using make deploy from source code
3. Create simple-kmod 

2022-01-04T03:47:01.447Z        ERROR   controller.specialresource      Reconciler error        {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "name": "special-resource-preamble", "namespace": "", "error": "RECONCILE ERROR: Cannot upgrade special resource: OSVersion mismatch NFD: 4.10 vs. DTK: 8.4", "errorVerbose": "OSVersion mismatch NFD: 4.10 vs. DTK: 8.4\nRECONCILE ERROR: Cannot upgrade special resource\ngithub.com/openshift-psap/special-resource-operator/controllers.(*SpecialResourceReconciler).Reconcile\n\t/workspace/controllers/specialresource_controller.go:83\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1581"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
^C

Comment 1 Pablo Acevedo 2022-01-04 17:12:18 UTC
After some testing I was able to see a difference in NFD pods behavior.

Setup:
- OCP 4.10 through cluster bot.
- SRO 4.10 deployed from source.
- NFD 4.10 deployed from source.

SRO matches the contents of the label "feature.node.kubernetes.io/system-os_release.RHEL_VERSION" to the OS release it finds in driver toolkit. If these dont match, we get the error that is shown in the BZ.
This label is set by NFD pods after creating a CR, which is taken straight from the samples directory in https://github.com/openshift/cluster-nfd-operator/blob/master/config/samples/nfd.openshift.io_v1_nodefeaturediscovery.yaml.

When using version 4.9 for node-feature-discovery image we get the following labels:
    feature.node.kubernetes.io/system-os_release.ID: rhcos
    feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION: "4.10"
    feature.node.kubernetes.io/system-os_release.OSTREE_VERSION: 410.84.202112230202-0
    feature.node.kubernetes.io/system-os_release.RHEL_VERSION: "8.4"
    feature.node.kubernetes.io/system-os_release.VERSION_ID: "4.10"
    feature.node.kubernetes.io/system-os_release.VERSION_ID.major: "4"
    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor: "10"

When using 4.10 we get these:
    feature.node.kubernetes.io/system-os_release.ID: rhcos
    feature.node.kubernetes.io/system-os_release.OSTREE_VERSION: 410.84.202112230202-0
    feature.node.kubernetes.io/system-os_release.VERSION_ID: "4.10"
    feature.node.kubernetes.io/system-os_release.VERSION_ID.major: "4"
    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor: "10"

And all these are taken from the same file in the worker:
$ oc debug node/ip-10-0-147-108.us-east-2.compute.internal
Starting pod/ip-10-0-147-108us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.147.108
If you don't see a command prompt, try pressing enter.
sh-4.4# cat /host/etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="410.84.202112230202-0"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION_ID="4.10"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 410.84.202112230202-0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.10/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.10"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.10"
OPENSHIFT_VERSION="4.10"
RHEL_VERSION="8.4"
OSTREE_VERSION='410.84.202112230202-0'

We have lost RHEL_VERSION label, which is the one SRO is using to compare with the DTK. If not available, SRO takes the VERSION_ID labels which match the OCP version and not the OS, thus rendering the error message.

Comment 2 Pablo Acevedo 2022-01-07 14:59:07 UTC
Routing to Eduardo Arango, as it looks like NFD.

Comment 4 liqcui 2022-01-11 14:17:20 UTC
Verified Result:

[mirroradmin@ec2-18-217-45-133 ~]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-11-065245   True        False         29m     Cluster version is 4.10.0-0.nightly-2022-01-11-065245
[mirroradmin@ec2-18-217-45-133 ~]$ oc get nodes
NAME                                                         STATUS   ROLES    AGE   VERSION
liqcui-oc4101-k2dmj-master-0.c.openshift-qe.internal         Ready    master   48m   v1.22.1+6859754
liqcui-oc4101-k2dmj-master-1.c.openshift-qe.internal         Ready    master   48m   v1.22.1+6859754
liqcui-oc4101-k2dmj-master-2.c.openshift-qe.internal         Ready    master   48m   v1.22.1+6859754
liqcui-oc4101-k2dmj-worker-a-kxbff.c.openshift-qe.internal   Ready    worker   38m   v1.22.1+6859754
liqcui-oc4101-k2dmj-worker-b-72x79.c.openshift-qe.internal   Ready    worker   36m   v1.22.1+6859754
liqcui-oc4101-k2dmj-worker-c-cv99r.c.openshift-qe.internal   Ready    worker   38m   v1.22.1+6859754
[mirroradmin@ec2-18-217-45-133 ~]$  oc describe node liqcui-oc4101-k2dmj-worker-a-kxbff.c.openshift-qe.internal |grep featur
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true
                    feature.node.kubernetes.io/cpu-cpuid.IBPB=true
                    feature.node.kubernetes.io/cpu-cpuid.STIBP=true
                    feature.node.kubernetes.io/cpu-cpuid.VMX=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ_FULL=true
                    feature.node.kubernetes.io/kernel-selinux.enabled=true
                    feature.node.kubernetes.io/kernel-version.full=4.18.0-305.30.1.el8_4.x86_64
                    feature.node.kubernetes.io/kernel-version.major=4
                    feature.node.kubernetes.io/kernel-version.minor=18
                    feature.node.kubernetes.io/kernel-version.revision=0
                    feature.node.kubernetes.io/pci-1af4.present=true
                    feature.node.kubernetes.io/storage-nonrotationaldisk=true
                    feature.node.kubernetes.io/system-os_release.ID=rhcos
                    feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=410.84.202201101959-0
                    feature.node.kubernetes.io/system-os_release.RHEL_VERSION=8.4
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=4.10
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=10
                    nfd.node.kubernetes.io/feature-labels:

oc get pods -n simple-kmod
NAME                                                  READY   STATUS      RESTARTS   AGE
simple-kmod-driver-build-7a2fc1535ea1b11f-1-build     0/1     Completed   0          6m36s
simple-kmod-driver-container-7a2fc1535ea1b11f-6dsrf   1/1     Running     0          7m32s
simple-kmod-driver-container-7a2fc1535ea1b11f-f79lv   1/1     Running     0          7m32s
simple-kmod-driver-container-7a2fc1535ea1b11f-qt8wd   1/1     Running     0          7m32s

Comment 7 errata-xmlrpc 2022-03-10 15:56:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.3 extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0057