Bug 2036809 - Special Resource Operator(SRO) - OSVersion mismatch NFD error
Summary: Special Resource Operator(SRO) - OSVersion mismatch NFD error
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node Feature Discovery Operator
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Carlos Eduardo Arango Gutierrez
QA Contact: liqcui
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-04 03:57 UTC by liqcui
Modified: 2022-03-10 15:57 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-10 15:56:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift node-feature-discovery pull 73 0 None open Bug 2036809: Add dropped RHEL_VERSION after sync 2022-01-10 13:58:03 UTC
Red Hat Product Errata RHBA-2022:0057 0 None None None 2022-03-10 15:57:16 UTC

Description liqcui 2022-01-04 03:57:58 UTC
Description of problem:
When Deploy SRO4.9 without NFD, SRO operator will automatically deploy NFD4.10 in the same namespace, the operator pod will threw OSVersion mismatch NFD: 4.10 vs. DTK: 8.4, it cause sro operator keep restarting

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Deploy SRO4.9 from operatorhub
2. NFD4.10 will automatically deploy in the same namespace.


Actual results:

SRO operator will keep restarting with error OSVersion mismatch NFD: 4.10 vs. DTK: 8.4
2021-12-20T07:19:35.363Z        INFO    cache   Nodes   {"num": 3}
2021-12-20T07:19:35.460Z        INFO    upgrade         History {"entry": "registry.ci.openshift.org/ocp/release@sha256:8207b4e6371144d8a715617ddf1f5958b87e26a015da23cfec7ccbefab9cd49f"}
2021-12-20T07:19:37.833Z        INFO    registry        DTK     {"kernel-version": "4.18.0-305.28.1.el8_4.x86_64"}
2021-12-20T07:19:37.833Z        INFO    registry        DTK     {"rt-kernel-version": "4.18.0-305.28.1.rt7.100.el8_4.x86_64"}
2021-12-20T07:19:37.833Z        INFO    registry        DTK     {"rhel-version": "8.4"}
2021-12-20T07:19:37.833Z        INFO    exit    OnError: upgrade.UpdateInfo[upgrade.go:116] OSVersion mismatch NFD: 4.10 vs. DTK: 8.4  

Expected results:
SRO operator should be able to deploy successfully.

Additional info:

Or
1. Deploy NFD4.10
2. Deploy SRO4.10 using make deploy from source code
3. Create simple-kmod 

2022-01-04T03:47:01.447Z        ERROR   controller.specialresource      Reconciler error        {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "name": "special-resource-preamble", "namespace": "", "error": "RECONCILE ERROR: Cannot upgrade special resource: OSVersion mismatch NFD: 4.10 vs. DTK: 8.4", "errorVerbose": "OSVersion mismatch NFD: 4.10 vs. DTK: 8.4\nRECONCILE ERROR: Cannot upgrade special resource\ngithub.com/openshift-psap/special-resource-operator/controllers.(*SpecialResourceReconciler).Reconcile\n\t/workspace/controllers/specialresource_controller.go:83\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1581"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
^C

Comment 1 Pablo Acevedo 2022-01-04 17:12:18 UTC
After some testing I was able to see a difference in NFD pods behavior.

Setup:
- OCP 4.10 through cluster bot.
- SRO 4.10 deployed from source.
- NFD 4.10 deployed from source.

SRO matches the contents of the label "feature.node.kubernetes.io/system-os_release.RHEL_VERSION" to the OS release it finds in driver toolkit. If these dont match, we get the error that is shown in the BZ.
This label is set by NFD pods after creating a CR, which is taken straight from the samples directory in https://github.com/openshift/cluster-nfd-operator/blob/master/config/samples/nfd.openshift.io_v1_nodefeaturediscovery.yaml.

When using version 4.9 for node-feature-discovery image we get the following labels:
    feature.node.kubernetes.io/system-os_release.ID: rhcos
    feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION: "4.10"
    feature.node.kubernetes.io/system-os_release.OSTREE_VERSION: 410.84.202112230202-0
    feature.node.kubernetes.io/system-os_release.RHEL_VERSION: "8.4"
    feature.node.kubernetes.io/system-os_release.VERSION_ID: "4.10"
    feature.node.kubernetes.io/system-os_release.VERSION_ID.major: "4"
    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor: "10"

When using 4.10 we get these:
    feature.node.kubernetes.io/system-os_release.ID: rhcos
    feature.node.kubernetes.io/system-os_release.OSTREE_VERSION: 410.84.202112230202-0
    feature.node.kubernetes.io/system-os_release.VERSION_ID: "4.10"
    feature.node.kubernetes.io/system-os_release.VERSION_ID.major: "4"
    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor: "10"

And all these are taken from the same file in the worker:
$ oc debug node/ip-10-0-147-108.us-east-2.compute.internal
Starting pod/ip-10-0-147-108us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.147.108
If you don't see a command prompt, try pressing enter.
sh-4.4# cat /host/etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="410.84.202112230202-0"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION_ID="4.10"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 410.84.202112230202-0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.10/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.10"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.10"
OPENSHIFT_VERSION="4.10"
RHEL_VERSION="8.4"
OSTREE_VERSION='410.84.202112230202-0'

We have lost RHEL_VERSION label, which is the one SRO is using to compare with the DTK. If not available, SRO takes the VERSION_ID labels which match the OCP version and not the OS, thus rendering the error message.

Comment 2 Pablo Acevedo 2022-01-07 14:59:07 UTC
Routing to Eduardo Arango, as it looks like NFD.

Comment 4 liqcui 2022-01-11 14:17:20 UTC
Verified Result:

[mirroradmin@ec2-18-217-45-133 ~]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-11-065245   True        False         29m     Cluster version is 4.10.0-0.nightly-2022-01-11-065245
[mirroradmin@ec2-18-217-45-133 ~]$ oc get nodes
NAME                                                         STATUS   ROLES    AGE   VERSION
liqcui-oc4101-k2dmj-master-0.c.openshift-qe.internal         Ready    master   48m   v1.22.1+6859754
liqcui-oc4101-k2dmj-master-1.c.openshift-qe.internal         Ready    master   48m   v1.22.1+6859754
liqcui-oc4101-k2dmj-master-2.c.openshift-qe.internal         Ready    master   48m   v1.22.1+6859754
liqcui-oc4101-k2dmj-worker-a-kxbff.c.openshift-qe.internal   Ready    worker   38m   v1.22.1+6859754
liqcui-oc4101-k2dmj-worker-b-72x79.c.openshift-qe.internal   Ready    worker   36m   v1.22.1+6859754
liqcui-oc4101-k2dmj-worker-c-cv99r.c.openshift-qe.internal   Ready    worker   38m   v1.22.1+6859754
[mirroradmin@ec2-18-217-45-133 ~]$  oc describe node liqcui-oc4101-k2dmj-worker-a-kxbff.c.openshift-qe.internal |grep featur
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true
                    feature.node.kubernetes.io/cpu-cpuid.IBPB=true
                    feature.node.kubernetes.io/cpu-cpuid.STIBP=true
                    feature.node.kubernetes.io/cpu-cpuid.VMX=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ_FULL=true
                    feature.node.kubernetes.io/kernel-selinux.enabled=true
                    feature.node.kubernetes.io/kernel-version.full=4.18.0-305.30.1.el8_4.x86_64
                    feature.node.kubernetes.io/kernel-version.major=4
                    feature.node.kubernetes.io/kernel-version.minor=18
                    feature.node.kubernetes.io/kernel-version.revision=0
                    feature.node.kubernetes.io/pci-1af4.present=true
                    feature.node.kubernetes.io/storage-nonrotationaldisk=true
                    feature.node.kubernetes.io/system-os_release.ID=rhcos
                    feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=410.84.202201101959-0
                    feature.node.kubernetes.io/system-os_release.RHEL_VERSION=8.4
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=4.10
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=10
                    nfd.node.kubernetes.io/feature-labels:

oc get pods -n simple-kmod
NAME                                                  READY   STATUS      RESTARTS   AGE
simple-kmod-driver-build-7a2fc1535ea1b11f-1-build     0/1     Completed   0          6m36s
simple-kmod-driver-container-7a2fc1535ea1b11f-6dsrf   1/1     Running     0          7m32s
simple-kmod-driver-container-7a2fc1535ea1b11f-f79lv   1/1     Running     0          7m32s
simple-kmod-driver-container-7a2fc1535ea1b11f-qt8wd   1/1     Running     0          7m32s

Comment 7 errata-xmlrpc 2022-03-10 15:56:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.3 extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0057


Note You need to log in before you can comment on or make changes to this bug.