Description of problem: 1- Its an SNO deployment - so master & worker on the same node . 2- install-config.yaml during deployment was used to enable fips option as true . 3 - As its for telco use case so they need to tune some kernel parameters ( to use with like sriov ) 4 - to tune those parameters we use PAO - performance addon operator & apply a profile [1] 5- Below parameters were changed on fips enabled cluster , no idea which parameter is not compatible with RT kernel but error msg points to FIPS - which is trying to revert FIPS [2] 6 - if we disable RT kernel -& deploy performance profile it works fine . It's difficult to understand if it is an issue with the RT kernel or any correlation with FIPS . [1] apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: perfprofile-policy spec: additionalKernelArgs: - idle=poll - rcupdate.rcu_normal_after_boot=0 - nosmt cpu: isolated: 4-29 reserved: 0-3,30-31 hugepages: defaultHugepagesSize: 1G pages: - count: 100 size: 1G machineConfigPoolSelector: pools.operator.machineconfiguration.openshift.io/master: "" net: userLevelNetworking: true nodeSelector: node-role.kubernetes.io/master: "" numa: topologyPolicy: restricted realTimeKernel: enabled: true [2] "can't reconcile config rendered-master-e8e27928dc39315c31039269fa97acaf with rendered-master-d93b203982e5acfc85030325016e6a90: detected change to FIPS flag; refusing to modify FIPS on a running cluster: unreconcilable". Version-Release number of selected component (if applicable): How reproducible: Every time with FIPS & RT kernel via performance profile Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: As per the error messages , master node stuck in degraded state because one of machineconfig trying to change fips status. ~~~ master: 'pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node sno-offline.ocp410.lab.com is reporting: \"can''t reconcile config rendered-master-e8e27928dc39315c31039269fa97acaf with rendered-master-d93b203982e5acfc85030325016e6a90: detected change to FIPS flag; refusing to modify FIPS on a running cluster: unreconcilable\""' ~~~ Current : rendered-master-e8e27928dc39315c31039269fa97acaf ~~~ extensions: [] fips: true ~~~ Desired : rendered-master-d93b203982e5acfc85030325016e6a90 ~~~ extensions: [] fips: false kernelArguments: ~~~ This Machine config triggering changes. 50-nto-master ~~~ extensions: null fips: false kernelArguments: - skew_tick=1 ~~~
Based on the origin of the snippet with fips: false I am moving this to NTO.
David Gray did some investigation on this yesterday. This is not PAO bug, but at this point I'm not sure this is NTO bug either. NTO doesn't (explicitly) set fips anywhere -- it is left out when creating MachineConfigs. I'll dig more into this today.
The MachineConfig structure in MCO only allows true/false (it is NOT tri-state): https://github.com/openshift/machine-config-operator/blob/master/pkg/apis/machineconfiguration.openshift.io/v1/types.go#L205 And the MCO documentation says "If any of the configuration has FIPS enabled, it'll be set." here https://github.com/openshift/machine-config-operator/blob/61159678d9bf051a5e8a017210a349f8c643b910/docs/MachineConfiguration.md?plain=1#L226 The MCO logic seems to confirm that: https://github.com/openshift/machine-config-operator/blob/eff34a518152841050b161b81cebd656ff6ff4cd/pkg/controller/common/helpers.go#L100
Is it possible there was no MachineConfig with FIPS set at all in the running cluster at the time this bug happened?
Searching for a minimum reproducer. With fips-enabled cluster and creating a simple MC targetting master, I cannot reproduce the issue right now. There could be another issue/race to do with the fact realtime kernel is also enabled at the same time -- I didn't do that yet. Will also look at the attached must gather and possibly involve MCO folk. Based on the docs (and https://bugzilla.redhat.com/show_bug.cgi?id=2096496#c4) I doubt NTO/PAO needs to do things differently. $ oc get -o yaml mc/99-master-fips|grep -iC1 fips machineconfiguration.openshift.io/role: master name: 99-master-fips resourceVersion: "1715" -- extensions: null fips: true kernelArguments: null $ oc get -o yaml mc/50-nto-master|grep -iC1 fips extensions: null fips: false kernelArguments: - trigger-sno-fips-issue=1 kernelType: "" $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-4da522dcf9c8593c573d26f7e021cc8a True False False 1 1 1 0 159m worker rendered-worker-894edb5e1755f80795fd35bafebde9aa True False False 0 0 0 0 159m
I believe I have a minimal(ish) reproducer without involving PAO. Will try without involving NTO later on. 1) Install a fips enabled cluster. 2) oc create -f- <<'EOF' apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: openshift-sno-fips namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Custom OpenShift profile include=openshift-node [bootloader] cmdline_ocp_realtime=trigger-sno-fips-issue=1 name: openshift-sno-fips recommend: - machineConfigLabels: machineconfiguration.openshift.io/role: "master" priority: 20 profile: openshift-sno-fips EOF 3) Wait for MCP to be updated. 4) Enable realtime kernel for the cluster. oc create -f- <'EOF' apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 50-realtime-kernel spec: config: ignition: version: 3.2.0 kernelType: "realtime" EOF 5) $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-4da522dcf9c8593c573d26f7e021cc8a False True True 1 0 0 1 3h5m worker rendered-worker-894edb5e1755f80795fd35bafebde9aa True False False 0 0 0 0 3h5m $ oc get no NAME STATUS ROLES AGE VERSION sno Ready master,worker 3h8m v1.23.5+3afdacb $ oc get mcp/master -o yaml|grep -i reconci message: 'Node sno is reporting: "can''t reconcile config rendered-master-4da522dcf9c8593c573d26f7e021cc8a flag; refusing to modify FIPS on a running cluster: unreconcilable"' MCO team, could you please take a look?
Minimum reproducer without NTO/PAO. 1) Install a fips enabled cluster. 2) oc create -f- <<'EOF' apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 50-fips-bz-poc spec: config: ignition: version: 3.2.0 kernelArguments: - trigger-sno-fips-issue=1 EOF 3) Wait for MCP to be updated. 4) oc create -f- <<'EOF' apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 50-realtime-kernel spec: config: ignition: version: 3.2.0 kernelType: "realtime" EOF
This is an issue with the MCO's MC merge logic. This is not a regression but is an issue that has always existed, which essentially prevents you from setting a FIPS + realtime enabled cluster IF AND ONLY IF the FIPS MachineConfig has higher alphanumeric priority compared to the realtime MC (which is normally the case, since the FIPS MC has a 99 prefix). I will apply a patch for this. In the meantime, there is a workaround: 1. create a 01-fips MC that is just: ``` apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: "master" name: 01-fips-master spec: fips: true ``` this is just a dummy MC. It in theory gets overwritten by the 99-fips MC anyways, and you can't otherwise change FIPS, so this is a dummy no-op. If you are reinstalling, just drop this into the install manifests folder (while also specifying FIPS: true in the install configs). No other action should be needed. 2. If on a running cluster and you aren't degraded, just apply this MC before the realtime MC. 3. If on a running cluster and you are already degraded, after applying the MC, a new rendered-MC should be generated (we'll call this rendered-master-cccc which should be the current desired config, with the only change being FIPS: true instead of FIPS: false). You can diff the two MCs to make sure. Find the node that has the degradation, and do: `oc edit node/xxxx` You should see in the annotations that: machineconfiguration.openshift.io/currentConfig: rendered-master-aaaa machineconfiguration.openshift.io/desiredConfig: rendered-master-bbbb It is trying an update from an old MC to the second newest MC that is broken. Change that rendered-master-bbbb by hand to the rendered-master-cccc that was just generated. Markings as high priority. Is there a reason this is considered urgent severity? The customer case seems to be normal severity, just wanted to make sure I understand why this was marked urgent
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069