Bug 2096496
Summary: | FIPS issue on OCP SNO with RT Kernel via performance profile | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Anil Dhingra <adhingra> |
Component: | Machine Config Operator | Assignee: | Yu Qi Zhang <jerzhang> |
Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Rio Liu <rioliu> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | high | CC: | aygarg, dagray, jerzhang, jmencak, mco-triage, mkrejci, nm-s, sregidor |
Version: | 4.10 | ||
Target Milestone: | --- | ||
Target Release: | 4.11.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
If you attempt to create a cluster with both FIPS and realtime kernel enabled, it is highly possible the MCO will degrade due to a merge logic issue within the code
Consequence:
The MCO will degrade. Can be workaround'ed with https://bugzilla.redhat.com/show_bug.cgi?id=2096496#c10
Fix:
Workaround above or update to version with MCO fix
Result:
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2022-08-10 11:17:47 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2099686 |
Description
Anil Dhingra
2022-06-14 03:58:51 UTC
Based on the origin of the snippet with fips: false I am moving this to NTO. David Gray did some investigation on this yesterday. This is not PAO bug, but at this point I'm not sure this is NTO bug either. NTO doesn't (explicitly) set fips anywhere -- it is left out when creating MachineConfigs. I'll dig more into this today. The MachineConfig structure in MCO only allows true/false (it is NOT tri-state): https://github.com/openshift/machine-config-operator/blob/master/pkg/apis/machineconfiguration.openshift.io/v1/types.go#L205 And the MCO documentation says "If any of the configuration has FIPS enabled, it'll be set." here https://github.com/openshift/machine-config-operator/blob/61159678d9bf051a5e8a017210a349f8c643b910/docs/MachineConfiguration.md?plain=1#L226 The MCO logic seems to confirm that: https://github.com/openshift/machine-config-operator/blob/eff34a518152841050b161b81cebd656ff6ff4cd/pkg/controller/common/helpers.go#L100 Is it possible there was no MachineConfig with FIPS set at all in the running cluster at the time this bug happened? Searching for a minimum reproducer. With fips-enabled cluster and creating a simple MC targetting master, I cannot reproduce the issue right now. There could be another issue/race to do with the fact realtime kernel is also enabled at the same time -- I didn't do that yet. Will also look at the attached must gather and possibly involve MCO folk. Based on the docs (and https://bugzilla.redhat.com/show_bug.cgi?id=2096496#c4) I doubt NTO/PAO needs to do things differently. $ oc get -o yaml mc/99-master-fips|grep -iC1 fips machineconfiguration.openshift.io/role: master name: 99-master-fips resourceVersion: "1715" -- extensions: null fips: true kernelArguments: null $ oc get -o yaml mc/50-nto-master|grep -iC1 fips extensions: null fips: false kernelArguments: - trigger-sno-fips-issue=1 kernelType: "" $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-4da522dcf9c8593c573d26f7e021cc8a True False False 1 1 1 0 159m worker rendered-worker-894edb5e1755f80795fd35bafebde9aa True False False 0 0 0 0 159m I believe I have a minimal(ish) reproducer without involving PAO. Will try without involving NTO later on. 1) Install a fips enabled cluster. 2) oc create -f- <<'EOF' apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: openshift-sno-fips namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Custom OpenShift profile include=openshift-node [bootloader] cmdline_ocp_realtime=trigger-sno-fips-issue=1 name: openshift-sno-fips recommend: - machineConfigLabels: machineconfiguration.openshift.io/role: "master" priority: 20 profile: openshift-sno-fips EOF 3) Wait for MCP to be updated. 4) Enable realtime kernel for the cluster. oc create -f- <'EOF' apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 50-realtime-kernel spec: config: ignition: version: 3.2.0 kernelType: "realtime" EOF 5) $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-4da522dcf9c8593c573d26f7e021cc8a False True True 1 0 0 1 3h5m worker rendered-worker-894edb5e1755f80795fd35bafebde9aa True False False 0 0 0 0 3h5m $ oc get no NAME STATUS ROLES AGE VERSION sno Ready master,worker 3h8m v1.23.5+3afdacb $ oc get mcp/master -o yaml|grep -i reconci message: 'Node sno is reporting: "can''t reconcile config rendered-master-4da522dcf9c8593c573d26f7e021cc8a flag; refusing to modify FIPS on a running cluster: unreconcilable"' MCO team, could you please take a look? Minimum reproducer without NTO/PAO. 1) Install a fips enabled cluster. 2) oc create -f- <<'EOF' apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 50-fips-bz-poc spec: config: ignition: version: 3.2.0 kernelArguments: - trigger-sno-fips-issue=1 EOF 3) Wait for MCP to be updated. 4) oc create -f- <<'EOF' apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 50-realtime-kernel spec: config: ignition: version: 3.2.0 kernelType: "realtime" EOF This is an issue with the MCO's MC merge logic. This is not a regression but is an issue that has always existed, which essentially prevents you from setting a FIPS + realtime enabled cluster IF AND ONLY IF the FIPS MachineConfig has higher alphanumeric priority compared to the realtime MC (which is normally the case, since the FIPS MC has a 99 prefix). I will apply a patch for this. In the meantime, there is a workaround: 1. create a 01-fips MC that is just: ``` apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: "master" name: 01-fips-master spec: fips: true ``` this is just a dummy MC. It in theory gets overwritten by the 99-fips MC anyways, and you can't otherwise change FIPS, so this is a dummy no-op. If you are reinstalling, just drop this into the install manifests folder (while also specifying FIPS: true in the install configs). No other action should be needed. 2. If on a running cluster and you aren't degraded, just apply this MC before the realtime MC. 3. If on a running cluster and you are already degraded, after applying the MC, a new rendered-MC should be generated (we'll call this rendered-master-cccc which should be the current desired config, with the only change being FIPS: true instead of FIPS: false). You can diff the two MCs to make sure. Find the node that has the degradation, and do: `oc edit node/xxxx` You should see in the annotations that: machineconfiguration.openshift.io/currentConfig: rendered-master-aaaa machineconfiguration.openshift.io/desiredConfig: rendered-master-bbbb It is trying an update from an old MC to the second newest MC that is broken. Change that rendered-master-bbbb by hand to the rendered-master-cccc that was just generated. Markings as high priority. Is there a reason this is considered urgent severity? The customer case seems to be normal severity, just wanted to make sure I understand why this was marked urgent Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |