Bug 2096496

Summary:	FIPS issue on OCP SNO with RT Kernel via performance profile
Product:	OpenShift Container Platform	Reporter:	Anil Dhingra <adhingra>
Component:	Machine Config Operator	Assignee:	Yu Qi Zhang <jerzhang>
Machine Config Operator sub component:	Machine Config Operator	QA Contact:	Rio Liu <rioliu>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	high	CC:	aygarg, dagray, jerzhang, jmencak, mco-triage, mkrejci, nm-s, sregidor
Version:	4.10
Target Milestone:	---
Target Release:	4.11.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: If you attempt to create a cluster with both FIPS and realtime kernel enabled, it is highly possible the MCO will degrade due to a merge logic issue within the code Consequence: The MCO will degrade. Can be workaround'ed with https://bugzilla.redhat.com/show_bug.cgi?id=2096496#c10 Fix: Workaround above or update to version with MCO fix Result:	Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-10 11:17:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2099686

Description Anil Dhingra 2022-06-14 03:58:51 UTC

Description of problem:
1- Its an SNO deployment - so master & worker on the same node .
2- install-config.yaml during deployment was used to enable  fips option as true .
3 - As its for telco use case so they need to tune some kernel parameters ( to use with like sriov  )

4 - to tune those parameters we use PAO - performance addon operator & apply a profile  [1] 
5-  Below parameters were changed on fips enabled cluster , no idea which parameter is not compatible with RT kernel but error msg points to FIPS - which is trying to revert FIPS   [2]
6 - if we disable RT kernel -& deploy performance profile  it works fine .

It's difficult to understand if it is an issue with the RT kernel or any correlation with FIPS .

[1]
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: perfprofile-policy
spec:
  additionalKernelArgs:
    - idle=poll
    - rcupdate.rcu_normal_after_boot=0
    - nosmt
  cpu:
    isolated: 4-29
    reserved: 0-3,30-31
  hugepages:
    defaultHugepagesSize: 1G
    pages:
      - count: 100
        size: 1G
  machineConfigPoolSelector:
    pools.operator.machineconfiguration.openshift.io/master: ""
  net:
    userLevelNetworking: true
  nodeSelector:
    node-role.kubernetes.io/master: ""
  numa:
    topologyPolicy: restricted
  realTimeKernel:
    enabled: true
[2]
"can't reconcile config rendered-master-e8e27928dc39315c31039269fa97acaf with rendered-master-d93b203982e5acfc85030325016e6a90: detected change to FIPS flag; refusing to modify FIPS on a running cluster: unreconcilable".


Version-Release number of selected component (if applicable):


How reproducible:

Every time with FIPS & RT kernel via performance profile 

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

As per the error messages , master node stuck in degraded state because one of machineconfig trying to change fips status.

~~~
      master: 'pool is degraded because nodes fail with "1 nodes are reporting degraded
      status on sync": "Node sno-offline.ocp410.lab.com is reporting: \"can''t reconcile
      config rendered-master-e8e27928dc39315c31039269fa97acaf with rendered-master-d93b203982e5acfc85030325016e6a90:
      detected change to FIPS flag; refusing to modify FIPS on a running cluster:
      unreconcilable\""'
~~~

Current : rendered-master-e8e27928dc39315c31039269fa97acaf 

~~~
  extensions: []
  fips: true
~~~

Desired : rendered-master-d93b203982e5acfc85030325016e6a90

~~~
extensions: []
  fips: false
  kernelArguments:
~~~

This Machine config triggering changes.

50-nto-master

~~~
  extensions: null
  fips: false
  kernelArguments:
  - skew_tick=1
~~~

Comment 1 Martin Sivák 2022-06-14 06:43:46 UTC

Based on the origin of the snippet with fips: false I am moving this to NTO.

Comment 3 Jiří Mencák 2022-06-15 05:35:40 UTC

David Gray did some investigation on this yesterday.  This is not PAO bug, but at this point I'm not sure this is NTO bug either.  NTO doesn't (explicitly) set fips anywhere -- it is left out when creating MachineConfigs.  I'll dig more into this today.

Comment 4 Martin Sivák 2022-06-15 06:44:29 UTC

The MachineConfig structure in MCO only allows true/false (it is NOT tri-state): https://github.com/openshift/machine-config-operator/blob/master/pkg/apis/machineconfiguration.openshift.io/v1/types.go#L205

And the MCO documentation says "If any of the configuration has FIPS enabled, it'll be set." here https://github.com/openshift/machine-config-operator/blob/61159678d9bf051a5e8a017210a349f8c643b910/docs/MachineConfiguration.md?plain=1#L226

The MCO logic seems to confirm that: https://github.com/openshift/machine-config-operator/blob/eff34a518152841050b161b81cebd656ff6ff4cd/pkg/controller/common/helpers.go#L100

Comment 5 Martin Sivák 2022-06-15 06:46:30 UTC

Is it possible there was no MachineConfig with FIPS set at all in the running cluster at the time this bug happened?

Comment 6 Jiří Mencák 2022-06-15 12:18:45 UTC

Searching for a minimum reproducer.  With fips-enabled cluster and creating a simple MC targetting master, I cannot reproduce the issue right now.  There could be another issue/race to do with the fact realtime kernel is also enabled at the same time -- I didn't do that yet.  Will also look at the attached must gather and possibly involve MCO folk.  Based on the docs (and https://bugzilla.redhat.com/show_bug.cgi?id=2096496#c4) I doubt NTO/PAO needs to do things differently.

$ oc get -o yaml mc/99-master-fips|grep -iC1 fips
    machineconfiguration.openshift.io/role: master
  name: 99-master-fips
  resourceVersion: "1715"
--
  extensions: null
  fips: true
  kernelArguments: null

$ oc get -o yaml mc/50-nto-master|grep -iC1 fips
  extensions: null
  fips: false
  kernelArguments:
  - trigger-sno-fips-issue=1
  kernelType: ""

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-4da522dcf9c8593c573d26f7e021cc8a   True      False      False      1              1                   1                     0                      159m
worker   rendered-worker-894edb5e1755f80795fd35bafebde9aa   True      False      False      0              0                   0                     0                      159m

Comment 7 Jiří Mencák 2022-06-15 12:45:17 UTC

I believe I have a minimal(ish) reproducer without involving PAO.  Will try without involving NTO later on.

1) Install a fips enabled cluster.

2) oc create -f- <<'EOF'
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-sno-fips
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Custom OpenShift profile
      include=openshift-node
      [bootloader]
      cmdline_ocp_realtime=trigger-sno-fips-issue=1
    name: openshift-sno-fips

  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: "master"
    priority: 20
    profile: openshift-sno-fips
EOF

3) Wait for MCP to be updated.

4) Enable realtime kernel for the cluster.
oc create -f- <'EOF'
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 50-realtime-kernel
spec:
  config:
    ignition:
      version: 3.2.0
  kernelType: "realtime"
EOF

5) 

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-4da522dcf9c8593c573d26f7e021cc8a   False     True       True       1              0                   0                     1                      3h5m
worker   rendered-worker-894edb5e1755f80795fd35bafebde9aa   True      False      False      0              0                   0                     0                      3h5m

$ oc get no
NAME   STATUS   ROLES           AGE    VERSION
sno    Ready    master,worker   3h8m   v1.23.5+3afdacb

$ oc get mcp/master -o yaml|grep -i reconci
    message: 'Node sno is reporting: "can''t reconcile config rendered-master-4da522dcf9c8593c573d26f7e021cc8a
      flag; refusing to modify FIPS on a running cluster: unreconcilable"'

MCO team, could you please take a look?

Comment 8 Jiří Mencák 2022-06-15 13:57:50 UTC

Minimum reproducer without NTO/PAO.

1) Install a fips enabled cluster.

2) oc create -f- <<'EOF'
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 50-fips-bz-poc
spec:
  config:
    ignition:
      version: 3.2.0
  kernelArguments:
  - trigger-sno-fips-issue=1
EOF

3) Wait for MCP to be updated.

4) oc create -f- <<'EOF'
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 50-realtime-kernel
spec:
  config:
    ignition:
      version: 3.2.0
  kernelType: "realtime"
EOF

Comment 10 Yu Qi Zhang 2022-06-20 19:22:55 UTC

This is an issue with the MCO's MC merge logic. This is not a regression but is an issue that has always existed, which essentially prevents you from setting a FIPS + realtime enabled cluster IF AND ONLY IF the FIPS MachineConfig has higher alphanumeric priority compared to the realtime MC (which is normally the case, since the FIPS MC has a 99 prefix).

I will apply a patch for this. In the meantime, there is a workaround:

1. create a 01-fips MC that is just:

```
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: "master"
name: 01-fips-master
spec:
fips: true
```

this is just a dummy MC. It in theory gets overwritten by the 99-fips MC anyways, and you can't otherwise change FIPS, so this is a dummy no-op.

If you are reinstalling, just drop this into the install manifests folder (while also specifying FIPS: true in the install configs). No other action should be needed.

2. If on a running cluster and you aren't degraded, just apply this MC before the realtime MC.

3. If on a running cluster and you are already degraded, after applying the MC, a new rendered-MC should be generated (we'll call this rendered-master-cccc which should be the current desired config, with the only change being FIPS: true instead of FIPS: false). You can diff the two MCs to make sure. Find the node that has the degradation, and do:

`oc edit node/xxxx`

You should see in the annotations that:

machineconfiguration.openshift.io/currentConfig: rendered-master-aaaa
machineconfiguration.openshift.io/desiredConfig: rendered-master-bbbb

It is trying an update from an old MC to the second newest MC that is broken. Change that rendered-master-bbbb by hand to the rendered-master-cccc that was just generated.

Markings as high priority. Is there a reason this is considered urgent severity? The customer case seems to be normal severity, just wanted to make sure I understand why this was marked urgent

Comment 19 errata-xmlrpc 2022-08-10 11:17:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069