1978581 – machine-config-operator: remove runlevel from mco namespace

Bug 1978581 - machine-config-operator: remove runlevel from mco namespace

Summary: machine-config-operator: remove runlevel from mco namespace

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Yu Qi Zhang
QA Contact:	Sergio
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-02 08:40 UTC by Mark Cooper
Modified:	2022-03-12 04:36 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-12 04:35:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2655	0	None	open	Bug 1978581: remove run-level info from operators namespaces	2021-11-22 16:15:14 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-12 04:36:15 UTC

Description Mark Cooper 2021-07-02 08:40:56 UTC

Description of problem:

Currently the openshift-machine-config-operator namespace is setting `run-level 1`, in two places:
 - https://github.com/openshift/machine-config-operator/blob/68e602d88967f30eebf86086d68f7c6303064893/install/0000_80_machine-config-operator_00_namespace.yaml#L13
 - https://github.com/openshift/installer/blob/6d778f911e79afad8ba2ff4301eda5b5cf4d8e9e/data/data/manifests/bootkube/04-openshift-machine-config-operator.yaml#L7

These appear to be either copy and paste based on bad documentation, or previously required before OCP 4.7 due to significant boot delays.

With some basic testing, it appears that these can be removed.

The run-level definition actually prevents any Security Context Constraint (SCC) from being applied to pods within that namespace. 

Also brings in line with: https://bugzilla.redhat.com/show_bug.cgi?id=1805488

Actual results:

$ oc get ns openshift-machine-config-operator -o yaml | grep run-level
    openshift.io/run-level: "1"

$ oc -n openshift-machine-config-operator get pod machine-config-operator-5c9f8b8457-nvjmf -o yaml | grep scc


Expected results:

$ oc get ns openshift-machine-config-operator -o yaml | grep run-level

# oc -n openshift-machine-config-operator get pod  machine-config-operator-7465cbf688-84g2w -o yaml | grep
 scc
    openshift.io/scc: hostmount-anyuid

Comment 1 Mark Cooper 2021-07-02 08:42:42 UTC

Two PR's have also been created here:

 - https://github.com/openshift/machine-config-operator/pull/2655
 - https://github.com/openshift/installer/pull/5053

Comment 2 Yu Qi Zhang 2021-07-03 01:23:26 UTC

Given that this is a low severity bug the team is likely unable to work on this in the near future. If you are able to test with the PRs and can confirm they should not otherwise affect MCO operation, I am ok with helping review.

The one thing I would potentially check is upgrade time, and whether running the MCO pods without run-level 1 slows down upgrades in any way

Comment 3 Mark Cooper 2021-08-30 02:03:55 UTC

So far the testing looks ok. 

Looking at the PR: 

ci/prow/e2e-agnostic-upgrade
: [sig-cluster-lifecycle] cluster upgrade should complete in 75.00 minutes	58m50s
: [sig-cluster-lifecycle] Cluster completes upgrade	58m50s
ci/prow/e2e-vsphere-upgrade 
: [sig-cluster-lifecycle] cluster upgrade should complete in 75.00 minutes	54m40s
: [sig-cluster-lifecycle] Cluster completes upgrade	54m40s


https://github.com/openshift/machine-config-operator/pull/2703

ci/prow/e2e-vsphere-upgrade
: [sig-cluster-lifecycle] cluster upgrade should complete in 75.00 minutes	1h4m10s
: [sig-cluster-lifecycle] Cluster completes upgrade	1h4m10s
ci/prow/e2e-agnostic-upgrade
: [sig-cluster-lifecycle] cluster upgrade should complete in 75m (105m on AWS)	1h0m40s
: [sig-cluster-lifecycle] Cluster completes upgrade	1h0m40s

So overall the upgrades I think are ok and are not negatively affected. 

However, still need to look at MCO upgrade times specifically, but not sure yet how we find that.

Comment 4 Mark Cooper 2021-09-07 03:35:30 UTC

From Slack: 

@Yu Qi Zhang
I think that's a fine statistic to look at, although that said that encompasses all the operators, and I don't know how big the variance is for all the other operators when it comes to upgrade times (shouldn't be too big)

Looking at MCO upgrade time itself would be best, but I am not sure if there is a very straightforward metric to see that off the top of my head. It would either have to be looking at the operator (events?) or the logs of the MCC/MCDs to see when it started/ended

@Mark Cooper
nods i'll have a look

@Mark Cooper
Hmmm well "I" couldn't find any logs on the CI. Played around a bit unsuccessfully on a cluster, but even then given the number of variables not sure exactly what would be a good baseline for it. Are there any previous bugs that we can go off that have tested the same? Or had similar concerns that you're aware of?

@Yu Qi Zhang
Sort of, we've had slowdowns in the past that have affected CI, but they've generally been pretty significant increases (50%) so it was very noticeable.
I think given that we've had multiple test runs on multiple PRs, and the PRs that we have merged have not had any issues, it seems unlikely we are going to run into any in our current setup, so I am fine with things as they are :slightly_smiling_face:

Comment 5 Mark Cooper 2021-10-18 23:21:13 UTC

Would it be worth revisiting this now for 4.10 @jerzhang

Comment 6 Yu Qi Zhang 2021-11-02 20:50:52 UTC

Added some comments to PR. Assigning to Mark since I am not directly working on it.

Comment 14 errata-xmlrpc 2022-03-12 04:35:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.