Bug 1978581 - machine-config-operator: remove runlevel from mco namespace
Summary: machine-config-operator: remove runlevel from mco namespace
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.10.0
Assignee: Yu Qi Zhang
QA Contact: Sergio
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-02 08:40 UTC by Mark Cooper
Modified: 2022-03-12 04:36 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-12 04:35:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2655 0 None open Bug 1978581: remove run-level info from operators namespaces 2021-11-22 16:15:14 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:36:15 UTC

Description Mark Cooper 2021-07-02 08:40:56 UTC
Description of problem:

Currently the openshift-machine-config-operator namespace is setting `run-level 1`, in two places:
 - https://github.com/openshift/machine-config-operator/blob/68e602d88967f30eebf86086d68f7c6303064893/install/0000_80_machine-config-operator_00_namespace.yaml#L13
 - https://github.com/openshift/installer/blob/6d778f911e79afad8ba2ff4301eda5b5cf4d8e9e/data/data/manifests/bootkube/04-openshift-machine-config-operator.yaml#L7

These appear to be either copy and paste based on bad documentation, or previously required before OCP 4.7 due to significant boot delays.

With some basic testing, it appears that these can be removed.

The run-level definition actually prevents any Security Context Constraint (SCC) from being applied to pods within that namespace. 

Also brings in line with: https://bugzilla.redhat.com/show_bug.cgi?id=1805488

Actual results:

$ oc get ns openshift-machine-config-operator -o yaml | grep run-level
    openshift.io/run-level: "1"

$ oc -n openshift-machine-config-operator get pod machine-config-operator-5c9f8b8457-nvjmf -o yaml | grep scc


Expected results:

$ oc get ns openshift-machine-config-operator -o yaml | grep run-level

# oc -n openshift-machine-config-operator get pod  machine-config-operator-7465cbf688-84g2w -o yaml | grep
 scc
    openshift.io/scc: hostmount-anyuid

Comment 1 Mark Cooper 2021-07-02 08:42:42 UTC
Two PR's have also been created here:

 - https://github.com/openshift/machine-config-operator/pull/2655
 - https://github.com/openshift/installer/pull/5053

Comment 2 Yu Qi Zhang 2021-07-03 01:23:26 UTC
Given that this is a low severity bug the team is likely unable to work on this in the near future. If you are able to test with the PRs and can confirm they should not otherwise affect MCO operation, I am ok with helping review.

The one thing I would potentially check is upgrade time, and whether running the MCO pods without run-level 1 slows down upgrades in any way

Comment 3 Mark Cooper 2021-08-30 02:03:55 UTC
So far the testing looks ok. 

Looking at the PR: 

ci/prow/e2e-agnostic-upgrade
: [sig-cluster-lifecycle] cluster upgrade should complete in 75.00 minutes	58m50s
: [sig-cluster-lifecycle] Cluster completes upgrade	58m50s
ci/prow/e2e-vsphere-upgrade 
: [sig-cluster-lifecycle] cluster upgrade should complete in 75.00 minutes	54m40s
: [sig-cluster-lifecycle] Cluster completes upgrade	54m40s


https://github.com/openshift/machine-config-operator/pull/2703

ci/prow/e2e-vsphere-upgrade
: [sig-cluster-lifecycle] cluster upgrade should complete in 75.00 minutes	1h4m10s
: [sig-cluster-lifecycle] Cluster completes upgrade	1h4m10s
ci/prow/e2e-agnostic-upgrade
: [sig-cluster-lifecycle] cluster upgrade should complete in 75m (105m on AWS)	1h0m40s
: [sig-cluster-lifecycle] Cluster completes upgrade	1h0m40s

So overall the upgrades I think are ok and are not negatively affected. 

However, still need to look at MCO upgrade times specifically, but not sure yet how we find that.

Comment 4 Mark Cooper 2021-09-07 03:35:30 UTC
From Slack: 

@Yu Qi Zhang
I think that's a fine statistic to look at, although that said that encompasses all the operators, and I don't know how big the variance is for all the other operators when it comes to upgrade times (shouldn't be too big)

Looking at MCO upgrade time itself would be best, but I am not sure if there is a very straightforward metric to see that off the top of my head. It would either have to be looking at the operator (events?) or the logs of the MCC/MCDs to see when it started/ended

@Mark Cooper
nods i'll have a look

@Mark Cooper
Hmmm well "I" couldn't find any logs on the CI. Played around a bit unsuccessfully on a cluster, but even then given the number of variables not sure exactly what would be a good baseline for it. Are there any previous bugs that we can go off that have tested the same? Or had similar concerns that you're aware of?

@Yu Qi Zhang
Sort of, we've had slowdowns in the past that have affected CI, but they've generally been pretty significant increases (50%) so it was very noticeable.
I think given that we've had multiple test runs on multiple PRs, and the PRs that we have merged have not had any issues, it seems unlikely we are going to run into any in our current setup, so I am fine with things as they are :slightly_smiling_face:

Comment 5 Mark Cooper 2021-10-18 23:21:13 UTC
Would it be worth revisiting this now for 4.10 @jerzhang

Comment 6 Yu Qi Zhang 2021-11-02 20:50:52 UTC
Added some comments to PR. Assigning to Mark since I am not directly working on it.

Comment 14 errata-xmlrpc 2022-03-12 04:35:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.