Description of problem: Currently the openshift-machine-config-operator namespace is setting `run-level 1`, in two places: - https://github.com/openshift/machine-config-operator/blob/68e602d88967f30eebf86086d68f7c6303064893/install/0000_80_machine-config-operator_00_namespace.yaml#L13 - https://github.com/openshift/installer/blob/6d778f911e79afad8ba2ff4301eda5b5cf4d8e9e/data/data/manifests/bootkube/04-openshift-machine-config-operator.yaml#L7 These appear to be either copy and paste based on bad documentation, or previously required before OCP 4.7 due to significant boot delays. With some basic testing, it appears that these can be removed. The run-level definition actually prevents any Security Context Constraint (SCC) from being applied to pods within that namespace. Also brings in line with: https://bugzilla.redhat.com/show_bug.cgi?id=1805488 Actual results: $ oc get ns openshift-machine-config-operator -o yaml | grep run-level openshift.io/run-level: "1" $ oc -n openshift-machine-config-operator get pod machine-config-operator-5c9f8b8457-nvjmf -o yaml | grep scc Expected results: $ oc get ns openshift-machine-config-operator -o yaml | grep run-level # oc -n openshift-machine-config-operator get pod machine-config-operator-7465cbf688-84g2w -o yaml | grep scc openshift.io/scc: hostmount-anyuid
Two PR's have also been created here: - https://github.com/openshift/machine-config-operator/pull/2655 - https://github.com/openshift/installer/pull/5053
Given that this is a low severity bug the team is likely unable to work on this in the near future. If you are able to test with the PRs and can confirm they should not otherwise affect MCO operation, I am ok with helping review. The one thing I would potentially check is upgrade time, and whether running the MCO pods without run-level 1 slows down upgrades in any way
So far the testing looks ok. Looking at the PR: ci/prow/e2e-agnostic-upgrade : [sig-cluster-lifecycle] cluster upgrade should complete in 75.00 minutes 58m50s : [sig-cluster-lifecycle] Cluster completes upgrade 58m50s ci/prow/e2e-vsphere-upgrade : [sig-cluster-lifecycle] cluster upgrade should complete in 75.00 minutes 54m40s : [sig-cluster-lifecycle] Cluster completes upgrade 54m40s https://github.com/openshift/machine-config-operator/pull/2703 ci/prow/e2e-vsphere-upgrade : [sig-cluster-lifecycle] cluster upgrade should complete in 75.00 minutes 1h4m10s : [sig-cluster-lifecycle] Cluster completes upgrade 1h4m10s ci/prow/e2e-agnostic-upgrade : [sig-cluster-lifecycle] cluster upgrade should complete in 75m (105m on AWS) 1h0m40s : [sig-cluster-lifecycle] Cluster completes upgrade 1h0m40s So overall the upgrades I think are ok and are not negatively affected. However, still need to look at MCO upgrade times specifically, but not sure yet how we find that.
From Slack: @Yu Qi Zhang I think that's a fine statistic to look at, although that said that encompasses all the operators, and I don't know how big the variance is for all the other operators when it comes to upgrade times (shouldn't be too big) Looking at MCO upgrade time itself would be best, but I am not sure if there is a very straightforward metric to see that off the top of my head. It would either have to be looking at the operator (events?) or the logs of the MCC/MCDs to see when it started/ended @Mark Cooper nods i'll have a look @Mark Cooper Hmmm well "I" couldn't find any logs on the CI. Played around a bit unsuccessfully on a cluster, but even then given the number of variables not sure exactly what would be a good baseline for it. Are there any previous bugs that we can go off that have tested the same? Or had similar concerns that you're aware of? @Yu Qi Zhang Sort of, we've had slowdowns in the past that have affected CI, but they've generally been pretty significant increases (50%) so it was very noticeable. I think given that we've had multiple test runs on multiple PRs, and the PRs that we have merged have not had any issues, it seems unlikely we are going to run into any in our current setup, so I am fine with things as they are :slightly_smiling_face:
Would it be worth revisiting this now for 4.10 @jerzhang
Added some comments to PR. Assigning to Mark since I am not directly working on it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056