Bug 2108858 - cluster-version operator should clear (pod) securityContext when the manifest does not set the property
Summary: cluster-version operator should clear (pod) securityContext when the manifest...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.12
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.12.0
Assignee: Over the Air Updates
QA Contact: Yang Yang
URL:
Whiteboard:
Depends On:
Blocks: 2109983
TreeView+ depends on / blocked
 
Reported: 2022-07-19 21:22 UTC by Hongkai Liu
Modified: 2023-09-18 04:42 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2110501 (view as bug list)
Environment:
Last Closed: 2023-01-17 19:53:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 804 0 None open Bug 2108858: lib/resourcemerge: change SecurityContext reconcile 2022-07-25 22:15:54 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:53:30 UTC

Comment 4 Standa Laznicka 2022-07-20 08:34:21 UTC
I don't understand why all the above comments are private?

The admission messages need to be improved, it's true. What we can see that there are two usable SCCs but neither of them can match the pod due to the `runAsUser` field. These two SCCs are likely "restricted" and "restricted-v2".

I am not sure why this issue would pop up during upgrade. The SA running the above pod needs to have access to at least "nonroot-v2" SCC, otherwise it won't be allowed to run. The only good explanation for this behavior during upgrade would be that there was previously a "runlevel" annotation on the NS that would disable SCC annotation. We explicitly removed this annotation in the latest MAO versions, and so if it did not have proper RBAC before it might now start failing as the SCC admission finally triggers.

Comment 5 Ben Parees 2022-07-20 13:50:39 UTC
The openshift-machine-api NS on build01 (v4.10, not yet upgraded) indeed has "openshift.io/run-level: "1"" specified.

and build02 (upgraded to 4.12) indeed has "openshift.io/run-level: """

So that explains why the pod started getting rejected for admission during the upgrade to 4.12.  And why the pod itself did not have an scc annotation prior to 4.12.  (It also means when build01 is upgraded, it's going to hit this same problem)

so that leaves next steps for resolution:  Either the CVO needs to be removing securityContext values, or the MAO itself needs to do so.


I'm going to leave this on the OTA team until we hear a reason why the CVO is not clearly the securityContext values during reconciliation of the deployment.

Comment 6 Ben Parees 2022-07-20 21:51:22 UTC
> The admission messages need to be improved, it's true. What we can see that there are two usable SCCs but neither of them can match the pod due to the `runAsUser` field. These two SCCs are likely "restricted" and "restricted-v2".

Standa, do you have a separate bug to track this message improvement?

Comment 7 Lalatendu Mohanty 2022-07-22 16:00:04 UTC
We're asking the following questions to evaluate whether or not this bug warrants changing update recommendations from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions.

Which 4.y.z to 4.y'.z' updates increase vulnerability? Which types of clusters?

    reasoning: This allows us to populate from, to, and matchingRules in conditional update recommendations for "the $SOURCE_RELEASE to $TARGET_RELEASE update is not recommended for clusters like $THIS".
    example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet. Check your vulnerability with oc ... or the following PromQL count (...) > 0.
    example: All customers upgrading from 4.y.z to 4.y+1.z fail. Check your vulnerability with oc adm upgrade to show your current cluster version.

What is the impact? Is it serious enough to warrant removing update recommendations?

    reasoning: This allows us to populate name and message in conditional update recommendations for "...because if you update, $THESE_CONDITIONS may cause $THESE_UNFORTUNATE_SYMPTOMS".
    example: Around 2 minute disruption in edge routing for 10% of clusters. Check with oc ....
    example: Up to 90 seconds of API downtime. Check with curl ....
    example: etcd loses quorum and you have to restore from backup. Check with ssh ....

How involved is remediation?

    reasoning: This allows administrators who are already vulnerable, or who chose to waive conditional-update risks, to recover their cluster. And even moderately serious impacts might be acceptable if they are easy to mitigate.
    example: Issue resolves itself after five minutes.
    example: Admin can run a single: oc ....
    example: Admin must SSH to hosts, restore from backups, or other non standard admin activities.

Is this a regression?

    reasoning: Updating between two vulnerable releases may not increase exposure (unless rebooting during the update increases vulnerability, etc.). We only qualify update recommendations if the update increases exposure.
    example: No, it has always been like this we just never noticed.
    example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1.

Comment 8 Lalatendu Mohanty 2022-07-22 21:02:00 UTC
> Which 4.y.z to 4.y'.z' updates increase vulnerability? Which types of clusters?

  Every cluster originated in  OCP 4.4.7 and earlier will encounter this bug. 

> What is the impact? Is it serious enough to warrant removing update recommendations?

  Upgrade will get stuck. Availability of OCP cluster is not impacted. If it impacts a significant % of the fleet then we recommend removing the edge or use conditional update for this. 

> How involved is remediation?

  The workaround is to delete the deployment resource for machine-api-operator. That would recreate the fresh deployment from the manifest which solves the issue. 

   $ oc delete deploy -n openshift-machine-api machine-api-operator --as system:admin

> Is this a regression?

 It is not a regression for CVO as the CVO code which is responsible for this has not been changed from the beginning of CVO. But code changes in machine-api-operator exposed this issue now. We can say the code change in machine-api-operator is a regression.

Comment 9 Yang Yang 2022-07-25 14:03:37 UTC
Reproducing it by adding securityContext to a 4.11 cluster and then getting it upgraded to 4.12

1. Install a 4.11 cluster
# ./flexy.sh 123962
connecting flexy job 123962
connecting using no proxy, while cluster proxy is set to: null
web console: kubeadmin:YVFYu-tJyGj-ULVaA-RvJa8
https://console-openshift-console.apps.yanyang-0725d.qe.gcp.devcluster.openshift.com
Client Version: 4.11.0-0.nightly-2022-06-30-005428
Kustomize Version: v4.5.4
Server Version: 4.11.0-rc.5
Kubernetes Version: v1.24.0+9546431
type exit (🍏) or press Ctrl+D (🐧) to clean up, once finished

2. Update openshift-machine-api/machine-api-operator deployment by adding securityContext
# oc edit deploy -n openshift-machine-api machine-api-operator

# oc get deploy -n openshift-machine-api machine-api-operator -oyaml | grep securityContext -A2

securityContext:
        runAsNonRoot: true
        runAsUser: 65534

3. Upgrade to 4.12
# oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release@sha256:f2762722d749479737024db5c37273bd98ecd277de3627ab619998950bb4bc31 --allow-explicit-upgrade --force

# oc adm upgrade 
warning: Cannot display available updates:
  Reason: VersionNotFound
  Message: Unable to retrieve available updates: currently reconciling cluster version 4.12.0-0.nightly-2022-07-24-180529 not found in the "stable-4.11" channel

info: An upgrade is in progress. Unable to apply 4.12.0-0.nightly-2022-07-24-180529: the workload openshift-machine-api/machine-api-operator cannot roll out

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.11

CVO log says

E0725 09:19:11.303354       1 task.go:117] error running apply for deployment "openshift-machine-api/machine-api-operator" (213 of 802): deployment openshift-machine-api/machine-api-operator has a replica failure FailedCreate: pods "machine-api-operator-5b649b89c9-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, spec.containers[0].securityContext.runAsUser: Invalid value: 65534: must be in the ranges: [1000470000, 1000479999], spec.containers[1].securityContext.runAsUser: Invalid value: 65534: must be in the ranges: [1000470000, 1000479999], provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]

Okay, it's reproduced. Will verify it later.

Comment 10 Yang Yang 2022-07-26 05:52:56 UTC
Verifying before PR merging:

1. Install a 4.11 cluster

2. Update openshift-machine-api/machine-api-operator deployment by adding securityContext
# oc get deploy -n openshift-machine-api machine-api-operator -oyaml | grep securityContext -A2
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534

3. Build a payload using cluster-bot

4. Upgrading to the payload got from step#3
# oc adm upgrade --to-image=registry.build05.ci.openshift.org/ci-ln-y073bik/release:latest --allow-explicit-upgrade --force

# oc get deploy -n openshift-machine-api machine-api-operator -oyaml | grep securityContext -A2
      securityContext: {}
      serviceAccount: machine-api-operator
      serviceAccountName: machine-api-operator

# oc adm upgrade 
warning: Cannot display available updates:
  Reason: VersionNotFound
  Message: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.ci.test-2022-07-26-032905-ci-ln-y073bik-latest not found in the "stable-4.11" channel

Cluster version is 4.11.0-0.ci.test-2022-07-26-032905-ci-ln-y073bik-latest

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.11

Upgrade is successful and securityContext is cleared. Looks good to me.

Comment 12 Yang Yang 2022-07-27 02:36:06 UTC
Based on comment#10, moving it to verified state.

Comment 13 Jack Ottofaro 2022-08-02 22:03:01 UTC
Removing UpgradeBlocker since no request to block edges has been forthcoming and no obvious signs of a significant % of the fleet being impacted.

Comment 17 errata-xmlrpc 2023-01-17 19:53:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Comment 18 Red Hat Bugzilla 2023-09-18 04:42:19 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.