Bug 1934400

Summary: [ocp_4][4.6][apiserver-auth] OAuth API servers are not ready - PreconditionNotReady
Product: OpenShift Container Platform Reporter: Vincent Lours <vlours>
Component: apiserver-authAssignee: Standa Laznicka <slaznick>
Status: CLOSED ERRATA QA Contact: pmali
Severity: urgent Docs Contact:
Priority: high    
Version: 4.6CC: aos-bugs, mfojtik, openshift-bugs-escalate, peli, psonavan, rdey, rkshirsa, shishika, slaznick, sttts, wking
Target Milestone: ---Flags: rkshirsa: needinfo-
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: A custom SCC that contains an unlikely, but possible combination of `defaultAllowPrivilegeEscalation: false` and `allowPrivilegedContainer: true` fields was causing the privileged openshift-apiserver and oauth-apiserver pods to fail as the SCC mutates the pods to a state that fails API validation. Consequence: openshift-apiserver and oauth-apiserver pods would be prevented from starting, which might cause an outage on the OpenShift APIs. Fix: Make the security context mutator ignore the `defaultAllowPrivilegeEscalation` field on containers that are already privileged. Result: A custom SCC should no longer be able to block privileged pods from starting.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:49:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1967359, 1989060    

Comment 10 Standa Laznicka 2021-03-03 12:04:13 UTC
(from a private comment)
> ~~~
>    - lastTransitionTime: "2021-03-02T06:33:06Z"
>      lastUpdateTime: "2021-03-02T06:33:06Z"
>      message: 'Pod "apiserver-d476db957-9csf9" is invalid: [spec.containers[0].securityContext: Invalid value: core.SecurityContext{Capabilities:(*core.Capabilities)(nil), Privileged:(*bool)(0xc0702fc836), SELinuxOptions:(*core.SELinuxOptions)(nil), WindowsOptions:(*core.WindowsSecurityContextOptions)(nil), RunAsUser:(*int64)(nil), RunAsGroup:(*int64)(nil), RunAsNonRoot:(*bool)(nil), ReadOnlyRootFilesystem:(*bool)(nil), AllowPrivilegeEscalation:(*bool) (0xc0702fc39c), ProcMount:(*core.ProcMountType)(nil), SeccompProfile:(*core.SeccompProfile)(nil)}: cannot set `allowPrivilegeEscalation` to false and `privileged` to true, spec.initContainers[0].securityContext: Invalid value: core.SecurityContext{Capabilities:(*core.Capabilities)(nil), Privileged:(*bool)(0xc0702fc835), SELinuxOptions:(*core.SELinuxOptions)(nil), WindowsOptions:(*core.WindowsSecurityContextOptions)(nil), RunAsUser:(*int64)(nil), RunAsGroup:(*int64)(nil), RunAsNonRoot:(*bool)(nil), ReadOnlyRootFilesystem:(*bool)(nil), AllowPrivilegeEscalation:(*bool)(0xc0702fc39c), ProcMount:(*core.ProcMountType)(nil), SeccompProfile:(*core.SeccompProfile)(nil)}: cannot set `allowPrivilegeEscalation` to false and `privileged` to true]'
>      reason: FailedCreate
>      status: "True"
>      type: ReplicaFailure
> ~~~

Did they add any SCC that would allow privileged pods but would still cause `allowPrivilegeEscalation` to default to false? Are there custom SCCs in this cluster, even if possibly deployed by a 3rd party product?

Comment 11 Standa Laznicka 2021-03-03 12:08:46 UTC
nvm, I see we already have SCCs in the must-gather. I'll check those.

Comment 12 Standa Laznicka 2021-03-03 12:12:40 UTC
Can you please get me the audit logs for the cluster?

Comment 13 Standa Laznicka 2021-03-03 14:36:10 UTC
Actually, I can see the issue, I overlooked a small detail.

The "vulnerability-advisor-scc" is to blame here. It apparently matches the pod's security context, and while it configures `allowPrivilegedContainer`, it also has `defaultAllowPrivilegeEscalation: false` and `priority: 1` which causes this behavior.

We can fix this, but to work around this behavior, either remove the `vulnerability-advisor-scc` SCC, or remove its `defaultAllowPrivilegeEscalation: false` field.

I assume this is not only going to be a problem for the oauth-apiserver, but for openshift-apiserver pods, too.

Comment 14 Vincent Lours 2021-03-04 04:11:04 UTC
Hi Standa,

Thank you so much for the workaround.
In addition of removing the `defaultAllowPrivilegeEscalation: false` field in the `vulnerability-advisor-scc` SCC, we also had to remove the same field in the following SCC:
- mutation-advisor-scc
- management-restricted

I've reached IBM to check with them if the SCCs are part of the MCM core installation.

Do you think there is something that should be changed in the `openshift-oauth-apiserver` config to avoid it being impacted by a `defaultAllowPrivilegeEscalation: false` field in an SCC?

In addition to that, do you think I should create a new BZ for the missing `openshift-oauth-apiserver` from the must-gather?

Thanks again for your support.

Cheers,
Vincent

Comment 16 Standa Laznicka 2021-03-04 08:32:50 UTC
I don't think you need to be concerned with SCCs that set `defaultAllowPrivilegeEscalation: false` if they're not setting `allowPrivilegedContainer` to `true`.

> In addition to that, do you think I should create a new BZ for the missing `openshift-oauth-apiserver` from the must-gather?
I am not sure about the must-gather thing being a bug.

> Do you think there is something that should be changed in the `openshift-oauth-apiserver` config to avoid it being impacted by a `defaultAllowPrivilegeEscalation: false` field in an SCC?
I'll open a PR that should fix this.

Comment 18 Vincent Lours 2021-03-05 02:59:39 UTC
The KCS 5859251[1] has been created to address the issue and provide the workaround.

[1] https://access.redhat.com/solutions/5859251

Comment 29 Standa Laznicka 2021-04-30 10:04:30 UTC
Vincent, I don't see those failing pods anywhere in the must-gather and you did not share the pods' statuses. When you get them and if they match the description from comment 10, then this fix is going to cover it. Otherwise it might be a different bugzilla.

Comment 47 errata-xmlrpc 2021-07-27 22:49:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438