1824800 – openshift authentication operator is in a crashbackoffloop

Bug 1824800 - openshift authentication operator is in a crashbackoffloop

Summary: openshift authentication operator is in a crashbackoffloop

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	apiserver-auth
Sub Component:
Version:	4.3.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Standa Laznicka
QA Contact:	pmali
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1842442
TreeView+	depends on / blocked

Reported:	2020-04-16 13:14 UTC by Matthew Robson
Modified:	2024-12-20 19:02 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: incomplete security context of a the oauth-server pods might result in the pods crashing as it picks up a custom SCC that reverts the default behavior Consequence: the oauth-server pods start crash-looping Fix: modify the security-context of the oauth-server pods to include configuration that it needs in order to run Result: a custom SCC does not prevent the oauth-server pods from running
Clone Of:
Environment:
Last Closed:	2020-07-13 17:27:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-authentication-operator pull 273	0	None	closed	Bug 1824800: explicitly set oauth-server container's root file system to writable	2021-02-01 12:07:26 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:28:23 UTC

Internal Links: 1942725

Description Matthew Robson 2020-04-16 13:14:08 UTC

Description of problem:

The logs show:

Copying system trust bundle
cp: cannot remove '/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem': Read-only file system

Root cause is the customer was PoCing a security product StackRox. The product was correctly creating it's own scc, but they had given their SCC a priority of 100 with RunAsAny and readOnlyRootFilesystem: true. This put it's priority ahead of anyuid for certain operators causing them to crash as seen above.


Similarly to the DefaultSecurityContextConstraints_Mutated alerting, how can we prevent an ill-advised SCC from negatively impacting the platform and ensure ISVs are configuring things correctly.


Version-Release number of selected component (if applicable):
4.3.x

How reproducible:
Always

Steps to Reproduce:
1. Create a scc as noted above
2. Observe operators like authentication
3.

Actual results:
Operators in crashloop without an obviously reason related to the SCC changes. 

Expected results:
Warning or guidance around these types of platform impacting changes.

Additional info:
Setting the priority to less than anyuid fixed the issue.

Comment 5 Stefan Schimanski 2020-04-16 16:30:38 UTC

DefaultSecurityContextConstraints_Mutated is going to reverted. PRs merged. Next z stream release should have it removed.

The other topic must be analyzed. Have you done a comparison between the original and the installed SCC? Hard to believe that equal SCCs behave differently.

Comment 6 Matthew Robson 2020-04-16 17:20:29 UTC

It's not that 2 equal SCCs are behaving differently, it's the impact of a 3rd party SCC can have on the platform components.

A default install:

oc get pod authentication-operator-7fb9bc495c-5pt9p -o yaml | grep scc
    openshift.io/scc: anyuid

oc get pod oauth-openshift-594478b797-xkgxj -o yaml | grep scc
    openshift.io/scc: anyuid

3rd party tool comes along and creates its own SCC, as it should, but the SCC creates a conflict with anyuid.

oc apply -f securitycontextconstraints-collector.yaml
securitycontextconstraints.security.openshift.io/collector created

The full scc is attached above.

For a while, nothing may change as all of the pods are already running. 

An oauth change happens and the oauth pods start rolling:

oc get pods oauth-openshift-594478b797-9gc98 -o yaml | grep scc
    openshift.io/scc: collector

The first pods goes into a crashloopbackoff because now its using the collector SCC (because it has has a higher priority and it setting readonly) instead of anyuid which leads to the pods failing. This would be a bigger issue during an upgrade event.

There are 4 operators and the oauth pods that use the anyuid SCC: authentication-operator, oauth-openshift, cluster-node-tuning-operator, openshift-service-catalog-apiserver-operator and openshift-service-catalog-controller-manager-operator

Comment 7 Standa Laznicka 2020-04-17 12:39:06 UTC

This is caused by the oauth-server pods not being specific enough about their security context and their service-account's privileges being too broad

Comment 12 errata-xmlrpc 2020-07-13 17:27:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.