Bug 1824800 - openshift authentication operator is in a crashbackoffloop
Summary: openshift authentication operator is in a crashbackoffloop
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: apiserver-auth
Version: 4.3.0
Hardware: All
OS: Linux
Target Milestone: ---
: 4.5.0
Assignee: Standa Laznicka
QA Contact: pmali
Depends On:
Blocks: 1842442
TreeView+ depends on / blocked
Reported: 2020-04-16 13:14 UTC by Matthew Robson
Modified: 2020-07-13 17:28 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: incomplete security context of a the oauth-server pods might result in the pods crashing as it picks up a custom SCC that reverts the default behavior Consequence: the oauth-server pods start crash-looping Fix: modify the security-context of the oauth-server pods to include configuration that it needs in order to run Result: a custom SCC does not prevent the oauth-server pods from running
Clone Of:
Last Closed: 2020-07-13 17:27:59 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Priority Status Summary Last Updated
Github openshift cluster-authentication-operator pull 273 None closed Bug 1824800: explicitly set oauth-server container's root file system to writable 2020-09-03 10:48:04 UTC
Red Hat Product Errata RHBA-2020:2409 None None None 2020-07-13 17:28:23 UTC

Description Matthew Robson 2020-04-16 13:14:08 UTC
Description of problem:

The logs show:

Copying system trust bundle
cp: cannot remove '/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem': Read-only file system

Root cause is the customer was PoCing a security product StackRox. The product was correctly creating it's own scc, but they had given their SCC a priority of 100 with RunAsAny and readOnlyRootFilesystem: true. This put it's priority ahead of anyuid for certain operators causing them to crash as seen above.

Similarly to the DefaultSecurityContextConstraints_Mutated alerting, how can we prevent an ill-advised SCC from negatively impacting the platform and ensure ISVs are configuring things correctly.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Create a scc as noted above
2. Observe operators like authentication

Actual results:
Operators in crashloop without an obviously reason related to the SCC changes. 

Expected results:
Warning or guidance around these types of platform impacting changes.

Additional info:
Setting the priority to less than anyuid fixed the issue.

Comment 5 Stefan Schimanski 2020-04-16 16:30:38 UTC
DefaultSecurityContextConstraints_Mutated is going to reverted. PRs merged. Next z stream release should have it removed.

The other topic must be analyzed. Have you done a comparison between the original and the installed SCC? Hard to believe that equal SCCs behave differently.

Comment 6 Matthew Robson 2020-04-16 17:20:29 UTC
It's not that 2 equal SCCs are behaving differently, it's the impact of a 3rd party SCC can have on the platform components.

A default install:

oc get pod authentication-operator-7fb9bc495c-5pt9p -o yaml | grep scc
    openshift.io/scc: anyuid

oc get pod oauth-openshift-594478b797-xkgxj -o yaml | grep scc
    openshift.io/scc: anyuid

3rd party tool comes along and creates its own SCC, as it should, but the SCC creates a conflict with anyuid.

oc apply -f securitycontextconstraints-collector.yaml
securitycontextconstraints.security.openshift.io/collector created

The full scc is attached above.

For a while, nothing may change as all of the pods are already running. 

An oauth change happens and the oauth pods start rolling:

oc get pods oauth-openshift-594478b797-9gc98 -o yaml | grep scc
    openshift.io/scc: collector

The first pods goes into a crashloopbackoff because now its using the collector SCC (because it has has a higher priority and it setting readonly) instead of anyuid which leads to the pods failing. This would be a bigger issue during an upgrade event.

There are 4 operators and the oauth pods that use the anyuid SCC: authentication-operator, oauth-openshift, cluster-node-tuning-operator, openshift-service-catalog-apiserver-operator and openshift-service-catalog-controller-manager-operator

Comment 7 Standa Laznicka 2020-04-17 12:39:06 UTC
This is caused by the oauth-server pods not being specific enough about their security context and their service-account's privileges being too broad

Comment 12 errata-xmlrpc 2020-07-13 17:27:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.