Bug 1824800

Summary: openshift authentication operator is in a crashbackoffloop
Product: OpenShift Container Platform Reporter: Matthew Robson <mrobson>
Component: apiserver-authAssignee: Standa Laznicka <slaznick>
Status: CLOSED ERRATA QA Contact: pmali
Severity: high Docs Contact:
Priority: high    
Version: 4.3.0CC: aos-bugs, arghosh, mfojtik, scheng, slaznick, sttts, xxia
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: incomplete security context of a the oauth-server pods might result in the pods crashing as it picks up a custom SCC that reverts the default behavior Consequence: the oauth-server pods start crash-looping Fix: modify the security-context of the oauth-server pods to include configuration that it needs in order to run Result: a custom SCC does not prevent the oauth-server pods from running
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:27:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1842442    

Description Matthew Robson 2020-04-16 13:14:08 UTC
Description of problem:

The logs show:

Copying system trust bundle
cp: cannot remove '/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem': Read-only file system

Root cause is the customer was PoCing a security product StackRox. The product was correctly creating it's own scc, but they had given their SCC a priority of 100 with RunAsAny and readOnlyRootFilesystem: true. This put it's priority ahead of anyuid for certain operators causing them to crash as seen above.


Similarly to the DefaultSecurityContextConstraints_Mutated alerting, how can we prevent an ill-advised SCC from negatively impacting the platform and ensure ISVs are configuring things correctly.


Version-Release number of selected component (if applicable):
4.3.x

How reproducible:
Always

Steps to Reproduce:
1. Create a scc as noted above
2. Observe operators like authentication
3.

Actual results:
Operators in crashloop without an obviously reason related to the SCC changes. 

Expected results:
Warning or guidance around these types of platform impacting changes.

Additional info:
Setting the priority to less than anyuid fixed the issue.

Comment 5 Stefan Schimanski 2020-04-16 16:30:38 UTC
DefaultSecurityContextConstraints_Mutated is going to reverted. PRs merged. Next z stream release should have it removed.

The other topic must be analyzed. Have you done a comparison between the original and the installed SCC? Hard to believe that equal SCCs behave differently.

Comment 6 Matthew Robson 2020-04-16 17:20:29 UTC
It's not that 2 equal SCCs are behaving differently, it's the impact of a 3rd party SCC can have on the platform components.

A default install:

oc get pod authentication-operator-7fb9bc495c-5pt9p -o yaml | grep scc
    openshift.io/scc: anyuid

oc get pod oauth-openshift-594478b797-xkgxj -o yaml | grep scc
    openshift.io/scc: anyuid

3rd party tool comes along and creates its own SCC, as it should, but the SCC creates a conflict with anyuid.

oc apply -f securitycontextconstraints-collector.yaml
securitycontextconstraints.security.openshift.io/collector created

The full scc is attached above.

For a while, nothing may change as all of the pods are already running. 

An oauth change happens and the oauth pods start rolling:

oc get pods oauth-openshift-594478b797-9gc98 -o yaml | grep scc
    openshift.io/scc: collector

The first pods goes into a crashloopbackoff because now its using the collector SCC (because it has has a higher priority and it setting readonly) instead of anyuid which leads to the pods failing. This would be a bigger issue during an upgrade event.

There are 4 operators and the oauth pods that use the anyuid SCC: authentication-operator, oauth-openshift, cluster-node-tuning-operator, openshift-service-catalog-apiserver-operator and openshift-service-catalog-controller-manager-operator

Comment 7 Standa Laznicka 2020-04-17 12:39:06 UTC
This is caused by the oauth-server pods not being specific enough about their security context and their service-account's privileges being too broad

Comment 12 errata-xmlrpc 2020-07-13 17:27:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409