Bug 1961472 - openshift-marketplace pods in CrashLoopBackOff state after RHACS installed with an SCC with readOnlyFileSystem set to true
Summary: openshift-marketplace pods in CrashLoopBackOff state after RHACS installed wi...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.7
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.8.0
Assignee: Joe Lanford
QA Contact: Bruno Andrade
Depends On:
Blocks: 1962314
TreeView+ depends on / blocked
Reported: 2021-05-18 04:02 UTC by Matt Bargenquast
Modified: 2021-07-27 23:09 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Catalog registry pods do not have `readOnlyRootFileSystem: false` explicitly set in their securityContext field. Consequence: If an SCC exists that enforces `readOnlyRootFileSystem: true` and otherwise matches the catalog registry pod securityContext, it will be assigned to the catalog registry pod, causing it to fail in a crash loop. Fix: Explicitly set `readOnlyRootFileSystem: false` when creating catalog registry pods. Result: Catalog registry pods are no longer matched to SCCs that enforce a read-only root filesystem, and thus no longer fail.
Clone Of:
: 1962314 (view as bug list)
Last Closed: 2021-07-27 23:08:46 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift operator-framework-olm pull 82 0 None open Bug 1961472: Explicitly set `readOnlyRootFilesystem: false` on created registry pods. 2021-05-19 17:00:15 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:09:02 UTC

Description Matt Bargenquast 2021-05-18 04:02:18 UTC
Description of problem:
Various operator pods in "openshift-marketplace" are in a CrashLoopBackOff state after installation of Red Hat Advanced Cluster Security / StackRox

They appear to be picking up one of the installed RHACS SCCs ('monitoring') which has lesser privileges which defines readOnlyRootFilesystem as "true". 

Affected pods:
NAME                                             READY   STATUS             RESTARTS   AGE
certified-operators-mqzg7                        0/1     CrashLoopBackOff   7244       26d
community-operators-fkm62                        0/1     CrashLoopBackOff   7257       26d
redhat-marketplace-5x6fh                         0/1     CrashLoopBackOff   7270       26d
redhat-operators-dm9sc                           0/1     CrashLoopBackOff   7319       26d

Example Pod logs:

$ oc logs -n openshift-marketplace certified-operators-mqzg7
Error: open db-943778815: read-only file system
  opm registry serve [flags]

  -d, --database string          relative path to sqlite db (default "bundles.db")
      --debug                    enable debug logging
  -h, --help                     help for serve
  -p, --port string              port number to serve on (default "50051")
      --skip-migrate             do  not attempt to migrate to the latest db revision when starting
  -t, --termination-log string   path to a container termination log file (default "/dev/termination-log")

Version-Release number of selected component (if applicable):

OCP v4.7.6

How reproducible:

Always after the SCC is created, and the pods are recreated.

Steps to Reproduce:
1. Create the 'monitoring' SCC from RHACS (or a high priority SCC with readOnlyFileSystem set to true)
2. Delete the marketplace pods.
3. Pods will be recreated and use the 'monitoring' SCC, resulting in crashloop.

Related: https://bugzilla.redhat.com/show_bug.cgi?id=1942725

Comment 1 Joe Lanford 2021-05-19 13:49:58 UTC
> When evaluating SCCs, the admission controller runs through them by priority.  A priority of 'nil' is equal to a priority of 0, which is the highest priority; therefore, all of these SCCs are at the top of the list.  From there, they're evaluated from most restrictive to least restrictive until an SCC matches the requests in the pod's SecurityContext and applies the first one that matches.
> The root cause here is that the API server's security context specifies it needs to be privileged but it does _not_ specify that it needs a read/write root file system.  So, if the StackRox SCC is in place, that's the most restrictive, priority 0 SCC and it gets applied.  Later, when the API server tries to write something, it fails and bad things happen.

It appears that we need to explicitly set `securityContext.readOnlyRootFilesystem` to false to avoid matching to the StackRox SCC. This seems somewhat unexpected to me since the default value of `securityContext.readOnlyRootFilesystem` is false, so it seems unnecessary to set it explicitly, but it is very possible that I don't understand the background and reasoning.

Looks like this is the pod that needs to be updated with an explicit container securityContext: https://github.com/operator-framework/operator-lifecycle-manager/blob/15790a8a2f07fe65a3dbf5a45a54d35e20f2cce9/pkg/controller/registry/reconciler/reconciler.go#L94

Comment 5 Bruno Andrade 2021-05-21 15:12:21 UTC
OCP Version: 4.8.0-0.nightly-2021-05-21-101954%
OLM version: 0.17.0
git commit: ca1f0b69c3e2eb06ab4e62517fe5bd11e59a3239

1) Confirmed that catalog pod has attribute readOnlyRootFilesystem set to false

oc get pods redhat-operators-xl5jb -n openshift-marketplace -o yaml

  - image: registry.redhat.io/redhat/redhat-operator-index:v4.8
    imagePullPolicy: Always
        - MKNOD
      readOnlyRootFilesystem: false

2) Installed the Advanced Cluster Management for Kubernetes Operator

oc get csv -n open-cluster-management                                                                        
NAME                                 DISPLAY                                      VERSION   REPLACES                             PHASE
advanced-cluster-management.v2.2.3   Advanced Cluster Management for Kubernetes   2.2.3     advanced-cluster-management.v2.2.2   Succeeded

3) Check if catalog pod are healthy:

oc get pods -n openshift-marketplace                                                                          
NAME                                                              READY   STATUS      RESTARTS   AGE
14bbd46d68f3ddd50b9328cee6854a36807ef784dac2bded9cc20638fbv7f5f   0/1     Completed   0          5m51s
certified-operators-jcpbp                                         1/1     Running     0          49m
community-operators-5qt64                                         1/1     Running     0          49m
marketplace-operator-99db68d8d-czzwm                              1/1     Running     0          52m
redhat-marketplace-5q5vk                                          1/1     Running     0          49m
redhat-operators-xl5jb                                            1/1     Running     0          49m

LGTM, marking as VERIFIED.

Comment 8 errata-xmlrpc 2021-07-27 23:08:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.