Bug 1919311 - Compliance operator pod fails with: Couldn't ensure directory","error":"mkdir /reports/0: permission denied"
Summary: Compliance operator pod fails with: Couldn't ensure directory","error":"mkdir...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Compliance Operator
Version: 4.6.z
Hardware: All
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.8.0
Assignee: Jakub Hrozek
QA Contact: Prashant Dhamdhere
URL:
Whiteboard:
Depends On:
Blocks: 1940776 1940781
TreeView+ depends on / blocked
 
Reported: 2021-01-22 15:08 UTC by David Hernández Fernández
Modified: 2024-06-14 00:00 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1940776 (view as bug list)
Environment:
Last Closed: 2021-07-07 11:29:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
openshift-operators dump (13.66 MB, text/plain)
2021-01-22 15:08 UTC, David Hernández Fernández
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift compliance-operator pull 573 0 None open Bug 1919311: Use a separate SA for resultserver 2021-02-19 14:06:15 UTC
Red Hat Product Errata RHBA-2021:2652 0 None None None 2021-07-07 11:31:09 UTC

Description David Hernández Fernández 2021-01-22 15:08:20 UTC
Created attachment 1749822 [details]
openshift-operators dump

Created attachment 1749822 [details]
openshift-operators dump

#############################
Description of problem: 
#############################
Compliance pods failing with error: Couldn't ensure directory","error":"mkdir /reports/0: permission denied"
Same issue as https://github.com/openshift/compliance-operator/issues/464

- Issue: 
Customer created the 'scanning' and 'scanningbinding' objects and 2 pods are created. The 'result-server' pod fails with a CrashLoopBackOff. The other pod has a second container that is stuck at the 'NotReady' state.

$ cat scansetting.yaml
apiVersion: compliance.openshift.io/v1alpha1
kind: ScanSetting
metadata:
  name: my-companys-constraints
autoApplyRemediations: false
schedule: "*/5 * * * *"
rawResultStorage:
  size: "2Gi"
  rotation: 10
roles:
  - worker
  - master

$ cat scansettingbinding.yaml
apiVersion: compliance.openshift.io/v1alpha1
kind: ScanSettingBinding
metadata:
  name: my-companys-compliance-requirements
profiles:
  # Cluster checks
  - name: ocp4-moderate
    kind: Profile
    apiGroup: compliance.openshift.io/v1alpha1
settingsRef:
  name: my-companys-constraints
  kind: ScanSetting
  apiGroup: compliance.openshift.io/v1alpha1

$ oc get pods | grep moderate
ocp4-moderate-api-checks-pod                   1/2     NotReady           7          113m
ocp4-moderate-rs-86d78d99f5-mpdc9              0/1     CrashLoopBackOff   26         113m

From the logs of the 'result-server' pod we can see the error, probably from the 'logger' : {"level":"error","ts":1611234025.136292,"logger":"cmd","msg":"Couldn't ensure directory","error":"mkdir /reports/0: permission denied","stacktrace":"github.com/go-logr/zapr.(*zapLogger) 

- SCC:
The SCC policy is correct in the openshift-operators namespace (anyuid):
$ oc get pod ocp4-moderate-api-checks-pod -oyaml > ocp4-moderate-api-checks-pod.yaml
$ oc get pods ocp4-moderate-rs-86d78d99f5-mpdc9 -oyaml > ocp4-moderate-rs-86d78d99f5-mpdc9.yaml
$ oc adm policy scc-subject-review -f ocp4-moderate-api-checks-pod.yaml
RESOURCE                           ALLOWED BY
Pod/ocp4-moderate-api-checks-pod   anyuid
$ oc adm policy scc-subject-review -f ocp4-moderate-rs-86d78d99f5-mpdc9.yaml
RESOURCE                                ALLOWED BY
Pod/ocp4-moderate-rs-86d78d99f5-mpdc9   anyuid

$ oc describe project openshift-operators
Name:			openshift-operators
Created:		10 months ago
Labels:			openshift.io/run-level=1
			openshift.io/scc=anyuid
Annotations:		openshift.io/node-selector=
			openshift.io/sa.scc.mcs=s0:c17,c14
			openshift.io/sa.scc.supplemental-groups=1000300000/10000
			openshift.io/sa.scc.uid-range=1000300000/10000
Display Name:		<none>
Description:		<none>
Status:			Active
Node Selector:		<none>
Quota:			<none>
Resource limits:	<none>

$ oc get scc anyuid
NAME     PRIV    CAPS         SELINUX     RUNASUSER   FSGROUP    SUPGROUP   PRIORITY   READONLYROOTFS   VOLUMES
anyuid   false   <no value>   MustRunAs   RunAsAny    RunAsAny   RunAsAny   10         false            ["configMap","downwardAPI","emptyDir","persistentVolumeClaim","projected","secret"]

$ oc describe scc anyuid
Name:						anyuid
Priority:					10
Access:
  Users:					<none>
  Groups:					system:cluster-admins
Settings:
  Allow Privileged:				false
  Allow Privilege Escalation:			true
  Default Add Capabilities:			<none>
  Required Drop Capabilities:			MKNOD
  Allowed Capabilities:				<none>
  Allowed Seccomp Profiles:			<none>
  Allowed Volume Types:				configMap,downwardAPI,emptyDir,persistentVolumeClaim,projected,secret
  Allowed Flexvolumes:				<all>
  Allowed Unsafe Sysctls:			<none>
  Forbidden Sysctls:				<none>
  Allow Host Network:				false
  Allow Host Ports:				false
  Allow Host PID:				false
  Allow Host IPC:				false
  Read Only Root Filesystem:			false
  Run As User Strategy: RunAsAny
    UID:					<none>
    UID Range Min:				<none>
    UID Range Max:				<none>
  SELinux Context Strategy: MustRunAs
    User:					<none>
    Role:					<none>
    Type:					<none>
    Level:					<none>
  FSGroup Strategy: RunAsAny
    Ranges:					<none>
  Supplemental Groups Strategy: RunAsAny
    Ranges:					<none>


- Operator
Details of the operator after trying to reinstall from OLM, all versions failed equally.
$ oc get installplan
NAME            CSV                           APPROVAL    APPROVED
install-7pg89   compliance-operator.v0.1.24   Automatic   true
install-vv4sp   compliance-operator.v0.1.17   Automatic   true

$ oc get csv
NAME                                         DISPLAY                                VERSION   REPLACES                              PHASE
compliance-operator.v0.1.17                  Compliance Operator                    0.1.17                                          Replacing
compliance-operator.v0.1.24                  Compliance Operator                    0.1.24    compliance-operator.v0.1.17           Failed

$ oc get subscription
NAME                              PACKAGE                           SOURCE             CHANNEL
compliance-operator               compliance-operator               redhat-operators   4.6

Attached a dump of openshift-operators namespace where it is installed. Let us know if you need more specific information.

Comment 10 Jakub Hrozek 2021-02-18 17:37:48 UTC
Matt had helped me get on the right track by noticing that we use the default SA for the resultserver, then it clicked for me.

The resultserver uses the 'default' SA which normally gets to only use the 'restricted' SCC. The restricted SCC causes the pod to be assigned an UID and GID from the namespace's range on admission and at the same time the pod would get a .spec.securityContext.fsGroup. The fsGroup option would in turn cause the PVC mount to be owned by root:GID where GID is the one that was also assigned on admission. That's by default, but neither of the customer cases uses exactly default configuration.

In one of the cases, the customer installed the operator into the openshift-operators namespace which is annotated with openshift.io/scc:anyuid, in the other case the customer runs something called "PlanetScale Operator for Vitess" whose description says "This operator should be deployed in an isolated namespace since the Pods it creates use the `default` service account and require the `use` permission on the `anyuid` Security Context Contraint (SCC) to run correctly.". In both cases the result is that the default SA uses the anyuid SCC instead of the restricted SCC and the pod then doesn't receive the IDs from the namespace or the fsGroup option, this causes the permission issue.

As an immediate workaround, the customer who deploys into the openshift-operators namespace could deploy into the openshift-compliance namespace. The other customer who also deployed PlanetScale Operator to openshift compliance can instead deploy that operator elsewhere (as the operator itself suggests).

That said, we should not assume so much in the Compliance Operator and we should be more defensive, I tried just forcing the restricted SCC by adding an annotation to the RS deployment and even the pod template inside it, but that didn't work. What seems to be working is to create a separate SA for the resultserver. I'll send a PR with these changes, on a first glance they seemed to work, but RBAC changes are tricky.

Since there seem to be workarounds for both of the cases, I'm not sure how urgent it is to deliver the fix to OCP, IOW how feasible the workarounds are for the customers.

Comment 11 Jakub Hrozek 2021-02-19 12:50:10 UTC
To reproduce:
1. create a Role that allows the use of the anyuid SCC in the openshift-compliance namespace

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: default-anyuid
  namespace: openshift-compliance
rules:
- apiGroups:
  - security.openshift.io
  resourceNames:
  - anyuid
  resources:
  - securitycontextconstraints
  verbs:
  - use

2. bind that Role to the default SA

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: default-to-anyuid
  namespace: openshift-compliance
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: default-anyuid
subjects:
- kind: ServiceAccount
  name: default

3. Start a ComplianceSuite

Comment 12 Jakub Hrozek 2021-02-19 12:57:51 UTC
Additional things to check when testing or reproducing the bug:
 - when the bug occurs, the resultserver pod would be annotated with openshift.io/scc:anyuid
 - when the bug occurs, the resultserver pod will NOT have .spec.securityContext.fsGroup option set in its spec

Comment 13 xiyuan 2021-02-19 13:54:48 UTC
I won't able to reproduce this issue with a separate namespace.

But with steps in https://bugzilla.redhat.com/show_bug.cgi?id=1919311#c11, it reproduced.
$ oc get pod
NAME                                                            READY   STATUS             RESTARTS   AGE
compliance-operator-56894574c6-482gd                            1/1     Running            0          19m
my-companys-compliance-requirements-rerunner-1613742000-2hmpj   0/1     Completed          0          12m
my-companys-compliance-requirements-rerunner-1613742300-cmfzg   0/1     Completed          0          7m48s
my-companys-compliance-requirements-rerunner-1613742600-tkb8t   0/1     Completed          0          2m46s
ocp4-co2-pp-9c9cf6c9-gbx66                                      1/1     Running            0          19m
ocp4-moderate-api-checks-pod                                    1/2     NotReady           3          17m
ocp4-moderate-rs-5b74cbb679-jxrj5                               0/1     CrashLoopBackOff   8          17m
rhcos4-co2-pp-697dc89f57-f242x                                  1/1     Running            0          19m

$ oc get pod/ocp4-moderate-rs-5b74cbb679-jxrj5 -o=jsonpath={.metadata.annotations} | jq -r
{
  "k8s.ovn.org/pod-networks": "{\"default\":{\"ip_addresses\":[\"10.131.0.211/23\"],\"mac_address\":\"0a:58:0a:83:00:d3\",\"gateway_ips\":[\"10.131.0.1\"],\"ip_address\":\"10.131.0.211/23\",\"gateway_ip\":\"10.131.0.1\"}}",
  "k8s.v1.cni.cncf.io/network-status": "[{\n    \"name\": \"\",\n    \"interface\": \"eth0\",\n    \"ips\": [\n        \"10.131.0.211\"\n    ],\n    \"mac\": \"0a:58:0a:83:00:d3\",\n    \"default\": true,\n    \"dns\": {}\n}]",
  "k8s.v1.cni.cncf.io/networks-status": "[{\n    \"name\": \"\",\n    \"interface\": \"eth0\",\n    \"ips\": [\n        \"10.131.0.211\"\n    ],\n    \"mac\": \"0a:58:0a:83:00:d3\",\n    \"default\": true,\n    \"dns\": {}\n}]",
  "openshift.io/scc": "anyuid"
}

$ oc get pod/ocp4-moderate-rs-5b74cbb679-jxrj5 -o=jsonpath={.spec.securityContext} | jq -r
{
  "seLinuxOptions": {
    "level": "s0:c26,c15"
  }
}

Comment 15 Prashant Dhamdhere 2021-02-19 16:10:22 UTC
Able to reproduce this issue on aws cluster if the Compliance operator installed in default namespace.
However, the Compliance operator successfully upgraded to latest version compliance-operator.v0.1.26
The issue does not observe if the Compliance operator installed in openshift-compliance namespace.

Steps to Reproduce :

1. Install OCP 4.6 with compliance-operator.v0.1.24 in default namespace
2. Upgrade OCP to 4.7 version and perform scan without upgrading compliance-operator

Version and upgrade path:

OCP 4.6 : 4.6.0-0.nightly-2021-02-18-050133 upgrade to OCP 4.7 : 4.7.0-0.nightly-2021-02-18-110409

Summarising steps :

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-02-18-110409   True        False         114m    Cluster version is 4.7.0-0.nightly-2021-02-18-110409

$ oc get csv -ndefault
NAME                                           DISPLAY                            VERSION                 REPLACES   PHASE
compliance-operator.v0.1.24                    Compliance Operator                0.1.24                             Succeeded
elasticsearch-operator.4.6.0-202102130420.p0   OpenShift Elasticsearch Operator   4.6.0-202102130420.p0              Succeeded


$ oc get pods -ndefault|grep cis 
ocp4-cis-api-checks-pod                                 1/2     NotReady           2          41m
ocp4-cis-node-master-rs-7695f48597-n8kqc                0/1     CrashLoopBackOff   12         41m
ocp4-cis-node-worker-rs-7c677ddd5-9lljw                 0/1     CrashLoopBackOff   12         41m
ocp4-cis-rs-8bbdbfcc7-drtfl                             0/1     CrashLoopBackOff   12         41m


$ oc logs ocp4-cis-node-master-rs-7695f48597-n8kqc -ndefault
{"level":"error","ts":1613748573.055087,"logger":"cmd","msg":"Couldn't ensure directory","error":"mkdir /reports/0: permission denied","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:132\nmain.ensureDir\n\t/remote-source/app/cmd/manager/resultserver.go:111\nmain.server\n\t/remote-source/app/cmd/manager/resultserver.go:169\nmain.glob..func2\n\t/remote-source/app/cmd/manager/resultserver.go:49\ngithub.com/spf13/cobra.(*Command).execute\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:854\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:958\ngithub.com/spf13/cobra.(*Command).Execute\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:895\nmain.main\n\t/remote-source/app/cmd/manager/main.go:34\nruntime.main\n\t/opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/proc.go:203"}
{"level":"dpanic","ts":1613748573.0561721,"logger":"cmd","msg":"odd number of arguments passed as key-value pairs for logging","ignored key":"/reports/0","stacktrace":"github.com/go-logr/zapr.handleFields\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:100\ngithub.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:133\nmain.server\n\t/remote-source/app/cmd/manager/resultserver.go:171\nmain.glob..func2\n\t/remote-source/app/cmd/manager/resultserver.go:49\ngithub.com/spf13/cobra.(*Command).execute\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:854\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:958\ngithub.com/spf13/cobra.(*Command).Execute\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:895\nmain.main\n\t/remote-source/app/cmd/manager/main.go:34\nruntime.main\n\t/opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/proc.go:203"}
{"level":"error","ts":1613748573.0560951,"logger":"cmd","msg":"Error ensuring result path: %s","error":"mkdir /reports/0: permission denied","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:132\nmain.server\n\t/remote-source/app/cmd/manager/resultserver.go:171\nmain.glob..func2\n\t/remote-source/app/cmd/manager/resultserver.go:49\ngithub.com/spf13/cobra.(*Command).execute\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:854\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:958\ngithub.com/spf13/cobra.(*Command).Execute\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:895\nmain.main\n\t/remote-source/app/cmd/manager/main.go:34\nruntime.main\n\t/opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/proc.go:203"}


$ oc get pod ocp4-cis-api-checks-pod -oyaml -ndefault> ocp4-api-checks-pod.yaml
$ oc get pod ocp4-cis-node-master-rs-7695f48597-n8kqc -ndefault -oyaml > ocp4-cis-node-master-rs-7695f48597-n8kqc.yaml

$ oc adm policy scc-subject-review -f ocp4-api-checks-pod.yaml
RESOURCE                      ALLOWED BY   
Pod/ocp4-cis-api-checks-pod   anyuid       

$ oc adm policy scc-subject-review -f ocp4-cis-node-master-rs-7695f48597-n8kqc.yaml
RESOURCE                                       ALLOWED BY   
Pod/ocp4-cis-node-master-rs-7695f48597-n8kqc   anyuid       

$ oc describe project default
Name:			default
Created:		10 hours ago
Labels:			olm.operatorgroup.uid/32454bdd-c2c9-4fbc-83a1-d07e4ab7f078=
Annotations:		openshift.io/sa.scc.mcs=s0:c6,c5
			openshift.io/sa.scc.supplemental-groups=1000040000/10000
			openshift.io/sa.scc.uid-range=1000040000/10000
Display Name:		<none>
Description:		<none>
Status:			Active
Node Selector:		<none>
Quota:			<none>
Resource limits:	<none>

$ oc get scc anyuid
NAME     PRIV    CAPS         SELINUX     RUNASUSER   FSGROUP    SUPGROUP   PRIORITY   READONLYROOTFS   VOLUMES
anyuid   false   <no value>   MustRunAs   RunAsAny    RunAsAny   RunAsAny   10         false            ["configMap","downwardAPI","emptyDir","persistentVolumeClaim","projected","secret"]

$ oc describe scc anyuid
Name:						anyuid
Priority:					10
Access:						
  Users:					<none>
  Groups:					system:cluster-admins
Settings:					
  Allow Privileged:				false
  Allow Privilege Escalation:			true
  Default Add Capabilities:			<none>
  Required Drop Capabilities:			MKNOD
  Allowed Capabilities:				<none>
  Allowed Seccomp Profiles:			<none>
  Allowed Volume Types:				configMap,downwardAPI,emptyDir,persistentVolumeClaim,projected,secret
  Allowed Flexvolumes:				<all>
  Allowed Unsafe Sysctls:			<none>
  Forbidden Sysctls:				<none>
  Allow Host Network:				false
  Allow Host Ports:				false
  Allow Host PID:				false
  Allow Host IPC:				false
  Read Only Root Filesystem:			false
  Run As User Strategy: RunAsAny		
    UID:					<none>
    UID Range Min:				<none>
    UID Range Max:				<none>
  SELinux Context Strategy: MustRunAs		
    User:					<none>
    Role:					<none>
    Type:					<none>
    Level:					<none>
  FSGroup Strategy: RunAsAny			
    Ranges:					<none>
  Supplemental Groups Strategy: RunAsAny	
    Ranges:					<none>


$ oc delete scansettingbinding --all -ndefault
scansettingbinding.compliance.openshift.io "cis-test" deleted

$ oc get sc
NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
gp2 (default)   kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   true                   9h
gp2-csi         ebs.csi.aws.com         Delete          WaitForFirstConsumer   true                   9h

$ oc get sub -ndefault
NAME                            PACKAGE               SOURCE                CHANNEL
openshift-compliance-operator   compliance-operator   compliance-operator   4.6

$ oc patch subscriptions openshift-compliance-operator -p '{"spec":{"source":"qe-app-registry"}}' --type='merge' -ndefault
subscription.operators.coreos.com/openshift-compliance-operator patched

$ oc get csv -ndefault -w
NAME                                           DISPLAY                            VERSION                 REPLACES                      PHASE
compliance-operator.v0.1.24                    Compliance Operator                0.1.24                                                Replacing
compliance-operator.v0.1.26                    Compliance Operator                0.1.26                  compliance-operator.v0.1.24   Installing
elasticsearch-operator.4.6.0-202102130420.p0   OpenShift Elasticsearch Operator   4.6.0-202102130420.p0                                 Succeeded
compliance-operator.v0.1.26                    Compliance Operator                0.1.26                  compliance-operator.v0.1.24   Succeeded
compliance-operator.v0.1.24                    Compliance Operator                0.1.24                                                Deleting
compliance-operator.v0.1.24                    Compliance Operator                0.1.24                                                Deleting

$ oc get csv -ndefault 
NAME                                           DISPLAY                            VERSION                 REPLACES                      PHASE
compliance-operator.v0.1.26                    Compliance Operator                0.1.26                  compliance-operator.v0.1.24   Succeeded
elasticsearch-operator.4.6.0-202102130420.p0   OpenShift Elasticsearch Operator   4.6.0-202102130420.p0                                 Succeeded


$ oc get pods -ndefault
NAME                                   READY   STATUS     RESTARTS   AGE
compliance-operator-84d98f59fc-mp5v7   1/1     Running    0          96s
ocp4-default-pp-7f6cdd564d-m2gcg       1/1     Running    0          50s
rhcos4-default-pp-65c5c4c44b-96shk     1/1     Running    0          141m
rhcos4-default-pp-6b9984f5cd-jxwmd     0/1     Init:1/2   0          50s

$ oc get pods -ndefault
NAME                                   READY   STATUS    RESTARTS   AGE
compliance-operator-84d98f59fc-mp5v7   1/1     Running   0          2m22s
ocp4-default-pp-7f6cdd564d-m2gcg       1/1     Running   0          96s
rhcos4-default-pp-6b9984f5cd-jxwmd     1/1     Running   0          96s

Comment 22 xiyuan 2021-03-22 09:20:57 UTC
Move to Verified per comment https://bugzilla.redhat.com/show_bug.cgi?id=1919311#c16

Comment 26 errata-xmlrpc 2021-07-07 11:29:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Compliance Operator version 0.1.35 for OpenShift Container Platform 4.6-4.8), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2652


Note You need to log in before you can comment on or make changes to this bug.