Created attachment 1749822 [details] openshift-operators dump Created attachment 1749822 [details] openshift-operators dump ############################# Description of problem: ############################# Compliance pods failing with error: Couldn't ensure directory","error":"mkdir /reports/0: permission denied" Same issue as https://github.com/openshift/compliance-operator/issues/464 - Issue: Customer created the 'scanning' and 'scanningbinding' objects and 2 pods are created. The 'result-server' pod fails with a CrashLoopBackOff. The other pod has a second container that is stuck at the 'NotReady' state. $ cat scansetting.yaml apiVersion: compliance.openshift.io/v1alpha1 kind: ScanSetting metadata: name: my-companys-constraints autoApplyRemediations: false schedule: "*/5 * * * *" rawResultStorage: size: "2Gi" rotation: 10 roles: - worker - master $ cat scansettingbinding.yaml apiVersion: compliance.openshift.io/v1alpha1 kind: ScanSettingBinding metadata: name: my-companys-compliance-requirements profiles: # Cluster checks - name: ocp4-moderate kind: Profile apiGroup: compliance.openshift.io/v1alpha1 settingsRef: name: my-companys-constraints kind: ScanSetting apiGroup: compliance.openshift.io/v1alpha1 $ oc get pods | grep moderate ocp4-moderate-api-checks-pod 1/2 NotReady 7 113m ocp4-moderate-rs-86d78d99f5-mpdc9 0/1 CrashLoopBackOff 26 113m From the logs of the 'result-server' pod we can see the error, probably from the 'logger' : {"level":"error","ts":1611234025.136292,"logger":"cmd","msg":"Couldn't ensure directory","error":"mkdir /reports/0: permission denied","stacktrace":"github.com/go-logr/zapr.(*zapLogger) - SCC: The SCC policy is correct in the openshift-operators namespace (anyuid): $ oc get pod ocp4-moderate-api-checks-pod -oyaml > ocp4-moderate-api-checks-pod.yaml $ oc get pods ocp4-moderate-rs-86d78d99f5-mpdc9 -oyaml > ocp4-moderate-rs-86d78d99f5-mpdc9.yaml $ oc adm policy scc-subject-review -f ocp4-moderate-api-checks-pod.yaml RESOURCE ALLOWED BY Pod/ocp4-moderate-api-checks-pod anyuid $ oc adm policy scc-subject-review -f ocp4-moderate-rs-86d78d99f5-mpdc9.yaml RESOURCE ALLOWED BY Pod/ocp4-moderate-rs-86d78d99f5-mpdc9 anyuid $ oc describe project openshift-operators Name: openshift-operators Created: 10 months ago Labels: openshift.io/run-level=1 openshift.io/scc=anyuid Annotations: openshift.io/node-selector= openshift.io/sa.scc.mcs=s0:c17,c14 openshift.io/sa.scc.supplemental-groups=1000300000/10000 openshift.io/sa.scc.uid-range=1000300000/10000 Display Name: <none> Description: <none> Status: Active Node Selector: <none> Quota: <none> Resource limits: <none> $ oc get scc anyuid NAME PRIV CAPS SELINUX RUNASUSER FSGROUP SUPGROUP PRIORITY READONLYROOTFS VOLUMES anyuid false <no value> MustRunAs RunAsAny RunAsAny RunAsAny 10 false ["configMap","downwardAPI","emptyDir","persistentVolumeClaim","projected","secret"] $ oc describe scc anyuid Name: anyuid Priority: 10 Access: Users: <none> Groups: system:cluster-admins Settings: Allow Privileged: false Allow Privilege Escalation: true Default Add Capabilities: <none> Required Drop Capabilities: MKNOD Allowed Capabilities: <none> Allowed Seccomp Profiles: <none> Allowed Volume Types: configMap,downwardAPI,emptyDir,persistentVolumeClaim,projected,secret Allowed Flexvolumes: <all> Allowed Unsafe Sysctls: <none> Forbidden Sysctls: <none> Allow Host Network: false Allow Host Ports: false Allow Host PID: false Allow Host IPC: false Read Only Root Filesystem: false Run As User Strategy: RunAsAny UID: <none> UID Range Min: <none> UID Range Max: <none> SELinux Context Strategy: MustRunAs User: <none> Role: <none> Type: <none> Level: <none> FSGroup Strategy: RunAsAny Ranges: <none> Supplemental Groups Strategy: RunAsAny Ranges: <none> - Operator Details of the operator after trying to reinstall from OLM, all versions failed equally. $ oc get installplan NAME CSV APPROVAL APPROVED install-7pg89 compliance-operator.v0.1.24 Automatic true install-vv4sp compliance-operator.v0.1.17 Automatic true $ oc get csv NAME DISPLAY VERSION REPLACES PHASE compliance-operator.v0.1.17 Compliance Operator 0.1.17 Replacing compliance-operator.v0.1.24 Compliance Operator 0.1.24 compliance-operator.v0.1.17 Failed $ oc get subscription NAME PACKAGE SOURCE CHANNEL compliance-operator compliance-operator redhat-operators 4.6 Attached a dump of openshift-operators namespace where it is installed. Let us know if you need more specific information.
Matt had helped me get on the right track by noticing that we use the default SA for the resultserver, then it clicked for me. The resultserver uses the 'default' SA which normally gets to only use the 'restricted' SCC. The restricted SCC causes the pod to be assigned an UID and GID from the namespace's range on admission and at the same time the pod would get a .spec.securityContext.fsGroup. The fsGroup option would in turn cause the PVC mount to be owned by root:GID where GID is the one that was also assigned on admission. That's by default, but neither of the customer cases uses exactly default configuration. In one of the cases, the customer installed the operator into the openshift-operators namespace which is annotated with openshift.io/scc:anyuid, in the other case the customer runs something called "PlanetScale Operator for Vitess" whose description says "This operator should be deployed in an isolated namespace since the Pods it creates use the `default` service account and require the `use` permission on the `anyuid` Security Context Contraint (SCC) to run correctly.". In both cases the result is that the default SA uses the anyuid SCC instead of the restricted SCC and the pod then doesn't receive the IDs from the namespace or the fsGroup option, this causes the permission issue. As an immediate workaround, the customer who deploys into the openshift-operators namespace could deploy into the openshift-compliance namespace. The other customer who also deployed PlanetScale Operator to openshift compliance can instead deploy that operator elsewhere (as the operator itself suggests). That said, we should not assume so much in the Compliance Operator and we should be more defensive, I tried just forcing the restricted SCC by adding an annotation to the RS deployment and even the pod template inside it, but that didn't work. What seems to be working is to create a separate SA for the resultserver. I'll send a PR with these changes, on a first glance they seemed to work, but RBAC changes are tricky. Since there seem to be workarounds for both of the cases, I'm not sure how urgent it is to deliver the fix to OCP, IOW how feasible the workarounds are for the customers.
To reproduce: 1. create a Role that allows the use of the anyuid SCC in the openshift-compliance namespace apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: default-anyuid namespace: openshift-compliance rules: - apiGroups: - security.openshift.io resourceNames: - anyuid resources: - securitycontextconstraints verbs: - use 2. bind that Role to the default SA apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: default-to-anyuid namespace: openshift-compliance roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: default-anyuid subjects: - kind: ServiceAccount name: default 3. Start a ComplianceSuite
Additional things to check when testing or reproducing the bug: - when the bug occurs, the resultserver pod would be annotated with openshift.io/scc:anyuid - when the bug occurs, the resultserver pod will NOT have .spec.securityContext.fsGroup option set in its spec
I won't able to reproduce this issue with a separate namespace. But with steps in https://bugzilla.redhat.com/show_bug.cgi?id=1919311#c11, it reproduced. $ oc get pod NAME READY STATUS RESTARTS AGE compliance-operator-56894574c6-482gd 1/1 Running 0 19m my-companys-compliance-requirements-rerunner-1613742000-2hmpj 0/1 Completed 0 12m my-companys-compliance-requirements-rerunner-1613742300-cmfzg 0/1 Completed 0 7m48s my-companys-compliance-requirements-rerunner-1613742600-tkb8t 0/1 Completed 0 2m46s ocp4-co2-pp-9c9cf6c9-gbx66 1/1 Running 0 19m ocp4-moderate-api-checks-pod 1/2 NotReady 3 17m ocp4-moderate-rs-5b74cbb679-jxrj5 0/1 CrashLoopBackOff 8 17m rhcos4-co2-pp-697dc89f57-f242x 1/1 Running 0 19m $ oc get pod/ocp4-moderate-rs-5b74cbb679-jxrj5 -o=jsonpath={.metadata.annotations} | jq -r { "k8s.ovn.org/pod-networks": "{\"default\":{\"ip_addresses\":[\"10.131.0.211/23\"],\"mac_address\":\"0a:58:0a:83:00:d3\",\"gateway_ips\":[\"10.131.0.1\"],\"ip_address\":\"10.131.0.211/23\",\"gateway_ip\":\"10.131.0.1\"}}", "k8s.v1.cni.cncf.io/network-status": "[{\n \"name\": \"\",\n \"interface\": \"eth0\",\n \"ips\": [\n \"10.131.0.211\"\n ],\n \"mac\": \"0a:58:0a:83:00:d3\",\n \"default\": true,\n \"dns\": {}\n}]", "k8s.v1.cni.cncf.io/networks-status": "[{\n \"name\": \"\",\n \"interface\": \"eth0\",\n \"ips\": [\n \"10.131.0.211\"\n ],\n \"mac\": \"0a:58:0a:83:00:d3\",\n \"default\": true,\n \"dns\": {}\n}]", "openshift.io/scc": "anyuid" } $ oc get pod/ocp4-moderate-rs-5b74cbb679-jxrj5 -o=jsonpath={.spec.securityContext} | jq -r { "seLinuxOptions": { "level": "s0:c26,c15" } }
https://github.com/openshift/compliance-operator/pull/573
Able to reproduce this issue on aws cluster if the Compliance operator installed in default namespace. However, the Compliance operator successfully upgraded to latest version compliance-operator.v0.1.26 The issue does not observe if the Compliance operator installed in openshift-compliance namespace. Steps to Reproduce : 1. Install OCP 4.6 with compliance-operator.v0.1.24 in default namespace 2. Upgrade OCP to 4.7 version and perform scan without upgrading compliance-operator Version and upgrade path: OCP 4.6 : 4.6.0-0.nightly-2021-02-18-050133 upgrade to OCP 4.7 : 4.7.0-0.nightly-2021-02-18-110409 Summarising steps : $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-02-18-110409 True False 114m Cluster version is 4.7.0-0.nightly-2021-02-18-110409 $ oc get csv -ndefault NAME DISPLAY VERSION REPLACES PHASE compliance-operator.v0.1.24 Compliance Operator 0.1.24 Succeeded elasticsearch-operator.4.6.0-202102130420.p0 OpenShift Elasticsearch Operator 4.6.0-202102130420.p0 Succeeded $ oc get pods -ndefault|grep cis ocp4-cis-api-checks-pod 1/2 NotReady 2 41m ocp4-cis-node-master-rs-7695f48597-n8kqc 0/1 CrashLoopBackOff 12 41m ocp4-cis-node-worker-rs-7c677ddd5-9lljw 0/1 CrashLoopBackOff 12 41m ocp4-cis-rs-8bbdbfcc7-drtfl 0/1 CrashLoopBackOff 12 41m $ oc logs ocp4-cis-node-master-rs-7695f48597-n8kqc -ndefault {"level":"error","ts":1613748573.055087,"logger":"cmd","msg":"Couldn't ensure directory","error":"mkdir /reports/0: permission denied","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:132\nmain.ensureDir\n\t/remote-source/app/cmd/manager/resultserver.go:111\nmain.server\n\t/remote-source/app/cmd/manager/resultserver.go:169\nmain.glob..func2\n\t/remote-source/app/cmd/manager/resultserver.go:49\ngithub.com/spf13/cobra.(*Command).execute\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:854\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:958\ngithub.com/spf13/cobra.(*Command).Execute\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:895\nmain.main\n\t/remote-source/app/cmd/manager/main.go:34\nruntime.main\n\t/opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/proc.go:203"} {"level":"dpanic","ts":1613748573.0561721,"logger":"cmd","msg":"odd number of arguments passed as key-value pairs for logging","ignored key":"/reports/0","stacktrace":"github.com/go-logr/zapr.handleFields\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:100\ngithub.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:133\nmain.server\n\t/remote-source/app/cmd/manager/resultserver.go:171\nmain.glob..func2\n\t/remote-source/app/cmd/manager/resultserver.go:49\ngithub.com/spf13/cobra.(*Command).execute\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:854\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:958\ngithub.com/spf13/cobra.(*Command).Execute\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:895\nmain.main\n\t/remote-source/app/cmd/manager/main.go:34\nruntime.main\n\t/opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/proc.go:203"} {"level":"error","ts":1613748573.0560951,"logger":"cmd","msg":"Error ensuring result path: %s","error":"mkdir /reports/0: permission denied","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:132\nmain.server\n\t/remote-source/app/cmd/manager/resultserver.go:171\nmain.glob..func2\n\t/remote-source/app/cmd/manager/resultserver.go:49\ngithub.com/spf13/cobra.(*Command).execute\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:854\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:958\ngithub.com/spf13/cobra.(*Command).Execute\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:895\nmain.main\n\t/remote-source/app/cmd/manager/main.go:34\nruntime.main\n\t/opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/proc.go:203"} $ oc get pod ocp4-cis-api-checks-pod -oyaml -ndefault> ocp4-api-checks-pod.yaml $ oc get pod ocp4-cis-node-master-rs-7695f48597-n8kqc -ndefault -oyaml > ocp4-cis-node-master-rs-7695f48597-n8kqc.yaml $ oc adm policy scc-subject-review -f ocp4-api-checks-pod.yaml RESOURCE ALLOWED BY Pod/ocp4-cis-api-checks-pod anyuid $ oc adm policy scc-subject-review -f ocp4-cis-node-master-rs-7695f48597-n8kqc.yaml RESOURCE ALLOWED BY Pod/ocp4-cis-node-master-rs-7695f48597-n8kqc anyuid $ oc describe project default Name: default Created: 10 hours ago Labels: olm.operatorgroup.uid/32454bdd-c2c9-4fbc-83a1-d07e4ab7f078= Annotations: openshift.io/sa.scc.mcs=s0:c6,c5 openshift.io/sa.scc.supplemental-groups=1000040000/10000 openshift.io/sa.scc.uid-range=1000040000/10000 Display Name: <none> Description: <none> Status: Active Node Selector: <none> Quota: <none> Resource limits: <none> $ oc get scc anyuid NAME PRIV CAPS SELINUX RUNASUSER FSGROUP SUPGROUP PRIORITY READONLYROOTFS VOLUMES anyuid false <no value> MustRunAs RunAsAny RunAsAny RunAsAny 10 false ["configMap","downwardAPI","emptyDir","persistentVolumeClaim","projected","secret"] $ oc describe scc anyuid Name: anyuid Priority: 10 Access: Users: <none> Groups: system:cluster-admins Settings: Allow Privileged: false Allow Privilege Escalation: true Default Add Capabilities: <none> Required Drop Capabilities: MKNOD Allowed Capabilities: <none> Allowed Seccomp Profiles: <none> Allowed Volume Types: configMap,downwardAPI,emptyDir,persistentVolumeClaim,projected,secret Allowed Flexvolumes: <all> Allowed Unsafe Sysctls: <none> Forbidden Sysctls: <none> Allow Host Network: false Allow Host Ports: false Allow Host PID: false Allow Host IPC: false Read Only Root Filesystem: false Run As User Strategy: RunAsAny UID: <none> UID Range Min: <none> UID Range Max: <none> SELinux Context Strategy: MustRunAs User: <none> Role: <none> Type: <none> Level: <none> FSGroup Strategy: RunAsAny Ranges: <none> Supplemental Groups Strategy: RunAsAny Ranges: <none> $ oc delete scansettingbinding --all -ndefault scansettingbinding.compliance.openshift.io "cis-test" deleted $ oc get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirstConsumer true 9h gp2-csi ebs.csi.aws.com Delete WaitForFirstConsumer true 9h $ oc get sub -ndefault NAME PACKAGE SOURCE CHANNEL openshift-compliance-operator compliance-operator compliance-operator 4.6 $ oc patch subscriptions openshift-compliance-operator -p '{"spec":{"source":"qe-app-registry"}}' --type='merge' -ndefault subscription.operators.coreos.com/openshift-compliance-operator patched $ oc get csv -ndefault -w NAME DISPLAY VERSION REPLACES PHASE compliance-operator.v0.1.24 Compliance Operator 0.1.24 Replacing compliance-operator.v0.1.26 Compliance Operator 0.1.26 compliance-operator.v0.1.24 Installing elasticsearch-operator.4.6.0-202102130420.p0 OpenShift Elasticsearch Operator 4.6.0-202102130420.p0 Succeeded compliance-operator.v0.1.26 Compliance Operator 0.1.26 compliance-operator.v0.1.24 Succeeded compliance-operator.v0.1.24 Compliance Operator 0.1.24 Deleting compliance-operator.v0.1.24 Compliance Operator 0.1.24 Deleting $ oc get csv -ndefault NAME DISPLAY VERSION REPLACES PHASE compliance-operator.v0.1.26 Compliance Operator 0.1.26 compliance-operator.v0.1.24 Succeeded elasticsearch-operator.4.6.0-202102130420.p0 OpenShift Elasticsearch Operator 4.6.0-202102130420.p0 Succeeded $ oc get pods -ndefault NAME READY STATUS RESTARTS AGE compliance-operator-84d98f59fc-mp5v7 1/1 Running 0 96s ocp4-default-pp-7f6cdd564d-m2gcg 1/1 Running 0 50s rhcos4-default-pp-65c5c4c44b-96shk 1/1 Running 0 141m rhcos4-default-pp-6b9984f5cd-jxwmd 0/1 Init:1/2 0 50s $ oc get pods -ndefault NAME READY STATUS RESTARTS AGE compliance-operator-84d98f59fc-mp5v7 1/1 Running 0 2m22s ocp4-default-pp-7f6cdd564d-m2gcg 1/1 Running 0 96s rhcos4-default-pp-6b9984f5cd-jxwmd 1/1 Running 0 96s
Move to Verified per comment https://bugzilla.redhat.com/show_bug.cgi?id=1919311#c16
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Compliance Operator version 0.1.35 for OpenShift Container Platform 4.6-4.8), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2652