Bug 1919311
| Summary: | Compliance operator pod fails with: Couldn't ensure directory","error":"mkdir /reports/0: permission denied" | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | David Hernández Fernández <dahernan> | ||||
| Component: | Compliance Operator | Assignee: | Jakub Hrozek <jhrozek> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Prashant Dhamdhere <pdhamdhe> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 4.6.z | CC: | jhrozek, josorior, mrogers, nkinder, nstielau, xiyuan | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.8.0 | ||||||
| Hardware: | All | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | No Doc Update | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1940776 (view as bug list) | Environment: | |||||
| Last Closed: | 2021-07-07 11:29:56 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1940776, 1940781 | ||||||
| Attachments: |
|
||||||
|
Description
David Hernández Fernández
2021-01-22 15:08:20 UTC
Matt had helped me get on the right track by noticing that we use the default SA for the resultserver, then it clicked for me. The resultserver uses the 'default' SA which normally gets to only use the 'restricted' SCC. The restricted SCC causes the pod to be assigned an UID and GID from the namespace's range on admission and at the same time the pod would get a .spec.securityContext.fsGroup. The fsGroup option would in turn cause the PVC mount to be owned by root:GID where GID is the one that was also assigned on admission. That's by default, but neither of the customer cases uses exactly default configuration. In one of the cases, the customer installed the operator into the openshift-operators namespace which is annotated with openshift.io/scc:anyuid, in the other case the customer runs something called "PlanetScale Operator for Vitess" whose description says "This operator should be deployed in an isolated namespace since the Pods it creates use the `default` service account and require the `use` permission on the `anyuid` Security Context Contraint (SCC) to run correctly.". In both cases the result is that the default SA uses the anyuid SCC instead of the restricted SCC and the pod then doesn't receive the IDs from the namespace or the fsGroup option, this causes the permission issue. As an immediate workaround, the customer who deploys into the openshift-operators namespace could deploy into the openshift-compliance namespace. The other customer who also deployed PlanetScale Operator to openshift compliance can instead deploy that operator elsewhere (as the operator itself suggests). That said, we should not assume so much in the Compliance Operator and we should be more defensive, I tried just forcing the restricted SCC by adding an annotation to the RS deployment and even the pod template inside it, but that didn't work. What seems to be working is to create a separate SA for the resultserver. I'll send a PR with these changes, on a first glance they seemed to work, but RBAC changes are tricky. Since there seem to be workarounds for both of the cases, I'm not sure how urgent it is to deliver the fix to OCP, IOW how feasible the workarounds are for the customers. To reproduce: 1. create a Role that allows the use of the anyuid SCC in the openshift-compliance namespace apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: default-anyuid namespace: openshift-compliance rules: - apiGroups: - security.openshift.io resourceNames: - anyuid resources: - securitycontextconstraints verbs: - use 2. bind that Role to the default SA apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: default-to-anyuid namespace: openshift-compliance roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: default-anyuid subjects: - kind: ServiceAccount name: default 3. Start a ComplianceSuite Additional things to check when testing or reproducing the bug: - when the bug occurs, the resultserver pod would be annotated with openshift.io/scc:anyuid - when the bug occurs, the resultserver pod will NOT have .spec.securityContext.fsGroup option set in its spec I won't able to reproduce this issue with a separate namespace. But with steps in https://bugzilla.redhat.com/show_bug.cgi?id=1919311#c11, it reproduced. $ oc get pod NAME READY STATUS RESTARTS AGE compliance-operator-56894574c6-482gd 1/1 Running 0 19m my-companys-compliance-requirements-rerunner-1613742000-2hmpj 0/1 Completed 0 12m my-companys-compliance-requirements-rerunner-1613742300-cmfzg 0/1 Completed 0 7m48s my-companys-compliance-requirements-rerunner-1613742600-tkb8t 0/1 Completed 0 2m46s ocp4-co2-pp-9c9cf6c9-gbx66 1/1 Running 0 19m ocp4-moderate-api-checks-pod 1/2 NotReady 3 17m ocp4-moderate-rs-5b74cbb679-jxrj5 0/1 CrashLoopBackOff 8 17m rhcos4-co2-pp-697dc89f57-f242x 1/1 Running 0 19m $ oc get pod/ocp4-moderate-rs-5b74cbb679-jxrj5 -o=jsonpath={.metadata.annotations} | jq -r { "k8s.ovn.org/pod-networks": "{\"default\":{\"ip_addresses\":[\"10.131.0.211/23\"],\"mac_address\":\"0a:58:0a:83:00:d3\",\"gateway_ips\":[\"10.131.0.1\"],\"ip_address\":\"10.131.0.211/23\",\"gateway_ip\":\"10.131.0.1\"}}", "k8s.v1.cni.cncf.io/network-status": "[{\n \"name\": \"\",\n \"interface\": \"eth0\",\n \"ips\": [\n \"10.131.0.211\"\n ],\n \"mac\": \"0a:58:0a:83:00:d3\",\n \"default\": true,\n \"dns\": {}\n}]", "k8s.v1.cni.cncf.io/networks-status": "[{\n \"name\": \"\",\n \"interface\": \"eth0\",\n \"ips\": [\n \"10.131.0.211\"\n ],\n \"mac\": \"0a:58:0a:83:00:d3\",\n \"default\": true,\n \"dns\": {}\n}]", "openshift.io/scc": "anyuid" } $ oc get pod/ocp4-moderate-rs-5b74cbb679-jxrj5 -o=jsonpath={.spec.securityContext} | jq -r { "seLinuxOptions": { "level": "s0:c26,c15" } } Able to reproduce this issue on aws cluster if the Compliance operator installed in default namespace.
However, the Compliance operator successfully upgraded to latest version compliance-operator.v0.1.26
The issue does not observe if the Compliance operator installed in openshift-compliance namespace.
Steps to Reproduce :
1. Install OCP 4.6 with compliance-operator.v0.1.24 in default namespace
2. Upgrade OCP to 4.7 version and perform scan without upgrading compliance-operator
Version and upgrade path:
OCP 4.6 : 4.6.0-0.nightly-2021-02-18-050133 upgrade to OCP 4.7 : 4.7.0-0.nightly-2021-02-18-110409
Summarising steps :
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.7.0-0.nightly-2021-02-18-110409 True False 114m Cluster version is 4.7.0-0.nightly-2021-02-18-110409
$ oc get csv -ndefault
NAME DISPLAY VERSION REPLACES PHASE
compliance-operator.v0.1.24 Compliance Operator 0.1.24 Succeeded
elasticsearch-operator.4.6.0-202102130420.p0 OpenShift Elasticsearch Operator 4.6.0-202102130420.p0 Succeeded
$ oc get pods -ndefault|grep cis
ocp4-cis-api-checks-pod 1/2 NotReady 2 41m
ocp4-cis-node-master-rs-7695f48597-n8kqc 0/1 CrashLoopBackOff 12 41m
ocp4-cis-node-worker-rs-7c677ddd5-9lljw 0/1 CrashLoopBackOff 12 41m
ocp4-cis-rs-8bbdbfcc7-drtfl 0/1 CrashLoopBackOff 12 41m
$ oc logs ocp4-cis-node-master-rs-7695f48597-n8kqc -ndefault
{"level":"error","ts":1613748573.055087,"logger":"cmd","msg":"Couldn't ensure directory","error":"mkdir /reports/0: permission denied","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:132\nmain.ensureDir\n\t/remote-source/app/cmd/manager/resultserver.go:111\nmain.server\n\t/remote-source/app/cmd/manager/resultserver.go:169\nmain.glob..func2\n\t/remote-source/app/cmd/manager/resultserver.go:49\ngithub.com/spf13/cobra.(*Command).execute\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:854\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:958\ngithub.com/spf13/cobra.(*Command).Execute\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:895\nmain.main\n\t/remote-source/app/cmd/manager/main.go:34\nruntime.main\n\t/opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/proc.go:203"}
{"level":"dpanic","ts":1613748573.0561721,"logger":"cmd","msg":"odd number of arguments passed as key-value pairs for logging","ignored key":"/reports/0","stacktrace":"github.com/go-logr/zapr.handleFields\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:100\ngithub.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:133\nmain.server\n\t/remote-source/app/cmd/manager/resultserver.go:171\nmain.glob..func2\n\t/remote-source/app/cmd/manager/resultserver.go:49\ngithub.com/spf13/cobra.(*Command).execute\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:854\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:958\ngithub.com/spf13/cobra.(*Command).Execute\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:895\nmain.main\n\t/remote-source/app/cmd/manager/main.go:34\nruntime.main\n\t/opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/proc.go:203"}
{"level":"error","ts":1613748573.0560951,"logger":"cmd","msg":"Error ensuring result path: %s","error":"mkdir /reports/0: permission denied","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:132\nmain.server\n\t/remote-source/app/cmd/manager/resultserver.go:171\nmain.glob..func2\n\t/remote-source/app/cmd/manager/resultserver.go:49\ngithub.com/spf13/cobra.(*Command).execute\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:854\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:958\ngithub.com/spf13/cobra.(*Command).Execute\n\t/remote-source/deps/gomod/pkg/mod/github.com/spf13/cobra.1/command.go:895\nmain.main\n\t/remote-source/app/cmd/manager/main.go:34\nruntime.main\n\t/opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/proc.go:203"}
$ oc get pod ocp4-cis-api-checks-pod -oyaml -ndefault> ocp4-api-checks-pod.yaml
$ oc get pod ocp4-cis-node-master-rs-7695f48597-n8kqc -ndefault -oyaml > ocp4-cis-node-master-rs-7695f48597-n8kqc.yaml
$ oc adm policy scc-subject-review -f ocp4-api-checks-pod.yaml
RESOURCE ALLOWED BY
Pod/ocp4-cis-api-checks-pod anyuid
$ oc adm policy scc-subject-review -f ocp4-cis-node-master-rs-7695f48597-n8kqc.yaml
RESOURCE ALLOWED BY
Pod/ocp4-cis-node-master-rs-7695f48597-n8kqc anyuid
$ oc describe project default
Name: default
Created: 10 hours ago
Labels: olm.operatorgroup.uid/32454bdd-c2c9-4fbc-83a1-d07e4ab7f078=
Annotations: openshift.io/sa.scc.mcs=s0:c6,c5
openshift.io/sa.scc.supplemental-groups=1000040000/10000
openshift.io/sa.scc.uid-range=1000040000/10000
Display Name: <none>
Description: <none>
Status: Active
Node Selector: <none>
Quota: <none>
Resource limits: <none>
$ oc get scc anyuid
NAME PRIV CAPS SELINUX RUNASUSER FSGROUP SUPGROUP PRIORITY READONLYROOTFS VOLUMES
anyuid false <no value> MustRunAs RunAsAny RunAsAny RunAsAny 10 false ["configMap","downwardAPI","emptyDir","persistentVolumeClaim","projected","secret"]
$ oc describe scc anyuid
Name: anyuid
Priority: 10
Access:
Users: <none>
Groups: system:cluster-admins
Settings:
Allow Privileged: false
Allow Privilege Escalation: true
Default Add Capabilities: <none>
Required Drop Capabilities: MKNOD
Allowed Capabilities: <none>
Allowed Seccomp Profiles: <none>
Allowed Volume Types: configMap,downwardAPI,emptyDir,persistentVolumeClaim,projected,secret
Allowed Flexvolumes: <all>
Allowed Unsafe Sysctls: <none>
Forbidden Sysctls: <none>
Allow Host Network: false
Allow Host Ports: false
Allow Host PID: false
Allow Host IPC: false
Read Only Root Filesystem: false
Run As User Strategy: RunAsAny
UID: <none>
UID Range Min: <none>
UID Range Max: <none>
SELinux Context Strategy: MustRunAs
User: <none>
Role: <none>
Type: <none>
Level: <none>
FSGroup Strategy: RunAsAny
Ranges: <none>
Supplemental Groups Strategy: RunAsAny
Ranges: <none>
$ oc delete scansettingbinding --all -ndefault
scansettingbinding.compliance.openshift.io "cis-test" deleted
$ oc get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirstConsumer true 9h
gp2-csi ebs.csi.aws.com Delete WaitForFirstConsumer true 9h
$ oc get sub -ndefault
NAME PACKAGE SOURCE CHANNEL
openshift-compliance-operator compliance-operator compliance-operator 4.6
$ oc patch subscriptions openshift-compliance-operator -p '{"spec":{"source":"qe-app-registry"}}' --type='merge' -ndefault
subscription.operators.coreos.com/openshift-compliance-operator patched
$ oc get csv -ndefault -w
NAME DISPLAY VERSION REPLACES PHASE
compliance-operator.v0.1.24 Compliance Operator 0.1.24 Replacing
compliance-operator.v0.1.26 Compliance Operator 0.1.26 compliance-operator.v0.1.24 Installing
elasticsearch-operator.4.6.0-202102130420.p0 OpenShift Elasticsearch Operator 4.6.0-202102130420.p0 Succeeded
compliance-operator.v0.1.26 Compliance Operator 0.1.26 compliance-operator.v0.1.24 Succeeded
compliance-operator.v0.1.24 Compliance Operator 0.1.24 Deleting
compliance-operator.v0.1.24 Compliance Operator 0.1.24 Deleting
$ oc get csv -ndefault
NAME DISPLAY VERSION REPLACES PHASE
compliance-operator.v0.1.26 Compliance Operator 0.1.26 compliance-operator.v0.1.24 Succeeded
elasticsearch-operator.4.6.0-202102130420.p0 OpenShift Elasticsearch Operator 4.6.0-202102130420.p0 Succeeded
$ oc get pods -ndefault
NAME READY STATUS RESTARTS AGE
compliance-operator-84d98f59fc-mp5v7 1/1 Running 0 96s
ocp4-default-pp-7f6cdd564d-m2gcg 1/1 Running 0 50s
rhcos4-default-pp-65c5c4c44b-96shk 1/1 Running 0 141m
rhcos4-default-pp-6b9984f5cd-jxwmd 0/1 Init:1/2 0 50s
$ oc get pods -ndefault
NAME READY STATUS RESTARTS AGE
compliance-operator-84d98f59fc-mp5v7 1/1 Running 0 2m22s
ocp4-default-pp-7f6cdd564d-m2gcg 1/1 Running 0 96s
rhcos4-default-pp-6b9984f5cd-jxwmd 1/1 Running 0 96s
Move to Verified per comment https://bugzilla.redhat.com/show_bug.cgi?id=1919311#c16 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Compliance Operator version 0.1.35 for OpenShift Container Platform 4.6-4.8), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2652 |