Bug 1967953
| Summary: | Search Down After ACM Deploy Due to RunAsRoot CreateContainerConfigError | ||
|---|---|---|---|
| Product: | Red Hat Advanced Cluster Management for Kubernetes | Reporter: | James Young <jayoung> |
| Component: | Search / Analytics | Assignee: | Xavier <xdharmai> |
| Status: | CLOSED ERRATA | QA Contact: | Xiang Yin <xiyin> |
| Severity: | high | Docs Contact: | Mikela Dockery <mdockery> |
| Priority: | unspecified | ||
| Version: | rhacm-2.2.z | CC: | ashafi, ecai, jpadilla, xdharmai |
| Target Milestone: | --- | Flags: | ashafi:
qe_test_coverage-
ming: rhacm-2.2.z+ |
| Target Release: | rhacm-2.2.6 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-02-08 18:03:27 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I found these 2 issues:
1. The Dockerfile is missing the line `USER 1001` which sets the image to a non-root user.
2. The Helm chart is missing `securityContext: { runAsNonRoot: true }`.
Thanks Jorge, Do you anticipate any ill side-effects of editing those fields in a running cluster? Do you know why their dockerfile and helm charts could have come configured incorrectly from the ACM deployment? Hi James, This is an isolated issue - there should not be any ill side effects be editing search-operator deployment . I am checking in our lab on what caused this issue . I will keep you posted. James, For #1 we need to rebuild the image with the correct setting. As a workaround the customer could change the securityContext in the search-operator deployment to allow the image to run as root. The workaround may raise security concerns, especially if it's a production environment. Thanks Jorge, so will we need a patch to truly resolve this issue? I will propose the workaround to them but I get the feeling it will not be met with approval due to the security concerns. Hi James , While we try the workaround suggested by Jorge - Add the following in the search-operator deployment yaml ``` securityContext: runAsUser: 1001 ``` I wanted to update you on what we found in our deployments. I notice that we are running nonroot in our deployments, hence I am not clear why we landed in your error state. Are you able to share the some more information - if they are not sensitive to share. 1. deployment yaml file for the search-operator 2. pod yaml file for the search-operator 3. List the PodSecurityPolicy and their describe output (oc get psp -o yaml) 4. List the scc configurations. ( oc get scc -o yaml) Thanks, Xavier Hi James ,
In your search-operator deployment find the section which looks like the below
```
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: search-operator
serviceAccountName: search-operator
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/infra
operator: Exists
```
Edit the search-operator deployment (oc edit deploy search-operator). Find the line securityContext: {} and make it look like the one below
```
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
runAsUser: 1001
runAsNonRoot: true
serviceAccount: search-operator
serviceAccountName: search-operator
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/infra
operator: Exists
```
Hi Xavier That resolved the issue, much thanks. Do you know why this deployment of ACM may have been missing those attributes in the first place? James, Those fields are added by the SecurityContextConstraint(SCC) admission controller. More info here: https://docs.openshift.com/container-platform/4.6/authentication/managing-security-context-constraints.html#admission_configuring-internal-oauth In a cluster with a default installation, the admission controller assigns the SCC named "restricted". We found that in the cluster with the problem the "restricted" SCC has been modified from the default values. As a result of those changes the SCC admission controller assigned the SCC named "nonroot" which created this issue. The default "restricted" SCC has `RunAsUser: MustRunAsRange` which assigns a specific user within the valid range to the container. The default "nonroot" SCC has `RunAsUser: MustRunAsNonRoot` this scc doesn't assign a user to the container. It only enforces that the user is not user id is not 0 or root. Since a user id is not passed to the container, it starts with the default root user. Thanks Jorge, is this expected behavior from ACM when the cluster is using the default nonroot SCC? Is this something we would look to address as a bug fix or would it be something that just needs documenting? Thanks Hi James , We are addressing this as a bug fix and will be available in the 2.2.5 fix pack. Thank you, Xavier G2Bsync 877355805 comment jlpadilla Fri, 09 Jul 2021 17:50:49 UTC G2Bsync 2 code changes were merged. Each would solve the issue independently, but decided to merge both for extra precaution. ### Errata Doc In some clusters the SecurityConstraintContext policy prevented the search operator pod from starting. We updated the docker image to run with a non-root user by default. As an additional safeguard, we updated the security context in the search operator deployment with `runAsNonRoot` which starts the container with a non-root user. Can we close this ticket? Github ticket for this issue has been closed. Closing it after confirming with @jayoung The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |
Description of the problem: ------------------ After deploying ACM, following deployments failed to deploy successfully: search-operator, search-prod-b6267-search-aggregator and search-prod-b6267-search-api. multiclusterhub remains in "Phase: installing". [root@openshift44 ~]# oc get po |grep -vi running NAME READY STATUS RESTARTS AGE search-operator-6996889b54-z8lkp 0/1 CreateContainerConfigError 0 16h search-prod-b6267-search-aggregator-774f5579c6-gtq9c 0/1 CreateContainerConfigError 0 16h search-prod-b6267-search-aggregator-7d7bd88649-d7x7p 0/1 CreateContainerConfigError 0 16h search-prod-b6267-search-api-5748684755-hmd5h 0/1 CreateContainerConfigError 0 16h search-prod-b6267-search-api-578777bd7b-456jd 0/1 CreateContainerConfigError 0 16h search-prod-b6267-search-api-578777bd7b-dn46b 0/1 CreateContainerConfigError 0 16h [root@openshift44 ~]# Looking at the status of these pods we see that the search-operator is failing due to a RunAsRoot error containerStatuses: - image: registry.redhat.io/rhacm2/search-rhel8@sha256:c8f2145a65b6495a58b4d402aa4431a3e2bb90356e52996a3c9c0f491b1cbeca imageID: "" lastState: {} name: search-operator ready: false restartCount: 0 started: false state: waiting: message: container has runAsNonRoot and image will run as root reason: CreateContainerConfigError and the rest are failing due to a missing redisgraph secret. state: waiting: message: secret "redisgraph-user-secret" not found reason: CreateContainerConfigError Release version: ACM 2.2.3 Operator snapshot version: OCP version: 4.5.24 Browser Info: Steps to reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Occurred on initial deployment of ACM, all other pods and components appear to be up and running. The endpoint is accessible but most pages fail with an error regarding the backend service being unavailable.