Bug 1967953

Summary: Search Down After ACM Deploy Due to RunAsRoot CreateContainerConfigError
Product: Red Hat Advanced Cluster Management for Kubernetes Reporter: James Young <jayoung>
Component: Search / AnalyticsAssignee: Xavier <xdharmai>
Status: CLOSED ERRATA QA Contact: Xiang Yin <xiyin>
Severity: high Docs Contact: Mikela Dockery <mdockery>
Priority: unspecified    
Version: rhacm-2.2.zCC: ashafi, ecai, jpadilla, xdharmai
Target Milestone: ---Flags: ashafi: qe_test_coverage-
ming: rhacm-2.2.z+
Target Release: rhacm-2.2.6   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-02-08 18:03:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description James Young 2021-06-04 14:05:13 UTC
Description of the problem:
------------------
After deploying ACM, following deployments failed to deploy successfully:

search-operator, search-prod-b6267-search-aggregator and  search-prod-b6267-search-api.
multiclusterhub remains in "Phase: installing".

[root@openshift44 ~]# oc get po |grep -vi running
NAME                                                              READY   STATUS                       RESTARTS   AGE
search-operator-6996889b54-z8lkp                                  0/1     CreateContainerConfigError   0          16h
search-prod-b6267-search-aggregator-774f5579c6-gtq9c              0/1     CreateContainerConfigError   0          16h
search-prod-b6267-search-aggregator-7d7bd88649-d7x7p              0/1     CreateContainerConfigError   0          16h
search-prod-b6267-search-api-5748684755-hmd5h                     0/1     CreateContainerConfigError   0          16h
search-prod-b6267-search-api-578777bd7b-456jd                     0/1     CreateContainerConfigError   0          16h
search-prod-b6267-search-api-578777bd7b-dn46b                     0/1     CreateContainerConfigError   0          16h
[root@openshift44 ~]#

Looking at the status of these pods we see that the search-operator is failing due to a RunAsRoot error 
  containerStatuses:
  - image: registry.redhat.io/rhacm2/search-rhel8@sha256:c8f2145a65b6495a58b4d402aa4431a3e2bb90356e52996a3c9c0f491b1cbeca
    imageID: ""
    lastState: {}
    name: search-operator
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: container has runAsNonRoot and image will run as root
        reason: CreateContainerConfigError

and the rest are failing due to a missing redisgraph secret.
    state:
      waiting:
        message: secret "redisgraph-user-secret" not found
        reason: CreateContainerConfigError

Release version: 
ACM 2.2.3

Operator snapshot version: 

OCP version:
4.5.24

Browser Info: 

Steps to reproduce:
1. 
2. 
3.

Actual results:

Expected results:

Additional info:
Occurred on initial deployment of ACM, all other pods and components appear to be up and running. The endpoint is accessible but most pages fail with an error regarding the backend service being unavailable.

Comment 2 Jorge Padilla 2021-06-04 16:21:31 UTC
I found these 2 issues:

1. The Dockerfile is missing the line `USER 1001` which sets the image to a non-root user.
2. The Helm chart is missing `securityContext: { runAsNonRoot: true }`.

Comment 3 James Young 2021-06-04 16:51:37 UTC
Thanks Jorge,

Do you anticipate any ill side-effects of editing those fields in a running cluster? Do you know why their dockerfile and helm charts could have come configured incorrectly from the ACM deployment?

Comment 4 Xavier 2021-06-04 17:16:40 UTC
Hi James, 
This is an isolated issue - there should not be any ill side effects be editing search-operator deployment . I am checking in our lab on what caused this issue . I will keep you posted.

Comment 5 Jorge Padilla 2021-06-04 18:48:47 UTC
James,

For #1 we need to rebuild the image with the correct setting.  As a workaround the customer could change the securityContext in the search-operator deployment to allow the image to run as root.  The workaround may raise security concerns, especially if it's a production environment.

Comment 6 James Young 2021-06-04 18:52:43 UTC
Thanks Jorge, so will we need a patch to truly resolve this issue? I will propose the workaround to them but I get the feeling it will not be met with approval due to the security concerns.

Comment 7 Xavier 2021-06-04 22:42:30 UTC
Hi James , 
  While we try the workaround suggested by Jorge  - Add the following in the search-operator deployment yaml
```
securityContext:
  runAsUser: 1001
```
I wanted to update you on what we found in our deployments. I notice that we are running nonroot in our deployments, hence I am not clear why we landed in your error state.
Are you able to share the some more information - if they are not sensitive to share.
1. deployment yaml file for the search-operator
2. pod yaml file for the search-operator
3. List the PodSecurityPolicy and their describe output (oc get psp -o yaml)
4. List the scc configurations. ( oc get scc -o yaml) 

Thanks,
Xavier

Comment 14 Xavier 2021-06-09 20:01:09 UTC
Hi James , 
 In your search-operator deployment find the section which looks like the below
```
        dnsPolicy: ClusterFirst
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: search-operator
        serviceAccountName: search-operator
        terminationGracePeriodSeconds: 30
        tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/infra
          operator: Exists
```

Edit the search-operator deployment (oc edit deploy search-operator). Find the line securityContext: {} and make it look like the one below

```
        dnsPolicy: ClusterFirst
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext:
          runAsUser: 1001
          runAsNonRoot: true
        serviceAccount: search-operator
        serviceAccountName: search-operator
        terminationGracePeriodSeconds: 30
        tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/infra
          operator: Exists
```

Comment 15 James Young 2021-06-16 13:13:27 UTC
Hi Xavier

That resolved the issue, much thanks. Do you know why this deployment of ACM may have been missing those attributes in the first place?

Comment 16 Jorge Padilla 2021-06-18 19:11:26 UTC
James,

Those fields are added by the SecurityContextConstraint(SCC) admission controller.  More info here: https://docs.openshift.com/container-platform/4.6/authentication/managing-security-context-constraints.html#admission_configuring-internal-oauth

In a cluster with a default installation, the admission controller assigns the SCC named "restricted".
We found that in the cluster with the problem the "restricted" SCC has been modified from the default values.  As a result of those changes the SCC admission controller assigned the SCC named "nonroot" which created this issue.

The default "restricted" SCC has `RunAsUser: MustRunAsRange` which assigns a specific user within the valid range to the container.
The default "nonroot" SCC has `RunAsUser: MustRunAsNonRoot` this scc doesn't assign a user to the container. It only enforces that the user is not user id is not 0 or root. Since a user id is not passed to the container, it starts with the default root user.

Comment 17 James Young 2021-06-25 18:31:29 UTC
Thanks Jorge, is this expected behavior from ACM when the cluster is using the default nonroot SCC? Is this something we would look to address as a bug fix or would it be something that just needs documenting? Thanks

Comment 18 Xavier 2021-07-02 03:24:09 UTC
Hi James ,
  We are addressing this as a bug fix and will be available in the 2.2.5 fix pack.
Thank you,
Xavier

Comment 19 Mike Ng 2021-07-09 18:01:00 UTC
G2Bsync 877355805 comment 
 jlpadilla Fri, 09 Jul 2021 17:50:49 UTC 
 G2Bsync
2 code changes were merged. Each would solve the issue independently, but decided to merge both for extra precaution.

### Errata Doc
In some clusters the SecurityConstraintContext policy prevented the search operator pod from starting. We updated the docker image to run with a non-root user by default. As an additional safeguard, we updated the security context in the search operator deployment with `runAsNonRoot` which starts the container with a non-root user.

Comment 20 Atif 2022-02-08 15:40:33 UTC
Can we close this ticket? Github ticket for this issue has been closed.

Comment 21 Atif 2022-02-08 18:03:27 UTC
Closing it after confirming with @jayoung

Comment 22 Red Hat Bugzilla 2023-09-15 01:09:02 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days