Bug 2007246

Summary:	Openshift Container Platform - Ingress Controller does not set allowPrivilegeEscalation in the router deployment
Product:	OpenShift Container Platform	Reporter:	Simon Reber <sreber>
Component:	Networking	Assignee:	Chad Scribner <cscribne>
Networking sub component:	router	QA Contact:	Shudi Li <shudili>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	high	CC:	amcdermo, aos-bugs, cscribne, ddharwar, hongli, mmasters, wking
Version:	4.9
Target Milestone:	---
Target Release:	4.11.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The default IngressController Deployment creates a container named "router" without requesting sufficient permissions in the `securityContext` of the container. Consequence: Normally, this will not cause an issue but in cases where clusters have a Security Context Constraint (SCC) that's similar enough to the hostnetwork SCC could result in router pods failing to start. Fix: Set `allowPrivilegeEscalation: true` in the `router` container's `securityContext` to ensure that it matches the default hostnetwork SCC. Result: The router pods will be admitted to the correct SCC and be created without error.	Story Points:	---
Clone Of:
Clones:	2079034 (view as bug list)		Environment:
Last Closed:	2022-08-10 10:37:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2079034

Description Simon Reber 2021-09-23 11:56:58 UTC

Description of problem:

According to "Red Hat OpenShift 4 Hardening Guide v1.1" (attached - see 5.2.5 Minimize admission of containers with Allow Privilege Escalation set to true and SELinux context set to RunAsAny (Manual), `allowPrivilegeEscalation` should be set to `false` in a customer specific Security Context Constrain (SCC) to have as many application related pod running with `allowPrivilegeEscalation` set to `false`. Application requiring `allowPrivilegeEscalation` set to `true` should either specify this in the `deployment` to have the default `restricted` SCC selected or else provide/use a specific SCC to address their use-case.

Thus when creating a custom restricted SCC with `allowPrivilegeEscalation` set to `false`, the `router` is unable to run (not really sure why). But it seems to rely on `no_new_privs` flag. Since this is the case, the `IngressController` should actually set `allowPrivilegeEscalation` to `true` in the `securityContext` of the `deployment` to make sure it always picks the `restricted` SCC which has `allowPrivilegeEscalation` set to `true`.

Alternative, the `IngressController` respectively `IngressOperator` could provide a specific `scc` and link that with the `router` `ServiceAccount` to run it with the given SCC which would make it more independent and not cause things to fail if more restrictive SCC are being created by the customer.

Take a look at https://docs.openshift.com/container-platform/4.8/authentication/managing-security-context-constraints.html#admission_configuring-internal-oauth to understand how the SCC is selected if nothing is defined.


OpenShift release version:

 - OpenShift Container Platform 4.9.0-rc.1, 4.8.*, 4.7.*

Cluster Platform:

 - Any

How reproducible:

 - Always

Steps to Reproduce (in detail):
1. Create a SCC as attached
2. Restart the `router` pod and see how it's failing to start


Actual results:

If a SCC is selected with `allowPrivilegeEscalation` set to `false` it will fail to start and log the following error.

> [NOTICE] 265/114634 (19) : haproxy version is 2.2.15-5e8f49d
> [NOTICE] 265/114634 (19) : path to executable is /usr/sbin/haproxy
> [ALERT] 265/114634 (19) : Starting frontend public: cannot bind socket [0.0.0.0:80]
> [ALERT] 265/114634 (19) : Starting frontend public_ssl: cannot bind socket [0.0.0.0:443]
> E0923 11:46:58.907432       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: connection refused
> E0923 11:47:02.963872       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: connection refused
> E0923 11:47:28.874274       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: connection refused
> E0923 11:47:32.961877       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: connection refused
> E0923 11:47:58.878652       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: connection refused
> E0923 11:48:02.971619       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: connection refused
> E0923 11:48:28.881073       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: connection refused
> I0923 11:48:28.983101       1 template.go:704] router "msg"="Shutdown requested, waiting 45s for new connections to cease"  
> E0923 11:48:32.962518       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: connection refused

Expected results:

The `router` should either be able to run with `allowPrivilegeEscalation` set to `false` or else specify that requirement in the `Deployment` or provide it's own specific SCC to prevent issues when customers are creating a more restricted SCC that what is provided by default.

Impact of the problem:

Router is failing to start and thus won't work. In worse case, the environment could completely become unavailable because routers are not working as expected and are selecting the most restrictive SCC even though they require certain capabilities.

Additional info:

Comment 3 Miciah Dashiel Butler Masters 2021-09-23 16:17:00 UTC

Out of curiosity, did you test on OpenShift 4.6 and determine that the issue does not affect it, or have you just not tested on OpenShift 4.6?

Comment 4 Simon Reber 2021-09-23 18:01:00 UTC

(In reply to Miciah Dashiel Butler Masters from comment #3)
> Out of curiosity, did you test on OpenShift 4.6 and determine that the issue
> does not affect it, or have you just not tested on OpenShift 4.6?
Sorry, I only checked on 4.9-rc as well as on 4.8 and 4.7 - but I suspect the behavior was always like that and therefore OpenShift Container Platform 4 in general is affected.

Comment 31 Shudi Li 2022-04-27 05:57:48 UTC

Verified it with 4.11.0-0.nightly-2022-04-26-181148: 
1. securityContext with allowPrivilegeEscalation true is added to the deployment/router-default
a,
% oc -n openshift-ingress get deployment.apps/router-default -o yaml | grep -i -A1 securityContext
        securityContext:
          allowPrivilegeEscalation: true
--
      securityContext: {}
      serviceAccount: router
%
b,
% oc -n openshift-ingress get pod/router-default-54c658ddc-8r99h -o yaml | grep -i -A1 securityContext
    securityContext:
      allowPrivilegeEscalation: true
--
  securityContext:
    fsGroup: 1000590000
410 % 

2. Create the customer SCC, delete a router pod, and a new router pod can be created successfully(Same as bug comment 23)

3. Create an ingress-controller, securityContext with allowPrivilegeEscalation true is also added to its deployment
%oc -n openshift-ingress get deployment.apps/router-internalapps2   -o yaml | grep -i -A1 securityContext     
        securityContext:
          allowPrivilegeEscalation: true
--
      securityContext: {}
      serviceAccount: router
%

4. Try to modify deployment/router-default with securityContext/allowPrivilegeEscalation false, the ingress-controller will revert it to true. A router pod is terminated and a new one is created.
a,
%oc -n openshift-ingress get pods
NAME                             READY   STATUS    RESTARTS   AGE
router-default-54c658ddc-b9sdx   1/1     Running   0          4h28m
router-default-54c658ddc-zdxpq   1/1     Running   0          137m
%

b, Edit deployment/router-default and try to configure spec\containers\securityContext\allowPrivilegeEscalation with false
% oc -n openshift-ingress edit deployment/router-default
Warning: would violate PodSecurity "restricted:latest": unrestricted capabilities (container "router" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "router" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "router" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
deployment.apps/router-default edited
% 

c,
% oc -n openshift-ingress get all                       
NAME                                  READY   STATUS        RESTARTS   AGE
pod/router-default-54c658ddc-8t6hr    0/1     Pending       0          36s
pod/router-default-54c658ddc-b9sdx    1/1     Running       0          4h35m
pod/router-default-54c658ddc-zdxpq    1/1     Terminating   0          144m
pod/router-default-58dc79958d-rdwv2   1/1     Terminating   0          37s

NAME                              TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                      AGE
service/router-default            LoadBalancer   172.30.243.173   34.122.142.53   80:32169/TCP,443:30398/TCP   4h35m
service/router-internal-default   ClusterIP      172.30.99.215    <none>          80/TCP,443/TCP,1936/TCP      4h35m

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/router-default   1/2     2            1           4h35m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/router-default-54c658ddc    2         2         1       4h35m
replicaset.apps/router-default-58dc79958d   0         0         0       149m
% 

d, 
% oc -n openshift-ingress get pods
NAME                             READY   STATUS    RESTARTS   AGE
router-default-54c658ddc-8t6hr   1/1     Running   0          5m13s
router-default-54c658ddc-b9sdx   1/1     Running   0          4h40m
%

Comment 33 errata-xmlrpc 2022-08-10 10:37:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069