Bug 2180456

Summary: grant access to unprivileged containers to the kubelet podresources socket
Product: Red Hat Enterprise Linux 9 Reporter: Francesco Romani <fromani>
Component: container-selinuxAssignee: Daniel Walsh <dwalsh>
Status: CLOSED MIGRATED QA Contact: atomic-bugs <atomic-bugs>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 9.2CC: dornelas, dwalsh, grajaiya, jnovy, lsm5, mboddu, travier, tsweeney
Target Milestone: rcKeywords: MigratedToJIRA
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-09-11 19:41:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
example pod yaml none

Description Francesco Romani 2023-03-21 14:11:04 UTC
Created attachment 1952417 [details]
example pod yaml

Description of problem:
containerized workloads running as container_t wants to connect to the kubelet podresources API through a unix domain socket sitting at `/var/lib/kubelet/pod-resources/kubelet.sock`

Both `/var/lib/kubelet/pod-resources/kubelet.sock` and `/var/lib/kubelet/pod-resources/kubelet.sock` are currently labeled as `container_var_lib_t`.

They should probably treated as device plugins

This is specially important to enable the numa-aware scheduler (

Version-Release number of selected component (if applicable):
selinux-policy-38.1.8-1.el9.noarch
selinux-policy-targeted-38.1.8-1.el9.noarch
container-selinux-2.199.0-1.el9.noarch


How reproducible:
100%

Steps to Reproduce:
1. run unprivileged container, mount the kubelet podresources socket (see attached example yaml)
2. try to access to the podresources socket. Example test client available at https://github.com/k8stopologyawareschedwg/podresourcesapi-tools/releases/tag/v0.0.1 - `podrescli --podresources-socket unix:///host-podresources-socket/kubelet.sock`
3.

Actual results:
avc denials:
```
type=AVC msg=audit(1679405027.674:79402): avc:  denied  { write } for  pid=2236325 comm="podrescli" name="kubelet.sock" dev="sdb4" ino=144703618 scontext=system_u:system_r:container_t:s0:c25,c26 tcontext=system_u:object_r:container_var_lib_t:s0 tclass=sock_file permissive=0
type=AVC msg=audit(1679405031.376:79403): avc:  denied  { write } for  pid=2236496 comm="podrescli" name="kubelet.sock" dev="sdb4" ino=144703618 scontext=system_u:system_r:container_t:s0:c25,c26 tcontext=system_u:object_r:container_var_lib_t:s0 tclass=sock_file permissive=0
type=AVC msg=audit(1679405032.046:79404): avc:  denied  { write } for  pid=2236590 comm="podrescli" name="kubelet.sock" dev="sdb4" ino=144703618 scontext=system_u:system_r:container_t:s0:c25,c26 tcontext=system_u:object_r:container_var_lib_t:s0 tclass=sock_file permissive=0
```

Expected results:
access should be granted

Additional info:
at the moment we are using a custom policy. We can keep using this a stopgap fix. The policy is also broken with the above package

policy: https://github.com/k8stopologyawareschedwg/deployer/blob/main/pkg/assets/rte/selinuxpolicy-ocp411.cil
denials:
```
type=AVC msg=audit(1679044968.338:5383): avc:  denied  { connectto } for  pid=10664 comm="resource-topolo" path="/var/lib/kubelet/pod-resources/kubelet.sock" scontext=system_u:system_r:rte.process:s0 tcontext=system_u:system_r:unconfined_service_t:s0 tclass=unix_stream_socket permissive=0
```
In that (different nightly though) system the label was
```
srwxr-xr-x. 1 root root system_u:object_r:container_var_lib_t:s0 0 Mar 15 22:02 /var/lib/kubelet/pod-resources/kubelet.sock
```

Comment 1 Martin Sivák 2023-03-23 17:26:30 UTC
I reproduced the issue and there is something really weird going on.
 
RTE is denied access to the kubelet.sock by SELinux as Francesco said already:


type=AVC msg=audit(1679590155.184:76): avc:  denied  { connectto } for  pid=12566 comm="resource-topolo" path="/var/lib/kubelet/pod-resources/kubelet.sock" scontext=system_u:system_r:rte.process:s0 tcontext=system_u:system_r:unconfined_service_t:s0 tclass=unix_stream_socket permissive=0
    Was caused by:
        Missing type enforcement (TE) allow rule.

        You can use audit2allow to generate a loadable module to allow this access.


Notice the target context though: system_u:system_r:unconfined_service_t:s0

What baffles me is the file has the proper context both from node side and from container side though:

Node:

sh-5.1# ls -alZ /var/lib/kubelet/pod-resources/kubelet.sock
srwxr-xr-x. 1 root root system_u:object_r:container_var_lib_t:s0 0 Mar 23 16:29 /var/lib/kubelet/pod-resources/kubelet.sock

Container:

The file is mounted into a container:

  volumeMounts:
  - mountPath: /host-podresources-socket/kubelet.sock
    name: host-podresources-socket

  volumes:
  - hostPath:
      path: /var/lib/kubelet/pod-resources/kubelet.sock
      type: Socket
    name: host-podresources-socket

And the end result from within the container looks like this:

sh-4.4# cd host-podresources-socket/
sh-4.4# ls -alZ
total 0
drwxr-xr-x. 2 root root system_u:object_r:container_file_t:s0    26 Mar 23 16:44 .
dr-xr-xr-x. 1 root root system_u:object_r:container_file_t:s0    85 Mar 23 16:44 ..
srwxr-xr-x. 1 root root system_u:object_r:container_var_lib_t:s0  0 Mar 23 16:32 kubelet.sock


So where is the unconfined_service_t coming from? On a file?

Comment 2 Martin Sivák 2023-03-23 17:29:09 UTC
I am also raising the severity as this is blocking a GA of a long awaited OCP feature. The workaround we have is insecure as it simply allows the access to unconfined_service_t.

Comment 4 Daniel Walsh 2023-03-25 11:14:18 UTC
If you want a container to talk to the kublet.sock, you should run it as spc_t, since this seems like a major escape vector for the container.

If you still need a confined container then whoever creates 
srwxr-xr-x. 1 root root system_u:object_r:container_var_lib_t:s0  0 Mar 23 16:32 kubelet.sock

Needs to make sure it is labeled container_file_t.

The Second AVC is not related to container-selinux at all.

type=AVC msg=audit(1679590155.184:76): avc:  denied  { connectto } for  pid=12566 comm="resource-topolo" path="/var/lib/kubelet/pod-resources/kubelet.sock" scontext=system_u:system_r:rte.process:s0 tcontext=system_u:system_r:unconfined_service_t:s0 tclass=unix_stream_socket permissive=0

This does not seem to be related to the first AVC, since now we have a new SELinux label type rte.process which does not come from container-selinux package.  No idea where it comes from.

The unconfined_service_t indicates that you set up a systemd service to run an application labeled bin_t, you might need to change the label of this executable to something like kublet_exec_t but I am
just shooting in the dark, since I do not know this policy or tool.

Comment 5 Martin Sivák 2023-03-26 13:39:22 UTC
Daniel, I do not think you are reading the AVC denial correctly.

The source context is the container process. rte.process comes from a custom policy defined here https://github.com/k8stopologyawareschedwg/deployer/blob/main/pkg/assets/rte/selinuxpolicy/ocp_v4.12.cil that we have exactly for the reasons you mentioned (confined container with an exception to access /var/lib/kubelet/pod-resources/kubelet.sock.

Notice that is a different kubelet.sock (pod-resources API is reporting api for NUMA capacity reporting) than you think about (the main kubelet API).

scontext=system_u:system_r:rte.process:s0

The target context is a unix stream socket (the above mentioned /var/lib/kubelet/pod-resources/kubelet.sock), notice the tclass. And the label does not make sense at all for unix domain socket. The socket file itself has the proper label container_var_lib_t as mentioned above.

tcontext=system_u:system_r:unconfined_service_t:s0 tclass=unix_stream_socket

The operation denied "avc:  denied  { connectto }" is also a socket connection.

Comment 6 Timothée Ravier 2023-03-27 16:35:09 UTC
Which application is creating this socket? `/var/lib/kubelet/pod-resources/kubelet.sock`

Comment 7 Francesco Romani 2023-03-27 19:25:31 UTC
(In reply to Timothée Ravier from comment #6)
> Which application is creating this socket?
> `/var/lib/kubelet/pod-resources/kubelet.sock`

Kubelet is. Kubelet exposes an API to monitor the assignment of compute resources  to pod: https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#monitoring-device-plugin-resources . Clients connecting to that endpoint _cannot_ change the kubelet state. The API is for monitoring only.

Regarding this denial
```
type=AVC msg=audit(1679044968.338:5383): avc:  denied  { connectto } for  pid=10664 comm="resource-topolo" path="/var/lib/kubelet/pod-resources/kubelet.sock" scontext=system_u:system_r:rte.process:s0 tcontext=system_u:system_r:unconfined_service_t:s0 tclass=unix_stream_socket permissive=0
```

what's puzzling is the application is run through a daemonset, which uses the policy linked above. The daemonset YAML looks like this:

```
apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "1"
  creationTimestamp: "2023-03-27T08:14:16Z"
  generation: 1
  name: numaresourcesoperator-worker
  namespace: numaresources
  ownerReferences:
  - apiVersion: nodetopology.openshift.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: NUMAResourcesOperator
    name: numaresourcesoperator
    uid: 04b34ca6-8a85-4956-aa59-441a03b10e29
  resourceVersion: "2356026"
  uid: fdc60be1-5b98-4f20-9fca-28ca674836c7
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: resource-topology
  template:
    metadata:
      creationTimestamp: null
      labels:
        name: resource-topology
    spec:
      containers:
      - args:
        - --sleep-interval=10s
        - --sysfs=/host-sys
        - --podresources-socket=unix:///host-podresources-socket/kubelet.sock
        - --v=2
        - --refresh-node-resources
        - --oci-hint-dir=/run/rte
        - --pods-fingerprint
        - --pods-fingerprint-status-file=/run/pfpstatus/dump.json
        command:
        - /bin/resource-topology-exporter
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: REFERENCE_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: REFERENCE_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: REFERENCE_CONTAINER_NAME
          value: shared-pool-container
        - name: METRICS_PORT
          value: "2112"
        image: quay.io/fromani/numaresources-operator:4.13.1011
        imagePullPolicy: IfNotPresent
        name: resource-topology-exporter
        ports:
        - containerPort: 2112
          name: metrics-port
          protocol: TCP
        resources: {}
        securityContext:
          runAsGroup: 0
          runAsUser: 0
          seLinuxOptions:
            level: s0
            type: rte.process
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host-sys
          name: host-sys
          readOnly: true
        - mountPath: /host-podresources-socket/kubelet.sock
          name: host-podresources-socket
        - mountPath: /host-run-rte
          name: host-run-rte
        - mountPath: /etc/resource-topology-exporter/
          name: rte-config-volume
        - mountPath: /run/pfpstatus
          name: run-pfpstatus
      - args:
        - while true; do sleep 30s; done
        command:
        - /bin/sh
        - -c
        - --
        image: quay.io/fromani/numaresources-operator:4.13.1011
        imagePullPolicy: IfNotPresent
        name: shared-pool-container
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      nodeSelector:
        node-role.kubernetes.io/worker: ""
      readinessGates:
      - conditionType: PodresourcesFetched
      - conditionType: NodeTopologyUpdated
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: rte
      serviceAccountName: rte
      terminationGracePeriodSeconds: 30
      volumes:
      - hostPath:
          path: /sys
          type: Directory
        name: host-sys
      - hostPath:
          path: /var/lib/kubelet/pod-resources/kubelet.sock
          type: Socket
        name: host-podresources-socket
      - hostPath:
          path: /run/rte
          type: DirectoryOrCreate
        name: host-run-rte
      - configMap:
          defaultMode: 420
          name: numaresourcesoperator-worker
          optional: true
        name: rte-config-volume
      - emptyDir:
          medium: Memory
          sizeLimit: 8Mi
        name: run-pfpstatus
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 2
  desiredNumberScheduled: 2
  numberAvailable: 2
  numberMisscheduled: 0
  numberReady: 2
  observedGeneration: 1
  updatedNumberScheduled: 2
```

Comment 8 Timothée Ravier 2023-03-28 17:24:39 UTC
If all containers should be able to access it then it should likely be labeled as `system_u:object_r:container_file_t` as Dan said.

Not sure where this should be done but likely either need a policy rule for it or the kubelet should make sure it gets the right label.

Comment 9 Francesco Romani 2023-03-28 19:03:25 UTC
(In reply to Timothée Ravier from comment #8)
> If all containers should be able to access it then it should likely be
> labeled as `system_u:object_r:container_file_t` as Dan said.
> 
> Not sure where this should be done but likely either need a policy rule for
> it or the kubelet should make sure it gets the right label.

Makes sense. Should we move this bug to the kubelet or to the openshift-hyperkube package?

I'm not sure this would solve the unconfined_service_t denial but surely fixing the kubelet.sock label it's a good start

Comment 10 Timothée Ravier 2023-03-29 11:37:18 UTC
Input from SELinux folks would be appreciate on the best approach:
- change kubelet to make sure the socket gets created with the right label
- or add a type transition rule to the policy to set the socket to the right label on creation by the kubelet

Not sure which on is best / recommended.

Comment 11 Daniel Walsh 2023-04-01 20:28:30 UTC
type=AVC msg=audit(1679044968.338:5383): avc:  denied  { connectto } for  pid=10664 comm="resource-topolo" path="/var/lib/kubelet/pod-resources/kubelet.sock" scontext=system_u:system_r:rte.process:s0 tcontext=system_u:system_r:unconfined_service_t:s0 tclass=unix_stream_socket permissive=0

This is a process label on the service not a file on disk.  Basically you have a systemd service that is labled bin_t which is listening on the "/var/lib/kubelet/pod-resources/kubelet.sock"

And resource-topolo is attempting to connect to it.  Might be a kuberernetes(kublet?).

We have the following labeling.

/usr/s?bin/kubelet.*		--	gen_context(system_u:object_r:kubelet_exec_t,s0)

Is this the correct service that is running in the systemd service?  And is it labeled kubelet_exec_t?

Comment 13 Martin Sivák 2023-04-03 13:56:02 UTC
Oh, thanks Dan for explaining this. So this is not a denial to access the socket file, but a denial to talk to the other side of the socket. I did not know SELinux can do this.

For the sake of the public side of the bug, Francesco reported that kubelet is indeed running with the unconfined_service_t context which is unexpected.

Comment 14 RHEL Program Management 2023-09-11 19:36:21 UTC
Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 15 RHEL Program Management 2023-09-11 19:41:35 UTC
This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.