This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2180456 - grant access to unprivileged containers to the kubelet podresources socket
Summary: grant access to unprivileged containers to the kubelet podresources socket
Keywords:
Status: CLOSED MIGRATED
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: container-selinux
Version: 9.2
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: rc
: ---
Assignee: Daniel Walsh
QA Contact: atomic-bugs@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-03-21 14:11 UTC by Francesco Romani
Modified: 2023-09-11 19:41 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-09-11 19:41:35 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
example pod yaml (2.18 KB, text/plain)
2023-03-21 14:11 UTC, Francesco Romani
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker   RHEL-3128 0 None Migrated None 2023-09-11 19:39:36 UTC
Red Hat Issue Tracker RHELPLAN-152667 0 None None None 2023-03-22 06:45:19 UTC

Description Francesco Romani 2023-03-21 14:11:04 UTC
Created attachment 1952417 [details]
example pod yaml

Description of problem:
containerized workloads running as container_t wants to connect to the kubelet podresources API through a unix domain socket sitting at `/var/lib/kubelet/pod-resources/kubelet.sock`

Both `/var/lib/kubelet/pod-resources/kubelet.sock` and `/var/lib/kubelet/pod-resources/kubelet.sock` are currently labeled as `container_var_lib_t`.

They should probably treated as device plugins

This is specially important to enable the numa-aware scheduler (

Version-Release number of selected component (if applicable):
selinux-policy-38.1.8-1.el9.noarch
selinux-policy-targeted-38.1.8-1.el9.noarch
container-selinux-2.199.0-1.el9.noarch


How reproducible:
100%

Steps to Reproduce:
1. run unprivileged container, mount the kubelet podresources socket (see attached example yaml)
2. try to access to the podresources socket. Example test client available at https://github.com/k8stopologyawareschedwg/podresourcesapi-tools/releases/tag/v0.0.1 - `podrescli --podresources-socket unix:///host-podresources-socket/kubelet.sock`
3.

Actual results:
avc denials:
```
type=AVC msg=audit(1679405027.674:79402): avc:  denied  { write } for  pid=2236325 comm="podrescli" name="kubelet.sock" dev="sdb4" ino=144703618 scontext=system_u:system_r:container_t:s0:c25,c26 tcontext=system_u:object_r:container_var_lib_t:s0 tclass=sock_file permissive=0
type=AVC msg=audit(1679405031.376:79403): avc:  denied  { write } for  pid=2236496 comm="podrescli" name="kubelet.sock" dev="sdb4" ino=144703618 scontext=system_u:system_r:container_t:s0:c25,c26 tcontext=system_u:object_r:container_var_lib_t:s0 tclass=sock_file permissive=0
type=AVC msg=audit(1679405032.046:79404): avc:  denied  { write } for  pid=2236590 comm="podrescli" name="kubelet.sock" dev="sdb4" ino=144703618 scontext=system_u:system_r:container_t:s0:c25,c26 tcontext=system_u:object_r:container_var_lib_t:s0 tclass=sock_file permissive=0
```

Expected results:
access should be granted

Additional info:
at the moment we are using a custom policy. We can keep using this a stopgap fix. The policy is also broken with the above package

policy: https://github.com/k8stopologyawareschedwg/deployer/blob/main/pkg/assets/rte/selinuxpolicy-ocp411.cil
denials:
```
type=AVC msg=audit(1679044968.338:5383): avc:  denied  { connectto } for  pid=10664 comm="resource-topolo" path="/var/lib/kubelet/pod-resources/kubelet.sock" scontext=system_u:system_r:rte.process:s0 tcontext=system_u:system_r:unconfined_service_t:s0 tclass=unix_stream_socket permissive=0
```
In that (different nightly though) system the label was
```
srwxr-xr-x. 1 root root system_u:object_r:container_var_lib_t:s0 0 Mar 15 22:02 /var/lib/kubelet/pod-resources/kubelet.sock
```

Comment 1 Martin Sivák 2023-03-23 17:26:30 UTC
I reproduced the issue and there is something really weird going on.
 
RTE is denied access to the kubelet.sock by SELinux as Francesco said already:


type=AVC msg=audit(1679590155.184:76): avc:  denied  { connectto } for  pid=12566 comm="resource-topolo" path="/var/lib/kubelet/pod-resources/kubelet.sock" scontext=system_u:system_r:rte.process:s0 tcontext=system_u:system_r:unconfined_service_t:s0 tclass=unix_stream_socket permissive=0
    Was caused by:
        Missing type enforcement (TE) allow rule.

        You can use audit2allow to generate a loadable module to allow this access.


Notice the target context though: system_u:system_r:unconfined_service_t:s0

What baffles me is the file has the proper context both from node side and from container side though:

Node:

sh-5.1# ls -alZ /var/lib/kubelet/pod-resources/kubelet.sock
srwxr-xr-x. 1 root root system_u:object_r:container_var_lib_t:s0 0 Mar 23 16:29 /var/lib/kubelet/pod-resources/kubelet.sock

Container:

The file is mounted into a container:

  volumeMounts:
  - mountPath: /host-podresources-socket/kubelet.sock
    name: host-podresources-socket

  volumes:
  - hostPath:
      path: /var/lib/kubelet/pod-resources/kubelet.sock
      type: Socket
    name: host-podresources-socket

And the end result from within the container looks like this:

sh-4.4# cd host-podresources-socket/
sh-4.4# ls -alZ
total 0
drwxr-xr-x. 2 root root system_u:object_r:container_file_t:s0    26 Mar 23 16:44 .
dr-xr-xr-x. 1 root root system_u:object_r:container_file_t:s0    85 Mar 23 16:44 ..
srwxr-xr-x. 1 root root system_u:object_r:container_var_lib_t:s0  0 Mar 23 16:32 kubelet.sock


So where is the unconfined_service_t coming from? On a file?

Comment 2 Martin Sivák 2023-03-23 17:29:09 UTC
I am also raising the severity as this is blocking a GA of a long awaited OCP feature. The workaround we have is insecure as it simply allows the access to unconfined_service_t.

Comment 4 Daniel Walsh 2023-03-25 11:14:18 UTC
If you want a container to talk to the kublet.sock, you should run it as spc_t, since this seems like a major escape vector for the container.

If you still need a confined container then whoever creates 
srwxr-xr-x. 1 root root system_u:object_r:container_var_lib_t:s0  0 Mar 23 16:32 kubelet.sock

Needs to make sure it is labeled container_file_t.

The Second AVC is not related to container-selinux at all.

type=AVC msg=audit(1679590155.184:76): avc:  denied  { connectto } for  pid=12566 comm="resource-topolo" path="/var/lib/kubelet/pod-resources/kubelet.sock" scontext=system_u:system_r:rte.process:s0 tcontext=system_u:system_r:unconfined_service_t:s0 tclass=unix_stream_socket permissive=0

This does not seem to be related to the first AVC, since now we have a new SELinux label type rte.process which does not come from container-selinux package.  No idea where it comes from.

The unconfined_service_t indicates that you set up a systemd service to run an application labeled bin_t, you might need to change the label of this executable to something like kublet_exec_t but I am
just shooting in the dark, since I do not know this policy or tool.

Comment 5 Martin Sivák 2023-03-26 13:39:22 UTC
Daniel, I do not think you are reading the AVC denial correctly.

The source context is the container process. rte.process comes from a custom policy defined here https://github.com/k8stopologyawareschedwg/deployer/blob/main/pkg/assets/rte/selinuxpolicy/ocp_v4.12.cil that we have exactly for the reasons you mentioned (confined container with an exception to access /var/lib/kubelet/pod-resources/kubelet.sock.

Notice that is a different kubelet.sock (pod-resources API is reporting api for NUMA capacity reporting) than you think about (the main kubelet API).

scontext=system_u:system_r:rte.process:s0

The target context is a unix stream socket (the above mentioned /var/lib/kubelet/pod-resources/kubelet.sock), notice the tclass. And the label does not make sense at all for unix domain socket. The socket file itself has the proper label container_var_lib_t as mentioned above.

tcontext=system_u:system_r:unconfined_service_t:s0 tclass=unix_stream_socket

The operation denied "avc:  denied  { connectto }" is also a socket connection.

Comment 6 Timothée Ravier 2023-03-27 16:35:09 UTC
Which application is creating this socket? `/var/lib/kubelet/pod-resources/kubelet.sock`

Comment 7 Francesco Romani 2023-03-27 19:25:31 UTC
(In reply to Timothée Ravier from comment #6)
> Which application is creating this socket?
> `/var/lib/kubelet/pod-resources/kubelet.sock`

Kubelet is. Kubelet exposes an API to monitor the assignment of compute resources  to pod: https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#monitoring-device-plugin-resources . Clients connecting to that endpoint _cannot_ change the kubelet state. The API is for monitoring only.

Regarding this denial
```
type=AVC msg=audit(1679044968.338:5383): avc:  denied  { connectto } for  pid=10664 comm="resource-topolo" path="/var/lib/kubelet/pod-resources/kubelet.sock" scontext=system_u:system_r:rte.process:s0 tcontext=system_u:system_r:unconfined_service_t:s0 tclass=unix_stream_socket permissive=0
```

what's puzzling is the application is run through a daemonset, which uses the policy linked above. The daemonset YAML looks like this:

```
apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "1"
  creationTimestamp: "2023-03-27T08:14:16Z"
  generation: 1
  name: numaresourcesoperator-worker
  namespace: numaresources
  ownerReferences:
  - apiVersion: nodetopology.openshift.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: NUMAResourcesOperator
    name: numaresourcesoperator
    uid: 04b34ca6-8a85-4956-aa59-441a03b10e29
  resourceVersion: "2356026"
  uid: fdc60be1-5b98-4f20-9fca-28ca674836c7
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: resource-topology
  template:
    metadata:
      creationTimestamp: null
      labels:
        name: resource-topology
    spec:
      containers:
      - args:
        - --sleep-interval=10s
        - --sysfs=/host-sys
        - --podresources-socket=unix:///host-podresources-socket/kubelet.sock
        - --v=2
        - --refresh-node-resources
        - --oci-hint-dir=/run/rte
        - --pods-fingerprint
        - --pods-fingerprint-status-file=/run/pfpstatus/dump.json
        command:
        - /bin/resource-topology-exporter
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: REFERENCE_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: REFERENCE_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: REFERENCE_CONTAINER_NAME
          value: shared-pool-container
        - name: METRICS_PORT
          value: "2112"
        image: quay.io/fromani/numaresources-operator:4.13.1011
        imagePullPolicy: IfNotPresent
        name: resource-topology-exporter
        ports:
        - containerPort: 2112
          name: metrics-port
          protocol: TCP
        resources: {}
        securityContext:
          runAsGroup: 0
          runAsUser: 0
          seLinuxOptions:
            level: s0
            type: rte.process
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host-sys
          name: host-sys
          readOnly: true
        - mountPath: /host-podresources-socket/kubelet.sock
          name: host-podresources-socket
        - mountPath: /host-run-rte
          name: host-run-rte
        - mountPath: /etc/resource-topology-exporter/
          name: rte-config-volume
        - mountPath: /run/pfpstatus
          name: run-pfpstatus
      - args:
        - while true; do sleep 30s; done
        command:
        - /bin/sh
        - -c
        - --
        image: quay.io/fromani/numaresources-operator:4.13.1011
        imagePullPolicy: IfNotPresent
        name: shared-pool-container
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      nodeSelector:
        node-role.kubernetes.io/worker: ""
      readinessGates:
      - conditionType: PodresourcesFetched
      - conditionType: NodeTopologyUpdated
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: rte
      serviceAccountName: rte
      terminationGracePeriodSeconds: 30
      volumes:
      - hostPath:
          path: /sys
          type: Directory
        name: host-sys
      - hostPath:
          path: /var/lib/kubelet/pod-resources/kubelet.sock
          type: Socket
        name: host-podresources-socket
      - hostPath:
          path: /run/rte
          type: DirectoryOrCreate
        name: host-run-rte
      - configMap:
          defaultMode: 420
          name: numaresourcesoperator-worker
          optional: true
        name: rte-config-volume
      - emptyDir:
          medium: Memory
          sizeLimit: 8Mi
        name: run-pfpstatus
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 2
  desiredNumberScheduled: 2
  numberAvailable: 2
  numberMisscheduled: 0
  numberReady: 2
  observedGeneration: 1
  updatedNumberScheduled: 2
```

Comment 8 Timothée Ravier 2023-03-28 17:24:39 UTC
If all containers should be able to access it then it should likely be labeled as `system_u:object_r:container_file_t` as Dan said.

Not sure where this should be done but likely either need a policy rule for it or the kubelet should make sure it gets the right label.

Comment 9 Francesco Romani 2023-03-28 19:03:25 UTC
(In reply to Timothée Ravier from comment #8)
> If all containers should be able to access it then it should likely be
> labeled as `system_u:object_r:container_file_t` as Dan said.
> 
> Not sure where this should be done but likely either need a policy rule for
> it or the kubelet should make sure it gets the right label.

Makes sense. Should we move this bug to the kubelet or to the openshift-hyperkube package?

I'm not sure this would solve the unconfined_service_t denial but surely fixing the kubelet.sock label it's a good start

Comment 10 Timothée Ravier 2023-03-29 11:37:18 UTC
Input from SELinux folks would be appreciate on the best approach:
- change kubelet to make sure the socket gets created with the right label
- or add a type transition rule to the policy to set the socket to the right label on creation by the kubelet

Not sure which on is best / recommended.

Comment 11 Daniel Walsh 2023-04-01 20:28:30 UTC
type=AVC msg=audit(1679044968.338:5383): avc:  denied  { connectto } for  pid=10664 comm="resource-topolo" path="/var/lib/kubelet/pod-resources/kubelet.sock" scontext=system_u:system_r:rte.process:s0 tcontext=system_u:system_r:unconfined_service_t:s0 tclass=unix_stream_socket permissive=0

This is a process label on the service not a file on disk.  Basically you have a systemd service that is labled bin_t which is listening on the "/var/lib/kubelet/pod-resources/kubelet.sock"

And resource-topolo is attempting to connect to it.  Might be a kuberernetes(kublet?).

We have the following labeling.

/usr/s?bin/kubelet.*		--	gen_context(system_u:object_r:kubelet_exec_t,s0)

Is this the correct service that is running in the systemd service?  And is it labeled kubelet_exec_t?

Comment 13 Martin Sivák 2023-04-03 13:56:02 UTC
Oh, thanks Dan for explaining this. So this is not a denial to access the socket file, but a denial to talk to the other side of the socket. I did not know SELinux can do this.

For the sake of the public side of the bug, Francesco reported that kubelet is indeed running with the unconfined_service_t context which is unexpected.

Comment 14 RHEL Program Management 2023-09-11 19:36:21 UTC
Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 15 RHEL Program Management 2023-09-11 19:41:35 UTC
This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.


Note You need to log in before you can comment on or make changes to this bug.