Bug 2180456
Summary: | grant access to unprivileged containers to the kubelet podresources socket | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Francesco Romani <fromani> | ||||
Component: | container-selinux | Assignee: | Daniel Walsh <dwalsh> | ||||
Status: | CLOSED MIGRATED | QA Contact: | atomic-bugs <atomic-bugs> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 9.2 | CC: | dornelas, dwalsh, grajaiya, jnovy, lsm5, mboddu, travier, tsweeney | ||||
Target Milestone: | rc | Keywords: | MigratedToJIRA | ||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2023-09-11 19:41:35 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Francesco Romani
2023-03-21 14:11:04 UTC
I reproduced the issue and there is something really weird going on. RTE is denied access to the kubelet.sock by SELinux as Francesco said already: type=AVC msg=audit(1679590155.184:76): avc: denied { connectto } for pid=12566 comm="resource-topolo" path="/var/lib/kubelet/pod-resources/kubelet.sock" scontext=system_u:system_r:rte.process:s0 tcontext=system_u:system_r:unconfined_service_t:s0 tclass=unix_stream_socket permissive=0 Was caused by: Missing type enforcement (TE) allow rule. You can use audit2allow to generate a loadable module to allow this access. Notice the target context though: system_u:system_r:unconfined_service_t:s0 What baffles me is the file has the proper context both from node side and from container side though: Node: sh-5.1# ls -alZ /var/lib/kubelet/pod-resources/kubelet.sock srwxr-xr-x. 1 root root system_u:object_r:container_var_lib_t:s0 0 Mar 23 16:29 /var/lib/kubelet/pod-resources/kubelet.sock Container: The file is mounted into a container: volumeMounts: - mountPath: /host-podresources-socket/kubelet.sock name: host-podresources-socket volumes: - hostPath: path: /var/lib/kubelet/pod-resources/kubelet.sock type: Socket name: host-podresources-socket And the end result from within the container looks like this: sh-4.4# cd host-podresources-socket/ sh-4.4# ls -alZ total 0 drwxr-xr-x. 2 root root system_u:object_r:container_file_t:s0 26 Mar 23 16:44 . dr-xr-xr-x. 1 root root system_u:object_r:container_file_t:s0 85 Mar 23 16:44 .. srwxr-xr-x. 1 root root system_u:object_r:container_var_lib_t:s0 0 Mar 23 16:32 kubelet.sock So where is the unconfined_service_t coming from? On a file? I am also raising the severity as this is blocking a GA of a long awaited OCP feature. The workaround we have is insecure as it simply allows the access to unconfined_service_t. If you want a container to talk to the kublet.sock, you should run it as spc_t, since this seems like a major escape vector for the container. If you still need a confined container then whoever creates srwxr-xr-x. 1 root root system_u:object_r:container_var_lib_t:s0 0 Mar 23 16:32 kubelet.sock Needs to make sure it is labeled container_file_t. The Second AVC is not related to container-selinux at all. type=AVC msg=audit(1679590155.184:76): avc: denied { connectto } for pid=12566 comm="resource-topolo" path="/var/lib/kubelet/pod-resources/kubelet.sock" scontext=system_u:system_r:rte.process:s0 tcontext=system_u:system_r:unconfined_service_t:s0 tclass=unix_stream_socket permissive=0 This does not seem to be related to the first AVC, since now we have a new SELinux label type rte.process which does not come from container-selinux package. No idea where it comes from. The unconfined_service_t indicates that you set up a systemd service to run an application labeled bin_t, you might need to change the label of this executable to something like kublet_exec_t but I am just shooting in the dark, since I do not know this policy or tool. Daniel, I do not think you are reading the AVC denial correctly. The source context is the container process. rte.process comes from a custom policy defined here https://github.com/k8stopologyawareschedwg/deployer/blob/main/pkg/assets/rte/selinuxpolicy/ocp_v4.12.cil that we have exactly for the reasons you mentioned (confined container with an exception to access /var/lib/kubelet/pod-resources/kubelet.sock. Notice that is a different kubelet.sock (pod-resources API is reporting api for NUMA capacity reporting) than you think about (the main kubelet API). scontext=system_u:system_r:rte.process:s0 The target context is a unix stream socket (the above mentioned /var/lib/kubelet/pod-resources/kubelet.sock), notice the tclass. And the label does not make sense at all for unix domain socket. The socket file itself has the proper label container_var_lib_t as mentioned above. tcontext=system_u:system_r:unconfined_service_t:s0 tclass=unix_stream_socket The operation denied "avc: denied { connectto }" is also a socket connection. Which application is creating this socket? `/var/lib/kubelet/pod-resources/kubelet.sock` (In reply to Timothée Ravier from comment #6) > Which application is creating this socket? > `/var/lib/kubelet/pod-resources/kubelet.sock` Kubelet is. Kubelet exposes an API to monitor the assignment of compute resources to pod: https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#monitoring-device-plugin-resources . Clients connecting to that endpoint _cannot_ change the kubelet state. The API is for monitoring only. Regarding this denial ``` type=AVC msg=audit(1679044968.338:5383): avc: denied { connectto } for pid=10664 comm="resource-topolo" path="/var/lib/kubelet/pod-resources/kubelet.sock" scontext=system_u:system_r:rte.process:s0 tcontext=system_u:system_r:unconfined_service_t:s0 tclass=unix_stream_socket permissive=0 ``` what's puzzling is the application is run through a daemonset, which uses the policy linked above. The daemonset YAML looks like this: ``` apiVersion: apps/v1 kind: DaemonSet metadata: annotations: deprecated.daemonset.template.generation: "1" creationTimestamp: "2023-03-27T08:14:16Z" generation: 1 name: numaresourcesoperator-worker namespace: numaresources ownerReferences: - apiVersion: nodetopology.openshift.io/v1 blockOwnerDeletion: true controller: true kind: NUMAResourcesOperator name: numaresourcesoperator uid: 04b34ca6-8a85-4956-aa59-441a03b10e29 resourceVersion: "2356026" uid: fdc60be1-5b98-4f20-9fca-28ca674836c7 spec: revisionHistoryLimit: 10 selector: matchLabels: name: resource-topology template: metadata: creationTimestamp: null labels: name: resource-topology spec: containers: - args: - --sleep-interval=10s - --sysfs=/host-sys - --podresources-socket=unix:///host-podresources-socket/kubelet.sock - --v=2 - --refresh-node-resources - --oci-hint-dir=/run/rte - --pods-fingerprint - --pods-fingerprint-status-file=/run/pfpstatus/dump.json command: - /bin/resource-topology-exporter env: - name: NODE_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: spec.nodeName - name: REFERENCE_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: REFERENCE_POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: REFERENCE_CONTAINER_NAME value: shared-pool-container - name: METRICS_PORT value: "2112" image: quay.io/fromani/numaresources-operator:4.13.1011 imagePullPolicy: IfNotPresent name: resource-topology-exporter ports: - containerPort: 2112 name: metrics-port protocol: TCP resources: {} securityContext: runAsGroup: 0 runAsUser: 0 seLinuxOptions: level: s0 type: rte.process terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /host-sys name: host-sys readOnly: true - mountPath: /host-podresources-socket/kubelet.sock name: host-podresources-socket - mountPath: /host-run-rte name: host-run-rte - mountPath: /etc/resource-topology-exporter/ name: rte-config-volume - mountPath: /run/pfpstatus name: run-pfpstatus - args: - while true; do sleep 30s; done command: - /bin/sh - -c - -- image: quay.io/fromani/numaresources-operator:4.13.1011 imagePullPolicy: IfNotPresent name: shared-pool-container resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst nodeSelector: node-role.kubernetes.io/worker: "" readinessGates: - conditionType: PodresourcesFetched - conditionType: NodeTopologyUpdated restartPolicy: Always schedulerName: default-scheduler securityContext: {} serviceAccount: rte serviceAccountName: rte terminationGracePeriodSeconds: 30 volumes: - hostPath: path: /sys type: Directory name: host-sys - hostPath: path: /var/lib/kubelet/pod-resources/kubelet.sock type: Socket name: host-podresources-socket - hostPath: path: /run/rte type: DirectoryOrCreate name: host-run-rte - configMap: defaultMode: 420 name: numaresourcesoperator-worker optional: true name: rte-config-volume - emptyDir: medium: Memory sizeLimit: 8Mi name: run-pfpstatus updateStrategy: rollingUpdate: maxSurge: 0 maxUnavailable: 1 type: RollingUpdate status: currentNumberScheduled: 2 desiredNumberScheduled: 2 numberAvailable: 2 numberMisscheduled: 0 numberReady: 2 observedGeneration: 1 updatedNumberScheduled: 2 ``` If all containers should be able to access it then it should likely be labeled as `system_u:object_r:container_file_t` as Dan said. Not sure where this should be done but likely either need a policy rule for it or the kubelet should make sure it gets the right label. (In reply to Timothée Ravier from comment #8) > If all containers should be able to access it then it should likely be > labeled as `system_u:object_r:container_file_t` as Dan said. > > Not sure where this should be done but likely either need a policy rule for > it or the kubelet should make sure it gets the right label. Makes sense. Should we move this bug to the kubelet or to the openshift-hyperkube package? I'm not sure this would solve the unconfined_service_t denial but surely fixing the kubelet.sock label it's a good start Input from SELinux folks would be appreciate on the best approach: - change kubelet to make sure the socket gets created with the right label - or add a type transition rule to the policy to set the socket to the right label on creation by the kubelet Not sure which on is best / recommended. type=AVC msg=audit(1679044968.338:5383): avc: denied { connectto } for pid=10664 comm="resource-topolo" path="/var/lib/kubelet/pod-resources/kubelet.sock" scontext=system_u:system_r:rte.process:s0 tcontext=system_u:system_r:unconfined_service_t:s0 tclass=unix_stream_socket permissive=0 This is a process label on the service not a file on disk. Basically you have a systemd service that is labled bin_t which is listening on the "/var/lib/kubelet/pod-resources/kubelet.sock" And resource-topolo is attempting to connect to it. Might be a kuberernetes(kublet?). We have the following labeling. /usr/s?bin/kubelet.* -- gen_context(system_u:object_r:kubelet_exec_t,s0) Is this the correct service that is running in the systemd service? And is it labeled kubelet_exec_t? Oh, thanks Dan for explaining this. So this is not a denial to access the socket file, but a denial to talk to the other side of the socket. I did not know SELinux can do this. For the sake of the public side of the bug, Francesco reported that kubelet is indeed running with the unconfined_service_t context which is unexpected. Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug. This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there. Due to differences in account names between systems, some fields were not replicated. Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information. To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer. You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like: "Bugzilla Bug" = 1234567 In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information. |