Bug 1593211 - Prometheus: kubernetes-service-endpoint, SELinux
Summary: Prometheus: kubernetes-service-endpoint, SELinux
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.9.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.11.0
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-20 09:46 UTC by pk
Modified: 2018-10-11 07:21 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: node_exporter starts with the wifi collector enabled by default. Consequence: the wifi collector requires SELinux permissions that aren't enabled. This causes AVC denials though it doesn't stop node_exporter. Fix: node_exporter starts with the wifi collector being explicitly disabled. Result: SELinux doesn't report any AVC denial.
Clone Of:
Environment:
Last Closed: 2018-10-11 07:20:54 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift openshift-ansible pull 8914 'None' closed Disable the wifi collector in node_exporter 2020-02-17 17:05:50 UTC
Github prometheus node_exporter issues 1008 'None' open Node-Exporter : memory usage too high (OOME) 2020-02-17 17:05:50 UTC
Github prometheus node_exporter issues 649 'None' closed SELinux denies wifi controller netlink acces 2020-02-17 17:05:50 UTC
Red Hat Product Errata RHBA-2018:2652 None None None 2018-10-11 07:21:18 UTC

Description pk 2018-06-20 09:46:53 UTC
Description of problem:
Random kubernetes-service-endpoints down, checked log with SELinux related. 
Tried manually adding port 9100, endpoints brought up but went DOWN after some time.

9100 port configuration disappeared from iptables.

Version-Release number of selected component (if applicable):
OCP 3.9.20 on RHEL 7.5

[root@mn-infra-general01 redhat]# oc version
oc v3.9.30
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

How reproducible:
Everytime. Configure iptables to proper port (9100/tcp), port went down after some time. Configuration not persistent.


Steps to Reproduce:
1.Configure iptables
2.Restart iptables services
3.Endpoint recover but failed again after some time.

Actual results:
Random endpoints DOWN.

Expected results:
Endpoints to be up all the time.

Additional info:
[root@mn-infra-general01 redhat]# sealert -l 249798f8-134f-42ed-9b18-6448f2c7e20e
SELinux is preventing /usr/bin/node_exporter from create access on the netlink_socket Unknown.

*****  Plugin catchall_boolean (89.3 confidence) suggests   ******************

If you want to allow virt to sandbox use netlink
Then you must tell SELinux about this by enabling the 'virt_sandbox_use_netlink' boolean.

Do
setsebool -P virt_sandbox_use_netlink 1

*****  Plugin catchall (11.6 confidence) suggests   **************************

If you believe that node_exporter should be allowed create access on the Unknown netlink_socket by default.
Then you should report this as a bug.
You can generate a local policy module to allow this access.
Do
allow this access for now by executing:
# ausearch -c 'node_exporter' --raw | audit2allow -M my-nodeexporter
# semodule -i my-nodeexporter.pp


Additional Information:
Source Context                system_u:system_r:container_t:s0:c0,c10
Target Context                system_u:system_r:container_t:s0:c0,c10
Target Objects                Unknown [ netlink_socket ]
Source                        node_exporter
Source Path                   /usr/bin/node_exporter
Port                          <Unknown>
Host                          mn-infra-general01
Source RPM Packages           
Target RPM Packages           
Policy RPM                    selinux-policy-3.13.1-192.el7_5.3.noarch
Selinux Enabled               True
Policy Type                   targeted
Enforcing Mode                Enforcing
Host Name                     mn-infra-general01
Platform                      Linux mn-infra-general01 3.10.0-862.3.2.el7.x86_64
                              #1 SMP Tue May 15 18:22:15 EDT 2018 x86_64 x86_64
Alert Count                   76
First Seen                    2018-06-20 11:02:08 +07
Last Seen                     2018-06-20 16:24:50 +07
Local ID                      249798f8-134f-42ed-9b18-6448f2c7e20e

Raw Audit Messages
type=AVC msg=audit(1529486690.697:63842): avc:  denied  { create } for  pid=24140 comm="node_exporter" scontext=system_u:system_r:container_t:s0:c0,c10 tcontext=system_u:system_r:container_t:s0:c0,c10 tclass=netlink_socket


type=SYSCALL msg=audit(1529486690.697:63842): arch=x86_64 syscall=socket success=no exit=EACCES a0=10 a1=3 a2=10 a3=0 items=0 ppid=24120 pid=24140 auid=4294967295 uid=1000090000 gid=0 euid=1000090000 suid=1000090000 fsuid=1000090000 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm=node_exporter exe=/usr/bin/node_exporter subj=system_u:system_r:container_t:s0:c0,c10 key=(null)

Hash: node_exporter,container_t,container_t,netlink_socket,create

Comment 1 pk 2018-06-20 16:23:18 UTC
another SELinux issue with another port

[root@mn-master01 redhat]# sealert -l 3391218e-8f7b-457c-b4ba-ccb614d38c7e
SELinux is preventing /usr/bin/node_exporter from module_request access on the system Unknown.

*****  Plugin catchall_boolean (89.3 confidence) suggests   ******************

If you want to allow domain to kernel load modules
Then you must tell SELinux about this by enabling the 'domain_kernel_load_modules' boolean.

Do
setsebool -P domain_kernel_load_modules 1

*****  Plugin catchall (11.6 confidence) suggests   **************************

If you believe that node_exporter should be allowed module_request access on the Unknown system by default.
Then you should report this as a bug.
You can generate a local policy module to allow this access.
Do
allow this access for now by executing:
# ausearch -c 'node_exporter' --raw | audit2allow -M my-nodeexporter
# semodule -i my-nodeexporter.pp


Additional Information:
Source Context                system_u:system_r:container_t:s0:c0,c10
Target Context                system_u:system_r:kernel_t:s0
Target Objects                Unknown [ system ]
Source                        node_exporter
Source Path                   /usr/bin/node_exporter
Port                          <Unknown>
Host                          mn-master01
Source RPM Packages           
Target RPM Packages           
Policy RPM                    selinux-policy-3.13.1-192.el7_5.3.noarch
Selinux Enabled               True
Policy Type                   targeted
Enforcing Mode                Enforcing
Host Name                     mn-master01
Platform                      Linux mn-master01 3.10.0-862.3.2.el7.x86_64 #1 SMP
                              Tue May 15 18:22:15 EDT 2018 x86_64 x86_64
Alert Count                   360
First Seen                    2018-06-20 17:16:44 +07
Last Seen                     2018-06-20 23:15:44 +07
Local ID                      3391218e-8f7b-457c-b4ba-ccb614d38c7e

Raw Audit Messages
type=AVC msg=audit(1529511344.593:160889): avc:  denied  { module_request } for  pid=120919 comm="node_exporter" kmod="net-pf-16-proto-16-family-nl80211" scontext=system_u:system_r:container_t:s0:c0,c10 tcontext=system_u:system_r:kernel_t:s0 tclass=system


type=SYSCALL msg=audit(1529511344.593:160889): arch=x86_64 syscall=sendmsg success=yes exit=EPIPE a0=6 a1=c420034580 a2=0 a3=0 items=0 ppid=120895 pid=120919 auid=4294967295 uid=1000090000 gid=0 euid=1000090000 suid=1000090000 fsuid=1000090000 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm=node_exporter exe=/usr/bin/node_exporter subj=system_u:system_r:container_t:s0:c0,c10 key=(null)

Hash: node_exporter,container_t,kernel_t,system,module_request

Comment 2 pk 2018-06-21 02:53:23 UTC
iptables, the ports with issue is jetdirect and jetcmeserver

Chain OS_FIREWALL_ALLOW (1 references)
target     prot opt source               destination         
ACCEPT     tcp  --  anywhere             anywhere             state NEW tcp dpt:2379
ACCEPT     tcp  --  anywhere             anywhere             state NEW tcp dpt:2380
ACCEPT     tcp  --  anywhere             anywhere             state NEW tcp dpt:https
ACCEPT     tcp  --  anywhere             anywhere             state NEW tcp dpt:pcsync-http
ACCEPT     tcp  --  anywhere             anywhere             state NEW tcp dpt:senomix02
ACCEPT     udp  --  anywhere             anywhere             state NEW udp dpt:senomix02
ACCEPT     tcp  --  anywhere             anywhere             state NEW tcp dpt:websm
ACCEPT     tcp  --  anywhere             anywhere             state NEW tcp dpt:10250
ACCEPT     tcp  --  anywhere             anywhere             state NEW tcp dpt:http
ACCEPT     tcp  --  anywhere             anywhere             state NEW tcp dpt:jetdirect
ACCEPT     tcp  --  anywhere             anywhere             state NEW tcp dpt:jetcmeserver
ACCEPT     udp  --  anywhere             anywhere             state NEW udp dpt:4789

Comment 3 Frederic Branczyk 2018-06-21 07:49:52 UTC
Reassigning to Paul Gier as this is about the old Prometheus tech preview.

Comment 4 Simon Pasquier 2018-06-22 09:24:40 UTC
I've checked in my local environment (all-in-one OpenShift 3.9 deployed with Ansible) and I'm getting also some SELinux errors.

I've got the first AVC about "netlink_generic_socket Unknown". Looking at this upstream issue [1], it is caused by the wifi collector that probes the WiFi interfaces. For OpenShift installations, there's no need to use the wifi collector: we should be passing the "--no-collector.wifi" option to the node_exporter's command line.
I don't see the second reported AVC but given the kernel module name (net-pf-16-proto-16-family-nl80211), it also relates to the wifi collector which is enabled by default.

I've also spotted [2] which reports the same issue but for Fedora (where it is legitimate to have the wifi collector working).

Having said that, I'm not sure that this is the reason why Prometheus can't scrape the node-exporter targets as those SELinux issues don't prevent node-exporter from starting. Noting that the problematic ports are 1936 and 9100, I would rather relate your problem to [3] and [4]. Are you sure that the  required ports on the firewall are still open once the targets go down again?

[1] https://github.com/prometheus/node_exporter/issues/649
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1585415
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1563888
[4] https://bugzilla.redhat.com/show_bug.cgi?id=1552235

Comment 5 pk 2018-06-25 02:01:40 UTC
The firewall/iptables settings were gone when the targets went down. We had to reconfigure it and the node comes up again.

Comment 6 Simon Pasquier 2018-06-25 13:05:43 UTC
@pk I suppose that for some reason the firewall rules that were allowing Prometheus to scrape the targets have gone at some point. The firewall issues are already tracked here:

https://bugzilla.redhat.com/show_bug.cgi?id=1563888
https://bugzilla.redhat.com/show_bug.cgi?id=1552235

I propose that you follow those tickets to track progress. And this ticket will track only the SELinux issue.

Comment 7 Simon Pasquier 2018-07-03 08:17:36 UTC
https://github.com/openshift/openshift-ansible/pull/8914 has been merged into upstream.

It has also been backported to the 3.9 release: https://github.com/openshift/openshift-ansible/pull/9007

And it will be backported to 3.10 after the initial release: https://github.com/openshift/openshift-ansible/pull/9006

Comment 9 Junqi Zhao 2018-08-31 06:57:40 UTC
kubernetes-service-endpoint is not down, and did not see following error in /var/log/audit/audit.log
type=AVC msg=audit(1502978293.035:120770): avc:  denied  { create } for  pid=1938 comm="node_exporter" scontext=system_u:system_r:svirt_lxc_net_t:s0:c182,c991 tcontext=system_u:system_r:svirt_lxc_net_t:s0:c182,c991 tclass=netlink_socket
type=SYSCALL msg=audit(1502978293.035:120770): arch=c000003e syscall=41 success=no exit=-13 a0=10 a1=3 a2=10 a3=0 items=0 ppid=1854 pid=1938 auid=4294967295 uid=992 gid=992 euid=992 suid=992 fsuid=992 egid=992 sgid=992 fsgid=992 tty=(none) ses=4294967295 comm="node_exporter" exe="/opt/gitlab/embedded/bin/node_exporter" subj=system_u:system_r:svirt_lxc_net_t:s0:c182,c991 key=(null)
type=NETFILTER_CFG msg=audit(1502978298.830:120771): table=filter family=2 entries=8

# rpm -qa | grep ansible
openshift-ansible-docs-3.11.0-0.25.0.git.0.7497e69.el7.noarch
openshift-ansible-roles-3.11.0-0.25.0.git.0.7497e69.el7.noarch
openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch
ansible-2.6.3-1.el7ae.noarch
openshift-ansible-playbooks-3.11.0-0.25.0.git.0.7497e69.el7.noarch

image:
prometheus-node-exporter-v3.11.0-0.25.0.0

# openshift version
openshift v3.11.0-0.25.0

Comment 11 errata-xmlrpc 2018-10-11 07:20:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652


Note You need to log in before you can comment on or make changes to this bug.