Hide Forgot
Description of problem: Random kubernetes-service-endpoints down, checked log with SELinux related. Tried manually adding port 9100, endpoints brought up but went DOWN after some time. 9100 port configuration disappeared from iptables. Version-Release number of selected component (if applicable): OCP 3.9.20 on RHEL 7.5 [root@mn-infra-general01 redhat]# oc version oc v3.9.30 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO How reproducible: Everytime. Configure iptables to proper port (9100/tcp), port went down after some time. Configuration not persistent. Steps to Reproduce: 1.Configure iptables 2.Restart iptables services 3.Endpoint recover but failed again after some time. Actual results: Random endpoints DOWN. Expected results: Endpoints to be up all the time. Additional info: [root@mn-infra-general01 redhat]# sealert -l 249798f8-134f-42ed-9b18-6448f2c7e20e SELinux is preventing /usr/bin/node_exporter from create access on the netlink_socket Unknown. ***** Plugin catchall_boolean (89.3 confidence) suggests ****************** If you want to allow virt to sandbox use netlink Then you must tell SELinux about this by enabling the 'virt_sandbox_use_netlink' boolean. Do setsebool -P virt_sandbox_use_netlink 1 ***** Plugin catchall (11.6 confidence) suggests ************************** If you believe that node_exporter should be allowed create access on the Unknown netlink_socket by default. Then you should report this as a bug. You can generate a local policy module to allow this access. Do allow this access for now by executing: # ausearch -c 'node_exporter' --raw | audit2allow -M my-nodeexporter # semodule -i my-nodeexporter.pp Additional Information: Source Context system_u:system_r:container_t:s0:c0,c10 Target Context system_u:system_r:container_t:s0:c0,c10 Target Objects Unknown [ netlink_socket ] Source node_exporter Source Path /usr/bin/node_exporter Port <Unknown> Host mn-infra-general01 Source RPM Packages Target RPM Packages Policy RPM selinux-policy-3.13.1-192.el7_5.3.noarch Selinux Enabled True Policy Type targeted Enforcing Mode Enforcing Host Name mn-infra-general01 Platform Linux mn-infra-general01 3.10.0-862.3.2.el7.x86_64 #1 SMP Tue May 15 18:22:15 EDT 2018 x86_64 x86_64 Alert Count 76 First Seen 2018-06-20 11:02:08 +07 Last Seen 2018-06-20 16:24:50 +07 Local ID 249798f8-134f-42ed-9b18-6448f2c7e20e Raw Audit Messages type=AVC msg=audit(1529486690.697:63842): avc: denied { create } for pid=24140 comm="node_exporter" scontext=system_u:system_r:container_t:s0:c0,c10 tcontext=system_u:system_r:container_t:s0:c0,c10 tclass=netlink_socket type=SYSCALL msg=audit(1529486690.697:63842): arch=x86_64 syscall=socket success=no exit=EACCES a0=10 a1=3 a2=10 a3=0 items=0 ppid=24120 pid=24140 auid=4294967295 uid=1000090000 gid=0 euid=1000090000 suid=1000090000 fsuid=1000090000 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm=node_exporter exe=/usr/bin/node_exporter subj=system_u:system_r:container_t:s0:c0,c10 key=(null) Hash: node_exporter,container_t,container_t,netlink_socket,create
another SELinux issue with another port [root@mn-master01 redhat]# sealert -l 3391218e-8f7b-457c-b4ba-ccb614d38c7e SELinux is preventing /usr/bin/node_exporter from module_request access on the system Unknown. ***** Plugin catchall_boolean (89.3 confidence) suggests ****************** If you want to allow domain to kernel load modules Then you must tell SELinux about this by enabling the 'domain_kernel_load_modules' boolean. Do setsebool -P domain_kernel_load_modules 1 ***** Plugin catchall (11.6 confidence) suggests ************************** If you believe that node_exporter should be allowed module_request access on the Unknown system by default. Then you should report this as a bug. You can generate a local policy module to allow this access. Do allow this access for now by executing: # ausearch -c 'node_exporter' --raw | audit2allow -M my-nodeexporter # semodule -i my-nodeexporter.pp Additional Information: Source Context system_u:system_r:container_t:s0:c0,c10 Target Context system_u:system_r:kernel_t:s0 Target Objects Unknown [ system ] Source node_exporter Source Path /usr/bin/node_exporter Port <Unknown> Host mn-master01 Source RPM Packages Target RPM Packages Policy RPM selinux-policy-3.13.1-192.el7_5.3.noarch Selinux Enabled True Policy Type targeted Enforcing Mode Enforcing Host Name mn-master01 Platform Linux mn-master01 3.10.0-862.3.2.el7.x86_64 #1 SMP Tue May 15 18:22:15 EDT 2018 x86_64 x86_64 Alert Count 360 First Seen 2018-06-20 17:16:44 +07 Last Seen 2018-06-20 23:15:44 +07 Local ID 3391218e-8f7b-457c-b4ba-ccb614d38c7e Raw Audit Messages type=AVC msg=audit(1529511344.593:160889): avc: denied { module_request } for pid=120919 comm="node_exporter" kmod="net-pf-16-proto-16-family-nl80211" scontext=system_u:system_r:container_t:s0:c0,c10 tcontext=system_u:system_r:kernel_t:s0 tclass=system type=SYSCALL msg=audit(1529511344.593:160889): arch=x86_64 syscall=sendmsg success=yes exit=EPIPE a0=6 a1=c420034580 a2=0 a3=0 items=0 ppid=120895 pid=120919 auid=4294967295 uid=1000090000 gid=0 euid=1000090000 suid=1000090000 fsuid=1000090000 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm=node_exporter exe=/usr/bin/node_exporter subj=system_u:system_r:container_t:s0:c0,c10 key=(null) Hash: node_exporter,container_t,kernel_t,system,module_request
iptables, the ports with issue is jetdirect and jetcmeserver Chain OS_FIREWALL_ALLOW (1 references) target prot opt source destination ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:2379 ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:2380 ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:https ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:pcsync-http ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:senomix02 ACCEPT udp -- anywhere anywhere state NEW udp dpt:senomix02 ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:websm ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:10250 ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:http ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:jetdirect ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:jetcmeserver ACCEPT udp -- anywhere anywhere state NEW udp dpt:4789
Reassigning to Paul Gier as this is about the old Prometheus tech preview.
I've checked in my local environment (all-in-one OpenShift 3.9 deployed with Ansible) and I'm getting also some SELinux errors. I've got the first AVC about "netlink_generic_socket Unknown". Looking at this upstream issue [1], it is caused by the wifi collector that probes the WiFi interfaces. For OpenShift installations, there's no need to use the wifi collector: we should be passing the "--no-collector.wifi" option to the node_exporter's command line. I don't see the second reported AVC but given the kernel module name (net-pf-16-proto-16-family-nl80211), it also relates to the wifi collector which is enabled by default. I've also spotted [2] which reports the same issue but for Fedora (where it is legitimate to have the wifi collector working). Having said that, I'm not sure that this is the reason why Prometheus can't scrape the node-exporter targets as those SELinux issues don't prevent node-exporter from starting. Noting that the problematic ports are 1936 and 9100, I would rather relate your problem to [3] and [4]. Are you sure that the required ports on the firewall are still open once the targets go down again? [1] https://github.com/prometheus/node_exporter/issues/649 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1585415 [3] https://bugzilla.redhat.com/show_bug.cgi?id=1563888 [4] https://bugzilla.redhat.com/show_bug.cgi?id=1552235
The firewall/iptables settings were gone when the targets went down. We had to reconfigure it and the node comes up again.
@pk I suppose that for some reason the firewall rules that were allowing Prometheus to scrape the targets have gone at some point. The firewall issues are already tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1563888 https://bugzilla.redhat.com/show_bug.cgi?id=1552235 I propose that you follow those tickets to track progress. And this ticket will track only the SELinux issue.
https://github.com/openshift/openshift-ansible/pull/8914 has been merged into upstream. It has also been backported to the 3.9 release: https://github.com/openshift/openshift-ansible/pull/9007 And it will be backported to 3.10 after the initial release: https://github.com/openshift/openshift-ansible/pull/9006
kubernetes-service-endpoint is not down, and did not see following error in /var/log/audit/audit.log type=AVC msg=audit(1502978293.035:120770): avc: denied { create } for pid=1938 comm="node_exporter" scontext=system_u:system_r:svirt_lxc_net_t:s0:c182,c991 tcontext=system_u:system_r:svirt_lxc_net_t:s0:c182,c991 tclass=netlink_socket type=SYSCALL msg=audit(1502978293.035:120770): arch=c000003e syscall=41 success=no exit=-13 a0=10 a1=3 a2=10 a3=0 items=0 ppid=1854 pid=1938 auid=4294967295 uid=992 gid=992 euid=992 suid=992 fsuid=992 egid=992 sgid=992 fsgid=992 tty=(none) ses=4294967295 comm="node_exporter" exe="/opt/gitlab/embedded/bin/node_exporter" subj=system_u:system_r:svirt_lxc_net_t:s0:c182,c991 key=(null) type=NETFILTER_CFG msg=audit(1502978298.830:120771): table=filter family=2 entries=8 # rpm -qa | grep ansible openshift-ansible-docs-3.11.0-0.25.0.git.0.7497e69.el7.noarch openshift-ansible-roles-3.11.0-0.25.0.git.0.7497e69.el7.noarch openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch ansible-2.6.3-1.el7ae.noarch openshift-ansible-playbooks-3.11.0-0.25.0.git.0.7497e69.el7.noarch image: prometheus-node-exporter-v3.11.0-0.25.0.0 # openshift version openshift v3.11.0-0.25.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652