Bug 1608288
Summary: | prometheus-node-exporter daemonset DESIRED number is 0 | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> |
Component: | Monitoring | Assignee: | Paul Gier <pgier> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.11.0 | CC: | aos-bugs, jforrest, pgier |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | 3.11.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: Installing both the legacy prometheus components and the new Prometheus openshift monitoring caused a port conflict between node_exporter containers.
Consequence: This prevented the node exporter daemonset from deploying correctly.
Fix: Changed the port of legacy node_exporter deployments from 9100 to 9101.
Result: Both Prometheus installations can now be performed successfully in a single cluster, however it is recommended to use only the newer Prometheus installation method.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-10-11 07:22:15 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Junqi Zhao
2018-07-25 09:17:09 UTC
Close it as WORKSFORME,re-tested, prometheus-node-exporter pods could be created # oc get pod NAME READY STATUS RESTARTS AGE prometheus-0 6/6 Running 0 35m prometheus-node-exporter-6t6sf 1/1 Running 0 34m prometheus-node-exporter-d8zhc 1/1 Running 0 34m prometheus-node-exporter-x9dcd 1/1 Running 0 34m # oc get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE prometheus-node-exporter 3 3 3 3 3 <none> 35m # rpm -qa | grep ansible openshift-ansible-docs-3.11.0-0.9.0.git.0.195bae3None.noarch ansible-2.6.1-1.el7ae.noarch openshift-ansible-roles-3.11.0-0.9.0.git.0.195bae3None.noarch openshift-ansible-3.11.0-0.9.0.git.0.195bae3None.noarch openshift-ansible-playbooks-3.11.0-0.9.0.git.0.195bae3None.noarch prometheus-node-exporter-v3.11.0-0.9.0.0 Issue is reproduced again on GCE, prometheus-node-exporter daemonset DESIRED number is 0 again # oc get po NAME READY STATUS RESTARTS AGE prometheus-0 6/6 Running 0 13m # oc get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE prometheus-node-exporter 0 0 0 0 0 <none> 12m Image: prometheus-node-exporter/images/v3.11.0-0.11.0.0 Re-open it The reason is there are node-exporter pods in cluster monitoring namepace: openshift-monitoring # oc -n openshift-monitoring get pod | grep node-exporter node-exporter-4ddmj 2/2 Running 0 5h node-exporter-6r4z8 2/2 Running 0 5h node-exporter-lsd4v 2/2 Running 0 5h node-exporter-tnbbg 2/2 Running 0 5h its port is also 9100, whereas in openshift-metrics namepace, node-exporter port is also 9100, so if we deploy cluster monitoring first, and then deploy prometheus under openshift-metrics project, we will find this issue. PR to change the port of legacy node_exporter install to 9101 to avoid conflict with newer monitoring. https://github.com/openshift/openshift-ansible/pull/9706 The latest openshift-ansible is openshift-ansible-3.11.0-0.20.0.git.0.ec6d8caNone.noarch, and it does not contain the fix, will verify this defect after the new openshift-ansible comes out @Paul Will this fix be back ported to 3.9 and 3.10 AFAIK, this issue doesn't affect 3.9 and 3.10 since the newer prometheus installer doesn't exist in 3.9 and doesn't deploy node_exporter in 3.10. So I wasn't planning to backport it. If you can reproduce the issue in the earlier versions or you think it might be a problem in the future, I can backport it. 9101 is used newer prometheus node-exporter pod's kube-rbac-proxy container # oc describe pod node-exporter-pgqwj -n openshift-monitoring *************************snipped************************ kube-rbac-proxy: Container ID: docker://ab61986095d56418e0439d356d9bc450cea4a44583a50e55da370b685db48c6f Image: registry.reg-aws.openshift.com:443/openshift3/ose-kube-rbac-proxy:v3.11 Image ID: docker-pullable://registry.reg-aws.openshift.com:443/openshift3/ose-kube-rbac-proxy@sha256:db2eb1c395773ade1bae8b75b4b5127250c09ee44a163d8e651f9955badca00a Port: 9100/TCP Host Port: 9100/TCP Args: --secure-listen-address=:9100 --upstream=http://127.0.0.1:9101/ --tls-cert-file=/etc/tls/private/tls.crt --tls-private-key-file=/etc/tls/private/tls.key *************************snipped************************ This caused prometheus-node-exporter can not be started, need to use other port, 9102 is ok. # oc get pod -n openshift-metrics NAME READY STATUS RESTARTS AGE prometheus-0 6/6 Running 0 10m prometheus-node-exporter-jbmvg 0/1 CrashLoopBackOff 6 9m prometheus-node-exporter-nqll9 0/1 CrashLoopBackOff 6 9m prometheus-node-exporter-zvqch 0/1 CrashLoopBackOff 6 9m ******************************************************************************** # oc logs prometheus-node-exporter-jbmvg -n openshift-metrics time="2018-08-24T01:47:14Z" level=info msg="Starting node_exporter (version=0.16.0, branch=, revision=)" source="node_exporter.go:82" time="2018-08-24T01:47:14Z" level=info msg="Build context (go=go1.10.2, user=mockbuild.eng.bos.redhat.com, date=20180823-13:59:32)" source="node_exporter.go:83" time="2018-08-24T01:47:14Z" level=info msg="Enabled collectors:" source="node_exporter.go:90" time="2018-08-24T01:47:14Z" level=info msg=" - arp" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - bcache" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - bonding" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - conntrack" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - cpu" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - diskstats" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - edac" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - entropy" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - filefd" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - filesystem" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - hwmon" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - infiniband" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - ipvs" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - loadavg" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - mdadm" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - meminfo" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - netdev" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - netstat" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - nfs" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - nfsd" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - sockstat" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - stat" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - textfile" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - time" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - timex" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - uname" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - vmstat" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - xfs" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - zfs" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg="Listening on :9101" source="node_exporter.go:111" time="2018-08-24T01:47:14Z" level=fatal msg="listen tcp :9101: bind: address already in use" source="node_exporter.go:114" ******************************************************************************** # netstat -anlp | grep 9101 tcp 0 0 127.0.0.1:9101 0.0.0.0:* LISTEN 7927/node_exporter tcp 0 0 127.0.0.1:49732 127.0.0.1:9101 ESTABLISHED 8309/kube-rbac-prox tcp 0 0 127.0.0.1:9101 127.0.0.1:49732 ESTABLISHED 7927/node_exporter (In reply to Junqi Zhao from comment #8) > 9101 is used newer prometheus node-exporter pod's kube-rbac-proxy container change to 9101 is used by newer prometheus node-exporter pod's kube-rbac-proxy container Ok, sorry about that, I didn't realize 9101 was also being used. I have created a new PR to change to 9102 (https://github.com/openshift/openshift-ansible/pull/9749). Deploy cluster monitoring first then deploy prometheus, prometheus-node-exporter pods could be started, and prometheus-node-exporter pods use 9102 port now # oc get pod -n openshift-monitoring NAME READY STATUS RESTARTS AGE alertmanager-main-0 3/3 Running 0 14m alertmanager-main-1 3/3 Running 0 13m alertmanager-main-2 3/3 Running 0 13m cluster-monitoring-operator-84cb5868d9-8ftvn 1/1 Running 0 26m grafana-568fbc644d-86gs2 2/2 Running 0 24m kube-state-metrics-5fbc788767-tbsx9 3/3 Running 0 11m node-exporter-6fqkf 2/2 Running 0 12m node-exporter-kdrnv 2/2 Running 0 12m node-exporter-s8gjz 2/2 Running 0 12m prometheus-k8s-0 4/4 Running 4 20m prometheus-k8s-1 4/4 Running 0 17m prometheus-operator-dd5d8897c-wgv5d 1/1 Running 0 26m # oc get pod -n openshift-metrics NAME READY STATUS RESTARTS AGE prometheus-0 6/6 Running 0 9m prometheus-node-exporter-52fzq 1/1 Running 0 9m prometheus-node-exporter-fsfrv 1/1 Running 0 9m prometheus-node-exporter-rgqxk 1/1 Running 0 9m # oc logs prometheus-node-exporter-fsfrv -n openshift-metrics ******************snipped********************** time="2018-08-28T03:05:31Z" level=info msg="Listening on :9102" source="node_exporter.go:111" # oc get ds -n openshift-metrics NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE prometheus-node-exporter 3 3 3 3 3 <none> 10m env: # rpm -qa | grep ansible openshift-ansible-3.11.0-0.24.0.git.0.3cd1597None.noarch ansible-2.6.3-1.el7ae.noarch openshift-ansible-docs-3.11.0-0.24.0.git.0.3cd1597None.noarch openshift-ansible-roles-3.11.0-0.24.0.git.0.3cd1597None.noarch openshift-ansible-playbooks-3.11.0-0.24.0.git.0.3cd1597None.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652 |