Description of problem: installed prometheus and prometheus-node-exporter, prometheus pod could be started up NAME READY STATUS RESTARTS AGE pod/prometheus-0 6/6 Running 0 1h but prometheus-node-exporter daemonset DESIRED number is 0, no prometheus-node-exporter pod is created NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/prometheus-node-exporter 0 0 0 0 0 <none> 1h # oc get ds prometheus-node-exporter -o yaml apiVersion: v1 items: - apiVersion: extensions/v1beta1 kind: DaemonSet metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"extensions/v1beta1","kind":"DaemonSet","metadata":{"annotations":{},"labels":{"app":"prometheus-node-exporter","role":"monitoring"},"name":"prometheus-node-exporter","namespace":"openshift-metrics"},"spec":{"template":{"metadata":{"labels":{"app":"prometheus-node-exporter","role":"monitoring"},"name":"prometheus-exporter"},"spec":{"containers":[{"args":["--no-collector.wifi"],"image":"registry.reg-aws.openshift.com:443/openshift3/prometheus-node-exporter:v3.11.0","name":"node-exporter","ports":[{"containerPort":9100,"name":"scrape"}],"resources":{"limits":{"cpu":"200m","memory":"50Mi"},"requests":{"cpu":"100m","memory":"30Mi"}},"volumeMounts":[{"mountPath":"/host/proc","name":"proc","readOnly":true},{"mountPath":"/host/sys","name":"sys","readOnly":true}]}],"hostNetwork":true,"hostPID":true,"serviceAccountName":"prometheus-node-exporter","volumes":[{"hostPath":{"path":"/proc"},"name":"proc"},{"hostPath":{"path":"/sys"},"name":"sys"}]}},"updateStrategy":{"type":"RollingUpdate"}}} creationTimestamp: 2018-07-25T08:02:23Z generation: 1 labels: app: prometheus-node-exporter role: monitoring name: prometheus-node-exporter namespace: openshift-metrics resourceVersion: "61254" selfLink: /apis/extensions/v1beta1/namespaces/openshift-metrics/daemonsets/prometheus-node-exporter uid: 104edfb8-8fe1-11e8-8be9-42010af00006 spec: revisionHistoryLimit: 10 selector: matchLabels: app: prometheus-node-exporter role: monitoring template: metadata: creationTimestamp: null labels: app: prometheus-node-exporter role: monitoring name: prometheus-exporter spec: containers: - args: - --no-collector.wifi image: registry.reg-aws.openshift.com:443/openshift3/prometheus-node-exporter:v3.11.0 imagePullPolicy: IfNotPresent name: node-exporter ports: - containerPort: 9100 hostPort: 9100 name: scrape protocol: TCP resources: limits: cpu: 200m memory: 50Mi requests: cpu: 100m memory: 30Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /host/proc name: proc readOnly: true - mountPath: /host/sys name: sys readOnly: true dnsPolicy: ClusterFirst hostNetwork: true hostPID: true restartPolicy: Always schedulerName: default-scheduler securityContext: {} serviceAccount: prometheus-node-exporter serviceAccountName: prometheus-node-exporter terminationGracePeriodSeconds: 30 volumes: - hostPath: path: /proc type: "" name: proc - hostPath: path: /sys type: "" name: sys templateGeneration: 1 updateStrategy: rollingUpdate: maxUnavailable: 1 type: RollingUpdate status: currentNumberScheduled: 0 desiredNumberScheduled: 0 numberMisscheduled: 0 numberReady: 0 observedGeneration: 1 kind: List metadata: resourceVersion: "" selfLink: "" Version-Release number of selected component (if applicable): # rpm -qa | grep openshift-ansible openshift-ansible-roles-3.11.0-0.9.0.git.0.195bae3None.noarch openshift-ansible-docs-3.11.0-0.9.0.git.0.195bae3None.noarch openshift-ansible-playbooks-3.11.0-0.9.0.git.0.195bae3None.noarch openshift-ansible-3.11.0-0.9.0.git.0.195bae3None.noarch How reproducible: Always Steps to Reproduce: 1.install prometheus and prometheus-node-exporter 2. 3. Actual results: prometheus-node-exporter daemonset DESIRED number is 0 Expected results: prometheus-node-exporter pod should be created Additional info: # parameter settings openshift_prometheus_state=present openshift_prometheus_node_exporter_install=true openshift_prometheus_node_selector={'role': 'node'}
Close it as WORKSFORME,re-tested, prometheus-node-exporter pods could be created # oc get pod NAME READY STATUS RESTARTS AGE prometheus-0 6/6 Running 0 35m prometheus-node-exporter-6t6sf 1/1 Running 0 34m prometheus-node-exporter-d8zhc 1/1 Running 0 34m prometheus-node-exporter-x9dcd 1/1 Running 0 34m # oc get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE prometheus-node-exporter 3 3 3 3 3 <none> 35m # rpm -qa | grep ansible openshift-ansible-docs-3.11.0-0.9.0.git.0.195bae3None.noarch ansible-2.6.1-1.el7ae.noarch openshift-ansible-roles-3.11.0-0.9.0.git.0.195bae3None.noarch openshift-ansible-3.11.0-0.9.0.git.0.195bae3None.noarch openshift-ansible-playbooks-3.11.0-0.9.0.git.0.195bae3None.noarch prometheus-node-exporter-v3.11.0-0.9.0.0
Issue is reproduced again on GCE, prometheus-node-exporter daemonset DESIRED number is 0 again # oc get po NAME READY STATUS RESTARTS AGE prometheus-0 6/6 Running 0 13m # oc get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE prometheus-node-exporter 0 0 0 0 0 <none> 12m Image: prometheus-node-exporter/images/v3.11.0-0.11.0.0 Re-open it
The reason is there are node-exporter pods in cluster monitoring namepace: openshift-monitoring # oc -n openshift-monitoring get pod | grep node-exporter node-exporter-4ddmj 2/2 Running 0 5h node-exporter-6r4z8 2/2 Running 0 5h node-exporter-lsd4v 2/2 Running 0 5h node-exporter-tnbbg 2/2 Running 0 5h its port is also 9100, whereas in openshift-metrics namepace, node-exporter port is also 9100, so if we deploy cluster monitoring first, and then deploy prometheus under openshift-metrics project, we will find this issue.
PR to change the port of legacy node_exporter install to 9101 to avoid conflict with newer monitoring. https://github.com/openshift/openshift-ansible/pull/9706
The latest openshift-ansible is openshift-ansible-3.11.0-0.20.0.git.0.ec6d8caNone.noarch, and it does not contain the fix, will verify this defect after the new openshift-ansible comes out
@Paul Will this fix be back ported to 3.9 and 3.10
AFAIK, this issue doesn't affect 3.9 and 3.10 since the newer prometheus installer doesn't exist in 3.9 and doesn't deploy node_exporter in 3.10. So I wasn't planning to backport it. If you can reproduce the issue in the earlier versions or you think it might be a problem in the future, I can backport it.
9101 is used newer prometheus node-exporter pod's kube-rbac-proxy container # oc describe pod node-exporter-pgqwj -n openshift-monitoring *************************snipped************************ kube-rbac-proxy: Container ID: docker://ab61986095d56418e0439d356d9bc450cea4a44583a50e55da370b685db48c6f Image: registry.reg-aws.openshift.com:443/openshift3/ose-kube-rbac-proxy:v3.11 Image ID: docker-pullable://registry.reg-aws.openshift.com:443/openshift3/ose-kube-rbac-proxy@sha256:db2eb1c395773ade1bae8b75b4b5127250c09ee44a163d8e651f9955badca00a Port: 9100/TCP Host Port: 9100/TCP Args: --secure-listen-address=:9100 --upstream=http://127.0.0.1:9101/ --tls-cert-file=/etc/tls/private/tls.crt --tls-private-key-file=/etc/tls/private/tls.key *************************snipped************************ This caused prometheus-node-exporter can not be started, need to use other port, 9102 is ok. # oc get pod -n openshift-metrics NAME READY STATUS RESTARTS AGE prometheus-0 6/6 Running 0 10m prometheus-node-exporter-jbmvg 0/1 CrashLoopBackOff 6 9m prometheus-node-exporter-nqll9 0/1 CrashLoopBackOff 6 9m prometheus-node-exporter-zvqch 0/1 CrashLoopBackOff 6 9m ******************************************************************************** # oc logs prometheus-node-exporter-jbmvg -n openshift-metrics time="2018-08-24T01:47:14Z" level=info msg="Starting node_exporter (version=0.16.0, branch=, revision=)" source="node_exporter.go:82" time="2018-08-24T01:47:14Z" level=info msg="Build context (go=go1.10.2, user=mockbuild.eng.bos.redhat.com, date=20180823-13:59:32)" source="node_exporter.go:83" time="2018-08-24T01:47:14Z" level=info msg="Enabled collectors:" source="node_exporter.go:90" time="2018-08-24T01:47:14Z" level=info msg=" - arp" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - bcache" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - bonding" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - conntrack" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - cpu" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - diskstats" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - edac" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - entropy" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - filefd" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - filesystem" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - hwmon" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - infiniband" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - ipvs" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - loadavg" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - mdadm" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - meminfo" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - netdev" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - netstat" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - nfs" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - nfsd" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - sockstat" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - stat" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - textfile" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - time" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - timex" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - uname" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - vmstat" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - xfs" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg=" - zfs" source="node_exporter.go:97" time="2018-08-24T01:47:14Z" level=info msg="Listening on :9101" source="node_exporter.go:111" time="2018-08-24T01:47:14Z" level=fatal msg="listen tcp :9101: bind: address already in use" source="node_exporter.go:114" ******************************************************************************** # netstat -anlp | grep 9101 tcp 0 0 127.0.0.1:9101 0.0.0.0:* LISTEN 7927/node_exporter tcp 0 0 127.0.0.1:49732 127.0.0.1:9101 ESTABLISHED 8309/kube-rbac-prox tcp 0 0 127.0.0.1:9101 127.0.0.1:49732 ESTABLISHED 7927/node_exporter
(In reply to Junqi Zhao from comment #8) > 9101 is used newer prometheus node-exporter pod's kube-rbac-proxy container change to 9101 is used by newer prometheus node-exporter pod's kube-rbac-proxy container
Ok, sorry about that, I didn't realize 9101 was also being used. I have created a new PR to change to 9102 (https://github.com/openshift/openshift-ansible/pull/9749).
Deploy cluster monitoring first then deploy prometheus, prometheus-node-exporter pods could be started, and prometheus-node-exporter pods use 9102 port now # oc get pod -n openshift-monitoring NAME READY STATUS RESTARTS AGE alertmanager-main-0 3/3 Running 0 14m alertmanager-main-1 3/3 Running 0 13m alertmanager-main-2 3/3 Running 0 13m cluster-monitoring-operator-84cb5868d9-8ftvn 1/1 Running 0 26m grafana-568fbc644d-86gs2 2/2 Running 0 24m kube-state-metrics-5fbc788767-tbsx9 3/3 Running 0 11m node-exporter-6fqkf 2/2 Running 0 12m node-exporter-kdrnv 2/2 Running 0 12m node-exporter-s8gjz 2/2 Running 0 12m prometheus-k8s-0 4/4 Running 4 20m prometheus-k8s-1 4/4 Running 0 17m prometheus-operator-dd5d8897c-wgv5d 1/1 Running 0 26m # oc get pod -n openshift-metrics NAME READY STATUS RESTARTS AGE prometheus-0 6/6 Running 0 9m prometheus-node-exporter-52fzq 1/1 Running 0 9m prometheus-node-exporter-fsfrv 1/1 Running 0 9m prometheus-node-exporter-rgqxk 1/1 Running 0 9m # oc logs prometheus-node-exporter-fsfrv -n openshift-metrics ******************snipped********************** time="2018-08-28T03:05:31Z" level=info msg="Listening on :9102" source="node_exporter.go:111" # oc get ds -n openshift-metrics NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE prometheus-node-exporter 3 3 3 3 3 <none> 10m env: # rpm -qa | grep ansible openshift-ansible-3.11.0-0.24.0.git.0.3cd1597None.noarch ansible-2.6.3-1.el7ae.noarch openshift-ansible-docs-3.11.0-0.24.0.git.0.3cd1597None.noarch openshift-ansible-roles-3.11.0-0.24.0.git.0.3cd1597None.noarch openshift-ansible-playbooks-3.11.0-0.24.0.git.0.3cd1597None.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652