Bug 1608288

Summary:	prometheus-node-exporter daemonset DESIRED number is 0
Product:	OpenShift Container Platform	Reporter:	Junqi Zhao <juzhao>
Component:	Monitoring	Assignee:	Paul Gier <pgier>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.11.0	CC:	aos-bugs, jforrest, pgier
Target Milestone:	---	Keywords:	Reopened
Target Release:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Installing both the legacy prometheus components and the new Prometheus openshift monitoring caused a port conflict between node_exporter containers. Consequence: This prevented the node exporter daemonset from deploying correctly. Fix: Changed the port of legacy node_exporter deployments from 9100 to 9101. Result: Both Prometheus installations can now be performed successfully in a single cluster, however it is recommended to use only the newer Prometheus installation method.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-11 07:22:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Junqi Zhao 2018-07-25 09:17:09 UTC

Description of problem:
installed prometheus and prometheus-node-exporter, prometheus pod could be started up
NAME               READY     STATUS    RESTARTS   AGE
pod/prometheus-0   6/6       Running   0          1h

but prometheus-node-exporter daemonset DESIRED number is 0, no prometheus-node-exporter pod is created
NAME                                      DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/prometheus-node-exporter   0         0         0         0            0           <none>          1h

# oc get ds  prometheus-node-exporter -o yaml
apiVersion: v1
items:
- apiVersion: extensions/v1beta1
  kind: DaemonSet
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"extensions/v1beta1","kind":"DaemonSet","metadata":{"annotations":{},"labels":{"app":"prometheus-node-exporter","role":"monitoring"},"name":"prometheus-node-exporter","namespace":"openshift-metrics"},"spec":{"template":{"metadata":{"labels":{"app":"prometheus-node-exporter","role":"monitoring"},"name":"prometheus-exporter"},"spec":{"containers":[{"args":["--no-collector.wifi"],"image":"registry.reg-aws.openshift.com:443/openshift3/prometheus-node-exporter:v3.11.0","name":"node-exporter","ports":[{"containerPort":9100,"name":"scrape"}],"resources":{"limits":{"cpu":"200m","memory":"50Mi"},"requests":{"cpu":"100m","memory":"30Mi"}},"volumeMounts":[{"mountPath":"/host/proc","name":"proc","readOnly":true},{"mountPath":"/host/sys","name":"sys","readOnly":true}]}],"hostNetwork":true,"hostPID":true,"serviceAccountName":"prometheus-node-exporter","volumes":[{"hostPath":{"path":"/proc"},"name":"proc"},{"hostPath":{"path":"/sys"},"name":"sys"}]}},"updateStrategy":{"type":"RollingUpdate"}}}
    creationTimestamp: 2018-07-25T08:02:23Z
    generation: 1
    labels:
      app: prometheus-node-exporter
      role: monitoring
    name: prometheus-node-exporter
    namespace: openshift-metrics
    resourceVersion: "61254"
    selfLink: /apis/extensions/v1beta1/namespaces/openshift-metrics/daemonsets/prometheus-node-exporter
    uid: 104edfb8-8fe1-11e8-8be9-42010af00006
  spec:
    revisionHistoryLimit: 10
    selector:
      matchLabels:
        app: prometheus-node-exporter
        role: monitoring
    template:
      metadata:
        creationTimestamp: null
        labels:
          app: prometheus-node-exporter
          role: monitoring
        name: prometheus-exporter
      spec:
        containers:
        - args:
          - --no-collector.wifi
          image: registry.reg-aws.openshift.com:443/openshift3/prometheus-node-exporter:v3.11.0
          imagePullPolicy: IfNotPresent
          name: node-exporter
          ports:
          - containerPort: 9100
            hostPort: 9100
            name: scrape
            protocol: TCP
          resources:
            limits:
              cpu: 200m
              memory: 50Mi
            requests:
              cpu: 100m
              memory: 30Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /host/proc
            name: proc
            readOnly: true
          - mountPath: /host/sys
            name: sys
            readOnly: true
        dnsPolicy: ClusterFirst
        hostNetwork: true
        hostPID: true
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: prometheus-node-exporter
        serviceAccountName: prometheus-node-exporter
        terminationGracePeriodSeconds: 30
        volumes:
        - hostPath:
            path: /proc
            type: ""
          name: proc
        - hostPath:
            path: /sys
            type: ""
          name: sys
    templateGeneration: 1
    updateStrategy:
      rollingUpdate:
        maxUnavailable: 1
      type: RollingUpdate
  status:
    currentNumberScheduled: 0
    desiredNumberScheduled: 0
    numberMisscheduled: 0
    numberReady: 0
    observedGeneration: 1
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Version-Release number of selected component (if applicable):
# rpm -qa | grep openshift-ansible
openshift-ansible-roles-3.11.0-0.9.0.git.0.195bae3None.noarch
openshift-ansible-docs-3.11.0-0.9.0.git.0.195bae3None.noarch
openshift-ansible-playbooks-3.11.0-0.9.0.git.0.195bae3None.noarch
openshift-ansible-3.11.0-0.9.0.git.0.195bae3None.noarch


How reproducible:
Always

Steps to Reproduce:
1.install prometheus and prometheus-node-exporter
2.
3.

Actual results:
prometheus-node-exporter daemonset DESIRED number is 0

Expected results:
prometheus-node-exporter pod should be created

Additional info:
# parameter settings
openshift_prometheus_state=present
openshift_prometheus_node_exporter_install=true
openshift_prometheus_node_selector={'role': 'node'}

Comment 1 Junqi Zhao 2018-07-27 02:55:11 UTC

Close it as WORKSFORME,re-tested, prometheus-node-exporter pods could be created

# oc get pod
NAME                             READY     STATUS    RESTARTS   AGE
prometheus-0                     6/6       Running   0          35m
prometheus-node-exporter-6t6sf   1/1       Running   0          34m
prometheus-node-exporter-d8zhc   1/1       Running   0          34m
prometheus-node-exporter-x9dcd   1/1       Running   0          34m
# oc get ds
NAME                       DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
prometheus-node-exporter   3         3         3         3            3           <none>          35m

# rpm -qa | grep ansible
openshift-ansible-docs-3.11.0-0.9.0.git.0.195bae3None.noarch
ansible-2.6.1-1.el7ae.noarch
openshift-ansible-roles-3.11.0-0.9.0.git.0.195bae3None.noarch
openshift-ansible-3.11.0-0.9.0.git.0.195bae3None.noarch
openshift-ansible-playbooks-3.11.0-0.9.0.git.0.195bae3None.noarch

prometheus-node-exporter-v3.11.0-0.9.0.0

Comment 2 Junqi Zhao 2018-08-03 06:50:17 UTC

Issue is reproduced again on GCE, prometheus-node-exporter daemonset DESIRED number is 0 again
# oc get po
NAME           READY     STATUS    RESTARTS   AGE
prometheus-0   6/6       Running   0          13m

# oc get ds
NAME                       DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
prometheus-node-exporter   0         0         0         0            0           <none>          12m

Image: prometheus-node-exporter/images/v3.11.0-0.11.0.0
Re-open it

Comment 3 Junqi Zhao 2018-08-03 06:59:18 UTC

The reason is there are node-exporter pods in cluster monitoring namepace: openshift-monitoring
# oc -n openshift-monitoring get pod | grep node-exporter
node-exporter-4ddmj                            2/2       Running   0          5h
node-exporter-6r4z8                            2/2       Running   0          5h
node-exporter-lsd4v                            2/2       Running   0          5h
node-exporter-tnbbg                            2/2       Running   0          5h

its port is also 9100, whereas in openshift-metrics namepace, node-exporter port is also 9100, so if we deploy cluster monitoring first, and then deploy prometheus under openshift-metrics project, we will find this issue.

Comment 4 Paul Gier 2018-08-21 20:20:05 UTC

PR to change the port of legacy node_exporter install to 9101 to avoid conflict with newer monitoring.
https://github.com/openshift/openshift-ansible/pull/9706

Comment 5 Junqi Zhao 2018-08-23 01:40:01 UTC

The latest openshift-ansible is openshift-ansible-3.11.0-0.20.0.git.0.ec6d8caNone.noarch, and it does not contain the fix, will verify this defect after the new openshift-ansible comes out

Comment 6 Junqi Zhao 2018-08-23 09:15:02 UTC

@Paul

Will this fix be back ported to 3.9 and 3.10

Comment 7 Paul Gier 2018-08-23 16:07:44 UTC

AFAIK, this issue doesn't affect 3.9 and 3.10 since the newer prometheus installer doesn't exist in 3.9 and doesn't deploy node_exporter in 3.10.  So I wasn't planning to backport it.  If you can reproduce the issue in the earlier versions or you think it might be a problem in the future, I can backport it.

Comment 8 Junqi Zhao 2018-08-24 02:31:12 UTC

9101 is used newer prometheus node-exporter pod's kube-rbac-proxy container
# oc describe pod node-exporter-pgqwj -n openshift-monitoring
*************************snipped************************
  kube-rbac-proxy:
    Container ID:  docker://ab61986095d56418e0439d356d9bc450cea4a44583a50e55da370b685db48c6f
    Image:         registry.reg-aws.openshift.com:443/openshift3/ose-kube-rbac-proxy:v3.11
    Image ID:      docker-pullable://registry.reg-aws.openshift.com:443/openshift3/ose-kube-rbac-proxy@sha256:db2eb1c395773ade1bae8b75b4b5127250c09ee44a163d8e651f9955badca00a
    Port:          9100/TCP
    Host Port:     9100/TCP
    Args:
      --secure-listen-address=:9100
      --upstream=http://127.0.0.1:9101/
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key

*************************snipped************************

This caused prometheus-node-exporter can not be started, need to use other port, 9102 is ok.
# oc get pod -n openshift-metrics
NAME                             READY     STATUS             RESTARTS   AGE
prometheus-0                     6/6       Running            0          10m
prometheus-node-exporter-jbmvg   0/1       CrashLoopBackOff   6          9m
prometheus-node-exporter-nqll9   0/1       CrashLoopBackOff   6          9m
prometheus-node-exporter-zvqch   0/1       CrashLoopBackOff   6          9m
********************************************************************************
# oc logs prometheus-node-exporter-jbmvg -n openshift-metrics
time="2018-08-24T01:47:14Z" level=info msg="Starting node_exporter (version=0.16.0, branch=, revision=)" source="node_exporter.go:82"
time="2018-08-24T01:47:14Z" level=info msg="Build context (go=go1.10.2, user=mockbuild.eng.bos.redhat.com, date=20180823-13:59:32)" source="node_exporter.go:83"
time="2018-08-24T01:47:14Z" level=info msg="Enabled collectors:" source="node_exporter.go:90"
time="2018-08-24T01:47:14Z" level=info msg=" - arp" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - bcache" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - bonding" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - conntrack" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - cpu" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - diskstats" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - edac" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - entropy" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - filefd" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - filesystem" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - hwmon" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - infiniband" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - ipvs" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - loadavg" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - mdadm" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - meminfo" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - netdev" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - netstat" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - nfs" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - nfsd" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - sockstat" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - stat" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - textfile" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - time" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - timex" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - uname" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - vmstat" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - xfs" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg=" - zfs" source="node_exporter.go:97"
time="2018-08-24T01:47:14Z" level=info msg="Listening on :9101" source="node_exporter.go:111"
time="2018-08-24T01:47:14Z" level=fatal msg="listen tcp :9101: bind: address already in use" source="node_exporter.go:114"
********************************************************************************
# netstat -anlp | grep 9101
tcp        0      0 127.0.0.1:9101          0.0.0.0:*               LISTEN      7927/node_exporter  
tcp        0      0 127.0.0.1:49732         127.0.0.1:9101          ESTABLISHED 8309/kube-rbac-prox 
tcp        0      0 127.0.0.1:9101          127.0.0.1:49732         ESTABLISHED 7927/node_exporter

Comment 9 Junqi Zhao 2018-08-24 02:32:14 UTC

(In reply to Junqi Zhao from comment #8)
> 9101 is used newer prometheus node-exporter pod's kube-rbac-proxy container
change to
9101 is used by newer prometheus node-exporter pod's kube-rbac-proxy container

Comment 10 Paul Gier 2018-08-24 15:07:30 UTC

Ok, sorry about that, I didn't realize 9101 was also being used.  I have created a new PR to change to 9102 (https://github.com/openshift/openshift-ansible/pull/9749).

Comment 11 Junqi Zhao 2018-08-28 03:21:04 UTC

Deploy cluster monitoring first then deploy prometheus, prometheus-node-exporter pods could be started, and prometheus-node-exporter pods use 9102 port now

# oc get pod -n openshift-monitoring
NAME                                           READY     STATUS    RESTARTS   AGE
alertmanager-main-0                            3/3       Running   0          14m
alertmanager-main-1                            3/3       Running   0          13m
alertmanager-main-2                            3/3       Running   0          13m
cluster-monitoring-operator-84cb5868d9-8ftvn   1/1       Running   0          26m
grafana-568fbc644d-86gs2                       2/2       Running   0          24m
kube-state-metrics-5fbc788767-tbsx9            3/3       Running   0          11m
node-exporter-6fqkf                            2/2       Running   0          12m
node-exporter-kdrnv                            2/2       Running   0          12m
node-exporter-s8gjz                            2/2       Running   0          12m
prometheus-k8s-0                               4/4       Running   4          20m
prometheus-k8s-1                               4/4       Running   0          17m
prometheus-operator-dd5d8897c-wgv5d            1/1       Running   0          26m

# oc get pod -n openshift-metrics
NAME                             READY     STATUS    RESTARTS   AGE
prometheus-0                     6/6       Running   0          9m
prometheus-node-exporter-52fzq   1/1       Running   0          9m
prometheus-node-exporter-fsfrv   1/1       Running   0          9m
prometheus-node-exporter-rgqxk   1/1       Running   0          9m

# oc logs prometheus-node-exporter-fsfrv -n openshift-metrics
******************snipped**********************
time="2018-08-28T03:05:31Z" level=info msg="Listening on :9102" source="node_exporter.go:111"

# oc get ds -n openshift-metrics
NAME                       DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
prometheus-node-exporter   3         3         3         3            3           <none>          10m

env:
# rpm -qa | grep ansible
openshift-ansible-3.11.0-0.24.0.git.0.3cd1597None.noarch
ansible-2.6.3-1.el7ae.noarch
openshift-ansible-docs-3.11.0-0.24.0.git.0.3cd1597None.noarch
openshift-ansible-roles-3.11.0-0.24.0.git.0.3cd1597None.noarch
openshift-ansible-playbooks-3.11.0-0.24.0.git.0.3cd1597None.noarch

Comment 13 errata-xmlrpc 2018-10-11 07:22:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652