Description of problem: When using an external hostname as the Hawkular Sink, the heapster pod repeatedly shows the following error: E0211 13:06:36.080999 1 driver.go:311] Post https://metrics.hackathon.openshift.com:443/hawkular/metrics/counters/data: read tcp 52.22.172.105:443: use of closed network connection The metrics URL appears to work, but none of the metrics are actually being updated in the web console. https://metrics.hackathon.openshift.com/hawkular/metrics https://console.hackathon.openshift.com/console/ Version-Release number of selected component (if applicable): image: registry.access.redhat.com/openshift3/metrics-heapster:3.1.0 atomic-openshift-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64 How reproducible: Every time. Steps to Reproduce: 1. Deploy metrics using an SSL cert which does *not* contain the hostname 'hawkular-metrics'. 2. 'oc edit rc/heapster' and change --sink to an external hostname that does match the SSL cert. Such as --sink=hawkular:https://metrics.hackathon.openshift.com:443 3. Create a pod and wait 5-10 minutes for metrics to update in the web console (under Browse -> Pods, select the running pod, then select the 'metrics' tab). Actual results: The metrics graphs remain blank and never update. Expected results: The metrics should begin to populate shortly after the pod begins running. Additional info: [root@hackathon-master-d8a69 ~]# oc get pods NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-lugoc 1/1 Running 0 25m hawkular-metrics-1pc3b 1/1 Running 0 25m heapster-3wkge 1/1 Running 0 6m [root@hackathon-master-d8a69 ~]# oc logs heapster-3wkge Starting Heapster with the following arguments: --source=kubernetes:https://api.hackathon.openshift.com?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250 --sink=hawkular:https://metrics.hackathon.openshift.com:443?tenant=_system&labelToTenant=pod_namespace&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=hawkular&pass=dZ4tTFhzhmu_6NB&filter=label(container_name:^/system.slice.*|^/user.slice) --logtostderr=true --tls_cert=/secrets/heapster.cert --tls_key=/secrets/heapster.key --tls_client_ca=/secrets/heapster.client-ca --allowed_users=system:master-proxy I0210 18:16:27.520393 1 heapster.go:60] heapster --source=kubernetes:https://api.hackathon.openshift.com?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250 --sink=hawkular:https://metrics.hackathon.openshift.com:443?tenant=_system&labelToTenant=pod_namespace&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=hawkular&pass=dZ4tTFhzhmu_6NB&filter=label(container_name:^/system.slice.*|^/user.slice) --logtostderr=true --tls_cert=/secrets/heapster.cert --tls_key=/secrets/heapster.key --tls_client_ca=/secrets/heapster.client-ca --allowed_users=system:master-proxy I0210 18:16:27.535092 1 heapster.go:61] Heapster version 0.18.0 I0210 18:16:27.535685 1 kube_factory.go:168] Using Kubernetes client with master "https://api.hackathon.openshift.com" and version "v1" I0210 18:16:27.535712 1 kube_factory.go:169] Using kubelet port 10250 I0210 18:16:27.536299 1 driver.go:491] Initialised Hawkular Sink with parameters {_system https://metrics.hackathon.openshift.com:443?tenant=_system&labelToTenant=pod_namespace&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=hawkular&pass=dZ4tTFhzhmu_6NB&filter=label(container_name:^/system.slice.*|^/user.slice) 0xc2081b2480 } I0210 18:16:27.673404 1 heapster.go:71] Starting heapster on port 8082 E0210 18:17:07.540620 1 driver.go:311] Post https://metrics.hackathon.openshift.com:443/hawkular/metrics/gauges/data: read tcp 52.73.52.6:443: use of closed network connection E0210 18:17:47.552054 1 driver.go:311] Post https://metrics.hackathon.openshift.com:443/hawkular/metrics/gauges/data: read tcp 52.22.172.105:443: use of closed network connection E0210 18:18:27.562796 1 driver.go:311] Post https://metrics.hackathon.openshift.com:443/hawkular/metrics/counters/data: read tcp 52.22.172.105:443: use of closed network connection E0210 18:19:07.572175 1 driver.go:311] Post https://metrics.hackathon.openshift.com:443/hawkular/metrics/counters/data: read tcp 52.73.52.6:443: use of closed network connection E0210 18:19:47.590230 1 driver.go:311] Post https://metrics.hackathon.openshift.com:443/hawkular/metrics/counters/data: read tcp 52.73.52.6:443: use of closed network connection E0210 18:20:27.604911 1 driver.go:311] Post https://metrics.hackathon.openshift.com:443/hawkular/metrics/counters/data: read tcp 52.22.172.105:443: use of closed network connection E0210 18:21:07.614956 1 driver.go:311] Post https://metrics.hackathon.openshift.com:443/hawkular/metrics/counters/data: read tcp 52.73.52.6:443: use of closed network connection E0210 18:21:47.636795 1 driver.go:311] Post https://metrics.hackathon.openshift.com:443/hawkular/metrics/counters/data: read tcp 52.22.172.105:443: use of closed network connection E0210 18:22:27.646143 1 driver.go:311] Post https://metrics.hackathon.openshift.com:443/hawkular/metrics/counters/data: read tcp 52.73.52.6:443: use of closed network connection [root@hackathon-master-d8a69 ~]# oc get pods NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-lugoc 1/1 Running 0 18h hawkular-metrics-1pc3b 1/1 Running 0 18h heapster-b2zne 1/1 Running 0 3m [root@hackathon-master-d8a69 ~]# oc get pods heapster-b2zne -o yaml apiVersion: v1 kind: Pod metadata: annotations: kubernetes.io/created-by: | {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"openshift-infra","name":"heapster","uid":"7ca798c0-d048-11e5-98af-12e685e684e7","apiVersion":"v1","resourceVersion":"990497"}} openshift.io/scc: restricted creationTimestamp: 2016-02-11T16:53:29Z generateName: heapster- labels: metrics-infra: heapster name: heapster name: heapster-b2zne namespace: openshift-infra resourceVersion: "1167798" selfLink: /api/v1/namespaces/openshift-infra/pods/heapster-b2zne uid: f9dd250f-d0df-11e5-97de-12b12f3dccab spec: containers: - command: - ./heapster-wrapper.sh - --wrapper.username_file=/hawkular-account/hawkular-metrics.username - --wrapper.password_file=/hawkular-account/hawkular-metrics.password - --wrapper.allowed_users_file=/secrets/heapster.allowed-users - --source=kubernetes:https://api.hackathon.openshift.com?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250 - --sink=hawkular:https://metrics.hackathon.openshift.com:443?tenant=_system&labelToTenant=pod_namespace&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=%username%&pass=%password%&filter=label(container_name:^/system.slice.*|^/user.slice) - --logtostderr=true - --tls_cert=/secrets/heapster.cert - --tls_key=/secrets/heapster.key - --tls_client_ca=/secrets/heapster.client-ca - --allowed_users=%allowed_users% image: registry.access.redhat.com/openshift3/metrics-heapster:3.1.0 imagePullPolicy: IfNotPresent name: heapster ports: - containerPort: 8082 name: http-endpoint protocol: TCP resources: {} securityContext: privileged: false runAsUser: 1000010000 seLinuxOptions: level: s0:c3,c2 terminationMessagePath: /dev/termination-log volumeMounts: - mountPath: /secrets name: heapster-secrets - mountPath: /hawkular-cert name: hawkular-metrics-certificate - mountPath: /hawkular-account name: hawkular-metrics-account - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: heapster-token-5rtto readOnly: true dnsPolicy: ClusterFirst host: ip-172-31-56-194.ec2.internal imagePullSecrets: - name: heapster-dockercfg-xuyhq nodeName: ip-172-31-56-194.ec2.internal nodeSelector: type: compute restartPolicy: Always securityContext: seLinuxOptions: level: s0:c3,c2 serviceAccount: heapster serviceAccountName: heapster terminationGracePeriodSeconds: 30 volumes: - name: heapster-secrets secret: secretName: heapster-secrets - name: hawkular-metrics-certificate secret: secretName: hawkular-metrics-certificate - name: hawkular-metrics-account secret: secretName: hawkular-metrics-account - name: heapster-token-5rtto secret: secretName: heapster-token-5rtto status: conditions: - lastProbeTime: null lastTransitionTime: 2016-02-11T16:53:32Z status: "True" type: Ready containerStatuses: - containerID: docker://816b4e18d319f130b0c99611cf3158c38f4e5284b2ee0b0ea61ab361d84c31cf image: registry.access.redhat.com/openshift3/metrics-heapster:3.1.0 imageID: docker://800434e622032d5bf46a24da3f498e58eb89fd613ce57a42ce66d938dfe21abd lastState: {} name: heapster ready: true restartCount: 0 state: running: startedAt: 2016-02-11T16:53:32Z hostIP: 172.31.56.194 phase: Running podIP: 10.1.13.34 startTime: 2016-02-11T16:53:29Z [root@hackathon-master-d8a69 ~]# host metrics.hackathon.openshift.com metrics.hackathon.openshift.com is an alias for hackathon-infra-450595167.us-east-1.elb.amazonaws.com. hackathon-infra-450595167.us-east-1.elb.amazonaws.com has address 52.22.172.105 hackathon-infra-450595167.us-east-1.elb.amazonaws.com has address 52.73.52.6
This is actually looking like an issue with the backing storage. When I deploy metrics using USE_PERSISTENT_STORAGE=false (and self-signed certs), the connection issues between the metrics pods disappear, and metrics function normally. When deploying with EBS PV storage and self-signed certs, network connection errors appear in the logs. From ops osevg cluster: E0211 16:16:36.042692 1 driver.go:234] Could not update tags: Hawkular returned status code 500, error message: Failed to perform operation due to an error: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.113.205:9042 (com.datastax.driver.core.OperationTimedOutException: [hawkular-cassandra/172.30.113.205:9042] Operation timed out)) From ops hackathon cluster: E0211 16:15:17.083519 1 driver.go:234] Could not update tags: Put https://hawkular-metrics:443/hawkular/metrics/counters/ddtest%2F73c08667-d029-11e5-8a25-12a2c75626b3%2Fmemory%2Fmajor_page_faults/tags: net/http: request canceled while waiting for connection In order to get EBS PVs to mount on the hawkular-cassandra pod, I have to edit the pvc by hand and remove the line 'ReadWriteMany'. That might have something to do with the problem.
I think this is an issue of the hard-coded value ReadWriteMany in the deployer: https://github.com/openshift/origin-metrics/blob/354fc55eb4eeb140891b8240bce0f313833b1796/deployer/templates/hawkular-cassandra-node-pv.yaml#L41 This prevents usage with EBS Persistent Volumes, which are ReadWriteOnce. If I'm understanding this correctly, each cassandra pod will receive its own PV anyway, so there shouldn't be a need for multiple pods to write to the same PV, right? If that's the case, can we have ReadWriteMany removed from the deployer?
After re-creating my EBS PVs using ReadWriteMany, metrics now deploys successfully, even with the signed certs and custom --sink option in the heapster rc. Since EBS doesn't actually support RWM, this could cause some issues later on. It would be a better long-term solution to remove ReadWriteMany from the PV requirements instead.
Fixed in our 3.2 containers.
Got the following error messages while testing with built out images from https://github.com/openshift/origin-metrics using nfs PV: oc get po NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-6q47v 0/1 Pending 0 22m hawkular-metrics-7yer2 0/1 CrashLoopBackOff 6 22m heapster-bxeod 0/1 CrashLoopBackOff 6 22m metrics-deployer-bnhx1 0/1 Completed 0 22m oc logs -f heapster-bxeod Endpoint Check in effect. Checking https://hawkular-metrics:443/hawkular/metrics/status Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 28. Status Code 000 'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 28]. Retrying. @mwringe Could you please help to confirm if the bug fix PR has been merged? Seems I didn't find it in the recently merged PR list: https://github.com/openshift/origin-metrics/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Amerged In the creation of my nfs pv I used accessMode ReadWriteOnce, and I noticed that in the metrics deployer , accessMode ReadWriteMany still exist, will we address this issue by deleting it here? https://github.com/openshift/origin-metrics/blob/354fc55eb4eeb140891b8240bce0f313833b1796/deployer/templates/hawkular-cassandra-node-pv.yaml#L41 I was in the progress of setting up environment with ebs pv to try with this, will update you later. Thanks, Xia
@mwringe Thanks a lot for the info. Tested with these OSE 3.2.0 images with nfs pv, the original error message about "closed network connection" disappeared, and CPU and memory metrics can be shown on web console. Closing this issue as fixed. Here are the images tested: openshift3/metrics-hawkular-metrics d1fe5a5605da openshift3/metrics-cassandra d01f8f782def openshift3/metrics-deployer a7099fb4216c openshift3/metrics-heapster 341bad0bb73f
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:1064