1306805 – Metrics updates fail with 'closed network connection'

Bug 1306805 - Metrics updates fail with 'closed network connection'

Summary: Metrics updates fail with 'closed network connection'

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Hawkular
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Matt Wringe
QA Contact:	chunchen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	OSOPS_V3
TreeView+	depends on / blocked

Reported:	2016-02-11 18:13 UTC by Stefanie Forrester
Modified:	2016-09-30 02:16 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-05-12 16:28:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2016:1064	0	normal	SHIPPED_LIVE	Important: Red Hat OpenShift Enterprise 3.2 security, bug fix, and enhancement update	2016-05-12 20:19:17 UTC

Description Stefanie Forrester 2016-02-11 18:13:51 UTC

Description of problem:

When using an external hostname as the Hawkular Sink, the heapster pod repeatedly shows the following error:

E0211 13:06:36.080999       1 driver.go:311] Post https://metrics.hackathon.openshift.com:443/hawkular/metrics/counters/data: read tcp 52.22.172.105:443: use of closed network connection

The metrics URL appears to work, but none of the metrics are actually being updated in the web console.

https://metrics.hackathon.openshift.com/hawkular/metrics

https://console.hackathon.openshift.com/console/

Version-Release number of selected component (if applicable):
image: registry.access.redhat.com/openshift3/metrics-heapster:3.1.0
atomic-openshift-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64

How reproducible:
Every time.

Steps to Reproduce:
1. Deploy metrics using an SSL cert which does *not* contain the hostname 'hawkular-metrics'.
2. 'oc edit rc/heapster' and change --sink to an external hostname that does match the SSL cert. Such as --sink=hawkular:https://metrics.hackathon.openshift.com:443
3. Create a pod and wait 5-10 minutes for metrics to update in the web console (under Browse -> Pods, select the running pod, then select the 'metrics' tab).

Actual results:
The metrics graphs remain blank and never update.

Expected results:
The metrics should begin to populate shortly after the pod begins running.

Additional info:


[root@hackathon-master-d8a69 ~]# oc get pods
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-lugoc   1/1       Running   0          25m
hawkular-metrics-1pc3b       1/1       Running   0          25m
heapster-3wkge               1/1       Running   0          6m

[root@hackathon-master-d8a69 ~]# oc logs heapster-3wkge
Starting Heapster with the following arguments: --source=kubernetes:https://api.hackathon.openshift.com?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250 --sink=hawkular:https://metrics.hackathon.openshift.com:443?tenant=_system&labelToTenant=pod_namespace&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=hawkular&pass=dZ4tTFhzhmu_6NB&filter=label(container_name:^/system.slice.*|^/user.slice) --logtostderr=true --tls_cert=/secrets/heapster.cert --tls_key=/secrets/heapster.key --tls_client_ca=/secrets/heapster.client-ca --allowed_users=system:master-proxy
I0210 18:16:27.520393       1 heapster.go:60] heapster --source=kubernetes:https://api.hackathon.openshift.com?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250 --sink=hawkular:https://metrics.hackathon.openshift.com:443?tenant=_system&labelToTenant=pod_namespace&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=hawkular&pass=dZ4tTFhzhmu_6NB&filter=label(container_name:^/system.slice.*|^/user.slice) --logtostderr=true --tls_cert=/secrets/heapster.cert --tls_key=/secrets/heapster.key --tls_client_ca=/secrets/heapster.client-ca --allowed_users=system:master-proxy
I0210 18:16:27.535092       1 heapster.go:61] Heapster version 0.18.0
I0210 18:16:27.535685       1 kube_factory.go:168] Using Kubernetes client with master "https://api.hackathon.openshift.com" and version "v1"
I0210 18:16:27.535712       1 kube_factory.go:169] Using kubelet port 10250
I0210 18:16:27.536299       1 driver.go:491] Initialised Hawkular Sink with parameters {_system https://metrics.hackathon.openshift.com:443?tenant=_system&labelToTenant=pod_namespace&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=hawkular&pass=dZ4tTFhzhmu_6NB&filter=label(container_name:^/system.slice.*|^/user.slice) 0xc2081b2480 }
I0210 18:16:27.673404       1 heapster.go:71] Starting heapster on port 8082
E0210 18:17:07.540620       1 driver.go:311] Post https://metrics.hackathon.openshift.com:443/hawkular/metrics/gauges/data: read tcp 52.73.52.6:443: use of closed network connection
E0210 18:17:47.552054       1 driver.go:311] Post https://metrics.hackathon.openshift.com:443/hawkular/metrics/gauges/data: read tcp 52.22.172.105:443: use of closed network connection
E0210 18:18:27.562796       1 driver.go:311] Post https://metrics.hackathon.openshift.com:443/hawkular/metrics/counters/data: read tcp 52.22.172.105:443: use of closed network connection
E0210 18:19:07.572175       1 driver.go:311] Post https://metrics.hackathon.openshift.com:443/hawkular/metrics/counters/data: read tcp 52.73.52.6:443: use of closed network connection
E0210 18:19:47.590230       1 driver.go:311] Post https://metrics.hackathon.openshift.com:443/hawkular/metrics/counters/data: read tcp 52.73.52.6:443: use of closed network connection
E0210 18:20:27.604911       1 driver.go:311] Post https://metrics.hackathon.openshift.com:443/hawkular/metrics/counters/data: read tcp 52.22.172.105:443: use of closed network connection
E0210 18:21:07.614956       1 driver.go:311] Post https://metrics.hackathon.openshift.com:443/hawkular/metrics/counters/data: read tcp 52.73.52.6:443: use of closed network connection
E0210 18:21:47.636795       1 driver.go:311] Post https://metrics.hackathon.openshift.com:443/hawkular/metrics/counters/data: read tcp 52.22.172.105:443: use of closed network connection
E0210 18:22:27.646143       1 driver.go:311] Post https://metrics.hackathon.openshift.com:443/hawkular/metrics/counters/data: read tcp 52.73.52.6:443: use of closed network connection



[root@hackathon-master-d8a69 ~]# oc get pods
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-lugoc   1/1       Running   0          18h
hawkular-metrics-1pc3b       1/1       Running   0          18h
heapster-b2zne               1/1       Running   0          3m

[root@hackathon-master-d8a69 ~]# oc get pods heapster-b2zne -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/created-by: |
      {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"openshift-infra","name":"heapster","uid":"7ca798c0-d048-11e5-98af-12e685e684e7","apiVersion":"v1","resourceVersion":"990497"}}
    openshift.io/scc: restricted
  creationTimestamp: 2016-02-11T16:53:29Z
  generateName: heapster-
  labels:
    metrics-infra: heapster
    name: heapster
  name: heapster-b2zne
  namespace: openshift-infra
  resourceVersion: "1167798"
  selfLink: /api/v1/namespaces/openshift-infra/pods/heapster-b2zne
  uid: f9dd250f-d0df-11e5-97de-12b12f3dccab
spec:
  containers:
  - command:
    - ./heapster-wrapper.sh
    - --wrapper.username_file=/hawkular-account/hawkular-metrics.username
    - --wrapper.password_file=/hawkular-account/hawkular-metrics.password
    - --wrapper.allowed_users_file=/secrets/heapster.allowed-users
    - --source=kubernetes:https://api.hackathon.openshift.com?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250
    - --sink=hawkular:https://metrics.hackathon.openshift.com:443?tenant=_system&labelToTenant=pod_namespace&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=%username%&pass=%password%&filter=label(container_name:^/system.slice.*|^/user.slice)
    - --logtostderr=true
    - --tls_cert=/secrets/heapster.cert
    - --tls_key=/secrets/heapster.key
    - --tls_client_ca=/secrets/heapster.client-ca
    - --allowed_users=%allowed_users%
    image: registry.access.redhat.com/openshift3/metrics-heapster:3.1.0
    imagePullPolicy: IfNotPresent
    name: heapster
    ports:
    - containerPort: 8082
      name: http-endpoint
      protocol: TCP
    resources: {}
    securityContext:
      privileged: false
      runAsUser: 1000010000
      seLinuxOptions:
        level: s0:c3,c2
    terminationMessagePath: /dev/termination-log
    volumeMounts:
    - mountPath: /secrets
      name: heapster-secrets
    - mountPath: /hawkular-cert
      name: hawkular-metrics-certificate
    - mountPath: /hawkular-account
      name: hawkular-metrics-account
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: heapster-token-5rtto
      readOnly: true
  dnsPolicy: ClusterFirst
  host: ip-172-31-56-194.ec2.internal
  imagePullSecrets:
  - name: heapster-dockercfg-xuyhq
  nodeName: ip-172-31-56-194.ec2.internal
  nodeSelector:
    type: compute
  restartPolicy: Always
  securityContext:
    seLinuxOptions:
      level: s0:c3,c2
  serviceAccount: heapster
  serviceAccountName: heapster
  terminationGracePeriodSeconds: 30
  volumes:
  - name: heapster-secrets
    secret:
      secretName: heapster-secrets
  - name: hawkular-metrics-certificate
    secret:
      secretName: hawkular-metrics-certificate
  - name: hawkular-metrics-account
    secret:
      secretName: hawkular-metrics-account
  - name: heapster-token-5rtto
    secret:
      secretName: heapster-token-5rtto
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2016-02-11T16:53:32Z
    status: "True"
    type: Ready
  containerStatuses:
  - containerID: docker://816b4e18d319f130b0c99611cf3158c38f4e5284b2ee0b0ea61ab361d84c31cf
    image: registry.access.redhat.com/openshift3/metrics-heapster:3.1.0
    imageID: docker://800434e622032d5bf46a24da3f498e58eb89fd613ce57a42ce66d938dfe21abd
    lastState: {}
    name: heapster
    ready: true
    restartCount: 0
    state:
      running:
        startedAt: 2016-02-11T16:53:32Z
  hostIP: 172.31.56.194
  phase: Running
  podIP: 10.1.13.34
  startTime: 2016-02-11T16:53:29Z


[root@hackathon-master-d8a69 ~]# host metrics.hackathon.openshift.com
metrics.hackathon.openshift.com is an alias for hackathon-infra-450595167.us-east-1.elb.amazonaws.com.
hackathon-infra-450595167.us-east-1.elb.amazonaws.com has address 52.22.172.105
hackathon-infra-450595167.us-east-1.elb.amazonaws.com has address 52.73.52.6

Comment 1 Stefanie Forrester 2016-02-12 17:56:55 UTC

This is actually looking like an issue with the backing storage. When I deploy metrics using USE_PERSISTENT_STORAGE=false (and self-signed certs), the connection issues between the metrics pods disappear, and metrics function normally. When deploying with EBS PV storage and self-signed certs, network connection errors appear in the logs.

From ops osevg cluster:

E0211 16:16:36.042692       1 driver.go:234] Could not update tags: Hawkular returned status code 500, error message: Failed to perform operation due to an error: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.113.205:9042 (com.datastax.driver.core.OperationTimedOutException: [hawkular-cassandra/172.30.113.205:9042] Operation timed out))

From ops hackathon cluster:

E0211 16:15:17.083519       1 driver.go:234] Could not update tags: Put https://hawkular-metrics:443/hawkular/metrics/counters/ddtest%2F73c08667-d029-11e5-8a25-12a2c75626b3%2Fmemory%2Fmajor_page_faults/tags: net/http: request canceled while waiting for connection

In order to get EBS PVs to mount on the hawkular-cassandra pod, I have to edit the pvc by hand and remove the line 'ReadWriteMany'. That might have something to do with the problem.

Comment 2 Stefanie Forrester 2016-02-12 18:53:45 UTC

I think this is an issue of the hard-coded value ReadWriteMany in the deployer:

https://github.com/openshift/origin-metrics/blob/354fc55eb4eeb140891b8240bce0f313833b1796/deployer/templates/hawkular-cassandra-node-pv.yaml#L41

This prevents usage with EBS Persistent Volumes, which are ReadWriteOnce. If I'm understanding this correctly, each cassandra pod will receive its own PV anyway, so there shouldn't be a need for multiple pods to write to the same PV, right? If that's the case, can we have ReadWriteMany removed from the deployer?

Comment 3 Stefanie Forrester 2016-02-16 15:26:20 UTC

After re-creating my EBS PVs using ReadWriteMany, metrics now deploys successfully, even with the signed certs and custom --sink option in the heapster rc. Since EBS doesn't actually support RWM, this could cause some issues later on. It would be a better long-term solution to remove ReadWriteMany from the PV requirements instead.

Comment 4 Matt Wringe 2016-02-26 16:34:40 UTC

Fixed in our 3.2 containers.

Comment 5 Xia Zhao 2016-02-29 13:03:48 UTC

Got the following error messages while testing with built out images from https://github.com/openshift/origin-metrics using nfs PV:

oc get po
NAME READY STATUS RESTARTS AGE
hawkular-cassandra-1-6q47v 0/1 Pending 0 22m
hawkular-metrics-7yer2 0/1 CrashLoopBackOff 6 22m
heapster-bxeod 0/1 CrashLoopBackOff 6 22m
metrics-deployer-bnhx1 0/1 Completed 0 22m

oc logs -f heapster-bxeod
Endpoint Check in effect. Checking https://hawkular-metrics:443/hawkular/metrics/status
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 28. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 28]. Retrying.

@mwringe Could you please help to confirm if the bug fix PR has been merged? Seems I didn't find it in the recently merged PR list: https://github.com/openshift/origin-metrics/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Amerged

In the creation of my nfs pv I used accessMode ReadWriteOnce, and I noticed that in the metrics deployer , accessMode ReadWriteMany still exist, will we address this issue by deleting it here? https://github.com/openshift/origin-metrics/blob/354fc55eb4eeb140891b8240bce0f313833b1796/deployer/templates/hawkular-cassandra-node-pv.yaml#L41

I was in the progress of setting up environment with ebs pv to try with this, will update you later.

Thanks,
Xia

Comment 7 Xia Zhao 2016-03-01 03:11:33 UTC

@mwringe Thanks a lot for the info. Tested with these OSE 3.2.0 images with nfs pv, the original error message about "closed network connection" disappeared, and CPU and memory metrics can be shown on web console. Closing this issue as fixed.

Here are the images tested:
openshift3/metrics-hawkular-metrics    d1fe5a5605da
openshift3/metrics-cassandra    d01f8f782def
openshift3/metrics-deployer    a7099fb4216c
openshift3/metrics-heapster    341bad0bb73f

Comment 9 errata-xmlrpc 2016-05-12 16:28:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1064

Note You need to log in before you can comment on or make changes to this bug.