Bug 1306805
Summary: | Metrics updates fail with 'closed network connection' | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Stefanie Forrester <dakini> |
Component: | Hawkular | Assignee: | Matt Wringe <mwringe> |
Status: | CLOSED ERRATA | QA Contact: | chunchen <chunchen> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.1.0 | CC: | agrimm, aos-bugs, bleanhar, cryan, jokerman, mwringe, tdawson, wsun, xiazhao |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-05-12 16:28:45 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1303130 |
Description
Stefanie Forrester
2016-02-11 18:13:51 UTC
This is actually looking like an issue with the backing storage. When I deploy metrics using USE_PERSISTENT_STORAGE=false (and self-signed certs), the connection issues between the metrics pods disappear, and metrics function normally. When deploying with EBS PV storage and self-signed certs, network connection errors appear in the logs. From ops osevg cluster: E0211 16:16:36.042692 1 driver.go:234] Could not update tags: Hawkular returned status code 500, error message: Failed to perform operation due to an error: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.113.205:9042 (com.datastax.driver.core.OperationTimedOutException: [hawkular-cassandra/172.30.113.205:9042] Operation timed out)) From ops hackathon cluster: E0211 16:15:17.083519 1 driver.go:234] Could not update tags: Put https://hawkular-metrics:443/hawkular/metrics/counters/ddtest%2F73c08667-d029-11e5-8a25-12a2c75626b3%2Fmemory%2Fmajor_page_faults/tags: net/http: request canceled while waiting for connection In order to get EBS PVs to mount on the hawkular-cassandra pod, I have to edit the pvc by hand and remove the line 'ReadWriteMany'. That might have something to do with the problem. I think this is an issue of the hard-coded value ReadWriteMany in the deployer: https://github.com/openshift/origin-metrics/blob/354fc55eb4eeb140891b8240bce0f313833b1796/deployer/templates/hawkular-cassandra-node-pv.yaml#L41 This prevents usage with EBS Persistent Volumes, which are ReadWriteOnce. If I'm understanding this correctly, each cassandra pod will receive its own PV anyway, so there shouldn't be a need for multiple pods to write to the same PV, right? If that's the case, can we have ReadWriteMany removed from the deployer? After re-creating my EBS PVs using ReadWriteMany, metrics now deploys successfully, even with the signed certs and custom --sink option in the heapster rc. Since EBS doesn't actually support RWM, this could cause some issues later on. It would be a better long-term solution to remove ReadWriteMany from the PV requirements instead. Fixed in our 3.2 containers. Got the following error messages while testing with built out images from https://github.com/openshift/origin-metrics using nfs PV: oc get po NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-6q47v 0/1 Pending 0 22m hawkular-metrics-7yer2 0/1 CrashLoopBackOff 6 22m heapster-bxeod 0/1 CrashLoopBackOff 6 22m metrics-deployer-bnhx1 0/1 Completed 0 22m oc logs -f heapster-bxeod Endpoint Check in effect. Checking https://hawkular-metrics:443/hawkular/metrics/status Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 28. Status Code 000 'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 28]. Retrying. @mwringe Could you please help to confirm if the bug fix PR has been merged? Seems I didn't find it in the recently merged PR list: https://github.com/openshift/origin-metrics/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Amerged In the creation of my nfs pv I used accessMode ReadWriteOnce, and I noticed that in the metrics deployer , accessMode ReadWriteMany still exist, will we address this issue by deleting it here? https://github.com/openshift/origin-metrics/blob/354fc55eb4eeb140891b8240bce0f313833b1796/deployer/templates/hawkular-cassandra-node-pv.yaml#L41 I was in the progress of setting up environment with ebs pv to try with this, will update you later. Thanks, Xia @mwringe Thanks a lot for the info. Tested with these OSE 3.2.0 images with nfs pv, the original error message about "closed network connection" disappeared, and CPU and memory metrics can be shown on web console. Closing this issue as fixed. Here are the images tested: openshift3/metrics-hawkular-metrics d1fe5a5605da openshift3/metrics-cassandra d01f8f782def openshift3/metrics-deployer a7099fb4216c openshift3/metrics-heapster 341bad0bb73f Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:1064 |