| Summary: | Metrics only captured briefly. | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Christophe Augello <caugello> | ||||
| Component: | Hawkular | Assignee: | Matt Wringe <mwringe> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | chunchen <chunchen> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 3.1.0 | CC: | aos-bugs, caugello, erich, pep, rhowe, shilpa.padgaonkar, wsun, xiazhao | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.2.1 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2016-05-17 20:22:01 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1267746 | ||||||
| Attachments: |
|
||||||
|
Description
Christophe Augello
2016-04-15 11:47:25 UTC
The issue is not reproducible with the steps given, this is the standard installation procedure that we use for testing internally and we do not see this issue. A few follow up questions: 1) is there anything obvious in the logs for Heapster, Hawkular Metrics, or Cassandra? 2) are the containers stable, or are they restarting after these 10-15 minutes? 3) if you scale the Heapster pod to 0 and then back to 1, do metrics start to appear again in the console again? Or are they also blank even after a restart? If after following step 3 the metrics start to appear again in the console, then it could be an issue with the Heapster pod requiring more resources than what the current node has available. If this is the case can they place the Heapster pod in its own node and see if that resolves the issue? 1)1) is there anything obvious in the logs for Heapster, Hawkular Metrics, or Cassandra? no logs look clean 2) are the containers stable, or are they restarting after these 10-15 minutes? yes no restarts from any of the 3 3) if you scale the Heapster pod to 0 and then back to 1, do metrics start to appear again in the console again? Or are they also blank even after a restart? yes does not change anything. We have 2 environments currently both installed in the exact same way and the version is provided below oc version oc v3.1.1.6-21-gcd70c35 kubernetes v1.1.0-origin-1107-g4c8e6f4 In 1 env the metrics are working fine, can see them in the console and in the other no metrics seen If its still the issue that metrics only appear briefly, please see https://access.redhat.com/solutions/2158481 Otherwise, is there anything different between these two clusters? Is one much bigger than the other? For the cluster which does not display any metrics, could you ever see metrics in the console? Or has it always been empty? No they are both the same size. I am not sure if at the very beginning there was any metrics since our ops team member installed it yesterday evening and I checked it only today. I made the edit in rc as you haev mentioned below but nothing chnaged I then made the heapster logs more verbose and hten found this I0426 11:19:30.001461 1 kubelet.go:96] failed to get stats from kubelet url: https://172.16.8.2:10250/stats/test-thomas/cakephp-example-1-foc53/32a4a90d-0557-11e6-a5e6-00505604764d/cakephp-example - Get https://172.16.8.2:10250/stats/test-thomas/cakephp-example-1-foc53/32a4a90d-0557-11e6-a5e6-00505604764d/cakephp-example: dial tcp 172.16.8.2:10250: i/o timeout Could the problem be https://github.com/kubernetes/heapster/issues/127 ? OK, I am going to assume this is a new install then and is not the original issue about metrics only appearing briefly A couple of quick checks: - the url in the master-config.yaml looks like https://${HAWKULAR_METRICS_HOSTNAME}/hawkular/metrics [the Hawkular Metrics part at the end is important and is easy to forget] - you can access https://${HAWKULAR_METRICS_HOSTNAME}/hawkular/metrics in the same browser as the console, and there are no certificate issues As for the message you are seeing in the Heapster logs, how often are you seeing that message and is it only for that one particular cakephp-example or also for all the containers running in the node? Is that cakephp-example container currently running or is it in a completed state? I am not sure why exactly it would be giving a timeout unless there was a problem with accessing the node over the network. Can you directly access https://172.16.8.2:10250 ? 1.THe master config looks good 2. I can access https://${HAWKULAR_METRICS_HOSTNAME}/hawkular/metrics from the browser after accepting the warning for self signed certificate as below: ---------------------------------------------------------------------------------------------------------------------------------- Your connection is not private Attackers might be trying to steal your information from hawkular-metrics.xxx.com (for example, passwords, messages, or credit cards). NET::ERR_CERT_AUTHORITY_INVALID Automatically report details of possible security incidents to Google. Privacy policy Back to safetyHide advanced This server could not prove that it is hawkular-metrics.xxx.com; its security certificate is not trusted by your computer's operating system. This may be caused by a misconfiguration or an attacker intercepting your connection. Proceed to hawkular-metrics.xxx.com (unsafe) ----------------------------------------------------------------------------------------------------------------------------------------- However this is same for the other env where metrics are working fine. So I am not sure if this is the issue 3. I see timeouts for other pods as well. These are pods from 2 different projects I0426 11:19:30.001461 1 kubelet.go:96] failed to get stats from kubelet url: https://172.16.8.2:10250/stats/test-thomas/cakephp-example-1-foc53/32a4a90d-0557-11e6-a5e6-00505604764d/cakephp-example - Get https://172.16.8.2:10250/stats/test-thomas/cakephp-example-1-foc53/32a4a90d-0557-11e6-a5e6-00505604764d/cakephp-example: dial tcp 172.16.8.2:10250: i/o timeout I0426 11:19:30.001699 1 kubelet.go:96] failed to get stats from kubelet url: https://172.16.8.2:10250/stats/test/cakephp-example-1-fh98h/57a2d6a5-0b8c-11e6-a0fd-00505604764d/cakephp-example - Get https://172.16.8.2:10250/stats/test/cakephp-example-1-fh98h/57a2d6a5-0b8c-11e6-a0fd-00505604764d/cakephp-example: dial tcp 172.16.8.2:10250: i/o timeout Both these podscakephp-example-1-foc53 and are running #oc get pods -n test-thomas NAME READY STATUS RESTARTS AGE cakephp-example-1-build 0/1 Completed 0 8d cakephp-example-1-foc53 1/1 Running 0 8d #oc get pods -n test NAME READY STATUS RESTARTS AGE cakephp-example-1-build 0/1 Completed 0 9h cakephp-example-1-fh98h 1/1 Running 0 9h 4. I can access the node using both telnet on port 10250 and also using curl curl https://172.16.8.2:10250 --insecure 404 page not found 5. No chance of this being https://github.com/kubernetes/heapster/issues/127 ? I am not sure why you think its related to https://github.com/kubernetes/heapster/issues/127. What are you seeing which makes you believe that? It doesn't appear to match type of errors you are seeing, and the Heapster version being used already has the fix applied. Are the messages you are seeing in the Heapster logs occurring with the default log levels, or did you edit the logging levels in the heapster rc to be more verbose? Also are you seeing that message for all containers or just the cakephp-example ones? Can you try running the following curl command to see if this is returning something : curl --insecure -H "Authorization: Bearer `oc whoami -t`" -H "Hawkular-tenant: openshift-infra" -X GET https://${HAWKULAR_METRICS_HOSTNAME}/hawkular/metrics/gauges/data?tags=container_name:hawkular-metrics\&buckets=1 Created attachment 1151207 [details]
heapster log on behalf of cust
Sorry I forgot to mention above that its not directly the issue127. If you see the comments in the issue, there is one person complaining about kubelet io timeout and then the heapster developer mentions this problem being fixed on head by #4784. I could see these messages only after I the made the logs more verbose No its also seen for other containers heapster.log:22852:I0426 14:53:01.252686 1 kubelet.go:96] failed to get stats from kubelet url: https://172.16.8.2:10250/stats/shilpa-test/mysql-1-pt0a5/55f288b0-0bdf-11e6-a0fd-00505604764d/mysql - Get https://172.16.8.2:10250/stats/shilpa-test/mysql-1-pt0a5/55f288b0-0bdf-11e6-a0fd-00505604764d/mysql: dial tcp 172.16.8.2:10250: i/o timeout oc get pods -n shilpa-test NAME READY STATUS RESTARTS AGE mysql-1-pt0a5 1/1 Running 0 11h Yes it returns nan curl -H "Authorization: Bearer xxxxxxxxxxxxxxxxx " -H "Hawkular-tenant: openshift-infra" -H "Accept: application/json" -X GET https://172.30.196.143/hawkular/metrics/gauges/data?'tags=container_name:hawkular-metrics-nmslu&buckets=1' --insecure [{"start":1461714634506,"end":1461743434506,"min":"NaN","avg":"NaN","median":"NaN","max":"NaN","percentile95th":"NaN","samples":0,"empty":true}] From the logs I can see that we are gathering metrics from the router, the registry, hawkular-metrics, cassandra, heapster, etc.... Can you please verify that you are not seeing metrics for those containers in the console? Also, can you please run the curl command from https://bugzilla.redhat.com/show_bug.cgi?id=1327558#c10 (please do not add the "-nmlsu" after the hawkular-metrics container name) curl -H "Authorization: Bearer xxxxxxxxxxxxxxxxxxxxxx " -H "Hawkular-tenant: openshift-infra" -H "Accept: application/json" -X GET https://172.30.196.143/hawkular/metrics/gauges/data?'tags=container_name:hawkular-metrics&buckets=1' --insecure [{"start":1461739054504,"end":1461767854504,"min":-1.0,"avg":5.2182521667085904E8,"median":7.665096137944177E8,"max":1.125175296E9,"percentile95th":1.123234304854552E9,"samples":3816,"empty":false}] Ok, so we know that Heapster is gathering metrics for your container, its being properly processed by Hawkular Metrics and its being stored into Cassandra. Can you see the graphs in the console for the Hawkular Metrics container? Yes i do see the grafs for openshift-infra pods But I only see them for these openshift-infra pods.. For the pods in other projects, i see no metrics From the logs this looks like an issue where we cannot connect to the 172.16.8.3 and the 172.16.8.2 nodes. The node running the OpenShift infra components (172.16.4.2) does appear to be functioning properly. I am not sure what exactly is causing the timeout on those nodes. Can you please check and make sure that the clocks across these nodes are all synchronized? Is there any filewall issues trying to access port 10250 on these nodes? I checked the rules and hte problem was they were only in 1 direction... Thanks for your support and we can now close this ticket Set to verified according to comment #19 (In reply to Xia Zhao from comment #27) > Set to verified according to comment #19 If I read comment #19 correctly, it says that the problem there was a firewall misconfiguration, no? If that's the case we should probably close this bz as notabug... (In reply to Josep 'Pep' Turro Mauri from comment #28) > (In reply to Xia Zhao from comment #27) > > Set to verified according to comment #19 > > If I read comment #19 correctly, it says that the problem there was a > firewall misconfiguration, no? > > If that's the case we should probably close this bz as notabug... Yes, comment #19 confirmed that the root cause was misconfiguration. As a QE, I'm not allowed to close this directly, you may need to find other proper person to help close if you want... (In reply to Josep 'Pep' Turro Mauri from comment #28) > (In reply to Xia Zhao from comment #27) > > Set to verified according to comment #19 > > If I read comment #19 correctly, it says that the problem there was a > firewall misconfiguration, no? > > If that's the case we should probably close this bz as notabug... I agree, and will close this as not a bug. Feel free to re-open if new information is identified. |