Bug 1327558

Summary: Metrics only captured briefly.
Product: OpenShift Container Platform Reporter: Christophe Augello <caugello>
Component: HawkularAssignee: Matt Wringe <mwringe>
Status: CLOSED NOTABUG QA Contact: chunchen <chunchen>
Severity: high Docs Contact:
Priority: urgent    
Version: 3.1.0CC: aos-bugs, caugello, erich, pep, rhowe, shilpa.padgaonkar, wsun, xiazhao
Target Milestone: ---   
Target Release: 3.2.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-17 20:22:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1267746    
Attachments:
Description Flags
heapster log on behalf of cust none

Description Christophe Augello 2016-04-15 11:47:25 UTC
Description of problem:
Metrics only captured briefly.  Approx 10-15 mins.

Version-Release number of selected component (if applicable):

atomic-openshift-3.1.1.6-1.git.0.b57e8bd.el7aos.x86_64
atomic-openshift-clients-3.1.1.6-1.git.0.b57e8bd.el7aos.x86_64
atomic-openshift-master-3.1.1.6-1.git.0.b57e8bd.el7aos.x86_64
atomic-openshift-node-3.1.1.6-1.git.0.b57e8bd.el7aos.x86_64
atomic-openshift-sdn-ovs-3.1.1.6-1.git.0.b57e8bd.el7aos.x86_64

How reproducible:


Steps to Reproduce:
1. deploy the metrics as per docs:
# oc process -f /usr/share/ansible/openshift-ansible/roles/openshift_examples/files/examples/v1.1/infrastructure-templates/enterprise/metrics-deployer.yaml -v HAWKULAR_METRICS_HOSTNAME=<hostnameHere>,IMAGE_PREFIX=registry.access.redhat.com/openshift3/,IMAGE_VERSION=latest,REDEPLOY=true | oc create -f -

Actual results:

Metrics were only captured for approx 10 - 15 mins.

Expected results:

Permanent capture

Additional info:

Comment 4 Matt Wringe 2016-04-15 13:40:42 UTC
The issue is not reproducible with the steps given, this is the standard installation procedure that we use for testing internally and we do not see this issue.

A few follow up questions:
1) is there anything obvious in the logs for Heapster, Hawkular Metrics, or Cassandra?

2) are the containers stable, or are they restarting after these 10-15 minutes?

3) if you scale the Heapster pod to 0 and then back to 1, do metrics start to appear again in the console again? Or are they also blank even after a restart?

If after following step 3 the metrics start to appear again in the console, then it could be an issue with the Heapster pod requiring more resources than what the current node has available. If this is the case can they place the Heapster pod in its own node and see if that resolves the issue?

Comment 5 spadgaon 2016-04-26 10:24:33 UTC
1)1) is there anything obvious in the logs for Heapster, Hawkular Metrics, or Cassandra?

no logs look clean

2) are the containers stable, or are they restarting after these 10-15 minutes?

yes no restarts from any of the 3

3) if you scale the Heapster pod to 0 and then back to 1, do metrics start to appear again in the console again? Or are they also blank even after a restart?

yes does not change anything.


We have 2 environments currently both installed in the exact same way and the version is provided below

oc version
oc v3.1.1.6-21-gcd70c35
kubernetes v1.1.0-origin-1107-g4c8e6f4

In 1 env the metrics are working fine, can see them in the console and in the other no metrics seen

Comment 6 Matt Wringe 2016-04-26 13:44:42 UTC
If its still the issue that metrics only appear briefly, please see https://access.redhat.com/solutions/2158481

Otherwise, is there anything different between these two clusters? Is one much bigger than the other?

For the cluster which does not display any metrics, could you ever see metrics in the console? Or has it always been empty?

Comment 7 spadgaon 2016-04-26 15:30:16 UTC
No they are both the same size.

I am not sure if at the very beginning there was any metrics since our ops team member installed it yesterday evening and I checked it only today.

I made the edit in rc as you haev mentioned below but nothing chnaged

I then made the heapster logs more verbose and hten found this

I0426 11:19:30.001461       1 kubelet.go:96] failed to get stats from kubelet url: https://172.16.8.2:10250/stats/test-thomas/cakephp-example-1-foc53/32a4a90d-0557-11e6-a5e6-00505604764d/cakephp-example - Get https://172.16.8.2:10250/stats/test-thomas/cakephp-example-1-foc53/32a4a90d-0557-11e6-a5e6-00505604764d/cakephp-example: dial tcp 172.16.8.2:10250: i/o timeout


Could the problem be 
https://github.com/kubernetes/heapster/issues/127 ?

Comment 8 Matt Wringe 2016-04-26 17:45:24 UTC
OK, I am going to assume this is a new install then and is not the original issue about metrics only appearing briefly

A couple of quick checks:
- the url in the master-config.yaml looks like https://${HAWKULAR_METRICS_HOSTNAME}/hawkular/metrics [the Hawkular Metrics part at the end is important and is easy to forget]

- you can access https://${HAWKULAR_METRICS_HOSTNAME}/hawkular/metrics in the same browser as the console, and there are no certificate issues


As for the message you are seeing in the Heapster logs, how often are you seeing that message and is it only for that one particular cakephp-example or also for all the containers running in the node? Is that cakephp-example container currently running or is it in a completed state? I am not sure why exactly it would be giving a timeout unless there was a problem with accessing the node over the network. Can you directly access https://172.16.8.2:10250 ?

Comment 9 spadgaon 2016-04-26 18:37:34 UTC
1.THe master config looks good
2. I can access https://${HAWKULAR_METRICS_HOSTNAME}/hawkular/metrics from the browser after accepting the warning for self signed certificate as below:
----------------------------------------------------------------------------------------------------------------------------------
Your connection is not private

Attackers might be trying to steal your information from hawkular-metrics.xxx.com (for example, passwords, messages, or credit cards). 
NET::ERR_CERT_AUTHORITY_INVALID
  Automatically report details of possible security incidents to Google. Privacy policy
Back to safetyHide advanced
This server could not prove that it is hawkular-metrics.xxx.com; its security certificate is not trusted by your computer's 
operating system. This may be caused by a misconfiguration or an attacker intercepting your connection.
Proceed to hawkular-metrics.xxx.com (unsafe)

-----------------------------------------------------------------------------------------------------------------------------------------
However this is same for the other env where metrics are working fine. So I am not sure if this is the issue


3.
I see timeouts for other pods as well. These are pods from 2 different projects 
I0426 11:19:30.001461       1 kubelet.go:96] failed to get stats from kubelet url: https://172.16.8.2:10250/stats/test-thomas/cakephp-example-1-foc53/32a4a90d-0557-11e6-a5e6-00505604764d/cakephp-example - Get 
https://172.16.8.2:10250/stats/test-thomas/cakephp-example-1-foc53/32a4a90d-0557-11e6-a5e6-00505604764d/cakephp-example: dial tcp 172.16.8.2:10250: i/o timeout

I0426 11:19:30.001699       1 kubelet.go:96] failed to get stats from kubelet url: https://172.16.8.2:10250/stats/test/cakephp-example-1-fh98h/57a2d6a5-0b8c-11e6-a0fd-00505604764d/cakephp-example - 
Get https://172.16.8.2:10250/stats/test/cakephp-example-1-fh98h/57a2d6a5-0b8c-11e6-a0fd-00505604764d/cakephp-example: dial tcp 172.16.8.2:10250: i/o timeout

Both these podscakephp-example-1-foc53 and are running

#oc get pods -n test-thomas
NAME                      READY     STATUS      RESTARTS   AGE
cakephp-example-1-build   0/1       Completed   0          8d
cakephp-example-1-foc53   1/1       Running     0          8d

#oc get pods -n test
NAME                      READY     STATUS      RESTARTS   AGE
cakephp-example-1-build   0/1       Completed   0          9h
cakephp-example-1-fh98h   1/1       Running     0          9h


4. I can access the node using both telnet on port 10250 and also using curl
curl https://172.16.8.2:10250 --insecure
404 page not found


5. No chance of this being https://github.com/kubernetes/heapster/issues/127 ?

Comment 10 Matt Wringe 2016-04-26 22:06:57 UTC
I am not sure why you think its related to https://github.com/kubernetes/heapster/issues/127. What are you seeing which makes you believe that?

It doesn't appear to match type of errors you are seeing, and the Heapster version being used already has the fix applied.



Are the messages you are seeing in the Heapster logs occurring with the default log levels, or did you edit the logging levels in the heapster rc to be more verbose? 

Also are you seeing that message for all containers or just the cakephp-example ones?

Can you try running the following curl command to see if this is returning something :
curl --insecure -H "Authorization: Bearer `oc whoami -t`" -H "Hawkular-tenant: openshift-infra" -X GET https://${HAWKULAR_METRICS_HOSTNAME}/hawkular/metrics/gauges/data?tags=container_name:hawkular-metrics\&buckets=1

Comment 11 Christophe Augello 2016-04-27 06:53:22 UTC
Created attachment 1151207 [details]
heapster log on behalf of cust

Comment 12 spadgaon 2016-04-27 07:53:33 UTC
Sorry I forgot to mention above that its not directly the issue127. If you see the comments in the issue, there is one person complaining about kubelet io timeout and then the heapster developer mentions this problem being fixed on head by #4784.


I could see these messages only after I the made the logs more verbose

No its also seen for other containers


heapster.log:22852:I0426 14:53:01.252686       1 kubelet.go:96] failed to get stats from kubelet url: https://172.16.8.2:10250/stats/shilpa-test/mysql-1-pt0a5/55f288b0-0bdf-11e6-a0fd-00505604764d/mysql - Get https://172.16.8.2:10250/stats/shilpa-test/mysql-1-pt0a5/55f288b0-0bdf-11e6-a0fd-00505604764d/mysql: dial tcp 172.16.8.2:10250: i/o timeout

oc get pods -n shilpa-test
NAME            READY     STATUS    RESTARTS   AGE
mysql-1-pt0a5   1/1       Running   0          11h


Yes it returns nan

curl -H "Authorization: Bearer xxxxxxxxxxxxxxxxx " -H "Hawkular-tenant: openshift-infra" -H "Accept: application/json" -X GET https://172.30.196.143/hawkular/metrics/gauges/data?'tags=container_name:hawkular-metrics-nmslu&buckets=1' --insecure


[{"start":1461714634506,"end":1461743434506,"min":"NaN","avg":"NaN","median":"NaN","max":"NaN","percentile95th":"NaN","samples":0,"empty":true}]

Comment 13 Matt Wringe 2016-04-27 14:31:55 UTC
From the logs I can see that we are gathering metrics from the router, the registry, hawkular-metrics, cassandra, heapster, etc.... 

Can you please verify that you are not seeing metrics for those containers in the console?

Also, can you please run the curl command from https://bugzilla.redhat.com/show_bug.cgi?id=1327558#c10 (please do not add the "-nmlsu" after the hawkular-metrics container name)

Comment 14 spadgaon 2016-04-27 14:38:14 UTC
curl -H "Authorization: Bearer xxxxxxxxxxxxxxxxxxxxxx " -H "Hawkular-tenant: openshift-infra" -H "Accept: application/json" -X GET https://172.30.196.143/hawkular/metrics/gauges/data?'tags=container_name:hawkular-metrics&buckets=1' --insecure


[{"start":1461739054504,"end":1461767854504,"min":-1.0,"avg":5.2182521667085904E8,"median":7.665096137944177E8,"max":1.125175296E9,"percentile95th":1.123234304854552E9,"samples":3816,"empty":false}]

Comment 15 Matt Wringe 2016-04-27 16:04:53 UTC
Ok, so we know that Heapster is gathering metrics for your container, its being properly processed by Hawkular Metrics and its being stored into Cassandra.

Can you see the graphs in the console for the Hawkular Metrics container?

Comment 16 spadgaon 2016-04-27 16:55:52 UTC
Yes i do see the grafs for openshift-infra pods

Comment 17 spadgaon 2016-04-27 16:57:05 UTC
But I only see them for these openshift-infra pods.. For the pods in other projects, i see no metrics

Comment 18 Matt Wringe 2016-04-27 17:29:14 UTC
From the logs this looks like an issue where we cannot connect to the 172.16.8.3 and the 172.16.8.2 nodes. The node running the OpenShift infra components (172.16.4.2) does appear to be functioning properly.

I am not sure what exactly is causing the timeout on those nodes.

Can you please check and make sure that the clocks across these nodes are all synchronized? Is there any filewall issues trying to access port 10250 on these nodes?

Comment 19 spadgaon 2016-04-28 06:39:45 UTC
I checked the rules and hte problem was they were only in 1 direction...

Thanks for your support and we can now close this ticket

Comment 27 Xia Zhao 2016-05-17 02:31:04 UTC
Set to verified according to comment #19

Comment 28 Josep 'Pep' Turro Mauri 2016-05-17 07:54:43 UTC
(In reply to Xia Zhao from comment #27)
> Set to verified according to comment #19

If I read comment #19 correctly, it says that the problem there was a firewall misconfiguration, no?

If that's the case we should probably close this bz as notabug...

Comment 29 Xia Zhao 2016-05-17 08:08:22 UTC
(In reply to Josep 'Pep' Turro Mauri from comment #28)
> (In reply to Xia Zhao from comment #27)
> > Set to verified according to comment #19
> 
> If I read comment #19 correctly, it says that the problem there was a
> firewall misconfiguration, no?
> 
> If that's the case we should probably close this bz as notabug...

Yes, comment #19 confirmed that the root cause was misconfiguration. 

As a QE, I'm not allowed to close this directly, you may need to find other proper person to help close if you want...

Comment 30 Eric Rich 2016-05-17 20:22:01 UTC
(In reply to Josep 'Pep' Turro Mauri from comment #28)
> (In reply to Xia Zhao from comment #27)
> > Set to verified according to comment #19
> 
> If I read comment #19 correctly, it says that the problem there was a
> firewall misconfiguration, no?
> 
> If that's the case we should probably close this bz as notabug...

I agree, and will close this as not a bug. Feel free to re-open if new information is identified.