Bug 1421060

Summary:	Could not get application level metrics with error 'Failed to collect endpoint'
Product:	OpenShift Container Platform	Reporter:	Peng Li <penli>
Component:	Hawkular	Assignee:	Matt Wringe <mwringe>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Peng Li <penli>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	3.5.0	CC:	aos-bugs, mazz, tdawson
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-03-02 22:22:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Peng Li 2017-02-10 09:16:41 UTC

Description of problem:
Could not get application level metrics, due to the agent pods has error  Failed to collect endpoint, can see from agent pod's log connection timeout. Though from jolokia pod's log service is up those endpoint is available.

Version-Release number of selected component (if applicable):
OCP 3.5
Metrics 3.5.0
Hawkular OpenShift Agent: Version: 1.2.0.Final

How reproducible:
Always

Steps to Reproduce:
1. Deploy 3.5.0 Metrics and the agent using the file here(https://github.com/openshift/origin-metrics/tree/enterprise/hawkular-openshift-agent).
This cluster has one master and one node, so there are 2 agents pod, and from the log can see it's running Hawkular OpenShift Agent: Version: 1.2.0.Final
 
# oc get pod
NAME                             READY     STATUS      RESTARTS   AGE
hawkular-cassandra-1-z7pn0       1/1       Running     0          3h
hawkular-metrics-gdkp7           1/1       Running     0          3h
hawkular-openshift-agent-2lvgf   1/1       Running     0          2h
hawkular-openshift-agent-n2dcg   1/1       Running     0          2h
heapster-d4svp                   1/1       Running     0          5h
metrics-deployer-x4n6w           0/1       Completed   0          5h

2. Deploy the example(https://github.com/hawkular/hawkular-openshift-agent/tree/master/examples/jolokia-wildfly-example)

$ oc new-project jolokia
$ make openshift-deploy
$ oc get pod
NAME                                                     READY     STATUS    RESTARTS   AGE
hawkular-openshift-agent-example-jolokia-wildfly-hmrjs   1/1       Running   0          1h

From above pod log we can see 
Jolokia: Agent started with URL https://10.129.0.29:8778/jolokia/

3. Check agent pod's log can see error info
err=Failed to collect metrics from Jolokia endpoint [https://10.129.0.29:8778/jolokia/]. err=Post https://10.129.0.29:8778/jolokia/: dial tcp 10.129.0.29:8778: getsockopt: connection
 timed out

Actual results:
Can't get jolokia info from metrics end point.

Expected results:
Should be able to get jolokia metrics.

Additional info:
A similar result when test Promethus endpoint
 err=Failed to collect Prometheus metrics from [http://10.129.0.30:8181/]. err=Cannot scrape Prometheus URL [http://10.129.0.30:8181/]: err=Get http://10.129.0.30:8181/: dial tcp 10.129
.0.30:8181: getsockopt: connection timed out

Comment 4 John Mazzitelli 2017-02-10 11:52:07 UTC

Are you seeing metrics for the agent itself?

I see this error, too, with the Jolokia example endpoint (not the Prometheus example), but I don't think it is an agent problem. Because if you navigate to the Jolokia example pod in the OpenShift UI, and click the "Open Java Console" link, OpenShift gets the exact same connection error as the agent does - you will see this in a popup dialogue box:

===
The connection to jolokia failed!

The connection to jolokia has failed with the following error, also check the javascript console for more details.

Error: 'dial tcp 172.17.0.6:8778: getsockopt: connection refused'
Trying to reach: 'https://172.17.0.6:8778/jolokia/?maxDepth=7&maxCollectionSize=500&ignoreErrors=true&canonicalNaming=false'
1
===

So for some reason, OpenShift cannot expose that Jolokia endpoint sometimes and when that happens, no client can connect to it.

My versions:

oc v1.5.0-alpha.0+3b2bbe5
kubernetes v1.4.0+776c994
openshift v1.5.0-alpha.0+3b2bbe5

Comment 6 Peng Li 2017-02-10 13:28:55 UTC

$ oc project test1
$ oc new-app amq62-basic
for this pod broker-amq, I can click the "Open Java Console" link, and see some JMX info.

Comment 7 John Mazzitelli 2017-02-10 13:31:07 UTC

(In reply to Peng Li from comment #6)
> $ oc project test1
> $ oc new-app amq62-basic
> for this pod broker-amq, I can click the "Open Java Console" link, and see
> some JMX info.

Then it sounds like it might be some kind of permission issue? What roles/permissions are the agent given? We might need Matt to take a look at the setup.

Comment 16 Peng Li 2017-02-17 13:24:17 UTC

verified

Version:
metrics-hawkular-metrics            3.5.0    b50862a32dd6        14 hours ago        1.508 GB
metrics-hawkular-openshift-agent    3.5.0    a66118961a69        20 hours ago        234.8 MB
metrics-heapster                    3.5.0    03d0a94d4bd2 
metrics-cassandra                   3.5.0    aa7e5b2b7210 

Steps:
1. deploy Metrics 3.5.0 using ansible.

2. deploy the agent
oc create -f hawkular-openshift-agent-configmap.yaml -n default
oc process -f hawkular-openshift-agent.yaml | oc create -n default -f -
oc adm policy add-cluster-role-to-user hawkular-openshift-agent system:serviceaccount:default:hawkular-openshift-agent

3. check metrics and agent pods running.
4. check from console that Promethus and Jokokia example pod endpoints could be gathered.

Comment 17 Peng Li 2017-02-17 13:29:08 UTC

Above test is in a ovs-multitatent enabled OCP

# openshift version
openshift v3.5.0.19+199197c

# oc get netnamespace | grep openshift-infra
openshift-infra    11939989

Comment 18 Troy Dawson 2017-03-02 22:22:46 UTC

Since this bug never reached customers, I am closing it.