Bug 1421060 - Could not get application level metrics with error 'Failed to collect endpoint'
Summary: Could not get application level metrics with error 'Failed to collect endpoint'
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: ---
Assignee: Matt Wringe
QA Contact: Peng Li
Depends On:
TreeView+ depends on / blocked
Reported: 2017-02-10 09:16 UTC by Peng Li
Modified: 2017-03-02 22:22 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2017-03-02 22:22:46 UTC
Target Upstream Version:

Attachments (Terms of Use)

Description Peng Li 2017-02-10 09:16:41 UTC
Description of problem:
Could not get application level metrics, due to the agent pods has error  Failed to collect endpoint, can see from agent pod's log connection timeout. Though from jolokia pod's log service is up those endpoint is available.

Version-Release number of selected component (if applicable):
OCP 3.5
Metrics 3.5.0
Hawkular OpenShift Agent: Version: 1.2.0.Final

How reproducible:

Steps to Reproduce:
1. Deploy 3.5.0 Metrics and the agent using the file here(https://github.com/openshift/origin-metrics/tree/enterprise/hawkular-openshift-agent).
This cluster has one master and one node, so there are 2 agents pod, and from the log can see it's running Hawkular OpenShift Agent: Version: 1.2.0.Final
# oc get pod
NAME                             READY     STATUS      RESTARTS   AGE
hawkular-cassandra-1-z7pn0       1/1       Running     0          3h
hawkular-metrics-gdkp7           1/1       Running     0          3h
hawkular-openshift-agent-2lvgf   1/1       Running     0          2h
hawkular-openshift-agent-n2dcg   1/1       Running     0          2h
heapster-d4svp                   1/1       Running     0          5h
metrics-deployer-x4n6w           0/1       Completed   0          5h

2. Deploy the example(https://github.com/hawkular/hawkular-openshift-agent/tree/master/examples/jolokia-wildfly-example)

$ oc new-project jolokia
$ make openshift-deploy
$ oc get pod
NAME                                                     READY     STATUS    RESTARTS   AGE
hawkular-openshift-agent-example-jolokia-wildfly-hmrjs   1/1       Running   0          1h

From above pod log we can see 
Jolokia: Agent started with URL

3. Check agent pod's log can see error info
err=Failed to collect metrics from Jolokia endpoint []. err=Post dial tcp getsockopt: connection
 timed out

Actual results:
Can't get jolokia info from metrics end point.

Expected results:
Should be able to get jolokia metrics.

Additional info:
A similar result when test Promethus endpoint
 err=Failed to collect Prometheus metrics from []. err=Cannot scrape Prometheus URL []: err=Get dial tcp 10.129
.0.30:8181: getsockopt: connection timed out

Comment 4 John Mazzitelli 2017-02-10 11:52:07 UTC
Are you seeing metrics for the agent itself?

I see this error, too, with the Jolokia example endpoint (not the Prometheus example), but I don't think it is an agent problem. Because if you navigate to the Jolokia example pod in the OpenShift UI, and click the "Open Java Console" link, OpenShift gets the exact same connection error as the agent does - you will see this in a popup dialogue box:

The connection to jolokia failed!

The connection to jolokia has failed with the following error, also check the javascript console for more details.

Error: 'dial tcp getsockopt: connection refused'
Trying to reach: ''

So for some reason, OpenShift cannot expose that Jolokia endpoint sometimes and when that happens, no client can connect to it.

My versions:

oc v1.5.0-alpha.0+3b2bbe5
kubernetes v1.4.0+776c994
openshift v1.5.0-alpha.0+3b2bbe5

Comment 6 Peng Li 2017-02-10 13:28:55 UTC
$ oc project test1
$ oc new-app amq62-basic
for this pod broker-amq, I can click the "Open Java Console" link, and see some JMX info.

Comment 7 John Mazzitelli 2017-02-10 13:31:07 UTC
(In reply to Peng Li from comment #6)
> $ oc project test1
> $ oc new-app amq62-basic
> for this pod broker-amq, I can click the "Open Java Console" link, and see
> some JMX info.

Then it sounds like it might be some kind of permission issue? What roles/permissions are the agent given? We might need Matt to take a look at the setup.

Comment 16 Peng Li 2017-02-17 13:24:17 UTC

metrics-hawkular-metrics            3.5.0    b50862a32dd6        14 hours ago        1.508 GB
metrics-hawkular-openshift-agent    3.5.0    a66118961a69        20 hours ago        234.8 MB
metrics-heapster                    3.5.0    03d0a94d4bd2 
metrics-cassandra                   3.5.0    aa7e5b2b7210 

1. deploy Metrics 3.5.0 using ansible.

2. deploy the agent
oc create -f hawkular-openshift-agent-configmap.yaml -n default
oc process -f hawkular-openshift-agent.yaml | oc create -n default -f -
oc adm policy add-cluster-role-to-user hawkular-openshift-agent system:serviceaccount:default:hawkular-openshift-agent

3. check metrics and agent pods running.
4. check from console that Promethus and Jokokia example pod endpoints could be gathered.

Comment 17 Peng Li 2017-02-17 13:29:08 UTC
Above test is in a ovs-multitatent enabled OCP

# openshift version
openshift v3.5.0.19+199197c

# oc get netnamespace | grep openshift-infra
openshift-infra    11939989

Comment 18 Troy Dawson 2017-03-02 22:22:46 UTC
Since this bug never reached customers, I am closing it.

Note You need to log in before you can comment on or make changes to this bug.