Description of problem: Could not get application level metrics, due to the agent pods has error Failed to collect endpoint, can see from agent pod's log connection timeout. Though from jolokia pod's log service is up those endpoint is available. Version-Release number of selected component (if applicable): OCP 3.5 Metrics 3.5.0 Hawkular OpenShift Agent: Version: 1.2.0.Final How reproducible: Always Steps to Reproduce: 1. Deploy 3.5.0 Metrics and the agent using the file here(https://github.com/openshift/origin-metrics/tree/enterprise/hawkular-openshift-agent). This cluster has one master and one node, so there are 2 agents pod, and from the log can see it's running Hawkular OpenShift Agent: Version: 1.2.0.Final # oc get pod NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-z7pn0 1/1 Running 0 3h hawkular-metrics-gdkp7 1/1 Running 0 3h hawkular-openshift-agent-2lvgf 1/1 Running 0 2h hawkular-openshift-agent-n2dcg 1/1 Running 0 2h heapster-d4svp 1/1 Running 0 5h metrics-deployer-x4n6w 0/1 Completed 0 5h 2. Deploy the example(https://github.com/hawkular/hawkular-openshift-agent/tree/master/examples/jolokia-wildfly-example) $ oc new-project jolokia $ make openshift-deploy $ oc get pod NAME READY STATUS RESTARTS AGE hawkular-openshift-agent-example-jolokia-wildfly-hmrjs 1/1 Running 0 1h From above pod log we can see Jolokia: Agent started with URL https://10.129.0.29:8778/jolokia/ 3. Check agent pod's log can see error info err=Failed to collect metrics from Jolokia endpoint [https://10.129.0.29:8778/jolokia/]. err=Post https://10.129.0.29:8778/jolokia/: dial tcp 10.129.0.29:8778: getsockopt: connection timed out Actual results: Can't get jolokia info from metrics end point. Expected results: Should be able to get jolokia metrics. Additional info: A similar result when test Promethus endpoint err=Failed to collect Prometheus metrics from [http://10.129.0.30:8181/]. err=Cannot scrape Prometheus URL [http://10.129.0.30:8181/]: err=Get http://10.129.0.30:8181/: dial tcp 10.129 .0.30:8181: getsockopt: connection timed out
Are you seeing metrics for the agent itself? I see this error, too, with the Jolokia example endpoint (not the Prometheus example), but I don't think it is an agent problem. Because if you navigate to the Jolokia example pod in the OpenShift UI, and click the "Open Java Console" link, OpenShift gets the exact same connection error as the agent does - you will see this in a popup dialogue box: === The connection to jolokia failed! The connection to jolokia has failed with the following error, also check the javascript console for more details. Error: 'dial tcp 172.17.0.6:8778: getsockopt: connection refused' Trying to reach: 'https://172.17.0.6:8778/jolokia/?maxDepth=7&maxCollectionSize=500&ignoreErrors=true&canonicalNaming=false' 1 === So for some reason, OpenShift cannot expose that Jolokia endpoint sometimes and when that happens, no client can connect to it. My versions: oc v1.5.0-alpha.0+3b2bbe5 kubernetes v1.4.0+776c994 openshift v1.5.0-alpha.0+3b2bbe5
$ oc project test1 $ oc new-app amq62-basic for this pod broker-amq, I can click the "Open Java Console" link, and see some JMX info.
(In reply to Peng Li from comment #6) > $ oc project test1 > $ oc new-app amq62-basic > for this pod broker-amq, I can click the "Open Java Console" link, and see > some JMX info. Then it sounds like it might be some kind of permission issue? What roles/permissions are the agent given? We might need Matt to take a look at the setup.
verified Version: metrics-hawkular-metrics 3.5.0 b50862a32dd6 14 hours ago 1.508 GB metrics-hawkular-openshift-agent 3.5.0 a66118961a69 20 hours ago 234.8 MB metrics-heapster 3.5.0 03d0a94d4bd2 metrics-cassandra 3.5.0 aa7e5b2b7210 Steps: 1. deploy Metrics 3.5.0 using ansible. 2. deploy the agent oc create -f hawkular-openshift-agent-configmap.yaml -n default oc process -f hawkular-openshift-agent.yaml | oc create -n default -f - oc adm policy add-cluster-role-to-user hawkular-openshift-agent system:serviceaccount:default:hawkular-openshift-agent 3. check metrics and agent pods running. 4. check from console that Promethus and Jokokia example pod endpoints could be gathered.
Above test is in a ovs-multitatent enabled OCP # openshift version openshift v3.5.0.19+199197c # oc get netnamespace | grep openshift-infra openshift-infra 11939989
Since this bug never reached customers, I am closing it.