Created attachment 1137422 [details] log from the heapster and hawkular pods. Description of problem: Metrics stop working after deploying a new pod Version-Release number of selected component (if applicable): 3.1.1 How reproducible: Always Steps to Reproduce: 1. Deploy a 3.1.1.6 cluster with 3.1.1 metrics 2. Deploy a pod 3. Deploy metrics 4. Notice metrics are working 5. Deploy a new pod Actual results: Metrics stop working Expected results: Metrics to keep working Additional info:
Is this related to any pod? Or will any pod reproduce this issue?
Any pod will reproduce it from what I have seen.
I cannot reproduce. Using OSE v3.1.1.6 and Metric components with version 3.1.1 an deployed the hello-openshift pod as a test pod. Is there anything else you can tell us about your setup? What deployment options you used when deploying the metrics components may help. The logs don't indicate any errors as to what would be causing this. @chunchen are you able to reproduce this?
Created attachment 1137485 [details] heapster log with verbose logging
Progress update: I've got the OSE 3.1 env installed and logging components deployed there, but I didn't get fluentd pod to be running. Need some more time to adjust the encironment and then try to reproduce this issue. Thanks, Xia
@Xia this has nothing to do with logging components. Why are you trying to test with logging here?
@mwringe, Oops, it's my mistake. Messed up when setting up enviornmennts previously. I will switch to deploying the metrics components with OSE3.1 there.
@mwringe, I didn't get it reproduced. Here are my test steps: 1. Deployed metrics stacks in metrics namespace on OSE 3.1.1.6 --> worked fine 2. Deployed logging stacks in logging namespace --> Metrics are available for logging stacks 3. Deployed new pods into metrics namespace: oc new-app --docker-image=docker.io/chunyunchen/java-mainclass:2.2.94-SNAPSHOT -n metrics --> Metrics are available for the java pod inside metrics namespace Tested with latest images pulled from brew registry: openshift3/metrics-deployer d3b5bd02c6ad openshift3/metrics-hawkular-metrics 0d825e62d05a openshift3/metrics-heapster 9a6aa3a55a44 openshift3/metrics-cassandra 2f9af4d01e97
Just to keep track of a few observations here: 1) it doesn't necessarily look like its an issue where deploying a new pod causes metrics collection to completely stop, it looks more like its an intermittent failure to gather metrics. 2) even when its functioning, the graphs are not right. The most recent values are all zero or near zero even when we are getting metrics. It appears to be an issue with time synchronization causing this.
@mwringe, by saying the issue reproduced, I meant I encountered the thing that metrics stopped working fine after new pod is deployed to the same project. And you are right the router in default namespace turned to be pending because the node label region=primary somehow missed. I'm working on fix this issue and see how will metrics go. BTW, I saw good metrics charts with exact stats displayed on web console before deploying the camel-spring pod into metrics project, and metrics service URL https://hawkular-metrics.0318-gtf.qe.rhcloud.com/hawkular/metrics is also accesible at that time.
@mwringe, After fixed the router issue, metrics components continued working fine in my OSE 3.1. The CPU and memory stats are visible on web console UI. So the original issue about "metrics stopped working after new pod is deployed" is still not reproduced.
@jdyson: any ideas about what might be causing this? I can't tell if it would be something wrong with heapster or with something else going wrong with the system.
Verified this as not reproduced according to my comment in #15.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:1064
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days