Bug 1474099 - 30+ minutes for metrics to re-stabilize after heapster restart @ 15K pods
30+ minutes for metrics to re-stabilize after heapster restart @ 15K pods
Status: ASSIGNED
Product: OpenShift Container Platform
Classification: Red Hat
Component: Metrics (Show other bugs)
3.6.0
x86_64 Linux
unspecified Severity high
: ---
: 3.7.0
Assigned To: Stefan Negrea
Vikas Laad
aos-scalability-36
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-23 17:13 EDT by Mike Fiedler
Modified: 2017-08-22 10:06 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
heapster log after restart with 15K pods running on 100 nodes (7.13 KB, text/plain)
2017-07-23 17:15 EDT, Mike Fiedler
no flags Details

  None (edit)
Description Mike Fiedler 2017-07-23 17:13:46 EDT
Description of problem:

- metrics are stable at 15K pods - no push failures and no missing metrics.   
Retrieving metrics via the Hawkular REST API and there are no empty buckets

- stop heapster, wait a few minutes and verify the most recent buckets are empty

- start heapster.   Watch heapster logs for push failures and check metrics for missing data

It takes 23 minutes for any data to push to Hawkular->Cassandra again and 30 minutes before pushes completely stabilize and no holes appear in metrics data


Version-Release number of selected component (if applicable):  Metrics container version 3.6.152


How reproducible:  Always when restarting heapster with a large number of pods.  I have not determined the threshold


Steps to Reproduce:
1.  Deploy metrics
2.  Start 15K pods and verify metrics are being collected for all pods - stable metrics with no errors in the heapster logs
3.  Stop heapster, verify metrics are no longer collected
4.  Start heapster, watch the heapster error logs for " Failed to push data to sink: Hawkular-Metrics Sink" to stop occurring
5.  Watch the OpenShift UI or Hawkular REST API for metrics collection to successfully restart

Actual results:

30 + minutes for metrics to be collected in a stable fashion

Expected results:

Heapster collects metrics and pushes them to Hawkular + Cassandra within 1 or 2 intervals of restarting.

Additional info:   Heapster logs attached.  Let me know what further info is required.   The configuration is the same as  https://bugzilla.redhat.com/show_bug.cgi?id=1465532
Comment 1 Mike Fiedler 2017-07-23 17:15 EDT
Created attachment 1303352 [details]
heapster log after restart with 15K pods running on 100 nodes
Comment 2 Matt Wringe 2017-07-24 12:30:28 EDT
Heapster will try and and update the metric definitions when the server starts, this means reading metric definitions from Hawkular Metrics and potentially writing back updates.

Is this something new in 3.6 or something we also experienced in 3.5?

Were any other updates made to the default configurations for Heapster or Hawkular Metrics? Could you please attach the output of 'oc get pods -n openshift-infra -o yaml'

@miburman: any thoughts here?
Comment 3 Mike Fiedler 2017-07-24 12:37:12 EDT
Not sure about 3.5 - we'd have to set up an environment to test that.

The only Heapster configuration modification is to remove the memory limit.

I hit an instance today where heapster seems never to reconnect.   Re-testing that now.  Let me know if there is any extra logging or other info you want to see.
Comment 4 Mike Fiedler 2017-07-24 12:44:41 EDT
Going to try deploying 3.5 in this cluster and attempt to reproduce.

Also, restoring the needinfo for miburman that I inadvertently cleared (@miburman see comment 2)
Comment 5 Michael Burman 2017-07-24 12:52:00 EDT
There's probably too big backlog of updates to the definitions that can't be handled (or takes too much time in other words). This is the same effect as when there's too many new pods. We haven't made changes to it between 3.5 and 3.6 in Heapster side, so the behavior should be the same in 3.5. That is, if we would see the same behavior in normal operations.
Comment 6 Mike Fiedler 2017-07-24 13:00:11 EDT
Correction comment 3 where I said it never reconnects.   That is false - it did eventually pick up and started pushing metrics -> hawkular->Cassandra
Comment 7 Mike Fiedler 2017-07-24 13:54:36 EDT
I deployed metrics 3.5 in the same cluster in the exact same state (15K running pods) and metrics started showing up within 2 minutes and by 12 minutes after heapster start all 5 of the 2-minute buckets I was requesting had good data.

I re-deployed 3.6 and saw the same behavior - no metrics data for an extended period of time after restarting heapster.  Understand the frustration with nothing 
changing in those areas between releases - reporting what's happening.   

We have reproducers of both the good/3.5 and reported behaviors on this 100 node AWS cluster.  Let us know if we can get any additional data.
Comment 8 Michael Burman 2017-07-24 14:50:14 EDT
Can we run a mixture of 3.5 & 3.6 images? If that's possible, a combination of 3.5 Heapster + 3.6 Hawkular-Metrics and 3.6 Heapster + 3.5 Hawkular-Metrics would be very interesting.

Maybe the problem is somewhere else in the Heapster and not in anywhere we're currently looking at (we've been concentrating on the sink<->Hawkular-Metrics integration, but perhaps that's not the biggest pain).
Comment 9 Matt Wringe 2017-07-24 14:52:17 EDT
Would the compression job startup right after the pods are restarted as well?
Comment 10 Vikas Laad 2017-07-24 15:01:15 EDT
(In reply to Michael Burman from comment #8)
> Can we run a mixture of 3.5 & 3.6 images? If that's possible, a combination
> of 3.5 Heapster + 3.6 Hawkular-Metrics and 3.6 Heapster + 3.5
> Hawkular-Metrics would be very interesting.
> 
> Maybe the problem is somewhere else in the Heapster and not in anywhere
> we're currently looking at (we've been concentrating on the
> sink<->Hawkular-Metrics integration, but perhaps that's not the biggest
> pain).

I am going to try this, will update the bug.
Comment 11 Vikas Laad 2017-07-24 15:09:01 EDT
(In reply to Matt Wringe from comment #9)
> Would the compression job startup right after the pods are restarted as well?

Compression is disabled on this cluster.
Comment 12 Vikas Laad 2017-07-24 16:42:22 EDT
(In reply to Michael Burman from comment #8)

Just tried with 3.5 version of heapster, result is same. It took around 32 mins for metrics to get stable.
Comment 13 Vikas Laad 2017-07-24 16:43:37 EDT
(In reply to Vikas Laad from comment #12)

Just tried with 3.5 version of heapster, result is same. It took around 32 mins for metrics to get stable. All the other components installed are 3.6 only heapster rc was updated to use 3.5 image and scaled.
Comment 14 Matt Wringe 2017-07-25 14:03:26 EDT
The 3.5 version of Cassandra was also tried and it still had the same effect.

At this point it looking like its a change in Hawkular Metrics which has caused this regression.

We did make some changes which could cause performance to degrade, but this was a required change to prevent metric tags data from getting lost. We are investigating if this is cause or not to this issue.
Comment 15 Vikas Laad 2017-07-26 21:41:37 EDT
This is not a regression, I deployed metrics few times on 3.5 and compared with 3.6 performance. I do not see any difference, its almost same time 24 secs to start loading metrics. This is still a performance issue and needs to be fixed.
Comment 16 Vikas Laad 2017-07-26 21:48:02 EDT
(In reply to Vikas Laad from comment #15)
correction: 24 mins to start loading metrics.
Comment 17 Michael Burman 2017-07-27 05:54:45 EDT
There's an improvement in the Heapster's master version for this. I'll make a backport for our current Heapster version also and submit to origin-metrics for testing.
Comment 18 Matt Wringe 2017-07-27 07:46:14 EDT
(In reply to Vikas Laad from comment #16)
> (In reply to Vikas Laad from comment #15)
> correction: 24 mins to start loading metrics.

You got my hopes up :)
Comment 20 Michael Burman 2017-08-06 09:46:39 EDT
This is expected behavior though, as we have to refresh all the tags in the Hawkular-Metrics when the restart happens (the initial caching only takes care of the _system tenant caching).

The only solutions are a) Heapster should request metric definitions for all active projects, b) speedup tags updates on the Hawkular-Metrics side. The a) requires that we would somehow know all the active projects in Heapster / we would maintain yet another cache of projects seen and request caches from Hawkular-Metrics when seen first time. I'm not sure if the latter would speed up the loading that much though.
Comment 21 Vikas Laad 2017-08-22 10:06:37 EDT
With the latest image provided im bz https://bugzilla.redhat.com/show_bug.cgi?id=1465532 this time has been reduced to 16 mins for 15K pods.

Note You need to log in before you can comment on or make changes to this bug.