Bug 1474099 - 30+ minutes for metrics to re-stabilize after heapster restart @ 15K pods
30+ minutes for metrics to re-stabilize after heapster restart @ 15K pods
Status: ASSIGNED
Product: OpenShift Container Platform
Classification: Red Hat
Component: Metrics (Show other bugs)
3.6.0
x86_64 Linux
unspecified Severity high
: ---
: 3.7.z
Assigned To: Stefan Negrea
Vikas Laad
aos-scalability-37
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-23 17:13 EDT by Mike Fiedler
Modified: 2017-11-17 06:53 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
heapster log after restart with 15K pods running on 100 nodes (7.13 KB, text/plain)
2017-07-23 17:15 EDT, Mike Fiedler
no flags Details
cassandra, hawkular, heapster logs for 3.7.0-0.158.0 (39.64 KB, application/x-gzip)
2017-10-17 16:15 EDT, Mike Fiedler
no flags Details
Logs with 2 hawkular, 2 cassandra and concurrencyLimit 32 (95.71 KB, application/x-gzip)
2017-10-19 18:34 EDT, Mike Fiedler
no flags Details
heapster logs pre and post restart (2.38 KB, application/x-gzip)
2017-10-25 22:14 EDT, Mike Fiedler
no flags Details
heapster logs, latest patch (1.3.0-5) (4.28 KB, text/plain)
2017-10-26 14:23 EDT, Mike Fiedler
no flags Details
hawkular-metrics pod log - deployment failure for ear (38.63 KB, text/plain)
2017-11-13 15:13 EST, Mike Fiedler
no flags Details

  None (edit)
Description Mike Fiedler 2017-07-23 17:13:46 EDT
Description of problem:

- metrics are stable at 15K pods - no push failures and no missing metrics.   
Retrieving metrics via the Hawkular REST API and there are no empty buckets

- stop heapster, wait a few minutes and verify the most recent buckets are empty

- start heapster.   Watch heapster logs for push failures and check metrics for missing data

It takes 23 minutes for any data to push to Hawkular->Cassandra again and 30 minutes before pushes completely stabilize and no holes appear in metrics data


Version-Release number of selected component (if applicable):  Metrics container version 3.6.152


How reproducible:  Always when restarting heapster with a large number of pods.  I have not determined the threshold


Steps to Reproduce:
1.  Deploy metrics
2.  Start 15K pods and verify metrics are being collected for all pods - stable metrics with no errors in the heapster logs
3.  Stop heapster, verify metrics are no longer collected
4.  Start heapster, watch the heapster error logs for " Failed to push data to sink: Hawkular-Metrics Sink" to stop occurring
5.  Watch the OpenShift UI or Hawkular REST API for metrics collection to successfully restart

Actual results:

30 + minutes for metrics to be collected in a stable fashion

Expected results:

Heapster collects metrics and pushes them to Hawkular + Cassandra within 1 or 2 intervals of restarting.

Additional info:   Heapster logs attached.  Let me know what further info is required.   The configuration is the same as  https://bugzilla.redhat.com/show_bug.cgi?id=1465532
Comment 1 Mike Fiedler 2017-07-23 17:15 EDT
Created attachment 1303352 [details]
heapster log after restart with 15K pods running on 100 nodes
Comment 2 Matt Wringe 2017-07-24 12:30:28 EDT
Heapster will try and and update the metric definitions when the server starts, this means reading metric definitions from Hawkular Metrics and potentially writing back updates.

Is this something new in 3.6 or something we also experienced in 3.5?

Were any other updates made to the default configurations for Heapster or Hawkular Metrics? Could you please attach the output of 'oc get pods -n openshift-infra -o yaml'

@miburman: any thoughts here?
Comment 3 Mike Fiedler 2017-07-24 12:37:12 EDT
Not sure about 3.5 - we'd have to set up an environment to test that.

The only Heapster configuration modification is to remove the memory limit.

I hit an instance today where heapster seems never to reconnect.   Re-testing that now.  Let me know if there is any extra logging or other info you want to see.
Comment 4 Mike Fiedler 2017-07-24 12:44:41 EDT
Going to try deploying 3.5 in this cluster and attempt to reproduce.

Also, restoring the needinfo for miburman that I inadvertently cleared (@miburman see comment 2)
Comment 5 Michael Burman 2017-07-24 12:52:00 EDT
There's probably too big backlog of updates to the definitions that can't be handled (or takes too much time in other words). This is the same effect as when there's too many new pods. We haven't made changes to it between 3.5 and 3.6 in Heapster side, so the behavior should be the same in 3.5. That is, if we would see the same behavior in normal operations.
Comment 6 Mike Fiedler 2017-07-24 13:00:11 EDT
Correction comment 3 where I said it never reconnects.   That is false - it did eventually pick up and started pushing metrics -> hawkular->Cassandra
Comment 7 Mike Fiedler 2017-07-24 13:54:36 EDT
I deployed metrics 3.5 in the same cluster in the exact same state (15K running pods) and metrics started showing up within 2 minutes and by 12 minutes after heapster start all 5 of the 2-minute buckets I was requesting had good data.

I re-deployed 3.6 and saw the same behavior - no metrics data for an extended period of time after restarting heapster.  Understand the frustration with nothing 
changing in those areas between releases - reporting what's happening.   

We have reproducers of both the good/3.5 and reported behaviors on this 100 node AWS cluster.  Let us know if we can get any additional data.
Comment 8 Michael Burman 2017-07-24 14:50:14 EDT
Can we run a mixture of 3.5 & 3.6 images? If that's possible, a combination of 3.5 Heapster + 3.6 Hawkular-Metrics and 3.6 Heapster + 3.5 Hawkular-Metrics would be very interesting.

Maybe the problem is somewhere else in the Heapster and not in anywhere we're currently looking at (we've been concentrating on the sink<->Hawkular-Metrics integration, but perhaps that's not the biggest pain).
Comment 9 Matt Wringe 2017-07-24 14:52:17 EDT
Would the compression job startup right after the pods are restarted as well?
Comment 10 Vikas Laad 2017-07-24 15:01:15 EDT
(In reply to Michael Burman from comment #8)
> Can we run a mixture of 3.5 & 3.6 images? If that's possible, a combination
> of 3.5 Heapster + 3.6 Hawkular-Metrics and 3.6 Heapster + 3.5
> Hawkular-Metrics would be very interesting.
> 
> Maybe the problem is somewhere else in the Heapster and not in anywhere
> we're currently looking at (we've been concentrating on the
> sink<->Hawkular-Metrics integration, but perhaps that's not the biggest
> pain).

I am going to try this, will update the bug.
Comment 11 Vikas Laad 2017-07-24 15:09:01 EDT
(In reply to Matt Wringe from comment #9)
> Would the compression job startup right after the pods are restarted as well?

Compression is disabled on this cluster.
Comment 12 Vikas Laad 2017-07-24 16:42:22 EDT
(In reply to Michael Burman from comment #8)

Just tried with 3.5 version of heapster, result is same. It took around 32 mins for metrics to get stable.
Comment 13 Vikas Laad 2017-07-24 16:43:37 EDT
(In reply to Vikas Laad from comment #12)

Just tried with 3.5 version of heapster, result is same. It took around 32 mins for metrics to get stable. All the other components installed are 3.6 only heapster rc was updated to use 3.5 image and scaled.
Comment 14 Matt Wringe 2017-07-25 14:03:26 EDT
The 3.5 version of Cassandra was also tried and it still had the same effect.

At this point it looking like its a change in Hawkular Metrics which has caused this regression.

We did make some changes which could cause performance to degrade, but this was a required change to prevent metric tags data from getting lost. We are investigating if this is cause or not to this issue.
Comment 15 Vikas Laad 2017-07-26 21:41:37 EDT
This is not a regression, I deployed metrics few times on 3.5 and compared with 3.6 performance. I do not see any difference, its almost same time 24 secs to start loading metrics. This is still a performance issue and needs to be fixed.
Comment 16 Vikas Laad 2017-07-26 21:48:02 EDT
(In reply to Vikas Laad from comment #15)
correction: 24 mins to start loading metrics.
Comment 17 Michael Burman 2017-07-27 05:54:45 EDT
There's an improvement in the Heapster's master version for this. I'll make a backport for our current Heapster version also and submit to origin-metrics for testing.
Comment 18 Matt Wringe 2017-07-27 07:46:14 EDT
(In reply to Vikas Laad from comment #16)
> (In reply to Vikas Laad from comment #15)
> correction: 24 mins to start loading metrics.

You got my hopes up :)
Comment 20 Michael Burman 2017-08-06 09:46:39 EDT
This is expected behavior though, as we have to refresh all the tags in the Hawkular-Metrics when the restart happens (the initial caching only takes care of the _system tenant caching).

The only solutions are a) Heapster should request metric definitions for all active projects, b) speedup tags updates on the Hawkular-Metrics side. The a) requires that we would somehow know all the active projects in Heapster / we would maintain yet another cache of projects seen and request caches from Hawkular-Metrics when seen first time. I'm not sure if the latter would speed up the loading that much though.
Comment 21 Vikas Laad 2017-08-22 10:06:37 EDT
With the latest image provided im bz https://bugzilla.redhat.com/show_bug.cgi?id=1465532 this time has been reduced to 16 mins for 15K pods.
Comment 26 Mike Fiedler 2017-10-17 16:13:59 EDT
Re-tested this scenario with 22500 pods

OCP:  v3.7.0-0.143.2

Logging:

registry.ops.openshift.com/openshift3/metrics-cassandra           v3.7.0-0.158.0      e64c5c6107ca

registry.ops.openshift.com/openshift3/metrics-hawkular-metrics    v3.7.0-0.158.0      35127125228a 

registry.ops.openshift.com/openshift3/metrics-heapster            v3.7.0-0.158.0      3e2f71da3b69 

No limits on any of the RCs.  Neither cpu nor memory.

1. Scaled up to 22,500 pods - scale up went great.   No dropout in the metrics on the way up at all.   bug 1465532 can be marked verified.
2. Took pbench data for 10 minutes of steady state
3. Stopped heapster
4. Waited 5 minutes to capture 5 minutes of steady state with heapster down
5. Started heapster

At this point, watch the heapster logs for manager.go:119] Failed to push data to sink: Hawkular-Metrics Sink  and use the SVT metrics verification tool to watch for empty buckets (-10 minute start time, 2 minute duration).   https://github.com/openshift/svt/blob/master/logging_metrics_performance/hawkular_metrics/scripts/metrics_checker.py

After heapster restarted, it took ~45 minutes for the first metrics to start to be populated and 53 minutes for the push failure messages to disappear completely.

In the heapster logs (all pod logs attached) you will see node not ready messages mixed in.   Some nodes were flapping NotReady due to bug 1483182 and bug 1451902/bug 1486914).

Let me know if there is a higher logging level which might help.  Shutting this env down for today but I can keep it around for a few more days.

I will link the pbench data in a private comment.
Comment 27 Mike Fiedler 2017-10-17 16:15 EDT
Created attachment 1339862 [details]
cassandra, hawkular, heapster logs for 3.7.0-0.158.0
Comment 28 Mike Fiedler 2017-10-17 16:18:14 EDT
Bad mention in comment 26.  First linked bug causing NotReady nodes should be bug 1493182
Comment 29 Matt Wringe 2017-10-17 16:21:55 EDT
For this many pods, I would highly suggest running more than just 1 Hawkular Metrics and 1 Cassandra instance :)

I think this is why its taking so long for those error messages to go away in Heapster, we are overloading the system way too much with just a single instance of Hawkular Metrics and Cassandra.

I would start running 2 of each and see how that affects things.

There is also another property that we might also need to tweak on the Heapster side which is how many concurrent connections it makes to Hawkular Metrics.

We have 22.5K pods, do we know how many projects there are?
Comment 31 Mike Fiedler 2017-10-17 20:37:52 EDT
metrics components are running on m4.2xlarge instances with no cpu or memory limits.  8 vCPU and 32GB available per infra node.
Comment 35 Matt Wringe 2017-10-18 09:52:46 EDT
Can you try setting the 'concurrencylimit' for the Hawkular sink parameter in Heapster?

In the Heapster RC there should be a line under the container's command which starts with "--sink:hawkular:https:....." at the end of this line can you add "&concurrencyLimit=32"

Then restart Heapster and see if that brings down the wait time that you have? We might also want to test with 64 and 128 to see if higher values give us better results.

The reason why its taking so long for Heapster to stop showing errors about writing to the Hawkular Sink is that when Heapster is restarted its sending both the metric definitions (eg metadata) and the metric data to Hawkular, and this is taking a long time.

The low cpu usage for the components is probably due to it not being CPU bound, but most likely IO bound.

I would first try setting the concurrencylimits for Heapster to see if that works, but I still think we should run with 2 Hawkular Metrics and Cassandra instances to see if that also helps to improve things.
Comment 36 Mike Fiedler 2017-10-18 09:55:50 EDT
I will re-test today with different concurrencyLimit values suggested in comment 35 on a deployment with 2 Cassandra and 2 Hawkular.
Comment 37 Mike Fiedler 2017-10-19 18:33:37 EDT
Ran the scenario with concurrencyLimit 32 and 2 Hawkular/2 Cassandra pods and there was no change.   Still ~45 minutes for first metrics and 58 minutes for the "push failed" messages to go away.   The pbench data shows the same pattern of low cpu for Heapster until the metrics fully start flowing.

Heapster is red:  http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-15-244/20171019_25K_start_heapster_a/tools-default/ip-172-31-14-208.us-west-2.compute.internal/pidstat/cpu_usage.html

I checked iostat for the Cassandra nodes and there is definitely nothing crazy going on there.   IOPS are under 100 until the metrics start fully flowing.

http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-15-244/20171019_25K_start_heapster_a/tools-default/ip-172-31-62-104.us-west-2.compute.internal/iostat/disk.html


The base for all pbench data is http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-15-244/20171019_25K_start_heapster_a/tools-default/

1 cassandra and 1 hawkular here:  http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-15-244/20171019_25K_start_heapster_a/tools-default/ip-172-31-62-104.us-west-2.compute.internal/pidstat/

1 cassandra here:  http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-15-244/20171019_25K_start_heapster_a/tools-default/ip-172-31-62-104.us-west-2.compute.internal/pidstat/

heapster and 1 hawkular here:  http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-15-244/20171019_25K_start_heapster_a/tools-default/ip-172-31-62-104.us-west-2.compute.internal/pidstat/

I will attach the logs for all pods
Comment 38 Mike Fiedler 2017-10-19 18:34 EDT
Created attachment 1340983 [details]
Logs with 2 hawkular, 2 cassandra and concurrencyLimit 32
Comment 40 Matt Wringe 2017-10-20 09:11:22 EDT
@micke: any idea if maybe Heapster is blocking waiting for all the metric definitions to be written first? Does the metric definitions part use the same Hawkular client as the metric data?
Comment 41 Michael Burman 2017-10-23 03:19:54 EDT
Yes, it uses the same connection pool and same Heapster path. So Heapster will wait until the client has written all those definitions (and that's where the errors will come from as that takes a while).
Comment 42 Matt Wringe 2017-10-23 09:25:44 EDT
@micke: Do we have to do it this way? I think that probably needs to change. We shouldn't have to wait for all the metric definitions to be uploaded before we can start to push out metric data. Can we not have two separate clients to do this?
Comment 43 Michael Burman 2017-10-23 10:06:07 EDT
Then it wouldn't ensure that all the data is searchable either and we had a ticket of that kind of behavior too.

Caching all the tenants would be a more performant option as then we wouldn't do unnecessary updates. I don't know if Heapster has access to the /tenants endpoint however in the current configurations.
Comment 44 Matt Wringe 2017-10-23 10:29:17 EDT
(In reply to Michael Burman from comment #43)
> Then it wouldn't ensure that all the data is searchable either and we had a
> ticket of that kind of behavior too.

Lets say we have 20k+ pods and we are installing things the first time.

We have a choice of either waiting 30+ minutes before any metric data is collected by the system. Or waiting longer than 30+ minutes for some queries to work, but we should have metrics data since the beginning.

What happens in the case where its not a new installation and its just Heapster which has been restarted? So now we have a 30+ minute gap in metric collections instead of having potentially losing one or two metric collection intervals.

That is a huge difference.

> Caching all the tenants would be a more performant option as then we
> wouldn't do unnecessary updates. I don't know if Heapster has access to the
> /tenants endpoint however in the current configurations.

It does have access to the /tenants endpoint, its always had access to this endpoint.

Are we still writing single metric definitions in one REST call? or has it been updated to push those out in a batch operation?
Comment 45 Matt Wringe 2017-10-23 10:33:21 EDT
I am still a bit surprised why increasing the concurrency setting of Heapster didn't help here. I would have expected that to have had a big change.

@mifiedle can we try with something like 100 or 150 to see if that makes any difference? The 512 probably put too much load on Cassandra and is why we were seeing the 500 errors.

If changing the concurrency setting doesn't help, then what else would be the bottleneck here?
Comment 46 Michael Burman 2017-10-24 07:05:56 EDT
There is no multi-tags-update interface in origin-metrics-3.7. Theoretically speaking, it is updating the metric data also to the Hawkular-Metrics instance, but Heapster does not understand that (and tries to reupload stuff).

Even a dual connection would be quite silly as it could spawn so many goroutines that it might make Heapster a lot of resources in the end.

I would assume the best idea is to fetch all tenants and for every tenant, fetch all metrics with descriptor_tag and then cache them correctly. It should take far less than 30 mins to do that and then we wouldn't need to bomb the metrics too much after that.

And we could push metrics to return all of this in a single call if we go and change metrics too - this way it would probably take a minute or so at most for Heapster to start.
Comment 47 Matt Wringe 2017-10-24 09:17:55 EDT
(In reply to Michael Burman from comment #46)
> There is no multi-tags-update interface in origin-metrics-3.7. Theoretically
> speaking, it is updating the metric data also to the Hawkular-Metrics
> instance, but Heapster does not understand that (and tries to reupload
> stuff).
> 
> Even a dual connection would be quite silly as it could spawn so many
> goroutines that it might make Heapster a lot of resources in the end.

I don't follow. So we have two connections, one for metric definitions and one for metric data. And lets say the concurrent connections are capped at 5 and 10 each respectively.

How does running two of them spawn too many goroutines while running one is just fine?

> I would assume the best idea is to fetch all tenants and for every tenant,
> fetch all metrics with descriptor_tag and then cache them correctly. It
> should take far less than 30 mins to do that and then we wouldn't need to
> bomb the metrics too much after that.
> 
> And we could push metrics to return all of this in a single call if we go
> and change metrics too - this way it would probably take a minute or so at
> most for Heapster to start.
Comment 48 Matt Wringe 2017-10-24 10:16:45 EDT
@mifiedle: would it be possible to start up 2 more Cassandra instances and see if this makes a difference?
Comment 49 Michael Burman 2017-10-24 10:45:40 EDT
Matt, because we couldn't block the Heapster processing. That means we would spawn each collection interval a new goroutine. And inside that goroutine, we would have to spawn a goroutine for each tags processing, because we wouldn't want to block the original goroutine that needs to process each datapoint also - and our cache wouldn't be up yet so that we could quickly check if processing is necessary.

It makes far more sense to just improve the caching procedure.
Comment 50 Mike Fiedler 2017-10-25 07:38:17 EDT
I tried with a concurrencyLimit of 100 and it did not speed things up.   Still roughly 1 hour for the push errors to go away.   I will see if I can try 4 cassandras today if you still think that would make a difference
Comment 51 Michael Burman 2017-10-25 10:16:38 EDT
I'll try to provide you with a image today if I make it before I have to go. It should have "alternate" way of doing things between Heapster & Hawkular-Metrics.
Comment 52 Matt Wringe 2017-10-25 10:23:08 EDT
(In reply to Michael Burman from comment #49)
> Matt, because we couldn't block the Heapster processing. That means we would
> spawn each collection interval a new goroutine. And inside that goroutine,
> we would have to spawn a goroutine for each tags processing, because we
> wouldn't want to block the original goroutine that needs to process each
> datapoint also - and our cache wouldn't be up yet so that we could quickly
> check if processing is necessary.

Could you not just have one go routine responsible for handling the metric definitions and give that goroutine its own Hawkular client (with its own concurrency setting)

The metric definition goroutine would have a queue where we push metric definitions.

So for each metric interval, if the metric definition has not already been handled, we push it to the queue instead of using the main Hawkular client to do the update (which can block other metric data from being updated)

> 
> It makes far more sense to just improve the caching procedure.

My above example is just a quick thought on this. I don't understand what the best option would be here. If its better to handle the caching differently, then lets go with that route.

Please keep us updated with the progress as we have the testing environment available to us now which we might not have available in the near future.

We also need to be aware that we will probably have to backport any changes here to some of the older versions.
Comment 53 Matt Wringe 2017-10-25 10:23:27 EDT
(In reply to Michael Burman from comment #51)
> I'll try to provide you with a image today if I make it before I have to go.
> It should have "alternate" way of doing things between Heapster &
> Hawkular-Metrics.

awesome, thanks
Comment 54 Michael Burman 2017-10-25 15:48:25 EDT
You can try now with burmanm/origin-metrics-heapster:1.3.0-4 from Docker Hub
Comment 55 Mike Fiedler 2017-10-25 22:12:25 EDT
Tested with the heapster image from comment 54

docker.io/burmanm/origin-metrics-heapster                         1.3.0-4             33aedfd31e20        10 hours ago        819.9 MB

There was some improvement for restarting heapster with ~20K pods.  In comment 37 with concurrencyLimit=32, it was 45 minutes for first data and 58 minutes for push failures to go away.

This time there were some push failures on the scale up which I had not seen previously in 3.7 (see logs).   Not sure if this is expected - did not see holes in the metrics via regular random sample.

On heapster restart it was ~35 minutes for first data and 47 minutes for the push failure messages to stop.  

Heapster logs pre and post restart attached.
Comment 56 Mike Fiedler 2017-10-25 22:14 EDT
Created attachment 1343479 [details]
heapster logs pre and post restart
Comment 57 Michael Burman 2017-10-26 06:13:22 EDT
I need to add more logging to see what is happening, I'll create a new image today. I was hoping those failed to push wouldn't be visible as the sink shouldn't be ready yet. Maybe I forgot to add a lock somewhere that I missed in the unit tests.
Comment 58 Matt Wringe 2017-10-26 08:55:32 EDT
@micke going from 45 minutes to 35 minutes is an improvement, is the PR you attached all that you were planning? or was that just an initial step and we should be expecting more improvements?
Comment 59 Michael Burman 2017-10-26 10:33:34 EDT
There was a compilation failure in my origin-metrics building, so actually my changes weren't even there.

I've submitted v1.3.0-5 to Mike, which should have more changes (+ logging to indicate how many tags we're updating). I would expect a slight reduction in Heapster memory usage & faster startup time.
Comment 60 Mike Fiedler 2017-10-26 10:50:12 EDT
Previous faster time must have been an anomaly.   In any case, first run with new image from this morning

docker.io/burmanm/origin-metrics-heapster                         v1.3.0-5            465e42d9ec0f

is definitely much better.  26 minutes for metrics to flow and the push error stops at the same time.   One difference I can see is that metrics seem to start back up much quicker (i.e. shorter elapsed time until there are no empty buckets for my "previous 10 minutes" sampling interval).

Next steps:

- run it multiple times with pbench to confirm the timing and compare memory usage to previous runs.

- my recent runs have been 20K pods, but this bz was originally reported on 15K pods.   I will get a timing at that level to compare to the original report which was 23 minutes for initial metrics and 30 for errors to go away.
Comment 61 Mike Fiedler 2017-10-26 14:21:43 EDT
20K pods

1st heapster restart after scale up - 26 minutes for metrics
2nd restart (no other changes) - 15 minutes
3rd restart (no other changes) - 14 minutes

Deleted 5K pods to get down the level this bz originally opened on

1st heapster restart - 11 minutes for metrics
2nd restart - 9 minutes


Will add pbench data in a private comment.   Heapster log attached.
Comment 62 Mike Fiedler 2017-10-26 14:23 EDT
Created attachment 1343927 [details]
heapster logs, latest patch (1.3.0-5)
Comment 66 Mike Fiedler 2017-10-27 13:21:48 EDT
I confirmed with additional runs that the CPU is roughly 2x with the patch.   1 core steady state for heapster for 20K pods before and 2 cores after.   


I was unable to get pprof working against heapster running in a pod in OpenShift

I disabled TLS and allowed_users in the replication config - before that I couldn't get to the heapster service at all.

After that, I can get to the pprof URL and heap works (sort of, it only sees overall usage, not broken out by object

root@ip-172-31-15-244: ~/svt/openshift_scalability # go tool pprof  http://172.24.34.114:8082/debug/pprof/heap                              
Fetching profile from http://172.24.34.114:8082/debug/pprof/heap
Saved profile in /root/pprof/pprof.172.24.34.114:8082.inuse_objects.inuse_space.001.pb.gz
Entering interactive mode (type "help" for commands)
(pprof) top20
2.61GB of 2.61GB total (  100%)
      flat  flat%   sum%        cum   cum%
    2.61GB   100%   100%     2.61GB   100%  


Trying to profile cpu gives me a 500 with nothing corresponding to it in the heapster log:

root@ip-172-31-15-244: ~ # go tool pprof http://172.24.34.114:8082/debug/pprof/profile                                                      
Fetching profile from http://172.24.34.114:8082/debug/pprof/profile
Please wait... (30s)
server response: 500 Internal Server Error

If anyone knows how to get this working, I can try again.  Otherwise - maybe try to replicate locally.
Comment 67 Matt Wringe 2017-10-27 13:44:24 EDT
(In reply to Mike Fiedler from comment #66)
> I confirmed with additional runs that the CPU is roughly 2x with the patch. 
> 1 core steady state for heapster for 20K pods before and 2 cores after.   
> 
> 
> I was unable to get pprof working against heapster running in a pod in
> OpenShift
> 
> I disabled TLS and allowed_users in the replication config - before that I
> couldn't get to the heapster service at all.

Heapster is configured to be only accessible via the api proxy.

You should be able to access its endpoint by doing something like:

curl -k -H "Authorization: Bearer `oc whoami -t`" -X GET https://localhost:8443/api/v1/proxy/namespaces/openshift-infra/services/https:heapster:/

But if disabling it is easier for the tests, then it shouldn't cause any harm.

> 
> After that, I can get to the pprof URL and heap works (sort of, it only sees
> overall usage, not broken out by object
> 
> root@ip-172-31-15-244: ~/svt/openshift_scalability # go tool pprof 
> http://172.24.34.114:8082/debug/pprof/heap                              
> Fetching profile from http://172.24.34.114:8082/debug/pprof/heap
> Saved profile in
> /root/pprof/pprof.172.24.34.114:8082.inuse_objects.inuse_space.001.pb.gz
> Entering interactive mode (type "help" for commands)
> (pprof) top20
> 2.61GB of 2.61GB total (  100%)
>       flat  flat%   sum%        cum   cum%
>     2.61GB   100%   100%     2.61GB   100%  
> 
> 
> Trying to profile cpu gives me a 500 with nothing corresponding to it in the
> heapster log:
> 
> root@ip-172-31-15-244: ~ # go tool pprof
> http://172.24.34.114:8082/debug/pprof/profile                               
> 
> Fetching profile from http://172.24.34.114:8082/debug/pprof/profile
> Please wait... (30s)
> server response: 500 Internal Server Error
> 
> If anyone knows how to get this working, I can try again.  Otherwise - maybe
> try to replicate locally.
Comment 68 Michael Burman 2017-10-28 05:48:21 EDT
I have created some usage graphs from my benchmark-application, I'll try to work with that and provide an image with hopefully some optimizations (I have couple of ideas - but I don't know if they'll transfer to real world as well as in the benchmark application).
Comment 69 Michael Burman 2017-10-28 07:33:24 EDT
As far as I can tell from my CPU profiling, it's the "recent"/"caching" checkings that take most of the time in our sink.

That is, 43% of time was spent in "registerLabeledIfNecessary", splitted by 10,6% of total time was taken by createDefinitionFromModel and 28,9% by hash method. 

In the hash method, we spend ~50% of the time to counter the "one great Golang feature", which is randomized map iteration order, thus we need to resort them.

Note, these are only times from our sink - nothing from inside Heapster is evaluated here.

Rest of time is spent in stuff like JSON encoding, which could be sped up by replacing Golang's JSON implementations with something better. 

I'll spend more time on Monday to think of something to get this time down.
Comment 70 Michael Burman 2017-10-28 10:59:56 EDT
There's now docker.io/burmanm/origin-metrics-heapster:v1.3.0-7 uploaded with a modified hash function.

I haven't extensively tested the correctness (so that we don't have incorrect caching), only preliminary - but it should work a lot faster.
Comment 71 Mike Fiedler 2017-10-30 13:22:55 EDT
Thew new patch helps some, but probably not enough.   Here is a comparison of 15 minutes of steady-state collection (no heapster restart, no scale up/down or project changes) for 3.7.0-0.158.0,  the v1.3.0-5 patch and the new v1.3.0-7 patch:

                        avg heapster cpu%     avt heapster rss GB
3.7.0-0.158.0                  138                 5.7
v1.3.0-5                       212                 4.4
v1.3.0-7                       181                 3.9
Comment 73 Michael Burman 2017-11-08 08:42:45 EST
Inbound new images.. this time the Heapster will require new Hawkular-Metrics image also.

burmanm/origin-metrics-hawkular-metrics:v3.7.0-rc.0-1
burmanm/origin-metrics-heapster:v1.3.0-8

Expected results:

 * Faster restart time
 * CPU usage is slightly lower (it will not reach the 3.7.0.0-158.0 but it should be lower than -7)
Comment 74 Michael Burman 2017-11-08 09:02:05 EST
And here's an update to Cassandra also (if you wish to run this version):

burmanm/origin-metrics-cassandra:v3.0.15
Comment 75 Mike Fiedler 2017-11-13 15:12:58 EST
Trying to run the hawkular-metrics from comment 73 fails - there's a deployment failure for hawkular-metrics-openshift-dist-0.29.0-SNAPSHOT.ear.

Search the attached log for WFLYCTL0013 - there are a few different errors starting there, including an IOException for a missing mount point.   Maybe need a newer RC/Volume definition?

docker.io/burmanm/origin-metrics-hawkular-metrics                         v3.7.0-rc.0-1       58bed5a105d4        5 days ago          909.2 MB
Comment 76 Mike Fiedler 2017-11-13 15:13 EST
Created attachment 1351675 [details]
hawkular-metrics pod log  - deployment failure for ear
Comment 77 Michael Burman 2017-11-14 12:23:54 EST
Although mentioned in the IRC, here is the new metrics version:

burmanm/origin-metrics-hawkular-metrics:v3.7.0-rc.0-2
Comment 78 Mike Fiedler 2017-11-15 15:09:22 EST
I tested the new images (comment 73, comment 74, comment 77) in a smaller cluster (20 nodes, 5000 pods).   I ran a comparison of the new images against 3.7.7.   Below is the summary, will link the detailed data in a followup comment.

2 tests:

#1 heapster cpu and rss for 20 minutes of steady state data collection for 5K pods on 20 nodes:
                   3.7.7         new heapster
avg cpu %          21.9%           31.7%   
rss in GB          1.44             1.0

3.7.7 heapster spikes to ~65% during collection and new heapster spikes to ~90%

#2 Heapster restart - how long it takes for push failures to go away an metrics to flow again (i.e. the subject of this bz)

3.7.7:   13 minutes 30 seconds
new img: 4 minutes

Note You need to log in before you can comment on or make changes to this bug.