Red Hat Bugzilla – Bug 1474099
30+ minutes for metrics to re-stabilize after heapster restart @ 15K pods
Last modified: 2017-11-17 06:53:18 EST
Description of problem:
- metrics are stable at 15K pods - no push failures and no missing metrics.
Retrieving metrics via the Hawkular REST API and there are no empty buckets
- stop heapster, wait a few minutes and verify the most recent buckets are empty
- start heapster. Watch heapster logs for push failures and check metrics for missing data
It takes 23 minutes for any data to push to Hawkular->Cassandra again and 30 minutes before pushes completely stabilize and no holes appear in metrics data
Version-Release number of selected component (if applicable): Metrics container version 3.6.152
How reproducible: Always when restarting heapster with a large number of pods. I have not determined the threshold
Steps to Reproduce:
1. Deploy metrics
2. Start 15K pods and verify metrics are being collected for all pods - stable metrics with no errors in the heapster logs
3. Stop heapster, verify metrics are no longer collected
4. Start heapster, watch the heapster error logs for " Failed to push data to sink: Hawkular-Metrics Sink" to stop occurring
5. Watch the OpenShift UI or Hawkular REST API for metrics collection to successfully restart
30 + minutes for metrics to be collected in a stable fashion
Heapster collects metrics and pushes them to Hawkular + Cassandra within 1 or 2 intervals of restarting.
Additional info: Heapster logs attached. Let me know what further info is required. The configuration is the same as https://bugzilla.redhat.com/show_bug.cgi?id=1465532
Created attachment 1303352 [details]
heapster log after restart with 15K pods running on 100 nodes
Heapster will try and and update the metric definitions when the server starts, this means reading metric definitions from Hawkular Metrics and potentially writing back updates.
Is this something new in 3.6 or something we also experienced in 3.5?
Were any other updates made to the default configurations for Heapster or Hawkular Metrics? Could you please attach the output of 'oc get pods -n openshift-infra -o yaml'
@miburman: any thoughts here?
Not sure about 3.5 - we'd have to set up an environment to test that.
The only Heapster configuration modification is to remove the memory limit.
I hit an instance today where heapster seems never to reconnect. Re-testing that now. Let me know if there is any extra logging or other info you want to see.
Going to try deploying 3.5 in this cluster and attempt to reproduce.
Also, restoring the needinfo for miburman that I inadvertently cleared (@miburman see comment 2)
There's probably too big backlog of updates to the definitions that can't be handled (or takes too much time in other words). This is the same effect as when there's too many new pods. We haven't made changes to it between 3.5 and 3.6 in Heapster side, so the behavior should be the same in 3.5. That is, if we would see the same behavior in normal operations.
Correction comment 3 where I said it never reconnects. That is false - it did eventually pick up and started pushing metrics -> hawkular->Cassandra
I deployed metrics 3.5 in the same cluster in the exact same state (15K running pods) and metrics started showing up within 2 minutes and by 12 minutes after heapster start all 5 of the 2-minute buckets I was requesting had good data.
I re-deployed 3.6 and saw the same behavior - no metrics data for an extended period of time after restarting heapster. Understand the frustration with nothing
changing in those areas between releases - reporting what's happening.
We have reproducers of both the good/3.5 and reported behaviors on this 100 node AWS cluster. Let us know if we can get any additional data.
Can we run a mixture of 3.5 & 3.6 images? If that's possible, a combination of 3.5 Heapster + 3.6 Hawkular-Metrics and 3.6 Heapster + 3.5 Hawkular-Metrics would be very interesting.
Maybe the problem is somewhere else in the Heapster and not in anywhere we're currently looking at (we've been concentrating on the sink<->Hawkular-Metrics integration, but perhaps that's not the biggest pain).
Would the compression job startup right after the pods are restarted as well?
(In reply to Michael Burman from comment #8)
> Can we run a mixture of 3.5 & 3.6 images? If that's possible, a combination
> of 3.5 Heapster + 3.6 Hawkular-Metrics and 3.6 Heapster + 3.5
> Hawkular-Metrics would be very interesting.
> Maybe the problem is somewhere else in the Heapster and not in anywhere
> we're currently looking at (we've been concentrating on the
> sink<->Hawkular-Metrics integration, but perhaps that's not the biggest
I am going to try this, will update the bug.
(In reply to Matt Wringe from comment #9)
> Would the compression job startup right after the pods are restarted as well?
Compression is disabled on this cluster.
(In reply to Michael Burman from comment #8)
Just tried with 3.5 version of heapster, result is same. It took around 32 mins for metrics to get stable.
(In reply to Vikas Laad from comment #12)
Just tried with 3.5 version of heapster, result is same. It took around 32 mins for metrics to get stable. All the other components installed are 3.6 only heapster rc was updated to use 3.5 image and scaled.
The 3.5 version of Cassandra was also tried and it still had the same effect.
At this point it looking like its a change in Hawkular Metrics which has caused this regression.
We did make some changes which could cause performance to degrade, but this was a required change to prevent metric tags data from getting lost. We are investigating if this is cause or not to this issue.
This is not a regression, I deployed metrics few times on 3.5 and compared with 3.6 performance. I do not see any difference, its almost same time 24 secs to start loading metrics. This is still a performance issue and needs to be fixed.
(In reply to Vikas Laad from comment #15)
correction: 24 mins to start loading metrics.
There's an improvement in the Heapster's master version for this. I'll make a backport for our current Heapster version also and submit to origin-metrics for testing.
(In reply to Vikas Laad from comment #16)
> (In reply to Vikas Laad from comment #15)
> correction: 24 mins to start loading metrics.
You got my hopes up :)
This is expected behavior though, as we have to refresh all the tags in the Hawkular-Metrics when the restart happens (the initial caching only takes care of the _system tenant caching).
The only solutions are a) Heapster should request metric definitions for all active projects, b) speedup tags updates on the Hawkular-Metrics side. The a) requires that we would somehow know all the active projects in Heapster / we would maintain yet another cache of projects seen and request caches from Hawkular-Metrics when seen first time. I'm not sure if the latter would speed up the loading that much though.
With the latest image provided im bz https://bugzilla.redhat.com/show_bug.cgi?id=1465532 this time has been reduced to 16 mins for 15K pods.
Re-tested this scenario with 22500 pods
registry.ops.openshift.com/openshift3/metrics-cassandra v3.7.0-0.158.0 e64c5c6107ca
registry.ops.openshift.com/openshift3/metrics-hawkular-metrics v3.7.0-0.158.0 35127125228a
registry.ops.openshift.com/openshift3/metrics-heapster v3.7.0-0.158.0 3e2f71da3b69
No limits on any of the RCs. Neither cpu nor memory.
1. Scaled up to 22,500 pods - scale up went great. No dropout in the metrics on the way up at all. bug 1465532 can be marked verified.
2. Took pbench data for 10 minutes of steady state
3. Stopped heapster
4. Waited 5 minutes to capture 5 minutes of steady state with heapster down
5. Started heapster
At this point, watch the heapster logs for manager.go:119] Failed to push data to sink: Hawkular-Metrics Sink and use the SVT metrics verification tool to watch for empty buckets (-10 minute start time, 2 minute duration). https://github.com/openshift/svt/blob/master/logging_metrics_performance/hawkular_metrics/scripts/metrics_checker.py
After heapster restarted, it took ~45 minutes for the first metrics to start to be populated and 53 minutes for the push failure messages to disappear completely.
In the heapster logs (all pod logs attached) you will see node not ready messages mixed in. Some nodes were flapping NotReady due to bug 1483182 and bug 1451902/bug 1486914).
Let me know if there is a higher logging level which might help. Shutting this env down for today but I can keep it around for a few more days.
I will link the pbench data in a private comment.
Created attachment 1339862 [details]
cassandra, hawkular, heapster logs for 3.7.0-0.158.0
Bad mention in comment 26. First linked bug causing NotReady nodes should be bug 1493182
For this many pods, I would highly suggest running more than just 1 Hawkular Metrics and 1 Cassandra instance :)
I think this is why its taking so long for those error messages to go away in Heapster, we are overloading the system way too much with just a single instance of Hawkular Metrics and Cassandra.
I would start running 2 of each and see how that affects things.
There is also another property that we might also need to tweak on the Heapster side which is how many concurrent connections it makes to Hawkular Metrics.
We have 22.5K pods, do we know how many projects there are?
metrics components are running on m4.2xlarge instances with no cpu or memory limits. 8 vCPU and 32GB available per infra node.
Can you try setting the 'concurrencylimit' for the Hawkular sink parameter in Heapster?
In the Heapster RC there should be a line under the container's command which starts with "--sink:hawkular:https:....." at the end of this line can you add "&concurrencyLimit=32"
Then restart Heapster and see if that brings down the wait time that you have? We might also want to test with 64 and 128 to see if higher values give us better results.
The reason why its taking so long for Heapster to stop showing errors about writing to the Hawkular Sink is that when Heapster is restarted its sending both the metric definitions (eg metadata) and the metric data to Hawkular, and this is taking a long time.
The low cpu usage for the components is probably due to it not being CPU bound, but most likely IO bound.
I would first try setting the concurrencylimits for Heapster to see if that works, but I still think we should run with 2 Hawkular Metrics and Cassandra instances to see if that also helps to improve things.
I will re-test today with different concurrencyLimit values suggested in comment 35 on a deployment with 2 Cassandra and 2 Hawkular.
Ran the scenario with concurrencyLimit 32 and 2 Hawkular/2 Cassandra pods and there was no change. Still ~45 minutes for first metrics and 58 minutes for the "push failed" messages to go away. The pbench data shows the same pattern of low cpu for Heapster until the metrics fully start flowing.
Heapster is red: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-15-244/20171019_25K_start_heapster_a/tools-default/ip-172-31-14-208.us-west-2.compute.internal/pidstat/cpu_usage.html
I checked iostat for the Cassandra nodes and there is definitely nothing crazy going on there. IOPS are under 100 until the metrics start fully flowing.
The base for all pbench data is http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-15-244/20171019_25K_start_heapster_a/tools-default/
1 cassandra and 1 hawkular here: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-15-244/20171019_25K_start_heapster_a/tools-default/ip-172-31-62-104.us-west-2.compute.internal/pidstat/
1 cassandra here: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-15-244/20171019_25K_start_heapster_a/tools-default/ip-172-31-62-104.us-west-2.compute.internal/pidstat/
heapster and 1 hawkular here: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-15-244/20171019_25K_start_heapster_a/tools-default/ip-172-31-62-104.us-west-2.compute.internal/pidstat/
I will attach the logs for all pods
Created attachment 1340983 [details]
Logs with 2 hawkular, 2 cassandra and concurrencyLimit 32
@micke: any idea if maybe Heapster is blocking waiting for all the metric definitions to be written first? Does the metric definitions part use the same Hawkular client as the metric data?
Yes, it uses the same connection pool and same Heapster path. So Heapster will wait until the client has written all those definitions (and that's where the errors will come from as that takes a while).
@micke: Do we have to do it this way? I think that probably needs to change. We shouldn't have to wait for all the metric definitions to be uploaded before we can start to push out metric data. Can we not have two separate clients to do this?
Then it wouldn't ensure that all the data is searchable either and we had a ticket of that kind of behavior too.
Caching all the tenants would be a more performant option as then we wouldn't do unnecessary updates. I don't know if Heapster has access to the /tenants endpoint however in the current configurations.
(In reply to Michael Burman from comment #43)
> Then it wouldn't ensure that all the data is searchable either and we had a
> ticket of that kind of behavior too.
Lets say we have 20k+ pods and we are installing things the first time.
We have a choice of either waiting 30+ minutes before any metric data is collected by the system. Or waiting longer than 30+ minutes for some queries to work, but we should have metrics data since the beginning.
What happens in the case where its not a new installation and its just Heapster which has been restarted? So now we have a 30+ minute gap in metric collections instead of having potentially losing one or two metric collection intervals.
That is a huge difference.
> Caching all the tenants would be a more performant option as then we
> wouldn't do unnecessary updates. I don't know if Heapster has access to the
> /tenants endpoint however in the current configurations.
It does have access to the /tenants endpoint, its always had access to this endpoint.
Are we still writing single metric definitions in one REST call? or has it been updated to push those out in a batch operation?
I am still a bit surprised why increasing the concurrency setting of Heapster didn't help here. I would have expected that to have had a big change.
@mifiedle can we try with something like 100 or 150 to see if that makes any difference? The 512 probably put too much load on Cassandra and is why we were seeing the 500 errors.
If changing the concurrency setting doesn't help, then what else would be the bottleneck here?
There is no multi-tags-update interface in origin-metrics-3.7. Theoretically speaking, it is updating the metric data also to the Hawkular-Metrics instance, but Heapster does not understand that (and tries to reupload stuff).
Even a dual connection would be quite silly as it could spawn so many goroutines that it might make Heapster a lot of resources in the end.
I would assume the best idea is to fetch all tenants and for every tenant, fetch all metrics with descriptor_tag and then cache them correctly. It should take far less than 30 mins to do that and then we wouldn't need to bomb the metrics too much after that.
And we could push metrics to return all of this in a single call if we go and change metrics too - this way it would probably take a minute or so at most for Heapster to start.
(In reply to Michael Burman from comment #46)
> There is no multi-tags-update interface in origin-metrics-3.7. Theoretically
> speaking, it is updating the metric data also to the Hawkular-Metrics
> instance, but Heapster does not understand that (and tries to reupload
> Even a dual connection would be quite silly as it could spawn so many
> goroutines that it might make Heapster a lot of resources in the end.
I don't follow. So we have two connections, one for metric definitions and one for metric data. And lets say the concurrent connections are capped at 5 and 10 each respectively.
How does running two of them spawn too many goroutines while running one is just fine?
> I would assume the best idea is to fetch all tenants and for every tenant,
> fetch all metrics with descriptor_tag and then cache them correctly. It
> should take far less than 30 mins to do that and then we wouldn't need to
> bomb the metrics too much after that.
> And we could push metrics to return all of this in a single call if we go
> and change metrics too - this way it would probably take a minute or so at
> most for Heapster to start.
@mifiedle: would it be possible to start up 2 more Cassandra instances and see if this makes a difference?
Matt, because we couldn't block the Heapster processing. That means we would spawn each collection interval a new goroutine. And inside that goroutine, we would have to spawn a goroutine for each tags processing, because we wouldn't want to block the original goroutine that needs to process each datapoint also - and our cache wouldn't be up yet so that we could quickly check if processing is necessary.
It makes far more sense to just improve the caching procedure.
I tried with a concurrencyLimit of 100 and it did not speed things up. Still roughly 1 hour for the push errors to go away. I will see if I can try 4 cassandras today if you still think that would make a difference
I'll try to provide you with a image today if I make it before I have to go. It should have "alternate" way of doing things between Heapster & Hawkular-Metrics.
(In reply to Michael Burman from comment #49)
> Matt, because we couldn't block the Heapster processing. That means we would
> spawn each collection interval a new goroutine. And inside that goroutine,
> we would have to spawn a goroutine for each tags processing, because we
> wouldn't want to block the original goroutine that needs to process each
> datapoint also - and our cache wouldn't be up yet so that we could quickly
> check if processing is necessary.
Could you not just have one go routine responsible for handling the metric definitions and give that goroutine its own Hawkular client (with its own concurrency setting)
The metric definition goroutine would have a queue where we push metric definitions.
So for each metric interval, if the metric definition has not already been handled, we push it to the queue instead of using the main Hawkular client to do the update (which can block other metric data from being updated)
> It makes far more sense to just improve the caching procedure.
My above example is just a quick thought on this. I don't understand what the best option would be here. If its better to handle the caching differently, then lets go with that route.
Please keep us updated with the progress as we have the testing environment available to us now which we might not have available in the near future.
We also need to be aware that we will probably have to backport any changes here to some of the older versions.
(In reply to Michael Burman from comment #51)
> I'll try to provide you with a image today if I make it before I have to go.
> It should have "alternate" way of doing things between Heapster &
You can try now with burmanm/origin-metrics-heapster:1.3.0-4 from Docker Hub
Tested with the heapster image from comment 54
docker.io/burmanm/origin-metrics-heapster 1.3.0-4 33aedfd31e20 10 hours ago 819.9 MB
There was some improvement for restarting heapster with ~20K pods. In comment 37 with concurrencyLimit=32, it was 45 minutes for first data and 58 minutes for push failures to go away.
This time there were some push failures on the scale up which I had not seen previously in 3.7 (see logs). Not sure if this is expected - did not see holes in the metrics via regular random sample.
On heapster restart it was ~35 minutes for first data and 47 minutes for the push failure messages to stop.
Heapster logs pre and post restart attached.
Created attachment 1343479 [details]
heapster logs pre and post restart
I need to add more logging to see what is happening, I'll create a new image today. I was hoping those failed to push wouldn't be visible as the sink shouldn't be ready yet. Maybe I forgot to add a lock somewhere that I missed in the unit tests.
@micke going from 45 minutes to 35 minutes is an improvement, is the PR you attached all that you were planning? or was that just an initial step and we should be expecting more improvements?
There was a compilation failure in my origin-metrics building, so actually my changes weren't even there.
I've submitted v1.3.0-5 to Mike, which should have more changes (+ logging to indicate how many tags we're updating). I would expect a slight reduction in Heapster memory usage & faster startup time.
Previous faster time must have been an anomaly. In any case, first run with new image from this morning
docker.io/burmanm/origin-metrics-heapster v1.3.0-5 465e42d9ec0f
is definitely much better. 26 minutes for metrics to flow and the push error stops at the same time. One difference I can see is that metrics seem to start back up much quicker (i.e. shorter elapsed time until there are no empty buckets for my "previous 10 minutes" sampling interval).
- run it multiple times with pbench to confirm the timing and compare memory usage to previous runs.
- my recent runs have been 20K pods, but this bz was originally reported on 15K pods. I will get a timing at that level to compare to the original report which was 23 minutes for initial metrics and 30 for errors to go away.
1st heapster restart after scale up - 26 minutes for metrics
2nd restart (no other changes) - 15 minutes
3rd restart (no other changes) - 14 minutes
Deleted 5K pods to get down the level this bz originally opened on
1st heapster restart - 11 minutes for metrics
2nd restart - 9 minutes
Will add pbench data in a private comment. Heapster log attached.
Created attachment 1343927 [details]
heapster logs, latest patch (1.3.0-5)
I confirmed with additional runs that the CPU is roughly 2x with the patch. 1 core steady state for heapster for 20K pods before and 2 cores after.
I was unable to get pprof working against heapster running in a pod in OpenShift
I disabled TLS and allowed_users in the replication config - before that I couldn't get to the heapster service at all.
After that, I can get to the pprof URL and heap works (sort of, it only sees overall usage, not broken out by object
root@ip-172-31-15-244: ~/svt/openshift_scalability # go tool pprof http://172.24.34.114:8082/debug/pprof/heap
Fetching profile from http://172.24.34.114:8082/debug/pprof/heap
Saved profile in /root/pprof/pprof.172.24.34.114:8082.inuse_objects.inuse_space.001.pb.gz
Entering interactive mode (type "help" for commands)
2.61GB of 2.61GB total ( 100%)
flat flat% sum% cum cum%
2.61GB 100% 100% 2.61GB 100%
Trying to profile cpu gives me a 500 with nothing corresponding to it in the heapster log:
root@ip-172-31-15-244: ~ # go tool pprof http://172.24.34.114:8082/debug/pprof/profile
Fetching profile from http://172.24.34.114:8082/debug/pprof/profile
Please wait... (30s)
server response: 500 Internal Server Error
If anyone knows how to get this working, I can try again. Otherwise - maybe try to replicate locally.
(In reply to Mike Fiedler from comment #66)
> I confirmed with additional runs that the CPU is roughly 2x with the patch.
> 1 core steady state for heapster for 20K pods before and 2 cores after.
> I was unable to get pprof working against heapster running in a pod in
> I disabled TLS and allowed_users in the replication config - before that I
> couldn't get to the heapster service at all.
Heapster is configured to be only accessible via the api proxy.
You should be able to access its endpoint by doing something like:
curl -k -H "Authorization: Bearer `oc whoami -t`" -X GET https://localhost:8443/api/v1/proxy/namespaces/openshift-infra/services/https:heapster:/
But if disabling it is easier for the tests, then it shouldn't cause any harm.
> After that, I can get to the pprof URL and heap works (sort of, it only sees
> overall usage, not broken out by object
> root@ip-172-31-15-244: ~/svt/openshift_scalability # go tool pprof
> Fetching profile from http://172.24.34.114:8082/debug/pprof/heap
> Saved profile in
> Entering interactive mode (type "help" for commands)
> (pprof) top20
> 2.61GB of 2.61GB total ( 100%)
> flat flat% sum% cum cum%
> 2.61GB 100% 100% 2.61GB 100%
> Trying to profile cpu gives me a 500 with nothing corresponding to it in the
> heapster log:
> root@ip-172-31-15-244: ~ # go tool pprof
> Fetching profile from http://172.24.34.114:8082/debug/pprof/profile
> Please wait... (30s)
> server response: 500 Internal Server Error
> If anyone knows how to get this working, I can try again. Otherwise - maybe
> try to replicate locally.
I have created some usage graphs from my benchmark-application, I'll try to work with that and provide an image with hopefully some optimizations (I have couple of ideas - but I don't know if they'll transfer to real world as well as in the benchmark application).
As far as I can tell from my CPU profiling, it's the "recent"/"caching" checkings that take most of the time in our sink.
That is, 43% of time was spent in "registerLabeledIfNecessary", splitted by 10,6% of total time was taken by createDefinitionFromModel and 28,9% by hash method.
In the hash method, we spend ~50% of the time to counter the "one great Golang feature", which is randomized map iteration order, thus we need to resort them.
Note, these are only times from our sink - nothing from inside Heapster is evaluated here.
Rest of time is spent in stuff like JSON encoding, which could be sped up by replacing Golang's JSON implementations with something better.
I'll spend more time on Monday to think of something to get this time down.
There's now docker.io/burmanm/origin-metrics-heapster:v1.3.0-7 uploaded with a modified hash function.
I haven't extensively tested the correctness (so that we don't have incorrect caching), only preliminary - but it should work a lot faster.
Thew new patch helps some, but probably not enough. Here is a comparison of 15 minutes of steady-state collection (no heapster restart, no scale up/down or project changes) for 3.7.0-0.158.0, the v1.3.0-5 patch and the new v1.3.0-7 patch:
avg heapster cpu% avt heapster rss GB
3.7.0-0.158.0 138 5.7
v1.3.0-5 212 4.4
v1.3.0-7 181 3.9
Inbound new images.. this time the Heapster will require new Hawkular-Metrics image also.
* Faster restart time
* CPU usage is slightly lower (it will not reach the 18.104.22.168-158.0 but it should be lower than -7)
And here's an update to Cassandra also (if you wish to run this version):
Trying to run the hawkular-metrics from comment 73 fails - there's a deployment failure for hawkular-metrics-openshift-dist-0.29.0-SNAPSHOT.ear.
Search the attached log for WFLYCTL0013 - there are a few different errors starting there, including an IOException for a missing mount point. Maybe need a newer RC/Volume definition?
docker.io/burmanm/origin-metrics-hawkular-metrics v3.7.0-rc.0-1 58bed5a105d4 5 days ago 909.2 MB
Created attachment 1351675 [details]
hawkular-metrics pod log - deployment failure for ear
Although mentioned in the IRC, here is the new metrics version:
I tested the new images (comment 73, comment 74, comment 77) in a smaller cluster (20 nodes, 5000 pods). I ran a comparison of the new images against 3.7.7. Below is the summary, will link the detailed data in a followup comment.
#1 heapster cpu and rss for 20 minutes of steady state data collection for 5K pods on 20 nodes:
3.7.7 new heapster
avg cpu % 21.9% 31.7%
rss in GB 1.44 1.0
3.7.7 heapster spikes to ~65% during collection and new heapster spikes to ~90%
#2 Heapster restart - how long it takes for push failures to go away an metrics to flow again (i.e. the subject of this bz)
3.7.7: 13 minutes 30 seconds
new img: 4 minutes