Created attachment 1292371 [details] Heapster (loglevel 5), Hawkular and Cassandra pod logs. Description of problem: - single cassandra pod (memory limit 4GB) - cassandra storage is a non-dynamic 395 GB cinder pv/pvc - single hawkular-metrics (memory limit 4GB) - see inventory below HA - 3 masters, 2 infra, 100 compute nodes - metrics is 3.6.1.122 In OCP 3.5 we could reliably reach 20-25K pods on 100 compute nodes in a cluster without dropping any metrics. Beyond that, 3.5 would start seeing empty buckets in the metric data. In OCP 3.6 after we pass 3.5K - 4K pods started, the following message starts showing up in the heapster logs: manager.go:119] Failed to push data to sink: Hawkular-Metrics Sink But there does not seem to be a corresponding error in the hawkular-metrics log. I turned heapster logging up to 5 (attached) but there is nothing to indicate what the actual push error is. There are some Cassandra timeout messages in the hawkular logs (attached) , but they do not correspond 1:1 to the heapster push errors. Example: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write) The Cassandra logs (attached) are clean. Just the usual statistics and gc chatter. The problem symptom is that the heapster logs and random checks of metric data show success and good data up until ~3.5-4K pods are created on 100 nodes. After that point we start seeing hawkular push failures in a pattern like this: W0627 13:23:27.254891 1 manager.go:119] Failed to push data to sink: Hawkular-Metrics Sink W0627 13:23:57.298169 1 manager.go:119] Failed to push data to sink: Hawkular-Metrics Sink W0627 13:24:27.370221 1 manager.go:119] Failed to push data to sink: Hawkular-Metrics Sink W0627 13:24:57.286841 1 manager.go:119] Failed to push data to sink: Hawkular-Metrics Sink W0627 13:25:27.324231 1 manager.go:119] Failed to push data to sink: Hawkular-Metrics Sink W0627 13:25:57.726255 1 manager.go:119] Failed to push data to sink: Hawkular-Metrics Sink I0627 13:26:20.557520 1 manager.go:116] Data push completed: Hawkular-Metrics Sink I0627 13:26:54.871042 1 manager.go:116] Data push completed: Hawkular-Metrics Sink W0627 13:27:27.352292 1 manager.go:119] Failed to push data to sink: Hawkular-Metrics Sink I0627 13:27:37.318129 1 manager.go:116] Data push completed: Hawkular-Metrics Sink I0627 13:28:10.550017 1 manager.go:116] Data push completed: Hawkular-Metrics Sink I0627 13:28:44.764977 1 manager.go:116] Data push completed: Hawkular-Metrics Sink I0627 13:29:18.995184 1 manager.go:116] Data push completed: Hawkular-Metrics Sink I0627 13:29:53.684293 1 manager.go:116] Data push completed: Hawkular-Metrics Sink W0627 13:30:27.293908 1 manager.go:119] Failed to push data to sink: Hawkular-Metrics Sink I0627 13:30:37.356895 1 manager.go:116] Data push completed: Hawkular-Metrics Sink Additionally, random checks of bucket data (2 minute duration, 5 intervals checked) shows patterns of empty buckets. Version-Release number of selected component (if applicable): metrics 3.6.122: registry.ops.openshift.com/openshift3/metrics-heapster v3.6.122 f3e12da8e7e3 6 days ago 274.4 MB registry.ops.openshift.com/openshift3/metrics-cassandra v3.6.122 5f1061b30752 6 days ago 572.8 MB registry.ops.openshift.com/openshift3/metrics-hawkular-metrics v3.6.122 992323342eea 6 days ago 1.293 GB How reproducible: Always, when scaling up to 5K pods Steps to Reproduce: 1. Create HA cluster with 100 compute nodes capable of running 25K total hello-openshift pods 2. Run cluster-loader (https://github.com/openshift/svt/tree/master/openshift_scalability) to create 250 projects with 20 pods each - 5K total pods. 3. As cluster-loader runs, monitor the heapster logs and randomly check the metrics for pods Actual results: Around the 4K pod mark, errors will pop in the Heapster log: Failed to push data to sink: Hawkular-Metrics Sink Hawkular logs will show a message that does not correlate 1:1 with heapster errors: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write) Spot checks of metrics data shows empty buckets Expected results: Expected is that the scalability numbers from OCP 3.5 can be reached without loss of data: 25K hello-openshift pods running on 100 compute nodes. Additional info: Inventory: [oo_first_master] 192.1.0.8 [oo_first_master:vars] openshift_deployment_type=openshift-enterprise openshift_release=v3.6.0 openshift_metrics_install_metrics=true openshift_metrics_hawkular_hostname=hawkular-metrics.0615-yzo.qe.rhcloud.com openshift_metrics_project=openshift-infra openshift_metrics_image_prefix=registry.ops.openshift.com/openshift3/ openshift_metrics_image_version=v3.6.122 openshift_metrics_cassandra_replicas=1 openshift_metrics_hawkular_replicas=1 openshift_metrics_cassandra_storage_type=pv openshift_metrics_cassandra_pvc_size=395Gi
This is a regression going fromn 25K pods to 4K pods. Shouldn't it be targeted for 3.6? Marking as TestBlocker for metrics scalability testing.
(In reply to Mike Fiedler from comment #1) > This is a regression going fromn 25K pods to 4K pods. Shouldn't it be > targeted for 3.6? > > Marking as TestBlocker for metrics scalability testing. I reset for 3.6. I was going through some of the changes since 3.5 to see what may have caused the regression. How easily/quickly could you test again with an updated hawkular-metrics?
Mike, You can retest with the current images and set the following environment variables for the hawkular-metrics pod: METRICS_EXPIRATION_JOB_ENABLED=false COMPRESSION_JOB_ENABLED=false I do not think that these background jobs would this sort of regression, but it cannot hurt to rule them out.
The pod creation rate for this test is: - 20 pods per namespace - 3 namespaces at a time (batches of 60 pods at a time which is < 1 pod/node) - wait for all pods to go fully into the Running state (takes about 15 seconds) - move on to the next batch of 3 namespaces This is a fairly typical pod creation rate for our scalability tests. If there is a suspicion that the rate is a factor, I can try slowing things down. The test is run with cluster-loader: https://github.com/openshift/svt/blob/master/openshift_scalability/README.md using the input configuration below. Translated, it is "Create 250 projects, each with 20 hello-openshift pods (in pod-default.json)." I run it with ./cluster-loader.py -p 3 -f <filename>.yaml -p 3 means to run 3 simultaneous threads, so for each burst, 3 x 20 = 60 pods are created and come fully Running before the next set of three. Hope that makes sense. projects: - num: 250 basename: svt-1- tuning: default pods: - total: 20 - num: 100 basename: hello file: content/pod-default.json storage: - type: none quotas: - name: default file: default tuningsets: - name: default pods: stepping: stepsize: 50 pause: 90 s rate_limit: delay: 0 ms
I believe this is caused by the fact that the image is shipped with rate limiter on (max concurrency is set to 1). Due to changes between 3.5 and 3.6, the tags update will take slightly longer and it will block everything else with the limiter. Increasing the concurrency to 4 or 8 should increase the throughoutput as it will no longer block the datapoint writes. But the tags update will still happen with just a concurrency of one. That's because when we created the change, we also made changes to the Heapster sink to make it work better with the updated procedure, but the Heapster changes were never merged (they're still waiting to be merged). For that reason, the shipped Heapster sink will work inefficiently with the current Hawkular-Metrics. The Heapster PR is #1666 and the important change is in the mutex handling of "registerLabeledIfNecessary" function. If we need to "workaround" this Heapster sink issue in Hawkular-Metrics, then a background job updating the tags is pretty much the only way to fix it (so that Heapster gets response 200 back instantly). That way we could increase the concurrency of the sink for tag updates also. This change would have other implications however. This is only apparent limitation in cases where lots of pods are launched at the same time - in a steady state (without increasing amount of pods) the concurrency setting should do the job as it will function correctly for datapoints in the older Heapster version also.
Do you have details on what I need to do to change max concurrency? I don't see it exposed currently in our deployment config and a quick search of the hawkular-metrics docs didn't find it.
Well, the current concurrency setting is 5 (and not one like I said originally - but one is the effective as long as the PR isn't merged with tags updates). It's not exposed in the configuration unfortunately, you would need to modify the Heapster's starting parameters directly to change the value (it's a sink setting). It seems the API for modifying running sink was removed at some point .
I still see this in the official 3.6.136 image: registry.ops.openshift.com/openshift3/metrics-heapster v3.6.136 a4b52c5e2110 7 hours ago 274.4 MB Does the hash match what you pushed? Moving to MODIFIED until an image is available in the OCP puddles.
Tried verifying with the latest image out of brew: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-heapster latest 5549c67d8607 42 hours ago 274.4 MB and I still see the issue. Specifically, while scaling up from 5K to 10K pods, I start seeing Failed to push data to sink: Hawkular-Metrics Sink in the heapster logs and "holes"/empty buckets start showing up in the data. The issue seems to go away when the cluster sits idle and stabilizes. I see the failures mainly during periods of adding pods. Attaching the heapster log. The end of the log is when I was seeing consecutive push failures. Note there are some nodes down/NotReady in this cluster.
Created attachment 1296635 [details] Heapster logs running latest image
@miburman: can you please take a look?
Mike, Can you test again using images from origin-metrics? We know for certain that the latest origin-metrics images have the heapster changes discussed earlier in the ticket.
I switched all replication controllers (hawkular-cassandra-1, hawkular-metrics, heapster) to use the latest images for origin: root@svt-j-1: ~/mffiedler/logs # ssh 172.16.0.7 docker images | grep origin docker.io/openshift/origin-metrics-cassandra latest 797dbea9e930 7 days ago 769.7 MB docker.io/openshift/origin-metrics-hawkular-metrics latest 3147880c9a30 7 days ago 886.1 MB docker.io/openshift/origin-metrics-heapster latest 6649a39ef5c1 7 days ago 768.7 MB I still see the error in the heapster logs - in fact this time I saw it at REST. I did not use a full new deployment from origin - I can't do that in this big environment. I hope it was OK to run with the origin images using the same general configuration laid down by openshift-ansible. Logs attached. If I need to do it from scratch with the origin deployer, that will probably have to wait until next week - we only have 1 day left in this big cluster.
Created attachment 1297800 [details] All metrics logs running origin images
I need to push out a new Heapster image to docker hub which should now hopefully have the patch required.
A newer Heapster image has been pushed out to docker hub, can you see if that helps? The previous one didn't have the patch properly applied.
Testing new heapster: root@svt-j-1: ~/mffiedler/logs # ssh 172.16.0.18 docker images | grep heapster docker.io/openshift/origin-metrics-heapster latest 2ff34583737a 2 hours ago 819.9 MB
I tested with the new heapster from comment 25 and comment 26. I am now running :lastest origin from dockerhub for cassandra, hawkular-metrics and heapster. I had problems running the origin heapster with the OCP hawkular, so just switched to using only origin containers. The symptoms are the same though, everything runs fine up through approximately the creation of 5K pods and then I start seeing the Hawkular-Metrics Sink push failures and holes in the metrics. The test creates pods in batches of 60, waits for them all to go into Running state and then moves on to a new batch of 60. In the attached logs, there are a couple failures right after heapster starts and then none until 15:06 which was approximately the 5K pod mark. At that point there starts to be more failures than success. Around 15:20 I stopped creating new pods. At 15:28 there is a mix of successful and failed pushes and around 15:35 we are mostly back to successful pushes. This is a ~450 node cluster which we are losing access to. I will try come up with a reproducer in AWS or GCE.
Created attachment 1298433 [details] heapster logs showing full run including re-stabilization
Do we know what the exact setup was used before and what the new one is? I am wondering if we are still running with the same resources or configurations or not. What is mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1465532#c0 does not match what the logs are saying for instance. Cassandra has access to less than 2GB, while the description says it has 4GB for instance.
That original cluster is gone. I will match the current cluster to the specs in the original description and run the test again using the origin images and capture all 3 logs.
We need to make sure that the setup being done today is similar in comparison to the old tests which had much greater results. I believe in the old setup there were additional configuration changes which were needed to get it to scale as high as they got it. Can you install the 3.5 images in this setup (it should just work on OCP 3.6) and see if you are seeing the same regression? We need to confirm that its not an environment issue and then narrow down which of the metric components has caused the regression.
Set hawkular and cassandra limits to 4GB. Note: during this run heapster and hawkular-metrics both restarted once. Logs attached - you can see the push failures throughout. Next up: will try on 3.5 images if I can get them running before the cluster goes away today. Otherwise A/B testing will have to be in AWS next week.
Created attachment 1298557 [details] hawkular and cassandra memory limits at 4GB. cpu is unlimited
I retested on the same cluster with metrics 3.5-018 deployed using the 3.5 openshift-ansible playbook. Same configuration and same workload. Testing on 3.5 there are no holes in the metrics data. I'm up over 10K pods and seeing 0 empty buckets. My tool continuously checks the most recent 5 x 2 minute buckets for randomly selected pods that have been up more than 10 minutes. I also verified for random pods in the web UI. For 3.5, I see no holes or empty buckets. From the web console, the metrics graphs are unbroken and continuous. Heapster on 3.5 does not log as much - I could not see messages related to pushing to hawkular. Metrics have no need to recover - at this level they are not missing. For 3.6, I start seeing many holes and for some pods all buckets are empty. From the web console, graphs are broken lines or altogether empty. The heapster logs show many push failures to hawkular. Metrics seem to recover when the cluster idles. I collected pbench (system monitoring) data for both the 3.5 and 3.6 runs. When I've processed it. I think this test shows there is definitely a difference in 3.6. Sample output for 3.5: Test of metrics for: svt-b-42.pausepods5 found empty: 0, not empty: 5 Test of metrics for: svt-b-63.pausepods4 found empty: 0, not empty: 5 Test of metrics for: svt-b-72.pausepods17 found empty: 0, not empty: 5 Test of metrics for: svt-b-63.pausepods3 found empty: 0, not empty: 5 Test of metrics for: svt-b-32.pausepods16 found empty: 0, not empty: 5 Test of metrics for: svt-b-20.pausepods7 found empty: 0, not empty: 5 Test of metrics for: svt-b-58.pausepods8 found empty: 0, not empty: 5 Test of metrics for: svt-b-46.pausepods17 found empty: 0, not empty: 5 Test of metrics for: svt-b-2.pausepods3 found empty: 0, not empty: 5 Test of metrics for: svt-b-105.pausepods10 found empty: 0, not empty: 5 Test of metrics for: svt-b-13.pausepods11 found empty: 0, not empty: 5 Test of metrics for: svt-b-70.pausepods11 found empty: 0, not empty: 5 Test of metrics for: svt-b-16.pausepods18 found empty: 0, not empty: 5 Test of metrics for: svt-b-102.pausepods17 found empty: 0, not empty: 5 For 3.6 there will be various non-0 values in the "found empty" field.
Pbench data is available but inconclusive due to other activity in the cluster. Next step is to get a true A/B comparison of scaling up to 6000 pods for 3.6 and 3.5 on AWS
I set up a 20 node, 10,000 pod cluster in AWS and was able to reproduce the issue with 3.6.141 but I was not able to reproduce it with origin. In the 450 node cluster I no longer have access to, I could reproduce with origin. I am going to make one more run tonight with origin images and then will consult with you on what step we want to take next. If there are test images you want me to try please add links here or send directly.
I've been unable to reproduce this issue on a 20 node/10K pod cluster on EC2. 3.6.152 metrics is stable all of the way up to 10K pods. The missing dimension from the previous environments is the number of compute nodes. I am going to scale the current 20 node environment to 100 nodes and make one more attempt at reproducing this and also at collecting metrics for 25K pods - the goal for 3.6. This cluster will be short-lived though to save on expense. pidstat and sar data from running both 3.5 and 3.6 in the 20 node/10K pod environment show: - significant cpu improvement in 3.6 for hawkular-metrics - modest memory improvement in 3.6 for hawkular-metrics - network utilization improvement in 3.6 for hawkular-metrics - unchanged cpu and network for heapster in 3.6 - modest cpu improvement in 3.6 for cassandra - unchanged memory for cassandra - modest network improvment in 3.6 for cassandra - modest increase in IOPS in 3.6 for cassandra I'll link the data internally in a follow-up comment
I hit this issue on EC2 - it required higher scale though. I hit it around 16,500 pods created. This time Hawkular restarted triggering a heapster restart after which heapster could not push data to hawkular. I scaled hawkular and heapster down to 0 and back up - both come to ready state but the heapster logs have the push error. Logs attached. Config changes are Hawkular memory raised to 4GB, Cassandra memory raised to 4GB and heapster memory raised to 6GB. I'll check the performance data and see if the hawkular issue might have been OOM. But even after both restarted and came to Ready state, the errors in the heapster log (and missing metrics data) continued. Is there any way to get details of what is failing with the push?
Created attachment 1301996 [details] July 20 logs from ec2 - failure started around 16K pods
The log is lost when the pod restarts. Let me extract all docker messages (which will include pod messages) from journald
Created attachment 1302092 [details] docker logs from journald with pod logging I tried to grep some docker spam out, but there will still be other docker messages mixed in
The compression job is supposed run every two hours, but if it falls behind schedule, it will run more frequently until it caches up. It looks like at some points it is running continuously until the container is killed. At this point I think it would be worth disabling the job. It can be disabled setting this environment variable, COMPRESSION_JOB_ENABLED=false.
Scaling up to 25K pods was going well until around 20K pods when heapster was OOMKilled (with 6GB limit). I'm taking the limits off of heapster and hawkular for now to measure actuals at this scale. More concerning, once heapster was restarted it began getting the push failures and never recovered. I tried restarting both hawkular and heapster and it never started pushing again. What could put heapster into a permfail state? hawkular logs did not indicate errors. Plan, re-run with no memory limits on hawkular or heapster and COMPRESSION_JOB_ENABLED=false
The combination of COMPRESSION_ENABLED=false and removing the memory limits for hawkular and heapster allowed metrics on the EC2 cluster to scale stably to 25K pods. From the performance data, COMRESSION_ENABLED=false freed up enough cpu on the hawkular node to avoid readiness failure restarts and removing the memory limits prevented OOMKills of both hawkular and heapster. I'll report 3.5 vs 3.6 memory usage for cassandra/hawkular/heapster during the 25K pods run separately. Were there changes in the compression area for 3.6? That setting was not required in 3.5 scale tests. What is the next step for this bz with respect to that? Additionally, as mentioned in comment 47 I did see an issue where once heapster restarted it could never push successfully again. Working on a reproducer for that for a separate bz. I'll link the perf data internal links in a followup comment.
I have been checking and not seeing anything. There is a job scheduler component that runs the compression job. I was wondering if that might be part of the problem. I am looking through git history, and it looks like the last changes for the job scheduler as well as for the compression job went into hawkular-metrics 0.23.4 which is what should be used in the latest OCP 3.5 release. Mickie, were there any changes to the compression code between 3.5 and 3.6?
@Mike, removing the limits is unstable, no? In environments where hawkular and heapster have competing memory resource demands, removing the limits will actually allow them to be OOM killed for other reasons, I believe. Seems like we need to find what the actual memory limits are, and request that as the minimum. We have the same problem with the logging side. Adding Eric Paris to comment.
When a node runs low on actual free RAM pods with no limit set are likely to be the first pod's OOM'd, these are called BestEffort pods. Pods with a limit larger than their request by some integer multiple (I'd have to look up the multiple, but it is something like 4) will be OOM'd next, these are called Burstable pods. Pods with a very large multiple between request and limit go in the BestEffort bucket. Pods with request==limit will go last, these are called Guaranteed pods. Removing limits entirely will make things "work" for some period of time. But it means if the system comes under any memory pressure you will be the first thing killed. Actually knowing exactly what your pods needs (while hard) makes it very unlikely they will get killed. Picking a reasonable min/max range is likely the right thing to do now if possible The logging team should actively look at how much RAM the pods are using in production and tune those numbers over time.
The purpose of removing the limits was to determine actual usage for this workload, not to recommend removing them in production. Scale up to 20K pods with: Cassandra - 3.3 GB peak RSS, regular 3.5 core cpu spikes and max 220 IOS (4.5 cores in 3.5 - memory comparison n/a) Hawkular - 1.9 GB peak RSS, regular 2 core cpu spikes and rare 3.5 core cpu spikes (2.5 cores and 2.5 GB RSS in 3.5) Heapster - 7.2 GB peak RSS, regular 2 core cpu spikes (1 core and 7GB in 3.5) Full data: http://pbench.perf.lab.eng.bos.redhat.com/results/EC2::ip-172-31-35-191/20170721_0_20k_metrics_a/tools-default/
There were no changes between 3.5 and 3.6 for compression code. Only 3.7 would have the updates that improve performance. I guess we could technically backport the range scanning feature which should reduce the load of Cassandra & Hawkular-Metrics during the compression job. As for Heapster, we didn't do any changes to the sink between 3.5 and 3.6 either (all the updates are for 3.7), so all those changes are internal to the Heapster.
We could be hitting https://issues.jboss.org/browse/HWKMETRICS-707 which is a bug in the retry logic for inserting data points that has actually been around since OCP 3.4. I think it might be good to do another test run with a build of hawkular-metrics that has the retry logic turned off. In OCP 3.4 and 3.5 it could essentially be turned off with an environment variable. With some of the changes in 3.6, the configuration option is ignored so we will need a new build of hawkular-metrics with which to test.
I really doubt that we're hitting that bug at all, since it's not used in the initial processing processes (which might be updating the tags - fetching the tags etc). If that retryWhen would be the problematic part, the processing would never stabilize - but it does. And it wouldn't systematically happen at the startup, since tags updates are done in a serial fashion in 3.4-3.6 (and parallel in next Heapster update). There's no initial burst of additional datapoints in the Heapster startup. Should we wish to speed up that processing, we should backport the parallel tags updating procedure.
I did another round of testing with cluster setup mentioned in this bug https://bugzilla.redhat.com/show_bug.cgi?id=1474099 1 m4.xlarge master 2 m4.xlarge infra nodes 100 m4.xlarge compute nodes With 15K pods I can still see following errors in the heapster logs Failed to push data to sink: Hawkular-Metrics Sink But the time to get stable metrics has reduced from 22 mins to 17 mins. I used image provided by Michael Burman from docker.io/burmanm/origin-metrics-heapster:v1.3.0-3
Another experiment with the image Hawkular Metrics provided by @miburman Scaled pods to 25K, without any limits on heapster, hawkular and cassandra Cassandra - 3G RSS Hawkular - 1G RSS Heapster - 6.6G RSS There is a big difference in comment# 52. That experiment was done with 20K pods and this is done with 25K pods. Please let me know if I need to any other experiment.
*** Bug 1388815 has been marked as a duplicate of this bug. ***
*** Bug 1391996 has been marked as a duplicate of this bug. ***
After reaching around 20K pods started getting following errors in Heapster E0818 17:36:47.932557 1 client.go:203] Could not update tags: Hawkular returned status code 500, error message: Reply could not be parsed: invalid character '<' looking for beginning of value in Hawkular 2017-08-18 17:36:47,935 INFO [org.hawkular.openshift.auth.org.hawkular.openshift.namespace.NamespaceHandler] (default I/O-2) Could not determine a namespace id for namespace null. Cannot process request. Returning an INTERNAL_SERVER_ERROR Attaching detailed logs.
Created attachment 1315348 [details] logs after the testing on 08-18
(In reply to Vikas Laad from comment #68) Please note errors happen for some time and they go away.
PR to fix the latest issue: https://github.com/hawkular/hawkular-metrics/pull/875
After running few more iterations I see following error in heapster logs after creating 13000 pods vs 4000 pods when the bug was created. This is with the modified code. Failed to push data to sink: Hawkular-Metrics Sink
I think the 13k value is what we were getting before with 3.5 without any changes to the Heapster concurrency value. Is this now similar to what you have been seeing with 3.5? The error message in the Heapster logs (Failed to push data to sink: Hawkular-Metrics Sink) usually indicates that Heapster was not able to push out metrics within its collection time window, so it will skip the next time window. Can we get the logs for Heapster attached here?
Moving this back to assigned while waiting on final PR/fix in this area.
The v3.6.173.0.5 in brew should now have this fix applied.
Created attachment 1317172 [details] casanrdra, metrics and heapster log This logs are before today's fix, I am going to test one more time with the image which has today's fix.
I tested with provided image, I dont see errors mentioned in comment #68 anymore. Setting bug to verified, initial errors are happening after 13K pods are created. Which is improved in 3.6 and comparable with 3.5. Also Scaled cluster to 25K pods and metrics was stable. Will test again with official puddle when available.
Sorry for typo, will be setting bug to verified once official puddle is available.
Verified on metrics images version 3.7.0-0.158.0 Scaled up to 22,500 pods on a 100 node AWS cluster at the same rate used to report this bug. No metrics lost during the scaleup - everything was stable. Could not go higher due to various bugs around nodes flapping NotReady under load. See: https://bugzilla.redhat.com/show_bug.cgi?id=1451902 https://bugzilla.redhat.com/show_bug.cgi?id=1486914 https://bugzilla.redhat.com/show_bug.cgi?id=1493182 But the problems originally reported in this bz are fixed. Marking VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3389