1465532 – Heapster fails to push to Hawkular-Metrics sink starting around 4K pods in 3.6

Bug 1465532 - Heapster fails to push to Hawkular-Metrics sink starting around 4K pods in 3.6

Summary: Heapster fails to push to Hawkular-Metrics sink starting around 4K pods in 3.6

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Hawkular
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.6.z
Assignee:	Michael Burman
QA Contact:	Vikas Laad
Docs Contact:
URL:
Whiteboard:	aos-scalability-36
Duplicates (2):	1388815 1391996 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-27 14:59 UTC by Mike Fiedler
Modified:	2017-12-07 07:10 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-12-07 07:10:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Heapster (loglevel 5), Hawkular and Cassandra pod logs. (972.85 KB, application/x-gzip) 2017-06-27 14:59 UTC, Mike Fiedler	no flags	Details
Heapster logs running latest image (333.15 KB, text/plain) 2017-07-12 01:12 UTC, Mike Fiedler	no flags	Details
All metrics logs running origin images (58.44 KB, application/x-gzip) 2017-07-13 18:41 UTC, Mike Fiedler	no flags	Details
heapster logs showing full run including re-stabilization (500.83 KB, application/x-gzip) 2017-07-14 15:42 UTC, Mike Fiedler	no flags	Details
hawkular and cassandra memory limits at 4GB. cpu is unlimited (230.68 KB, application/x-gzip) 2017-07-14 17:38 UTC, Mike Fiedler	no flags	Details
July 20 logs from ec2 - failure started around 16K pods (24.22 KB, application/x-gzip) 2017-07-20 20:32 UTC, Mike Fiedler	no flags	Details
docker logs from journald with pod logging (804.69 KB, application/x-gzip) 2017-07-21 01:50 UTC, Mike Fiedler	no flags	Details
logs after the testing on 08-18 (37.84 KB, application/zip) 2017-08-18 17:57 UTC, Vikas Laad	no flags	Details
casanrdra, metrics and heapster log (50.09 KB, application/zip) 2017-08-23 17:14 UTC, Vikas Laad	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	HWKMETRICS-728	0	Major	Resolved	Namespace Listener can return errors on namespace miss	2020-10-23 09:29:28 UTC
Red Hat Product Errata	RHSA-2017:3389	0	normal	SHIPPED_LIVE	Moderate: Red Hat OpenShift Enterprise security, bug fix, and enhancement update	2017-12-07 12:09:10 UTC

Description Mike Fiedler 2017-06-27 14:59:06 UTC

Created attachment 1292371 [details]
Heapster (loglevel 5), Hawkular and Cassandra pod logs.

Description of problem:

- single cassandra pod (memory limit 4GB)
  - cassandra storage is a non-dynamic 395 GB cinder pv/pvc
- single hawkular-metrics (memory limit 4GB)
- see inventory below

HA - 3 masters, 2 infra, 100 compute nodes - metrics is 3.6.1.122

In OCP 3.5 we could reliably reach 20-25K pods on 100 compute nodes in a cluster without dropping any metrics.   Beyond that, 3.5 would start seeing empty buckets in the metric data.

In OCP 3.6 after we pass 3.5K - 4K pods started, the following message starts showing up in the heapster logs:

manager.go:119] Failed to push data to sink: Hawkular-Metrics Sink

But there does not seem to be a corresponding error in the hawkular-metrics log.  I turned heapster logging up to 5 (attached) but there is nothing to indicate what the actual push error is.   There are some Cassandra timeout messages in the hawkular logs (attached) , but they do not correspond 1:1 to the heapster push errors.   Example:

Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write)

The Cassandra logs (attached) are clean.  Just the usual statistics and gc chatter.

The problem symptom is that the heapster logs and random checks of metric data show success and good data up until ~3.5-4K pods are created on 100 nodes.   After that point we start seeing hawkular push failures in a pattern like this:

W0627 13:23:27.254891       1 manager.go:119] Failed to push data to sink: Hawkular-Metrics Sink
W0627 13:23:57.298169       1 manager.go:119] Failed to push data to sink: Hawkular-Metrics Sink
W0627 13:24:27.370221       1 manager.go:119] Failed to push data to sink: Hawkular-Metrics Sink
W0627 13:24:57.286841       1 manager.go:119] Failed to push data to sink: Hawkular-Metrics Sink
W0627 13:25:27.324231       1 manager.go:119] Failed to push data to sink: Hawkular-Metrics Sink
W0627 13:25:57.726255       1 manager.go:119] Failed to push data to sink: Hawkular-Metrics Sink
I0627 13:26:20.557520       1 manager.go:116] Data push completed: Hawkular-Metrics Sink
I0627 13:26:54.871042       1 manager.go:116] Data push completed: Hawkular-Metrics Sink
W0627 13:27:27.352292       1 manager.go:119] Failed to push data to sink: Hawkular-Metrics Sink
I0627 13:27:37.318129       1 manager.go:116] Data push completed: Hawkular-Metrics Sink
I0627 13:28:10.550017       1 manager.go:116] Data push completed: Hawkular-Metrics Sink
I0627 13:28:44.764977       1 manager.go:116] Data push completed: Hawkular-Metrics Sink
I0627 13:29:18.995184       1 manager.go:116] Data push completed: Hawkular-Metrics Sink
I0627 13:29:53.684293       1 manager.go:116] Data push completed: Hawkular-Metrics Sink
W0627 13:30:27.293908       1 manager.go:119] Failed to push data to sink: Hawkular-Metrics Sink
I0627 13:30:37.356895       1 manager.go:116] Data push completed: Hawkular-Metrics Sink


Additionally, random checks of bucket data (2 minute duration, 5 intervals checked) shows patterns of empty buckets.

Version-Release number of selected component (if applicable): metrics 3.6.122:

registry.ops.openshift.com/openshift3/metrics-heapster           v3.6.122            f3e12da8e7e3        6 days ago          274.4 MB
registry.ops.openshift.com/openshift3/metrics-cassandra          v3.6.122            5f1061b30752        6 days ago          572.8 MB
registry.ops.openshift.com/openshift3/metrics-hawkular-metrics    v3.6.122            992323342eea        6 days ago          1.293 GB



How reproducible:  Always, when scaling up to 5K pods


Steps to Reproduce:
1.  Create HA cluster with 100 compute nodes capable of running 25K total hello-openshift pods
2.  Run cluster-loader (https://github.com/openshift/svt/tree/master/openshift_scalability) to create 250 projects with 20 pods each - 5K total pods.
3.  As cluster-loader runs, monitor the heapster logs and randomly check the metrics for pods

Actual results:

Around the 4K pod mark, errors will pop in the Heapster log:

Failed to push data to sink: Hawkular-Metrics Sink

Hawkular logs will show a message that does not correlate 1:1 with heapster errors:

Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write)

Spot checks of metrics data shows empty buckets


Expected results:

Expected is that the scalability numbers from OCP 3.5 can be reached without loss of data:  25K hello-openshift pods running on 100 compute nodes.

Additional info:

Inventory:

[oo_first_master]
192.1.0.8

[oo_first_master:vars]
openshift_deployment_type=openshift-enterprise
openshift_release=v3.6.0

openshift_metrics_install_metrics=true
openshift_metrics_hawkular_hostname=hawkular-metrics.0615-yzo.qe.rhcloud.com
openshift_metrics_project=openshift-infra
openshift_metrics_image_prefix=registry.ops.openshift.com/openshift3/
openshift_metrics_image_version=v3.6.122
openshift_metrics_cassandra_replicas=1
openshift_metrics_hawkular_replicas=1
openshift_metrics_cassandra_storage_type=pv
openshift_metrics_cassandra_pvc_size=395Gi

Comment 1 Mike Fiedler 2017-06-28 15:12:50 UTC

This is a regression going fromn 25K pods to 4K pods.  Shouldn't it be targeted for 3.6?

Marking as TestBlocker for metrics scalability testing.

Comment 2 John Sanda 2017-06-28 17:02:03 UTC

(In reply to Mike Fiedler from comment #1)
> This is a regression going fromn 25K pods to 4K pods.  Shouldn't it be
> targeted for 3.6?
> 
> Marking as TestBlocker for metrics scalability testing.

I reset for 3.6. I was going through some of the changes since 3.5 to see what may have caused the regression. How easily/quickly could you test again with an updated hawkular-metrics?

Comment 4 John Sanda 2017-06-29 01:15:22 UTC

Mike,

You can retest with the current images and set the following environment variables for the hawkular-metrics pod:

METRICS_EXPIRATION_JOB_ENABLED=false
COMPRESSION_JOB_ENABLED=false

I do not think that these background jobs would this sort of regression, but it cannot hurt to rule them out.

Comment 5 Mike Fiedler 2017-06-29 02:04:25 UTC

The pod creation rate for this test is:

- 20 pods per namespace
- 3 namespaces at a time (batches of 60 pods at a time  which is < 1 pod/node)
- wait for all pods to go fully into the Running state (takes about 15 seconds)
- move on to the next batch of 3 namespaces

This is a fairly typical pod creation rate for our scalability tests.  If there is a suspicion that the rate is a factor, I can try slowing things down.   

The test is run with cluster-loader:  https://github.com/openshift/svt/blob/master/openshift_scalability/README.md

using the input configuration below.   Translated, it is "Create 250 projects, each with 20 hello-openshift pods (in pod-default.json)."

I run it with ./cluster-loader.py -p 3 -f <filename>.yaml

-p 3 means to run 3 simultaneous threads, so for each burst, 3 x 20 = 60 pods are created and come fully Running before the next set of three.  Hope that makes sense.

projects:
  - num: 250
    basename: svt-1-
    tuning: default
    pods:
      - total: 20 
      - num: 100
        basename: hello
        file: content/pod-default.json
        storage:
          - type: none
          
quotas:
  - name: default
    file: default

tuningsets:
  - name: default
    pods:
      stepping:
        stepsize: 50
        pause: 90 s
      rate_limit:
        delay: 0 ms

Comment 6 Michael Burman 2017-06-29 10:40:38 UTC

I believe this is caused by the fact that the image is shipped with rate limiter on (max concurrency is set to 1). Due to changes between 3.5 and 3.6, the tags update will take slightly longer and it will block everything else with the limiter.

Increasing the concurrency to 4 or 8 should increase the throughoutput as it will no longer block the datapoint writes. But the tags update will still happen with just a concurrency of one.

That's because when we created the change, we also made changes to the Heapster sink to make it work better with the updated procedure, but the Heapster changes were never merged (they're still waiting to be merged). For that reason, the shipped Heapster sink will work inefficiently with the current Hawkular-Metrics.

The Heapster PR is #1666 and the important change is in the mutex handling of "registerLabeledIfNecessary" function.

If we need to "workaround" this Heapster sink issue in Hawkular-Metrics, then a background job updating the tags is pretty much the only way to fix it (so that Heapster gets response 200 back instantly). That way we could increase the concurrency of the sink for tag updates also. This change would have other implications however.

This is only apparent limitation in cases where lots of pods are launched at the same time - in a steady state (without increasing amount of pods) the concurrency setting should do the job as it will function correctly for datapoints in the older Heapster version also.

Comment 7 Mike Fiedler 2017-06-29 11:48:33 UTC

Do you have details on what I need to do to change max concurrency?  I don't see it exposed currently in our deployment config and a quick search of the hawkular-metrics docs didn't find it.

Comment 8 Michael Burman 2017-06-29 12:16:21 UTC

Well, the current concurrency setting is 5 (and not one like I said originally - but one is the effective as long as the PR isn't merged with tags updates). It's not exposed in the configuration unfortunately, you would need to modify the Heapster's starting parameters directly to change the value (it's a sink setting).

It seems the API for modifying running sink was removed at some point .

Comment 13 Mike Fiedler 2017-07-06 14:34:49 UTC

I still see this in the official 3.6.136 image:

registry.ops.openshift.com/openshift3/metrics-heapster            v3.6.136            a4b52c5e2110        7 hours ago         274.4 MB

Does the hash match what you pushed?

Moving to MODIFIED until an image is available in the OCP puddles.

Comment 18 Mike Fiedler 2017-07-12 01:11:47 UTC

Tried verifying with the latest image out of brew:

brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-heapster   latest              5549c67d8607        42 hours ago        274.4 MB


and I still see the issue.   Specifically, while scaling up from 5K to 10K pods, I start seeing 

Failed to push data to sink: Hawkular-Metrics Sink

in the heapster logs and "holes"/empty buckets start showing up in the data.   The issue seems to go away when the cluster sits idle and stabilizes.   I see the failures mainly during periods of adding pods.

Attaching the heapster log.   The end of the log is when I was seeing consecutive push failures.   Note there are some nodes down/NotReady in this cluster.

Comment 19 Mike Fiedler 2017-07-12 01:12:32 UTC

Created attachment 1296635 [details]
Heapster logs running latest image

Comment 20 Matt Wringe 2017-07-12 13:01:22 UTC

@miburman: can you please take a look?

Comment 21 John Sanda 2017-07-13 13:44:44 UTC

Mike,

Can you test again using images from origin-metrics? We know for certain that the latest origin-metrics images have the heapster changes discussed earlier in the ticket.

Comment 22 Mike Fiedler 2017-07-13 18:40:16 UTC

I switched all replication controllers (hawkular-cassandra-1, hawkular-metrics, heapster) to use the latest images for origin:

root@svt-j-1: ~/mffiedler/logs # ssh 172.16.0.7 docker images | grep origin
docker.io/openshift/origin-metrics-cassandra                  latest              797dbea9e930        7 days ago          769.7 MB
docker.io/openshift/origin-metrics-hawkular-metrics           latest              3147880c9a30        7 days ago          886.1 MB
docker.io/openshift/origin-metrics-heapster                   latest              6649a39ef5c1        7 days ago          768.7 MB


I still see the error in the heapster logs - in fact this time I saw it at REST.   I did not use a full new deployment from origin - I can't do that in this big environment.  I hope it was OK to run with the origin images using the same general configuration laid down by openshift-ansible.

Logs attached.

If I need to do it from scratch with the origin deployer, that will probably have to wait until next week - we only have 1 day left in this big cluster.

Comment 23 Mike Fiedler 2017-07-13 18:41:02 UTC

Created attachment 1297800 [details]
All metrics logs running origin images

Comment 24 Matt Wringe 2017-07-13 19:33:42 UTC

I need to push out a new Heapster image to docker hub which should now hopefully have the patch required.

Comment 25 Matt Wringe 2017-07-13 21:44:14 UTC

A newer Heapster image has been pushed out to docker hub, can you see if that helps? The previous one didn't have the patch properly applied.

Comment 26 Mike Fiedler 2017-07-14 00:32:56 UTC

Testing new heapster:

root@svt-j-1: ~/mffiedler/logs # ssh 172.16.0.18 docker images | grep heapster
docker.io/openshift/origin-metrics-heapster                                        latest              2ff34583737a        2 hours ago         819.9 MB

Comment 27 Mike Fiedler 2017-07-14 15:41:10 UTC

I tested with the new heapster from comment 25 and comment 26.    I am now running :lastest origin from dockerhub for cassandra, hawkular-metrics and heapster.  I had problems running the origin heapster with the OCP hawkular, so just switched to using only origin containers.

The symptoms are the same though, everything runs fine up through approximately the creation of 5K pods and then I start seeing the Hawkular-Metrics Sink push failures and holes in the metrics.

The test creates pods in batches of 60, waits for them all to go into Running state and then moves on to a new batch of 60.

In the attached logs, there are a couple failures right after heapster starts and then none until 15:06 which was approximately the 5K pod mark.   At that point there starts to be more failures than success.

Around 15:20 I stopped creating new pods.   At 15:28 there is a mix of successful and failed pushes and around 15:35 we are mostly back to successful pushes.

This is a  ~450 node cluster which we are losing access to.  I will try come up with a reproducer in AWS or GCE.

Comment 28 Mike Fiedler 2017-07-14 15:42:32 UTC

Created attachment 1298433 [details]
heapster logs showing full run including re-stabilization

Comment 29 Matt Wringe 2017-07-14 16:06:05 UTC

Do we know what the exact setup was used before and what the new one is? I am wondering if we are still running with the same resources or configurations or not.

What is mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1465532#c0 does not match what the logs are saying for instance. Cassandra has access to less than 2GB, while the description says it has 4GB for instance.

Comment 30 Mike Fiedler 2017-07-14 16:29:51 UTC

That original cluster is gone.  I will match the current cluster to the specs in the original description and run the test again using the origin images and capture all 3 logs.

Comment 31 Matt Wringe 2017-07-14 17:26:34 UTC

We need to make sure that the setup being done today is similar in comparison to the old tests which had much greater results.

I believe in the old setup there were additional configuration changes which were needed to get it to scale as high as they got it.

Can you install the 3.5 images in this setup (it should just work on OCP 3.6) and see if you are seeing the same regression? We need to confirm that its not an environment issue and then narrow down which of the metric components has caused the regression.

Comment 32 Mike Fiedler 2017-07-14 17:37:42 UTC

Set hawkular and cassandra limits to 4GB.   Note:  during this run heapster and hawkular-metrics both restarted once.

Logs attached - you can see the push failures throughout.

Next up:  will try on 3.5 images if I can get them running before the cluster goes away today.  Otherwise A/B testing will have to be in AWS next week.

Comment 33 Mike Fiedler 2017-07-14 17:38:16 UTC

Created attachment 1298557 [details]
hawkular and cassandra memory limits at 4GB.  cpu is unlimited

Comment 34 Mike Fiedler 2017-07-14 20:38:22 UTC

I retested on the same cluster with metrics 3.5-018 deployed using the 3.5 openshift-ansible playbook. Same configuration and same workload.

Testing on 3.5 there are no holes in the metrics data. I'm up over 10K pods and seeing 0 empty buckets. My tool continuously checks the most recent 5 x 2 minute buckets for randomly selected pods that have been up more than 10 minutes. I also verified for random pods in the web UI.

For 3.5, I see no holes or empty buckets. From the web console, the metrics graphs are unbroken and continuous. Heapster on 3.5 does not log as much - I could not see messages related to pushing to hawkular. Metrics have no need to recover - at this level they are not missing.

For 3.6, I start seeing many holes and for some pods all buckets are empty. From the web console, graphs are broken lines or altogether empty. The heapster logs show many push failures to hawkular. Metrics seem to recover when the cluster idles.

I collected pbench (system monitoring) data for both the 3.5 and 3.6 runs. When I've processed it.

I think this test shows there is definitely a difference in 3.6.

Sample output for 3.5:

Test of metrics for: svt-b-42.pausepods5 found empty: 0, not empty: 5
Test of metrics for: svt-b-63.pausepods4 found empty: 0, not empty: 5
Test of metrics for: svt-b-72.pausepods17 found empty: 0, not empty: 5
Test of metrics for: svt-b-63.pausepods3 found empty: 0, not empty: 5
Test of metrics for: svt-b-32.pausepods16 found empty: 0, not empty: 5
Test of metrics for: svt-b-20.pausepods7 found empty: 0, not empty: 5
Test of metrics for: svt-b-58.pausepods8 found empty: 0, not empty: 5
Test of metrics for: svt-b-46.pausepods17 found empty: 0, not empty: 5
Test of metrics for: svt-b-2.pausepods3 found empty: 0, not empty: 5
Test of metrics for: svt-b-105.pausepods10 found empty: 0, not empty: 5
Test of metrics for: svt-b-13.pausepods11 found empty: 0, not empty: 5
Test of metrics for: svt-b-70.pausepods11 found empty: 0, not empty: 5
Test of metrics for: svt-b-16.pausepods18 found empty: 0, not empty: 5
Test of metrics for: svt-b-102.pausepods17 found empty: 0, not empty: 5

For 3.6 there will be various non-0 values in the "found empty" field.

Comment 35 Mike Fiedler 2017-07-17 14:27:28 UTC

Pbench data is available but inconclusive due to other activity in the cluster.   Next step is to get a true A/B comparison of scaling up to 6000 pods for 3.6 and 3.5 on AWS

Comment 36 Mike Fiedler 2017-07-18 20:42:59 UTC

I set up a 20 node, 10,000 pod cluster in AWS and was able to reproduce the issue with 3.6.141 but I was not able to reproduce it with origin.   In the 450 node cluster I no longer have access to, I could reproduce with origin.

I am going to make one more run tonight with origin images and then will consult with you on what step we want to take next.   If there are test images you want me to try please add links here or send directly.

Comment 37 Mike Fiedler 2017-07-20 11:52:34 UTC

I've been unable to reproduce this issue on a 20 node/10K pod cluster on EC2.   3.6.152 metrics is stable all of the way up to 10K pods.    The missing dimension from the previous environments is the number of compute nodes.   I am going to scale the current 20 node environment to 100 nodes and make one more attempt at reproducing this and also at collecting metrics for 25K pods - the goal for 3.6.   This cluster will be short-lived though to save on expense.

pidstat and sar data from running both 3.5 and 3.6 in the 20 node/10K pod environment show:

- significant cpu improvement in 3.6 for hawkular-metrics
- modest memory improvement in 3.6 for hawkular-metrics
- network utilization improvement in 3.6 for hawkular-metrics
- unchanged cpu and network for heapster in 3.6
- modest cpu improvement in 3.6 for cassandra
- unchanged memory for cassandra
- modest network improvment in 3.6 for cassandra
- modest increase in IOPS in 3.6 for cassandra


I'll link the data internally in a follow-up comment

Comment 39 Mike Fiedler 2017-07-20 20:30:54 UTC

I hit this issue on EC2 - it required higher scale though.   I hit it around 16,500 pods created.   

This time Hawkular restarted triggering a heapster restart after which heapster could not push data to hawkular.   I scaled hawkular and heapster down to 0 and back up - both come to ready state but the heapster logs have the push error.   Logs attached.

Config changes are Hawkular memory raised to 4GB, Cassandra memory raised to 4GB and heapster memory raised to 6GB.

I'll check the performance data and see if the hawkular issue might have been OOM.  But even after both restarted and came to Ready state, the errors in the heapster log (and missing metrics data) continued.   

Is there any way to get details of what is failing with the push?

Comment 40 Mike Fiedler 2017-07-20 20:32:08 UTC

Created attachment 1301996 [details]
July 20 logs from ec2 - failure started around 16K pods

Comment 43 Mike Fiedler 2017-07-21 01:22:14 UTC

The log is lost when the pod restarts.   Let me extract all docker messages (which will include pod messages) from journald

Comment 44 Mike Fiedler 2017-07-21 01:50:50 UTC

Created attachment 1302092 [details]
docker logs from journald with pod logging

I tried to grep some docker spam out, but there will still be other docker messages mixed in

Comment 46 John Sanda 2017-07-21 02:50:16 UTC

The compression job is supposed run every two hours, but if it falls behind schedule, it will run more frequently until it caches up. It looks like at some points it is running continuously until the container is killed. At this point I think it would be worth disabling the job. It can be disabled setting this environment variable, COMPRESSION_JOB_ENABLED=false.

Comment 47 Mike Fiedler 2017-07-21 16:51:21 UTC

Scaling up to 25K pods was going well until around 20K pods when heapster was OOMKilled (with 6GB limit).  I'm taking the limits off of heapster and hawkular for now to measure actuals at this scale.

More concerning, once heapster was restarted it began getting the push failures and never recovered.  I tried restarting both hawkular and heapster and it never started pushing again.   What could put heapster into a permfail state?   hawkular logs did not indicate errors.

Plan, re-run with no memory limits on hawkular or heapster and COMPRESSION_JOB_ENABLED=false

Comment 48 Mike Fiedler 2017-07-21 20:05:39 UTC

The combination of COMPRESSION_ENABLED=false and removing the memory limits for hawkular and heapster allowed metrics on the EC2 cluster to scale stably to 25K pods.

From the performance data, COMRESSION_ENABLED=false freed up enough cpu on the hawkular node to avoid readiness failure restarts and removing the memory limits prevented OOMKills of both hawkular and heapster.  I'll report 3.5 vs 3.6 memory usage for cassandra/hawkular/heapster during the  25K pods run separately.

Were there changes in the compression area for 3.6?  That setting was not required in 3.5 scale tests.   What is the next step for this bz with respect to that?

Additionally, as mentioned in comment 47 I did see an issue where once heapster restarted it could never push successfully again.   Working on a reproducer for that for a separate bz.   I'll link the perf data internal links in a followup comment.

Comment 49 John Sanda 2017-07-21 20:53:39 UTC

I have been checking and not seeing anything. There is a job scheduler component that runs the compression job. I was wondering if that might be part of the problem. I am looking through git history, and it looks like the last changes for the job scheduler as well as for the compression job went into hawkular-metrics 0.23.4 which is what should be used in the latest OCP 3.5 release.

Mickie, were there any changes to the compression code between 3.5 and 3.6?

Comment 50 Peter Portante 2017-07-22 09:19:50 UTC

@Mike, removing the limits is unstable, no?  In environments where hawkular and heapster have competing memory resource demands, removing the limits will actually allow them to be OOM killed for other reasons, I believe.

Seems like we need to find what the actual memory limits are, and request that as the minimum.

We have the same problem with the logging side.  Adding Eric Paris to comment.

Comment 51 Eric Paris 2017-07-22 16:48:05 UTC

When a node runs low on actual free RAM pods with no limit set are likely to be the first pod's OOM'd, these are called BestEffort pods. Pods with a limit larger than their request by some integer multiple (I'd have to look up the multiple, but it is something like 4) will be OOM'd next, these are called Burstable pods. Pods with a very large multiple between request and limit go in the BestEffort bucket. Pods with request==limit will go last, these are called Guaranteed pods.

Removing limits entirely will make things "work" for some period of time. But it means if the system comes under any memory pressure you will be the first thing killed.

Actually knowing exactly what your pods needs (while hard) makes it very unlikely they will get killed.

Picking a reasonable min/max range is likely the right thing to do now if possible

The logging team should actively look at how much RAM the pods are using in production and tune those numbers over time.

Comment 52 Mike Fiedler 2017-07-23 18:20:45 UTC

The purpose of removing the limits was to determine actual usage for this workload, not to recommend removing them in production.

Scale up to 20K pods with:  

Cassandra - 3.3 GB peak RSS, regular 3.5 core cpu spikes and max 220 IOS   (4.5 cores in 3.5 - memory comparison n/a)

Hawkular - 1.9 GB peak RSS, regular 2 core cpu spikes and rare 3.5 core cpu spikes (2.5 cores and 2.5 GB RSS in 3.5)

Heapster - 7.2 GB peak RSS, regular 2 core cpu spikes (1 core and 7GB in 3.5)

Full data:  http://pbench.perf.lab.eng.bos.redhat.com/results/EC2::ip-172-31-35-191/20170721_0_20k_metrics_a/tools-default/

Comment 53 Michael Burman 2017-07-24 16:27:24 UTC

There were no changes between 3.5 and 3.6 for compression code. Only 3.7 would have the updates that improve performance. I guess we could technically backport the range scanning feature which should reduce the load of Cassandra & Hawkular-Metrics during the compression job. 

As for Heapster, we didn't do any changes to the sink between 3.5 and 3.6 either (all the updates are for 3.7), so all those changes are internal to the Heapster.

Comment 54 John Sanda 2017-07-26 04:13:36 UTC

We could be hitting https://issues.jboss.org/browse/HWKMETRICS-707 which is a bug in the retry logic for inserting data points that has actually been around since OCP 3.4. I think it might be good to do another test run with a build of hawkular-metrics that has the retry logic turned off. In OCP 3.4 and 3.5 it could essentially be turned off with an environment variable. With some of the changes in 3.6, the configuration option is ignored so we will need a new build of hawkular-metrics with which to test.

Comment 55 Michael Burman 2017-07-27 07:02:56 UTC

I really doubt that we're hitting that bug at all, since it's not used in the initial processing processes (which might be updating the tags - fetching the tags etc).

If that retryWhen would be the problematic part, the processing would never stabilize - but it does. And it wouldn't systematically happen at the startup, since tags updates are done in a serial fashion in 3.4-3.6 (and parallel in next Heapster update). There's no initial burst of additional datapoints in the Heapster startup.

Should we wish to speed up that processing, we should backport the parallel tags updating procedure.

Comment 56 Vikas Laad 2017-07-27 20:04:29 UTC

I did another round of testing with cluster setup mentioned in this bug https://bugzilla.redhat.com/show_bug.cgi?id=1474099

1 m4.xlarge master 
2 m4.xlarge infra nodes
100 m4.xlarge compute nodes

With 15K pods I can still see following errors in the heapster logs 

Failed to push data to sink: Hawkular-Metrics Sink

But the time to get stable metrics has reduced from 22 mins to 17 mins. I used image provided by Michael Burman from docker.io/burmanm/origin-metrics-heapster:v1.3.0-3

Comment 59 Vikas Laad 2017-08-01 19:51:05 UTC

Another experiment with the image Hawkular Metrics provided by @miburman

Scaled pods to 25K, without any limits on heapster, hawkular and cassandra

Cassandra - 3G RSS
Hawkular - 1G RSS 
Heapster - 6.6G RSS

There is a big difference in comment# 52. That experiment was done with 20K pods and this is done with 25K pods. Please let me know if I need to any other experiment.

Comment 65 Stefan Negrea 2017-08-04 15:47:06 UTC

*** Bug 1388815 has been marked as a duplicate of this bug. ***

Comment 66 Stefan Negrea 2017-08-04 15:47:56 UTC

*** Bug 1391996 has been marked as a duplicate of this bug. ***

Comment 68 Vikas Laad 2017-08-18 17:56:20 UTC

After reaching around 20K pods started getting following errors

in Heapster
E0818 17:36:47.932557       1 client.go:203] Could not update tags: Hawkular returned status code 500, error message: Reply could not be parsed: invalid character '<' looking for beginning of value

in Hawkular
2017-08-18 17:36:47,935 INFO  [org.hawkular.openshift.auth.org.hawkular.openshift.namespace.NamespaceHandler] (default I/O-2) Could not determine a namespace id for namespace null. Cannot process request. Returning an INTERNAL_SERVER_ERROR

Attaching detailed logs.

Comment 69 Vikas Laad 2017-08-18 17:57:02 UTC

Created attachment 1315348 [details]
logs after the testing on 08-18

Comment 70 Vikas Laad 2017-08-18 18:10:25 UTC

(In reply to Vikas Laad from comment #68)
Please note errors happen for some time and they go away.

Comment 71 Matt Wringe 2017-08-18 21:24:47 UTC

PR to fix the latest issue: https://github.com/hawkular/hawkular-metrics/pull/875

Comment 72 Vikas Laad 2017-08-21 20:17:43 UTC

After running few more iterations I see following error in heapster logs after creating 13000 pods vs 4000 pods when the bug was created. This is with the modified code.

Failed to push data to sink: Hawkular-Metrics Sink

Comment 73 Matt Wringe 2017-08-22 20:41:07 UTC

I think the 13k value is what we were getting before with 3.5 without any changes to the Heapster concurrency value. Is this now similar to what you have been seeing with 3.5?

The error message in the Heapster logs (Failed to push data to sink: Hawkular-Metrics Sink) usually indicates that Heapster was not able to push out metrics within its collection time window, so it will skip the next time window.

Can we get the logs for Heapster attached here?

Comment 74 Mike Fiedler 2017-08-23 13:36:17 UTC

Moving this back to assigned while waiting on final PR/fix in this area.

Comment 75 Matt Wringe 2017-08-23 15:37:59 UTC

The v3.6.173.0.5 in brew should now have this fix applied.

Comment 76 Vikas Laad 2017-08-23 17:14:53 UTC

Created attachment 1317172 [details]
casanrdra, metrics and heapster log

This logs are before today's fix, I am going to test one more time with the image which has today's fix.

Comment 78 Vikas Laad 2017-08-24 20:11:02 UTC

I tested with provided image, I dont see errors mentioned in comment #68 anymore.

Setting bug to verified, initial errors are happening after 13K pods are created. Which is improved in 3.6 and comparable with 3.5.

Also Scaled cluster to 25K pods and metrics was stable.

Will test again with official puddle when available.

Comment 79 Vikas Laad 2017-08-24 20:13:12 UTC

Sorry for typo, will be setting bug to verified once official puddle is available.

Comment 84 Mike Fiedler 2017-10-17 20:22:56 UTC

Verified on metrics images version 3.7.0-0.158.0

Scaled up to 22,500 pods on a 100 node AWS cluster at the same rate used to report this bug.   No metrics lost during the scaleup - everything was stable.   Could not go higher due to various bugs around nodes flapping NotReady under load.   See:

https://bugzilla.redhat.com/show_bug.cgi?id=1451902
https://bugzilla.redhat.com/show_bug.cgi?id=1486914
https://bugzilla.redhat.com/show_bug.cgi?id=1493182

But the problems originally reported in this bz are fixed.  Marking VERIFIED.

Comment 87 errata-xmlrpc 2017-12-07 07:10:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3389

Note You need to log in before you can comment on or make changes to this bug.