Created attachment 1279215 [details] Metrics web console UI Description of problem: When verifing https://bugzilla.redhat.com/show_bug.cgi?id=1439912 on one m4.xlarge master and two m4.xlarge nodes, Vlaad created 6500 pods and deleted them under one project, after that, I checked hawkular-metrics pod log, there was error trace indicate "Pool is busy (no available connection and the queue has reached its max size 256)))" Same info on web console, it is one performance issue, it did not happen during our daily functional testing. Version-Release number of selected component (if applicable): # openshift version openshift v3.5.5.15 kubernetes v1.5.2+43a9be4 etcd 3.1.0 # docker images | grep metrics openshift3/metrics-cassandra 3.5.0 309234b6f5fe 3 days ago 539.5 MB openshift3/metrics-heapster 3.5.0 525312ae7d60 3 days ago 317.9 MB openshift3/metrics-hawkular-metrics 3.5.0 fe477ed220e1 3 days ago 1.269 GB How reproducible: Find in performance testing. Steps to Reproduce: 1. Deploy Metrics 3.5.0 2. Create one project, then create 6500 pods and delete them under this project 3. Check metrics logs and web console UI. Actual results: 3. hawkular-metrics pod log shows "Pool is busy (no available connection and the queue has reached its max size 256)))", more info please see the attached file. Expected results: 3. No error in pod logs and can check CPU, Memory and Network usage in web console Additional info: Attached metrics pod logs and web console UI.
Created attachment 1279216 [details] hawkular metric pod log
Created attachment 1279217 [details] heapster pod log
Created attachment 1279218 [details] hawkular cassandra pod log
@jsanda: can you please take a look?
This issue started to become more prevalent when we upgraded the Cassandra driver from 3.0.1 to 3.0.5. Some of the internal driver code for obtaining a connection was refactored to be non-blocking, making it easier for the client, i.e., Hawkular Metrics, to saturate the driver to the point where it throws a BusyPoolException. We may need to implement a throttling or back pressure mechanism in Hawkular Metrics since it does not look like it is going to be done in the Cassandra driver.
I'm seeing the same behavior on my lab cluster, but I'm not doing performance testing. At this moment I have around 100 PODs: [root@ocp-master-1 ~]# oc get pods --all-namespaces | wc -l 106 This is my environment version: [root@ocp-master-1 ~]# openshift version openshift v3.5.5.26 kubernetes v1.5.2+43a9be4 etcd 3.1.0 This is the log from my Hawkular POD: 2017-07-28 14:45:27,549 WARN [org.hawkular.metrics.scheduler.impl.LockManager] (RxComputationScheduler-1) There was an error renewing locks: java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.243.48:9042 (com.datastax.driver.core.exceptions.BusyPoolException: [hawkular-cassandra/172.30.243.48] Pool is busy (no available connection and the queue has reached its max size 256))) (...) This is my metrics components versions: [root@ocp-node-1 ~]# docker images | grep metrics registry.access.redhat.com/openshift3/metrics-hawkular-metrics 3.5.0 e0d108bd9b0c 10 days ago 1.27 GB registry.access.redhat.com/openshift3/metrics-heapster 3.5.0 4e29df6bda85 4 weeks ago 318.5 MB registry.access.redhat.com/openshift3/metrics-hawkular-metrics <none> 6b8ebf578b45 4 weeks ago 1.27 GB
@jsanda: do you think this is a bug somewhere, or that this is just that the connection pool is over saturated and we need to deploy an additional Hawkular Metrics instance to be able to handle more load?
From user perspective this should be considered a bug. The real problem is the Cassandra drivers. The developers of the driver encourage users to use the async api for performance, but that requires throttling to avoid saturating the driver. Some people, including myself, have advocated for the driver handling this instead of each client dealing with it. The driver devs disagree. We are looking to address this in https://issues.jboss.org/browse/HWKMETRICS-597.
This may also be related to https://issues.jboss.org/browse/HWKMETRICS-710 but we need to investigate further.
The 3.7 release should have the update that includes https://issues.jboss.org/browse/HWKMETRICS-597 Marking this as ON_QA
@Matt Created 20 projects and each project has one pod, and run for about 9 hours, checked the hawkular-metrics pod logs during this period and did not see error "Pool is busy (no available connection and the queue has reached its max size 256)))" in hawkular-metrics pod logs. but the hawkular-metrics pod log only have one line after 9 hours, and hawkular-cassandra, heapster pods also have a few logs. see the attached file. # oc get po NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-swl4q 1/1 Running 0 10h hawkular-metrics-x06v8 1/1 Running 0 10h heapster-x6qpd 1/1 Running 0 10h # oc logs hawkular-metrics-x06v8 2017-10-12 21:00:08,917 INFO [org.hawkular.metrics.core.jobs.TempDataCompressor] (RxComputationScheduler-4) Finished processing data in 2273 ms Is it normal?
Created attachment 1338011 [details] metrics pods log
I believe the TempDataCompressor INFO level message in the logs is normal, but I will let jsanda respond in case its not. Normally INFO level logs are not something to be concerned about, if its something which should be taken more seriously if should be at the WARN or ERROR level.
(In reply to Junqi Zhao from comment #15) > @Matt > Created 20 projects and each project has one pod, and run for about 9 hours, > checked the hawkular-metrics pod logs during this period and did not see > error "Pool is busy (no available connection and the queue has reached its > max size 256)))" in hawkular-metrics pod logs. but the hawkular-metrics pod > log only have one line after 9 hours, and hawkular-cassandra, heapster pods > also have a few logs. see the attached file. > > # oc get po > NAME READY STATUS RESTARTS AGE > hawkular-cassandra-1-swl4q 1/1 Running 0 10h > hawkular-metrics-x06v8 1/1 Running 0 10h > heapster-x6qpd 1/1 Running 0 10h > > # oc logs hawkular-metrics-x06v8 > 2017-10-12 21:00:08,917 INFO > [org.hawkular.metrics.core.jobs.TempDataCompressor] > (RxComputationScheduler-4) Finished processing data in 2273 ms > > Is it normal? The message from TempDataCompressor is new in 3.7, and it is normal. In the initial description you said that you ran 6500 pods. Could you test with that again or at least a larger number of pods that will put more load on hawkular-metrics. If I read correctly you are only testing with 20 pods currently.
(In reply to John Sanda from comment #18) > > The message from TempDataCompressor is new in 3.7, and it is normal. What I want to say is hawkular-metrics pod only have one line log, it should not have so less logs. > In the initial description you said that you ran 6500 pods. Could you test > with that again or at least a larger number of pods that will put more load > on hawkular-metrics. If I read correctly you are only testing with 20 pods > currently. Will re-test to more than 6500 pods
Verified on: registry.ops.openshift.com/openshift3/metrics-hawkular-metrics v3.7.0-0.158.0 35127125228a 1. Created 1250 projects with 25K pods 2. Created a single project with 10K pods 3. Verified in both cases that metrics are flowing successfully No pool busy (or other) errors in the pod logs.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188