Bug 1451209 - [Performance issue]hawkular-cassandra pool is busy, there is no available connection and the queue has reached its max size 256
Summary: [Performance issue]hawkular-cassandra pool is busy, there is no available con...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.7.0
Assignee: John Sanda
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-16 06:50 UTC by Junqi Zhao
Modified: 2017-11-28 21:54 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-28 21:54:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Metrics web console UI (98.14 KB, image/png)
2017-05-16 06:50 UTC, Junqi Zhao
no flags Details
hawkular metric pod log (9.51 MB, text/plain)
2017-05-16 06:52 UTC, Junqi Zhao
no flags Details
heapster pod log (10.89 MB, text/plain)
2017-05-16 06:52 UTC, Junqi Zhao
no flags Details
hawkular cassandra pod log (48.17 KB, text/plain)
2017-05-16 06:53 UTC, Junqi Zhao
no flags Details
metrics pods log (12.67 KB, text/plain)
2017-10-13 00:53 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker HWKMETRICS-597 0 Major Resolved Investigate adding backpressure for Cassandra driver 2020-09-14 06:51:55 UTC
Red Hat Product Errata RHSA-2017:3188 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description Junqi Zhao 2017-05-16 06:50:45 UTC
Created attachment 1279215 [details]
Metrics web console UI

Description of problem:
When verifing https://bugzilla.redhat.com/show_bug.cgi?id=1439912 on one m4.xlarge master and two m4.xlarge nodes, Vlaad created 6500 pods and deleted them under one project, after that, I checked hawkular-metrics pod log, there was error trace indicate "Pool is busy (no available connection and the queue has reached its max size 256)))"
Same info on web console, it is one performance issue, it did not happen during our daily functional testing.

Version-Release number of selected component (if applicable):
# openshift version
openshift v3.5.5.15
kubernetes v1.5.2+43a9be4
etcd 3.1.0

# docker images | grep metrics
openshift3/metrics-cassandra                   3.5.0               309234b6f5fe        3 days ago          539.5 MB
openshift3/metrics-heapster                    3.5.0               525312ae7d60        3 days ago          317.9 MB
openshift3/metrics-hawkular-metrics            3.5.0               fe477ed220e1        3 days ago          1.269 GB

How reproducible:
Find in performance testing.

Steps to Reproduce:
1. Deploy Metrics 3.5.0
2. Create one project, then create 6500 pods and delete them under this project
3. Check metrics logs and web console UI.

Actual results:
3. hawkular-metrics pod log shows "Pool is busy (no available connection and the queue has reached its max size 256)))", more info please see the attached file.

Expected results:
3. No error in pod logs and can check CPU, Memory and Network usage in web console

Additional info:
Attached metrics pod logs and web console UI.

Comment 1 Junqi Zhao 2017-05-16 06:52:01 UTC
Created attachment 1279216 [details]
hawkular metric pod log

Comment 2 Junqi Zhao 2017-05-16 06:52:44 UTC
Created attachment 1279217 [details]
heapster pod log

Comment 3 Junqi Zhao 2017-05-16 06:53:19 UTC
Created attachment 1279218 [details]
hawkular cassandra pod log

Comment 4 Matt Wringe 2017-05-16 07:14:53 UTC
@jsanda: can you please take a look?

Comment 6 John Sanda 2017-05-22 21:53:06 UTC
This issue started to become more prevalent when we upgraded the Cassandra driver from 3.0.1 to 3.0.5. Some of the internal driver code for obtaining a connection was refactored to be non-blocking, making it easier for the client, i.e., Hawkular Metrics, to saturate the driver to the point where it throws a BusyPoolException. We may need to implement a throttling or back pressure mechanism in Hawkular Metrics since it does not look like it is going to be done in the Cassandra driver.

Comment 7 Davi Garcia 2017-07-28 14:55:39 UTC
I'm seeing the same behavior on my lab cluster, but I'm not doing performance testing. At this moment I have around 100 PODs:

[root@ocp-master-1 ~]# oc get pods --all-namespaces | wc -l
106

This is my environment version:

[root@ocp-master-1 ~]# openshift version
openshift v3.5.5.26
kubernetes v1.5.2+43a9be4
etcd 3.1.0

This is the log from my Hawkular POD:

2017-07-28 14:45:27,549 WARN  [org.hawkular.metrics.scheduler.impl.LockManager] (RxComputationScheduler-1) There was an error renewing locks: java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.243.48:9042 (com.datastax.driver.core.exceptions.BusyPoolException: [hawkular-cassandra/172.30.243.48] Pool is busy (no available connection and the queue has reached its max size 256)))
(...)

This is my metrics components versions:

[root@ocp-node-1 ~]# docker images | grep metrics
registry.access.redhat.com/openshift3/metrics-hawkular-metrics    3.5.0               e0d108bd9b0c        10 days ago         1.27 GB
registry.access.redhat.com/openshift3/metrics-heapster            3.5.0               4e29df6bda85        4 weeks ago         318.5 MB
registry.access.redhat.com/openshift3/metrics-hawkular-metrics    <none>              6b8ebf578b45        4 weeks ago         1.27 GB

Comment 10 Matt Wringe 2017-08-03 13:41:45 UTC
@jsanda: do you think this is a bug somewhere, or that this is just that the connection pool is over saturated and we need to deploy an additional Hawkular Metrics instance to be able to handle more load?

Comment 11 John Sanda 2017-08-03 14:35:41 UTC
From user perspective this should be considered a bug. The real problem is the Cassandra drivers. The developers of the driver encourage users to use the async api for performance, but that requires throttling to avoid saturating the driver. Some people, including myself, have advocated for the driver handling this instead of each client dealing with it. The driver devs disagree. We are looking to address this in https://issues.jboss.org/browse/HWKMETRICS-597.

Comment 12 Matt Wringe 2017-08-04 15:28:08 UTC
This may also be related to https://issues.jboss.org/browse/HWKMETRICS-710 but we need to investigate further.

Comment 14 Matt Wringe 2017-10-05 18:56:46 UTC
The 3.7 release should have the update that includes https://issues.jboss.org/browse/HWKMETRICS-597

Marking this as ON_QA

Comment 15 Junqi Zhao 2017-10-13 00:52:35 UTC
@Matt
Created 20 projects and each project has one pod, and run for about 9 hours, checked the hawkular-metrics pod logs during this period and did not see  error "Pool is busy (no available connection and the queue has reached its max size 256)))" in hawkular-metrics pod logs. but the hawkular-metrics pod log only have one line after 9 hours, and hawkular-cassandra, heapster pods also have a few logs. see the attached file.

# oc get po
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-swl4q   1/1       Running   0          10h
hawkular-metrics-x06v8       1/1       Running   0          10h
heapster-x6qpd               1/1       Running   0          10h

# oc logs hawkular-metrics-x06v8
2017-10-12 21:00:08,917 INFO  [org.hawkular.metrics.core.jobs.TempDataCompressor] (RxComputationScheduler-4) Finished processing data in 2273 ms

Is it normal?

Comment 16 Junqi Zhao 2017-10-13 00:53:23 UTC
Created attachment 1338011 [details]
metrics pods log

Comment 17 Matt Wringe 2017-10-13 12:53:56 UTC
I believe the TempDataCompressor INFO level message in the logs is normal, but I will let jsanda respond in case its not.

Normally INFO level logs are not something to be concerned about, if its something which should be taken more seriously if should be at the WARN or ERROR level.

Comment 18 John Sanda 2017-10-13 13:33:55 UTC
(In reply to Junqi Zhao from comment #15)
> @Matt
> Created 20 projects and each project has one pod, and run for about 9 hours,
> checked the hawkular-metrics pod logs during this period and did not see 
> error "Pool is busy (no available connection and the queue has reached its
> max size 256)))" in hawkular-metrics pod logs. but the hawkular-metrics pod
> log only have one line after 9 hours, and hawkular-cassandra, heapster pods
> also have a few logs. see the attached file.
> 
> # oc get po
> NAME                         READY     STATUS    RESTARTS   AGE
> hawkular-cassandra-1-swl4q   1/1       Running   0          10h
> hawkular-metrics-x06v8       1/1       Running   0          10h
> heapster-x6qpd               1/1       Running   0          10h
> 
> # oc logs hawkular-metrics-x06v8
> 2017-10-12 21:00:08,917 INFO 
> [org.hawkular.metrics.core.jobs.TempDataCompressor]
> (RxComputationScheduler-4) Finished processing data in 2273 ms
> 
> Is it normal?

The message from TempDataCompressor is new in 3.7, and it is normal.

In the initial description you said that you ran 6500 pods. Could you test with that again or at least a larger number of pods that will put more load on hawkular-metrics. If I read correctly you are only testing with 20 pods currently.

Comment 19 Junqi Zhao 2017-10-17 00:51:29 UTC
(In reply to John Sanda from comment #18)
> 
> The message from TempDataCompressor is new in 3.7, and it is normal.
 What I want to say is hawkular-metrics pod only have one line log, it should not have so less logs.

> In the initial description you said that you ran 6500 pods. Could you test
> with that again or at least a larger number of pods that will put more load
> on hawkular-metrics. If I read correctly you are only testing with 20 pods
> currently.
 Will re-test to more than 6500 pods

Comment 20 Mike Fiedler 2017-10-25 14:10:53 UTC
Verified on:   registry.ops.openshift.com/openshift3/metrics-hawkular-metrics    v3.7.0-0.158.0      35127125228a

1.  Created 1250 projects with 25K pods
2.  Created a single project with 10K pods
3.  Verified in both cases that metrics are flowing successfully

No pool busy (or other) errors in the pod logs.

Comment 24 errata-xmlrpc 2017-11-28 21:54:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.