Bug 1439912 - Large partitions make Cassandra unstable and cause requests to fail in Hawkular Metric [NEEDINFO]
Summary: Large partitions make Cassandra unstable and cause requests to fail in Hawkul...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.5.z
Assignee: Matt Wringe
QA Contact: Liming Zhou
URL:
Whiteboard:
Depends On: 1422271
Blocks: 1439910
TreeView+ depends on / blocked
 
Reported: 2017-04-06 20:02 UTC by John Sanda
Modified: 2017-05-18 09:28 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1422271
Environment:
Last Closed: 2017-05-18 09:28:10 UTC
Target Upstream Version:
mwringe: needinfo? (jsanda)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker HWKMETRICS-606 0 Major Closed Large (> 100 MB) partitions in metrics_idx table make Cassandra unstable 2020-05-21 16:00:12 UTC
Red Hat Product Errata RHBA-2017:1235 0 normal SHIPPED_LIVE OpenShift Container Platform 3.5, 3.4, 3.3, and 3.1 bug fix update 2017-05-18 13:15:52 UTC

Comment 5 Junqi Zhao 2017-05-09 09:21:22 UTC
@Matt,

We use ansible to deploy metrics since 3.5.0,
from https://issues.jboss.org/browse/HWKMETRICS-606, we should have one openshift ansible parameter to set partition threshold, do you know where we can find this parameter?

Comment 6 Matt Wringe 2017-05-09 13:31:23 UTC
(In reply to Junqi Zhao from comment #5)
> @Matt,
> 
> We use ansible to deploy metrics since 3.5.0,
> from https://issues.jboss.org/browse/HWKMETRICS-606, we should have one
> openshift ansible parameter to set partition threshold, do you know where we
> can find this parameter?

The problem is that with larger partition sizes we have been running into issues because the compaction strategy to handle those partitions were not working very well. We have moved to a different compaction strategy which should work better with the types of data that we are storing. There is no extra parameter or anything else which needs to be set.

Comment 7 Junqi Zhao 2017-05-10 01:45:47 UTC
(In reply to Matt Wringe from comment #6)

> The problem is that with larger partition sizes we have been running into
> issues because the compaction strategy to handle those partitions were not
> working very well. We have moved to a different compaction strategy which
> should work better with the types of data that we are storing. There is no
> extra parameter or anything else which needs to be set.


Thanks a lot, I see compaction_large_partition_warning_threshold_mb=100 in hawkular-cassandra pod log, I think we can verify this fix by the following steps:

1. Create a lot of projects to consume memory, CPU and network resources, so data can be kept in cassandra partition.

2. Check the hawkular-cassandra and hawkular-metrics pod logs, make sure there are no such warn info
"WARN  18:29:53 Writing large partition hawkular_metrics/metrics_idx:ops-health-monitoring:2 (****** bytes)"

Do you think my solution is well enough to verify this defect?

Comment 10 Junqi Zhao 2017-05-16 06:56:06 UTC
Vlaad(vlaad@redhat.com) created 6500 pods and deleted them under one project, and I checked the hawkular-cassandra and hawkular-metrics pod logs, there were no such warn info exists:
"WARN  18:29:53 Writing large partition hawkular_metrics/metrics_idx:ops-health-monitoring:2 (****** bytes)"

But we found another performance issue:https://bugzilla.redhat.com/show_bug.cgi?id=1451209, since this defect is not related to BZ #1439912, so close it.

docker images | grep metrics
openshift3/metrics-cassandra                   3.5.0               309234b6f5fe        3 days ago          539.5 MB
openshift3/metrics-heapster                    3.5.0               525312ae7d60        3 days ago          317.9 MB
openshift3/metrics-hawkular-metrics            3.5.0               fe477ed220e1        3 days ago          1.269 GB

Comment 12 errata-xmlrpc 2017-05-18 09:28:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1235


Note You need to log in before you can comment on or make changes to this bug.