Bug 1439912

Summary:	Large partitions make Cassandra unstable and cause requests to fail in Hawkular Metric
Product:	OpenShift Container Platform	Reporter:	John Sanda <jsanda>
Component:	Hawkular	Assignee:	Matt Wringe <mwringe>
Status:	CLOSED ERRATA	QA Contact:	Liming Zhou <lizhou>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.5.0	CC:	aos-bugs, bmorriso, gburges, jforrest, jgoulding, jsanda, juzhao, mmahut, mwringe, pdwyer, penli, sten, tdawson, whearn, wsun, xiazhao, zhiwliu, zhizhang
Target Milestone:	---	Keywords:	OpsBlocker
Target Release:	3.5.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1422271	Environment:
Last Closed:	2017-05-18 09:28:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1422271
Bug Blocks:	1439910

Comment 5 Junqi Zhao 2017-05-09 09:21:22 UTC

@Matt,

We use ansible to deploy metrics since 3.5.0,
from https://issues.jboss.org/browse/HWKMETRICS-606, we should have one openshift ansible parameter to set partition threshold, do you know where we can find this parameter?

Comment 6 Matt Wringe 2017-05-09 13:31:23 UTC

(In reply to Junqi Zhao from comment #5)
> @Matt,
> 
> We use ansible to deploy metrics since 3.5.0,
> from https://issues.jboss.org/browse/HWKMETRICS-606, we should have one
> openshift ansible parameter to set partition threshold, do you know where we
> can find this parameter?

The problem is that with larger partition sizes we have been running into issues because the compaction strategy to handle those partitions were not working very well. We have moved to a different compaction strategy which should work better with the types of data that we are storing. There is no extra parameter or anything else which needs to be set.

Comment 7 Junqi Zhao 2017-05-10 01:45:47 UTC

(In reply to Matt Wringe from comment #6)

> The problem is that with larger partition sizes we have been running into
> issues because the compaction strategy to handle those partitions were not
> working very well. We have moved to a different compaction strategy which
> should work better with the types of data that we are storing. There is no
> extra parameter or anything else which needs to be set.


Thanks a lot, I see compaction_large_partition_warning_threshold_mb=100 in hawkular-cassandra pod log, I think we can verify this fix by the following steps:

1. Create a lot of projects to consume memory, CPU and network resources, so data can be kept in cassandra partition.

2. Check the hawkular-cassandra and hawkular-metrics pod logs, make sure there are no such warn info
"WARN  18:29:53 Writing large partition hawkular_metrics/metrics_idx:ops-health-monitoring:2 (****** bytes)"

Do you think my solution is well enough to verify this defect?

Comment 10 Junqi Zhao 2017-05-16 06:56:06 UTC

Vlaad(vlaad) created 6500 pods and deleted them under one project, and I checked the hawkular-cassandra and hawkular-metrics pod logs, there were no such warn info exists:
"WARN  18:29:53 Writing large partition hawkular_metrics/metrics_idx:ops-health-monitoring:2 (****** bytes)"

But we found another performance issue:https://bugzilla.redhat.com/show_bug.cgi?id=1451209, since this defect is not related to BZ #1439912, so close it.

docker images | grep metrics
openshift3/metrics-cassandra                   3.5.0               309234b6f5fe        3 days ago          539.5 MB
openshift3/metrics-heapster                    3.5.0               525312ae7d60        3 days ago          317.9 MB
openshift3/metrics-hawkular-metrics            3.5.0               fe477ed220e1        3 days ago          1.269 GB

Comment 12 errata-xmlrpc 2017-05-18 09:28:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1235

Comment 13 Red Hat Bugzilla 2023-09-14 03:56:05 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days