Bug 1439912
Summary: | Large partitions make Cassandra unstable and cause requests to fail in Hawkular Metric | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | John Sanda <jsanda> |
Component: | Hawkular | Assignee: | Matt Wringe <mwringe> |
Status: | CLOSED ERRATA | QA Contact: | Liming Zhou <lizhou> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.5.0 | CC: | aos-bugs, bmorriso, gburges, jforrest, jgoulding, jsanda, juzhao, mmahut, mwringe, pdwyer, penli, sten, tdawson, whearn, wsun, xiazhao, zhiwliu, zhizhang |
Target Milestone: | --- | Keywords: | OpsBlocker |
Target Release: | 3.5.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | 1422271 | Environment: | |
Last Closed: | 2017-05-18 09:28:10 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1422271 | ||
Bug Blocks: | 1439910 |
Comment 5
Junqi Zhao
2017-05-09 09:21:22 UTC
(In reply to Junqi Zhao from comment #5) > @Matt, > > We use ansible to deploy metrics since 3.5.0, > from https://issues.jboss.org/browse/HWKMETRICS-606, we should have one > openshift ansible parameter to set partition threshold, do you know where we > can find this parameter? The problem is that with larger partition sizes we have been running into issues because the compaction strategy to handle those partitions were not working very well. We have moved to a different compaction strategy which should work better with the types of data that we are storing. There is no extra parameter or anything else which needs to be set. (In reply to Matt Wringe from comment #6) > The problem is that with larger partition sizes we have been running into > issues because the compaction strategy to handle those partitions were not > working very well. We have moved to a different compaction strategy which > should work better with the types of data that we are storing. There is no > extra parameter or anything else which needs to be set. Thanks a lot, I see compaction_large_partition_warning_threshold_mb=100 in hawkular-cassandra pod log, I think we can verify this fix by the following steps: 1. Create a lot of projects to consume memory, CPU and network resources, so data can be kept in cassandra partition. 2. Check the hawkular-cassandra and hawkular-metrics pod logs, make sure there are no such warn info "WARN 18:29:53 Writing large partition hawkular_metrics/metrics_idx:ops-health-monitoring:2 (****** bytes)" Do you think my solution is well enough to verify this defect? Vlaad(vlaad) created 6500 pods and deleted them under one project, and I checked the hawkular-cassandra and hawkular-metrics pod logs, there were no such warn info exists: "WARN 18:29:53 Writing large partition hawkular_metrics/metrics_idx:ops-health-monitoring:2 (****** bytes)" But we found another performance issue:https://bugzilla.redhat.com/show_bug.cgi?id=1451209, since this defect is not related to BZ #1439912, so close it. docker images | grep metrics openshift3/metrics-cassandra 3.5.0 309234b6f5fe 3 days ago 539.5 MB openshift3/metrics-heapster 3.5.0 525312ae7d60 3 days ago 317.9 MB openshift3/metrics-hawkular-metrics 3.5.0 fe477ed220e1 3 days ago 1.269 GB Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1235 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |