Bug 1422271
| Summary: | Large partitions make Cassandra unstable and cause requests to fail in Hawkular Metric | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | John Sanda <jsanda> | |
| Component: | Hawkular | Assignee: | Matt Wringe <mwringe> | |
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 3.3.0 | CC: | aos-bugs, bmorriso, gburges, jforrest, jgoulding, jsanda, lizhou, mmahut, pdwyer, pweil, smunilla, sten, trankin, vlaad, whearn, xtian, zhiwliu, zhizhang | |
| Target Milestone: | --- | Keywords: | OpsBlocker | |
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1439910 1439912 (view as bug list) | Environment: | ||
| Last Closed: | 2017-08-10 05:18:47 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1439910, 1439912 | |||
|
Description
John Sanda
2017-02-14 22:48:10 UTC
This issue is addressed upstream by https://issues.jboss.org/browse/HWKMETRICS-613. 3.6 builds should have this new functionality. This defect should be do stress testing on lb instance, currently there is one defect https://bugzilla.redhat.com/show_bug.cgi?id=1468113, will verify this defect after BZ #1468113 is get fixed. @jsanda, Tested and attached the hawkular-cassandra pod log, it did throw out warn info if large_partition threshold exceed 100Mb, see the following messags, 106891573 bytes = 101.9397Mb **************************************************************************** 2017-07-15 02:44:11,061 BigTableWriter.java:171 - Writing large partition hawkular_metrics/metrics_idx:clusterproject:0 (106891573 bytes to sstable /cassandra_data/data/hawkular_metrics/metrics_idx-7701404068a611e795d11b216051c746/mc-67-big-Data.db) WARN [SharedPool-Worker-20] 2017-07-15 02:44:35,286 NoSpamLogger.java:94 - Unlogged batch covering 12 partitions detected against table [hawkular_metrics.data]. You should use a logged batch for atomicity, or asynchronous writes for performance. **************************************************************************** But I also see the warning messages, 165956826 bytes = 158.2688Mb, my question is: if the large_partition still grows, for example, it grows to 200, 300Mb, does the program do nothing, but just throw out the warn messages to indicate it exceeds the large_partition threshold? *************************************************************************** WARN [CompactionExecutor:491] 2017-07-15 08:20:45,238 BigTableWriter.java:171 - Writing large partition hawkular_metrics/metrics_idx:clusterproject:0 (165956826 bytes to sstable /cassandra_data/data/hawkular_metrics/metrics_idx-7701404068a611e795d11b216051c746/mc-99-big-Data.db) WARN [SharedPool-Worker-1] 2017-07-15 08:21:05,760 NoSpamLogger.java:94 - Unlogged batch covering 16 partitions detected against table [hawkular_metrics.data]. You should use a logged batch for atomicity, or asynchronous writes for performance. **************************************************************************** Images from ops registry metrics-hawkular-metrics v3.6.140 3a5bebd0476a 6 days ago 1.293 GB metrics-heapster v3.6.140 5549c67d8607 6 days ago 274.4 MB metrics-cassandra v3.6.140 9644ec21e399 6 days ago 573.2 MB Vlaad(vlaad) created 6500 pods and deleted them under one project, and I checked the hawkular-cassandra and hawkular-metrics pod logs There was warning info when large partition is larger than settings of compaction_large_partition_warning_threshold_mb. And for OCP 3.6 we have introduced a background job in hawkular-metrics that clean up the index tables removing rows for the deleted pods. This should help prevent those partitions from constantly getting bigger. Images from ops registry metrics-hawkular-metrics v3.6.140 3a5bebd0476a 6 days ago 1.293 GB metrics-heapster v3.6.140 5549c67d8607 6 days ago 274.4 MB metrics-cassandra v3.6.140 9644ec21e399 6 days ago 573.2 MB (In reply to Junqi Zhao from comment #66) > Vlaad(vlaad) created 6500 pods and deleted them under one > project, and I checked the hawkular-cassandra and hawkular-metrics pod logs > There was warning info when large partition is larger than settings of > compaction_large_partition_warning_threshold_mb. And for OCP 3.6 we have > introduced a background job in hawkular-metrics that clean up the index > tables removing rows for the deleted pods. This should help prevent those > partitions from constantly getting bigger. > > Images from ops registry > metrics-hawkular-metrics v3.6.140 3a5bebd0476a 6 days > ago 1.293 GB > metrics-heapster v3.6.140 5549c67d8607 6 days > ago 274.4 MB > metrics-cassandra v3.6.140 9644ec21e399 6 days > ago 573.2 MB The deletion job only runs once a week by default. It can be scheduled to run more frequently by setting the METRICS_EXPIRATION_JOB_FREQUENCY envar whose values are interpreted in days. (In reply to John Sanda from comment #67) > (In reply to Junqi Zhao from comment #66) > > Vlaad(vlaad) created 6500 pods and deleted them under one > > project, and I checked the hawkular-cassandra and hawkular-metrics pod logs > > There was warning info when large partition is larger than settings of > > compaction_large_partition_warning_threshold_mb. And for OCP 3.6 we have > > introduced a background job in hawkular-metrics that clean up the index > > tables removing rows for the deleted pods. This should help prevent those > > partitions from constantly getting bigger. > > > > Images from ops registry > > metrics-hawkular-metrics v3.6.140 3a5bebd0476a 6 days > > ago 1.293 GB > > metrics-heapster v3.6.140 5549c67d8607 6 days > > ago 274.4 MB > > metrics-cassandra v3.6.140 9644ec21e399 6 days > > ago 573.2 MB > > The deletion job only runs once a week by default. It can be scheduled to > run more frequently by setting the METRICS_EXPIRATION_JOB_FREQUENCY envar > whose values are interpreted in days. Should we set this default to be lower than 7 days by default? Or expose this parameter in ansible? (In reply to Matt Wringe from comment #68) > (In reply to John Sanda from comment #67) > > (In reply to Junqi Zhao from comment #66) > > > Vlaad(vlaad) created 6500 pods and deleted them under one > > > project, and I checked the hawkular-cassandra and hawkular-metrics pod logs > > > There was warning info when large partition is larger than settings of > > > compaction_large_partition_warning_threshold_mb. And for OCP 3.6 we have > > > introduced a background job in hawkular-metrics that clean up the index > > > tables removing rows for the deleted pods. This should help prevent those > > > partitions from constantly getting bigger. > > > > > > Images from ops registry > > > metrics-hawkular-metrics v3.6.140 3a5bebd0476a 6 days > > > ago 1.293 GB > > > metrics-heapster v3.6.140 5549c67d8607 6 days > > > ago 274.4 MB > > > metrics-cassandra v3.6.140 9644ec21e399 6 days > > > ago 573.2 MB > > > > The deletion job only runs once a week by default. It can be scheduled to > > run more frequently by setting the METRICS_EXPIRATION_JOB_FREQUENCY envar > > whose values are interpreted in days. > > Should we set this default to be lower than 7 days by default? Or expose > this parameter in ansible? I think that the default of 7 days was based on the default data retention of 7 days. The job will not delete any metric definitions if they have live data points. To keep things consistent it probably makes sense to expose the setting in ansible. There are one or two other properties that might need to be exposed. I will take a look and create a ticket and PR for the changes. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716 |