Bug 1559440

Summary: Hawkular Metrics crashes with OutOfMemoryError under moderate load
Product: OpenShift Container Platform Reporter: John Sanda <jsanda>
Component: HawkularAssignee: John Sanda <jsanda>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.9.0CC: aos-bugs, jsanda
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1559443 (view as bug list) Environment:
Last Closed: 2018-07-30 19:10:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1558677, 1559443, 1559448, 1559450    

Description John Sanda 2018-03-22 14:47:31 UTC
Description of problem:
DeleteExpiredMetrics is a background job that runs in the hawkular-metrics server. It was introduced in OCP 3.6 to basically clean up index tables, removing rows for metrics/pods that no longer exist. This was needed because the indexes basically have unbounded growth which led to other problems. The queries that the job performs have the potential to pull back a large amount of data which makes hawkular-metrics very susceptible to OOMEs under a relatively modest sized load. It first observed the problem in a cluster of about 9k pods, but I also observed the problem in a cluster with around 2k pods. Increasing the heap size of hawkular-metrics could alleviate the issue; however, there is no need for the job to query aggressively as it does since it is not latency sensitive.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Configure hawkular-metrics with 3 GB of memory. This should result in a JVM heap size of around 1300 MB.
2. Create about 2k pods
3. Let the DeleteExpiredMetrics job run
4. The job is only scheduled to run every 7 days. I can provide additional details to make the job run more frequently to assist with testing

Actual results:


Expected results:


Additional info:

Comment 2 Junqi Zhao 2018-05-17 07:14:16 UTC
@John

We don't want to let metrics run for 7 days, that will take too long to verify it, is there one better way to test this defect?

And could you share the details about DeleteExpiredMetrics job?

Thanks

Comment 3 John Sanda 2018-05-17 13:21:03 UTC
For this bug we altogether removed the job so it no longer executes and is no longer in the code base. There are a few things that can be checked to verify that the removal is complete.

1) Verify that the job is not scheduled

$ oc -n openshift-infra exec <any_cassandra_pod> -- cqlsh --ssl -e "select * from hawkular_metrics.scheduled_jobs_idx" | grep DELETE_EXPIRED_METRICS

No matches should be returned.

2) Verify that the metrics_expiration_idx table has been dropped

$ oc -n openshift-infra exec <any_cassandra_pod> -- cqlsh --ssl -e "select table_name from system_schema.tables where keyspace_name = 'hawkular_metrics'" | grep metrics_expiration_idx

No matches should be returned

3) Verify that the job configuration has been removed from cassandra

$ oc -n openshift-infra exec <any_cassandra_pod> -- cqlsh --ssl -e "select * from hawkular_metrics.sys_config where config_id = 'org.hawkular.metrics.jobs.DELETE_EXPIRED_METRICS'"

This should return an empty result set

Comment 4 Junqi Zhao 2018-05-17 14:26:17 UTC
(In reply to John Sanda from comment #3)
> For this bug we altogether removed the job so it no longer executes and is
> no longer in the code base. There are a few things that can be checked to
> verify that the removal is complete.
> 
> 1) Verify that the job is not scheduled
> 
> $ oc -n openshift-infra exec <any_cassandra_pod> -- cqlsh --ssl -e "select *
> from hawkular_metrics.scheduled_jobs_idx" | grep DELETE_EXPIRED_METRICS
> 
> No matches should be returned.
> 
> 2) Verify that the metrics_expiration_idx table has been dropped
> 
> $ oc -n openshift-infra exec <any_cassandra_pod> -- cqlsh --ssl -e "select
> table_name from system_schema.tables where keyspace_name =
> 'hawkular_metrics'" | grep metrics_expiration_idx
> 
> No matches should be returned
> 
> 3) Verify that the job configuration has been removed from cassandra
> 
> $ oc -n openshift-infra exec <any_cassandra_pod> -- cqlsh --ssl -e "select *
> from hawkular_metrics.sys_config where config_id =
> 'org.hawkular.metrics.jobs.DELETE_EXPIRED_METRICS'"
> 
> This should return an empty result set

From your comments, I think these steps are enough to verify this defect, and we don't need the metrics run for a few days to verify it, since DeleteExpiredMetrics is already dropped from code. Am I right?

Comment 5 John Sanda 2018-05-17 14:37:15 UTC
(In reply to Junqi Zhao from comment #4)
> (In reply to John Sanda from comment #3)
> > For this bug we altogether removed the job so it no longer executes and is
> > no longer in the code base. There are a few things that can be checked to
> > verify that the removal is complete.
> > 
> > 1) Verify that the job is not scheduled
> > 
> > $ oc -n openshift-infra exec <any_cassandra_pod> -- cqlsh --ssl -e "select *
> > from hawkular_metrics.scheduled_jobs_idx" | grep DELETE_EXPIRED_METRICS
> > 
> > No matches should be returned.
> > 
> > 2) Verify that the metrics_expiration_idx table has been dropped
> > 
> > $ oc -n openshift-infra exec <any_cassandra_pod> -- cqlsh --ssl -e "select
> > table_name from system_schema.tables where keyspace_name =
> > 'hawkular_metrics'" | grep metrics_expiration_idx
> > 
> > No matches should be returned
> > 
> > 3) Verify that the job configuration has been removed from cassandra
> > 
> > $ oc -n openshift-infra exec <any_cassandra_pod> -- cqlsh --ssl -e "select *
> > from hawkular_metrics.sys_config where config_id =
> > 'org.hawkular.metrics.jobs.DELETE_EXPIRED_METRICS'"
> > 
> > This should return an empty result set
> 
> From your comments, I think these steps are enough to verify this defect,
> and we don't need the metrics run for a few days to verify it, since
> DeleteExpiredMetrics is already dropped from code. Am I right?

Yes, that is correct.

Comment 6 Junqi Zhao 2018-05-18 08:02:59 UTC
Verification steps please see Comment 3, DeleteExpiredMetrics job is already dropped from code


metrics-cassandra/images/v3.10.0-0.47.0.0
metrics-schema-installer/images/v3.10.0-0.47.0.0
metrics-hawkular-metrics/images/v3.10.0-0.47.0.0
metrics-hawkular-metrics/images/v3.10.0-0.47.0.0

# openshift version
openshift v3.10.0-0.47.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16

Comment 8 errata-xmlrpc 2018-07-30 19:10:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816