Bug 1558677

Summary:	[starter-us-west-1] Hawkular Metrics pod restarted 635 times in 6 days
Product:	OpenShift Container Platform	Reporter:	John Sanda <jsanda>
Component:	Hawkular	Assignee:	John Sanda <jsanda>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	unspecified	CC:	aos-bugs, dma, jcantril, pportant, tkatarki
Target Milestone:	---
Target Release:	3.9.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-07-30 16:30:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1559440, 1559443, 1559448, 1559450
Bug Blocks:

Description John Sanda 2018-03-20 18:46:18 UTC

Description of problem:
I was looking at logs of hawkular-metrics yesterday in starter-us-west-1 and found that one of the hawkular-metrics pods had been restarted 635 times in the past 6 days. The pod was getting oom killed, and the JVM was throwing OutOfMemoryErrors. I was fortunate enough to capture a heap dump. The heap dump along with the logs provided a good indication of what was causing the problem.

There is a background job named DeleteExpiredMetrics looks to be the culprit. The OOME would come after the job started. Based on an initial review of the heap dump, there does not appear to be a memory look. It looks like the job code just pulls back too much data at once with some of the queries it runs. I was able to disable the job, and there have not been any restarts in almost 14 hrs.

There are two other hawkular-metrics pods that have been running without restart issues. The hawkular-metrics pod that executes the job must first acquire a job execution lock, which is maintained in Cassandra and will expire using a TTL if the lock is not explicitly released. The job was not completing due to the OOME, and the same pod continues to own the lock. Upon restart, the same hawkular-metrics pod will rerun the job again (which also entails renewing the lock) since it did not finish. This was the cycle the pod was stuck in.

I will continue to keep an eye on the environment and create another ticket to address the problems with the DeleteExpiredMetrics job.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Comment 1 John Sanda 2018-03-21 14:51:57 UTC

starter-use-east-1 is hitting this issue as well. I will check on other starter clusters and update this ticket with my findings.

Comment 2 John Sanda 2018-03-21 15:33:04 UTC

starter-us-west-2 and starter-us-east-2 have already run into this issue. Based on the cluster size, I expect starter-ca-central-1 to run into this soon as well.

Comment 3 Peter Portante 2018-03-22 01:34:15 UTC

Should this be filed against the product so that we can get these changes into the 3.9 release?

These CPU consuming events can take out logging, preventing the service from collecting logs properly.

Comment 4 John Sanda 2018-03-22 14:12:31 UTC

(In reply to Peter Portante from comment #3)
> Should this be filed against the product so that we can get these changes
> into the 3.9 release?
> 
> These CPU consuming events can take out logging, preventing the service from
> collecting logs properly.

I will create a ticket against the product. After looking at several starter clusters, it is very clear that the DeleteExpiredMetrics job which runs in the hawkular-metrics server is the problem. We are going to have to back port as well.

Comment 5 John Sanda 2018-03-27 18:46:58 UTC

There are steps that can be taken to disable the job as a temporary work around.

1. Edit the hawkular-metrics RC and add the following argument to the hawkular-metrics-wrapper.sh script:

-Dhawkular.metrics.jobs.expiration.enabled=false

2. Scale hawkular-metrics down to zero

3. Run the following command against one of the cassandra pods (can be any of them):

$ oc -n openshift-infra exec <cassandra pod> -- cqlsh --ssl -e "select time_slice, job_id, job_name from hawkular_metrics.scheduled_jobs_idx"

We are interest in the time_slice and job_id columns for the row whose job_name is DELETE_EXPIRED_METRICS.

4. Run the following command against any cassandra pod to remove the job from the scheduled_jobs_idx table:

$ oc -n openshift-infra exec <cassandra pod --cqlsh --ssl -e "delete from scheduled_jobs_idx where time_slice = '<time slice>' and job_id = <job id>"

Substitute <time slice> and <job id> with the values from step 3.

5. Scale hawkular-metrics back up