Bug 1558677
Summary: | [starter-us-west-1] Hawkular Metrics pod restarted 635 times in 6 days | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | John Sanda <jsanda> |
Component: | Hawkular | Assignee: | John Sanda <jsanda> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Junqi Zhao <juzhao> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | unspecified | CC: | aos-bugs, dma, jcantril, pportant, tkatarki |
Target Milestone: | --- | ||
Target Release: | 3.9.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-07-30 16:30:46 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1559440, 1559443, 1559448, 1559450 | ||
Bug Blocks: |
Description
John Sanda
2018-03-20 18:46:18 UTC
starter-use-east-1 is hitting this issue as well. I will check on other starter clusters and update this ticket with my findings. starter-us-west-2 and starter-us-east-2 have already run into this issue. Based on the cluster size, I expect starter-ca-central-1 to run into this soon as well. Should this be filed against the product so that we can get these changes into the 3.9 release? These CPU consuming events can take out logging, preventing the service from collecting logs properly. (In reply to Peter Portante from comment #3) > Should this be filed against the product so that we can get these changes > into the 3.9 release? > > These CPU consuming events can take out logging, preventing the service from > collecting logs properly. I will create a ticket against the product. After looking at several starter clusters, it is very clear that the DeleteExpiredMetrics job which runs in the hawkular-metrics server is the problem. We are going to have to back port as well. There are steps that can be taken to disable the job as a temporary work around. 1. Edit the hawkular-metrics RC and add the following argument to the hawkular-metrics-wrapper.sh script: -Dhawkular.metrics.jobs.expiration.enabled=false 2. Scale hawkular-metrics down to zero 3. Run the following command against one of the cassandra pods (can be any of them): $ oc -n openshift-infra exec <cassandra pod> -- cqlsh --ssl -e "select time_slice, job_id, job_name from hawkular_metrics.scheduled_jobs_idx" We are interest in the time_slice and job_id columns for the row whose job_name is DELETE_EXPIRED_METRICS. 4. Run the following command against any cassandra pod to remove the job from the scheduled_jobs_idx table: $ oc -n openshift-infra exec <cassandra pod --cqlsh --ssl -e "delete from scheduled_jobs_idx where time_slice = '<time slice>' and job_id = <job id>" Substitute <time slice> and <job id> with the values from step 3. 5. Scale hawkular-metrics back up |