Bug 1425008 - [ded-stg2-aws]Error occurred getting metrics
Summary: [ded-stg2-aws]Error occurred getting metrics
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: unspecified
Hardware: All
OS: All
high
medium
Target Milestone: ---
: ---
Assignee: Devan Goodwin
QA Contact: Li Zhe
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-20 11:19 UTC by Li Zhe
Modified: 2018-07-26 19:04 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-04-05 20:52:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs from failing cassandra pod (150.53 KB, text/plain)
2017-03-07 12:54 UTC, Devan Goodwin
no flags Details
private files from mounted failing cassandra volume (340 bytes, application/x-gzip)
2017-03-07 15:49 UTC, Devan Goodwin
no flags Details

Comment 2 Li Zhe 2017-02-22 10:55:52 UTC
Metrics can be seen now.

Comment 3 Li Zhe 2017-03-07 07:51:33 UTC
Reproduced on 20170307 0751UTC
Version
OpenShift Master:v3.4.1.8
Kubernetes Master:v1.4.0+776c994

Comment 4 Devan Goodwin 2017-03-07 12:52:31 UTC
[root@ded-stg2-aws-master-6759f ~]# oc get pods
NAME                         READY     STATUS             RESTARTS   AGE
hawkular-cassandra-1-adgyf   0/1       CrashLoopBackOff   78         6h
hawkular-cassandra-2-9bddo   1/1       Running            0          4d
hawkular-metrics-exabn       0/1       CrashLoopBackOff   109        6h
heapster-i975z               0/1       Running            43         6h


Attaching logs from the failing cassandra pod.

Comment 5 Devan Goodwin 2017-03-07 12:54:46 UTC
Created attachment 1260792 [details]
logs from failing cassandra pod

Comment 6 Devan Goodwin 2017-03-07 15:49:33 UTC
Created attachment 1260875 [details]
private files from mounted failing cassandra volume

Comment 7 Matt Wringe 2017-03-08 16:24:22 UTC
The issue is because a SSTable become corrupted:

Caused by: org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: /cassandra_data/data/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/mc-354-big-Data.db

Unfortunately, since the file is corrupted and we by default are not using replication, the only way forward is to delete the corrupted file. The data contained in the file is lost.

Once you delete the corrupted file(s), Cassandra will be able to start up again. Once Cassandra starts, you will then need to run "nodetool repair hawkular_metrics".

This type of error should not occur under normal situations. It can occur if the pod has forcefully killed (eg using 'docker kill', 'kill', etc) and not using the oc command to gracefully shut it down. It could also occur if the machine was abruptly terminated or if the pod itself failed (such as an OOME)

Is there any information you can think of which would have caused this problem?

Comment 8 Devan Goodwin 2017-03-08 16:30:18 UTC
Leaving needinfo to see if zhezli knows of any reason this would have happened (we were unable to find evidence of any of the known potential causes).

Metrics has now been reinstalled on the cluster.

Comment 9 Li Zhe 2017-03-09 02:07:30 UTC
The metrics on ded-stg2-aws is normal now. I will check it after 24 hrs and 48hrs again.


Note You need to log in before you can comment on or make changes to this bug.