Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1425008

Summary: [ded-stg2-aws]Error occurred getting metrics
Product: OpenShift Container Platform Reporter: Li Zhe <zhezli>
Component: HawkularAssignee: Devan Goodwin <dgoodwin>
Status: CLOSED CURRENTRELEASE QA Contact: Li Zhe <zhezli>
Severity: medium Docs Contact:
Priority: high    
Version: unspecifiedCC: aos-bugs, dgoodwin, jgoulding, jokerman, mmccomas, zhezli
Target Milestone: ---Keywords: OpsBlocker
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-05 20:52:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs from failing cassandra pod
none
private files from mounted failing cassandra volume none

Comment 2 Li Zhe 2017-02-22 10:55:52 UTC
Metrics can be seen now.

Comment 3 Li Zhe 2017-03-07 07:51:33 UTC
Reproduced on 20170307 0751UTC
Version
OpenShift Master:v3.4.1.8
Kubernetes Master:v1.4.0+776c994

Comment 4 Devan Goodwin 2017-03-07 12:52:31 UTC
[root@ded-stg2-aws-master-6759f ~]# oc get pods
NAME                         READY     STATUS             RESTARTS   AGE
hawkular-cassandra-1-adgyf   0/1       CrashLoopBackOff   78         6h
hawkular-cassandra-2-9bddo   1/1       Running            0          4d
hawkular-metrics-exabn       0/1       CrashLoopBackOff   109        6h
heapster-i975z               0/1       Running            43         6h


Attaching logs from the failing cassandra pod.

Comment 5 Devan Goodwin 2017-03-07 12:54:46 UTC
Created attachment 1260792 [details]
logs from failing cassandra pod

Comment 6 Devan Goodwin 2017-03-07 15:49:33 UTC
Created attachment 1260875 [details]
private files from mounted failing cassandra volume

Comment 7 Matt Wringe 2017-03-08 16:24:22 UTC
The issue is because a SSTable become corrupted:

Caused by: org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: /cassandra_data/data/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/mc-354-big-Data.db

Unfortunately, since the file is corrupted and we by default are not using replication, the only way forward is to delete the corrupted file. The data contained in the file is lost.

Once you delete the corrupted file(s), Cassandra will be able to start up again. Once Cassandra starts, you will then need to run "nodetool repair hawkular_metrics".

This type of error should not occur under normal situations. It can occur if the pod has forcefully killed (eg using 'docker kill', 'kill', etc) and not using the oc command to gracefully shut it down. It could also occur if the machine was abruptly terminated or if the pod itself failed (such as an OOME)

Is there any information you can think of which would have caused this problem?

Comment 8 Devan Goodwin 2017-03-08 16:30:18 UTC
Leaving needinfo to see if zhezli knows of any reason this would have happened (we were unable to find evidence of any of the known potential causes).

Metrics has now been reinstalled on the cluster.

Comment 9 Li Zhe 2017-03-09 02:07:30 UTC
The metrics on ded-stg2-aws is normal now. I will check it after 24 hrs and 48hrs again.