Metrics can be seen now.
Reproduced on 20170307 0751UTC Version OpenShift Master:v3.4.1.8 Kubernetes Master:v1.4.0+776c994
[root@ded-stg2-aws-master-6759f ~]# oc get pods NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-adgyf 0/1 CrashLoopBackOff 78 6h hawkular-cassandra-2-9bddo 1/1 Running 0 4d hawkular-metrics-exabn 0/1 CrashLoopBackOff 109 6h heapster-i975z 0/1 Running 43 6h Attaching logs from the failing cassandra pod.
Created attachment 1260792 [details] logs from failing cassandra pod
Created attachment 1260875 [details] private files from mounted failing cassandra volume
The issue is because a SSTable become corrupted: Caused by: org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: /cassandra_data/data/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/mc-354-big-Data.db Unfortunately, since the file is corrupted and we by default are not using replication, the only way forward is to delete the corrupted file. The data contained in the file is lost. Once you delete the corrupted file(s), Cassandra will be able to start up again. Once Cassandra starts, you will then need to run "nodetool repair hawkular_metrics". This type of error should not occur under normal situations. It can occur if the pod has forcefully killed (eg using 'docker kill', 'kill', etc) and not using the oc command to gracefully shut it down. It could also occur if the machine was abruptly terminated or if the pod itself failed (such as an OOME) Is there any information you can think of which would have caused this problem?
Leaving needinfo to see if zhezli knows of any reason this would have happened (we were unable to find evidence of any of the known potential causes). Metrics has now been reinstalled on the cluster.
The metrics on ded-stg2-aws is normal now. I will check it after 24 hrs and 48hrs again.