Description of problem: When we do maintenance on the OpenShift cluster, we set a node unschedulable and evacuate all Pods. At this time, the Metrics Pod is scheduled to a different node and Metrics won't work anymore. The logs say: ERROR 07:11:56 Exiting due to error while processing commit log during initialization. org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Mutation checksum failure at 818339 in CommitLog-5-1470234746867.log We already tried the tips at https://access.redhat.com/solutions/2475241, but they don't help. Everytime we delete the CommitLog mentioned, another CommitLog will fail. Only deleting the whole data and starting from scratch helps. But as you can guess, this is no option. Some background information about our Metrics deployment: * 2 Cassandra Pods * 1 Metrics Pod * 1 Heapster Pod The persistent data of Cassandra is saved on GlusterFS. How can we make sure that we have a stable Metrics deployment? Version-Release number of selected component (if applicable): Openshift Enterprise 3.2.0 How reproducible: Always on customer end Steps to Reproduce: 1.Mentioned in the description 2. 3. Actual results: When we do maintenance on the OpenShift cluster, we set a node unschedulable and evacuate all Pods. At this time, the Metrics Pod is scheduled to a different node and Metrics won't work anymore. Expected results: When we do maintenance on the OpenShift cluster, we set a node unschedulable and evacuate all Pods. So at this time the Metrics Pod is scheduled to a different node and Metrics should work. Additional info:
Can you check if any of the SSTables are corrupt as well? You can check this with <CASSANDRA_HOME>/bin/sstableverify hawkular_metrics data
Can we get the full Cassandra logs? It may help to determine what is going on here. The bin directory for Cassandra is already in the path, they can just run 'sstableverify hawkular_metrics data' directly without knowing where it is installed.
Created attachment 1196612 [details] hawkular-cassandra-1-axq8k
Created attachment 1196613 [details] hawkular-cassandra-2-hghdu
Logs, see attachments. Current state of Pods: % oc get pods NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-axq8k 0/1 CrashLoopBackOff 6 10m hawkular-cassandra-2-hghdu 0/1 CrashLoopBackOff 5 10m hawkular-metrics-c1t81 0/1 Running 3 10m heapster-d3jn2 0/1 Running 4 10m Here is the ouput of the command: % oc debug pod hawkular-cassandra-1-axq8k Debugging with pod/hawkular-cassandra-1-axq8k-debug, original command: /opt/apache-cassandra/bin/cassandra-docker.sh --cluster_name=hawkular-metrics --data_volume=/cassandra_data --internode_encryption=all --require_node_auth=true --enable_client_encryption=true --require_client_auth=true --keystore_file=/secret/cassandra.keystore --keystore_password_file=/secret/cassandra.keystore.password --truststore_file=/secret/cassandra.truststore --truststore_password_file=/secret/cassandra.truststore.password --cassandra_pem_file=/secret/cassandra.pem Waiting for pod to start ... Hit enter for command prompt sh-4.2$ sstableverify hawkular_metrics data Unknown keyspace/table hawkular_metrics.data sh-4.2$
We are already preforming a 'nodetool drain' as part of the normal shutdown procedure which means this should not be occurring in normal situations in the future (power failures and forceful kills could still cause this to pop up in some situations though). We are also asking that QE test the maintenance scenarios described in the issue to see if it can be reproduced in those cases. @jsanda: if QE cannot reproduce, is there any other piece of information that would could potentially require to debug what the root cause of this is? Or is our only option to close as non-reproducible.
(In reply to Matt Wringe from comment #23) > We are already preforming a 'nodetool drain' as part of the normal shutdown > procedure which means this should not be occurring in normal situations in > the future (power failures and forceful kills could still cause this to pop > up in some situations though). > > We are also asking that QE test the maintenance scenarios described in the > issue to see if it can be reproduced in those cases. > > @jsanda: if QE cannot reproduce, is there any other piece of information > that would could potentially require to debug what the root cause of this > is? Or is our only option to close as non-reproducible. If we could enable debug logging in Cassandra for org.apache.cassandra.db.commitlog that might provide some more insight. I know that this ticket has been open for a while, but I am not entirely comfortable with closing as non-reproducible considering this problem has occurred multiple times. Maybe we close this with nodetool drain as the solution and docs update that if Cassandra does not have a clean shutdown then the maintenance involved can lead to commit log corruption. Then let's create a separate ticket for engineering to further investigate. I need to better understand what is involved with pod evacuation.
test below scenario, and looks no such issue with Metrics 3.2.1, you can check the test log I attached. 1. install OSE 3.2 with 2 nodes,configure multiple PV 2. mark node#1 as unschedulable, so all the pods should be deployed on node#2 oadm manage-node <node> --schedulable=false 3. deploy metrics 3.2 with PV, CASSANDRA_NODES=2 4. mark node#1 as schedulable and node#2 as unschedulable and do evacuate. oadm manage-node <node> oadm manage-node <node> --evacuate 4. check status There are some minor difference: 1 I use NFS PV; 2 there is no v3.2 tag on brew, so I use 3.2.1
There are some comments in https://bugzilla.redhat.com/show_bug.cgi?id=1385427#c25 which may provide a work around for now while we continue to figure out the root cause and be able to reproduce the problem. Additional information about why the Cassandra pod is being terminated and the logs of the terminated pod would be very useful.
Test 'move commit log to other volume(not cassandra PV)' has passed in OCP 3.4 with Metrics 3.4.0, can I set status to Verified now? Thanks. [root@host-8-174-32 ~]# openshift version openshift v3.4.0.23+24b1a58 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0066