Bug 1371937 - OpenShift Metrics fail with "Mutation checksum failure"
Summary: OpenShift Metrics fail with "Mutation checksum failure"
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: John Sanda
QA Contact: Peng Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-31 13:16 UTC by Miheer Salunke
Modified: 2020-01-17 15:54 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-01-18 12:53:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
hawkular-cassandra-1-axq8k (36.69 KB, text/plain)
2016-09-01 07:54 UTC, Tobias Brunner
no flags Details
hawkular-cassandra-2-hghdu (1.30 KB, text/plain)
2016-09-01 07:54 UTC, Tobias Brunner
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0066 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.4 RPM Release Advisory 2017-01-18 17:23:26 UTC

Description Miheer Salunke 2016-08-31 13:16:49 UTC
Description of problem:


When we do maintenance on the OpenShift cluster, we set a node unschedulable and evacuate all Pods. At this time, the Metrics Pod is scheduled to a different node and Metrics won't work anymore. The logs say:

ERROR 07:11:56 Exiting due to error while processing commit log during initialization.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Mutation checksum failure at 818339 in CommitLog-5-1470234746867.log

We already tried the tips at https://access.redhat.com/solutions/2475241, but they don't help. Everytime we delete the CommitLog mentioned, another CommitLog will fail. Only deleting the whole data and starting from scratch helps. But as you can guess, this is no option.

Some background information about our Metrics deployment:
* 2 Cassandra Pods
* 1 Metrics Pod
* 1 Heapster Pod

The persistent data of Cassandra is saved on GlusterFS.

How can we make sure that we have a stable Metrics deployment?


Version-Release number of selected component (if applicable):
Openshift Enterprise 3.2.0

How reproducible:
Always on customer end

Steps to Reproduce:
1.Mentioned in the description
2.
3.

Actual results:
When we do maintenance on the OpenShift cluster, we set a node unschedulable and evacuate all Pods. At this time, the Metrics Pod is scheduled to a different node and Metrics won't work anymore.

Expected results:
When we do maintenance on the OpenShift cluster, we set a node unschedulable and evacuate all Pods. So at this time the Metrics Pod is scheduled to a different node and Metrics should work.

Additional info:

Comment 1 John Sanda 2016-08-31 20:40:08 UTC
Can you check if any of the SSTables are corrupt as well? You can check this with 

<CASSANDRA_HOME>/bin/sstableverify hawkular_metrics data

Comment 2 Matt Wringe 2016-08-31 20:51:22 UTC
Can we get the full Cassandra logs? It may help to determine what is going on here.

The bin directory for Cassandra is already in the path, they can just run 'sstableverify hawkular_metrics data' directly without knowing where it is installed.

Comment 3 Tobias Brunner 2016-09-01 07:54:30 UTC
Created attachment 1196612 [details]
hawkular-cassandra-1-axq8k

Comment 4 Tobias Brunner 2016-09-01 07:54:44 UTC
Created attachment 1196613 [details]
hawkular-cassandra-2-hghdu

Comment 5 Tobias Brunner 2016-09-01 07:57:07 UTC
Logs, see attachments. Current state of Pods:

% oc get pods
NAME                         READY     STATUS             RESTARTS   AGE
hawkular-cassandra-1-axq8k   0/1       CrashLoopBackOff   6          10m
hawkular-cassandra-2-hghdu   0/1       CrashLoopBackOff   5          10m
hawkular-metrics-c1t81       0/1       Running            3          10m
heapster-d3jn2               0/1       Running            4          10m

Here is the ouput of the command:

% oc debug pod hawkular-cassandra-1-axq8k
Debugging with pod/hawkular-cassandra-1-axq8k-debug, original command: /opt/apache-cassandra/bin/cassandra-docker.sh --cluster_name=hawkular-metrics --data_volume=/cassandra_data --internode_encryption=all --require_node_auth=true --enable_client_encryption=true --require_client_auth=true --keystore_file=/secret/cassandra.keystore --keystore_password_file=/secret/cassandra.keystore.password --truststore_file=/secret/cassandra.truststore --truststore_password_file=/secret/cassandra.truststore.password --cassandra_pem_file=/secret/cassandra.pem
Waiting for pod to start ...

Hit enter for command prompt

sh-4.2$ sstableverify hawkular_metrics data
Unknown keyspace/table hawkular_metrics.data
sh-4.2$

Comment 23 Matt Wringe 2016-10-31 20:35:32 UTC
We are already preforming a 'nodetool drain' as part of the normal shutdown procedure which means this should not be occurring in normal situations in the future (power failures and forceful kills could still cause this to pop up in some situations though).

We are also asking that QE test the maintenance scenarios described in the issue to see if it can be reproduced in those cases.

@jsanda: if QE cannot reproduce, is there any other piece of information that would could potentially require to debug what the root cause of this is? Or is our only option to close as non-reproducible.

Comment 24 John Sanda 2016-11-01 13:29:31 UTC
(In reply to Matt Wringe from comment #23)
> We are already preforming a 'nodetool drain' as part of the normal shutdown
> procedure which means this should not be occurring in normal situations in
> the future (power failures and forceful kills could still cause this to pop
> up in some situations though).
> 
> We are also asking that QE test the maintenance scenarios described in the
> issue to see if it can be reproduced in those cases.
> 
> @jsanda: if QE cannot reproduce, is there any other piece of information
> that would could potentially require to debug what the root cause of this
> is? Or is our only option to close as non-reproducible.

If we could enable debug logging in Cassandra for org.apache.cassandra.db.commitlog that might provide some more insight.

I know that this ticket has been open for a while, but I am not entirely comfortable with closing as non-reproducible considering this problem has occurred multiple times. Maybe we close this with nodetool drain as the solution and docs update that if Cassandra does not have a clean shutdown then the maintenance involved can lead to commit log corruption. Then let's create a separate ticket for engineering to further investigate. I need to better understand what is involved with pod evacuation.

Comment 25 Peng Li 2016-11-02 09:43:00 UTC
test below scenario, and looks no such issue with Metrics 3.2.1, you can check the test log I attached.


1. install OSE 3.2 with 2 nodes,configure multiple PV

2. mark node#1 as unschedulable, so all the pods should be deployed on node#2 
  oadm manage-node <node> --schedulable=false

3. deploy metrics 3.2 with PV, CASSANDRA_NODES=2

4. mark node#1 as schedulable and node#2 as unschedulable and do evacuate.

  oadm manage-node <node>  
  oadm manage-node <node> --evacuate

4. check status

There are some minor difference: 
1 I use NFS PV; 
2 there is no v3.2 tag on brew, so I use 3.2.1

Comment 28 Matt Wringe 2016-11-02 22:38:54 UTC
There are some comments in https://bugzilla.redhat.com/show_bug.cgi?id=1385427#c25 which may provide a work around for now while we continue to figure out the root cause and be able to reproduce the problem.

Additional information about why the Cassandra pod is being terminated and the logs of the terminated pod would be very useful.

Comment 31 Peng Li 2016-11-08 08:36:11 UTC
Test 'move commit log to other volume(not cassandra PV)' has passed in OCP 3.4 with Metrics 3.4.0, can I set status to Verified now? Thanks.

[root@host-8-174-32 ~]# openshift version
openshift v3.4.0.23+24b1a58
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

Comment 34 errata-xmlrpc 2017-01-18 12:53:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066


Note You need to log in before you can comment on or make changes to this bug.