1371937 – OpenShift Metrics fail with "Mutation checksum failure"

Bug 1371937 - OpenShift Metrics fail with "Mutation checksum failure"

Summary: OpenShift Metrics fail with "Mutation checksum failure"

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Hawkular
Sub Component:
Version:	3.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	John Sanda
QA Contact:	Peng Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-08-31 13:16 UTC by Miheer Salunke
Modified:	2020-01-17 15:54 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-01-18 12:53:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
hawkular-cassandra-1-axq8k (36.69 KB, text/plain) 2016-09-01 07:54 UTC, Tobias Brunner	no flags	Details
hawkular-cassandra-2-hghdu (1.30 KB, text/plain) 2016-09-01 07:54 UTC, Tobias Brunner	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0066	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.4 RPM Release Advisory	2017-01-18 17:23:26 UTC

Description Miheer Salunke 2016-08-31 13:16:49 UTC

Description of problem:

When we do maintenance on the OpenShift cluster, we set a node unschedulable and evacuate all Pods. At this time, the Metrics Pod is scheduled to a different node and Metrics won't work anymore. The logs say:

ERROR 07:11:56 Exiting due to error while processing commit log during initialization.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Mutation checksum failure at 818339 in CommitLog-5-1470234746867.log

We already tried the tips at https://access.redhat.com/solutions/2475241, but they don't help. Everytime we delete the CommitLog mentioned, another CommitLog will fail. Only deleting the whole data and starting from scratch helps. But as you can guess, this is no option.

Some background information about our Metrics deployment:
* 2 Cassandra Pods
* 1 Metrics Pod
* 1 Heapster Pod

The persistent data of Cassandra is saved on GlusterFS.

How can we make sure that we have a stable Metrics deployment?

Version-Release number of selected component (if applicable):
Openshift Enterprise 3.2.0

How reproducible:
Always on customer end

Steps to Reproduce:
1.Mentioned in the description
2.
3.

Actual results:
When we do maintenance on the OpenShift cluster, we set a node unschedulable and evacuate all Pods. At this time, the Metrics Pod is scheduled to a different node and Metrics won't work anymore.

Expected results:
When we do maintenance on the OpenShift cluster, we set a node unschedulable and evacuate all Pods. So at this time the Metrics Pod is scheduled to a different node and Metrics should work.

Additional info:

Comment 1 John Sanda 2016-08-31 20:40:08 UTC

Can you check if any of the SSTables are corrupt as well? You can check this with 

<CASSANDRA_HOME>/bin/sstableverify hawkular_metrics data

Comment 2 Matt Wringe 2016-08-31 20:51:22 UTC

Can we get the full Cassandra logs? It may help to determine what is going on here.

The bin directory for Cassandra is already in the path, they can just run 'sstableverify hawkular_metrics data' directly without knowing where it is installed.

Comment 3 Tobias Brunner 2016-09-01 07:54:30 UTC

Created attachment 1196612 [details]
hawkular-cassandra-1-axq8k

Comment 4 Tobias Brunner 2016-09-01 07:54:44 UTC

Created attachment 1196613 [details]
hawkular-cassandra-2-hghdu

Comment 5 Tobias Brunner 2016-09-01 07:57:07 UTC

Logs, see attachments. Current state of Pods:

% oc get pods
NAME                         READY     STATUS             RESTARTS   AGE
hawkular-cassandra-1-axq8k   0/1       CrashLoopBackOff   6          10m
hawkular-cassandra-2-hghdu   0/1       CrashLoopBackOff   5          10m
hawkular-metrics-c1t81       0/1       Running            3          10m
heapster-d3jn2               0/1       Running            4          10m

Here is the ouput of the command:

% oc debug pod hawkular-cassandra-1-axq8k
Debugging with pod/hawkular-cassandra-1-axq8k-debug, original command: /opt/apache-cassandra/bin/cassandra-docker.sh --cluster_name=hawkular-metrics --data_volume=/cassandra_data --internode_encryption=all --require_node_auth=true --enable_client_encryption=true --require_client_auth=true --keystore_file=/secret/cassandra.keystore --keystore_password_file=/secret/cassandra.keystore.password --truststore_file=/secret/cassandra.truststore --truststore_password_file=/secret/cassandra.truststore.password --cassandra_pem_file=/secret/cassandra.pem
Waiting for pod to start ...

Hit enter for command prompt

sh-4.2$ sstableverify hawkular_metrics data
Unknown keyspace/table hawkular_metrics.data
sh-4.2$

Comment 23 Matt Wringe 2016-10-31 20:35:32 UTC

We are already preforming a 'nodetool drain' as part of the normal shutdown procedure which means this should not be occurring in normal situations in the future (power failures and forceful kills could still cause this to pop up in some situations though).

We are also asking that QE test the maintenance scenarios described in the issue to see if it can be reproduced in those cases.

@jsanda: if QE cannot reproduce, is there any other piece of information that would could potentially require to debug what the root cause of this is? Or is our only option to close as non-reproducible.

Comment 24 John Sanda 2016-11-01 13:29:31 UTC

(In reply to Matt Wringe from comment #23)
> We are already preforming a 'nodetool drain' as part of the normal shutdown
> procedure which means this should not be occurring in normal situations in
> the future (power failures and forceful kills could still cause this to pop
> up in some situations though).
> 
> We are also asking that QE test the maintenance scenarios described in the
> issue to see if it can be reproduced in those cases.
> 
> @jsanda: if QE cannot reproduce, is there any other piece of information
> that would could potentially require to debug what the root cause of this
> is? Or is our only option to close as non-reproducible.

If we could enable debug logging in Cassandra for org.apache.cassandra.db.commitlog that might provide some more insight.

I know that this ticket has been open for a while, but I am not entirely comfortable with closing as non-reproducible considering this problem has occurred multiple times. Maybe we close this with nodetool drain as the solution and docs update that if Cassandra does not have a clean shutdown then the maintenance involved can lead to commit log corruption. Then let's create a separate ticket for engineering to further investigate. I need to better understand what is involved with pod evacuation.

Comment 25 Peng Li 2016-11-02 09:43:00 UTC

test below scenario, and looks no such issue with Metrics 3.2.1, you can check the test log I attached.


1. install OSE 3.2 with 2 nodes,configure multiple PV

2. mark node#1 as unschedulable, so all the pods should be deployed on node#2 
  oadm manage-node <node> --schedulable=false

3. deploy metrics 3.2 with PV, CASSANDRA_NODES=2

4. mark node#1 as schedulable and node#2 as unschedulable and do evacuate.

  oadm manage-node <node>  
  oadm manage-node <node> --evacuate

4. check status

There are some minor difference: 
1 I use NFS PV; 
2 there is no v3.2 tag on brew, so I use 3.2.1

Comment 28 Matt Wringe 2016-11-02 22:38:54 UTC

There are some comments in https://bugzilla.redhat.com/show_bug.cgi?id=1385427#c25 which may provide a work around for now while we continue to figure out the root cause and be able to reproduce the problem.

Additional information about why the Cassandra pod is being terminated and the logs of the terminated pod would be very useful.

Comment 31 Peng Li 2016-11-08 08:36:11 UTC

Test 'move commit log to other volume(not cassandra PV)' has passed in OCP 3.4 with Metrics 3.4.0, can I set status to Verified now? Thanks.

[root@host-8-174-32 ~]# openshift version
openshift v3.4.0.23+24b1a58
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

Comment 34 errata-xmlrpc 2017-01-18 12:53:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066

Note You need to log in before you can comment on or make changes to this bug.