Description of problem: === - After running cassandra several days, cassandra CrashLooping with following errors: INFO 19:33:25 Initializing system.sstable_activity INFO 19:33:38 Opening /cassandra_data/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/la-340-big (35865 bytes) INFO 19:33:38 Opening /cassandra_data/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/la-311-big (34049 bytes) ... INFO 19:38:41 Replaying /cassandra_data/commitlog/CommitLog-5-1474325237113.log (CL version 5, messaging version 9, compression null) INFO 19:38:42 Finished reading /cassandra_data/commitlog/CommitLog-5-1474325237113.log INFO 19:38:42 Replaying /cassandra_data/commitlog/CommitLog-5-1474325237114.log (CL version 5, messaging version 9, compression null) ERROR 19:38:42 Exiting due to error while processing commit log during initialization. org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Mutation checksum failure at 33296351 in CommitLog-5-1474325237114.log at org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:622) [apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2] at org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:492) [apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2] at org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:388) [apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2] at org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:147) [apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2] at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:189) [apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2] at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:169) [apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2] at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:266) [apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2] at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:488) [apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2] at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:595) [apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2] Version-Release number of selected component (if applicable): === - registry.access.redhat.com/openshift3/metrics-cassandra:3.2.1 How reproducible: Steps to Reproduce: - 1. Running cassandra several days. Additional info: - We would like to know how to resume this status.
@jsanda is this a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1377239 ? or would this be something new?
(In reply to Matt Wringe from comment #2) > @jsanda is this a duplicate of > https://bugzilla.redhat.com/show_bug.cgi?id=1377239 ? or would this be > something new? I do not think it is a duplicate of bug 1377239, nor do I think it is a new issue. The problem here is corrupt commit log segment files. We have seen this in one or two other tickets. I know that changes have been made to perform a drain on shutdown to help prevent these kinds of problems. I need to research this a bit more though to see if we can understand how we can into these situations.
(In reply to Kenjiro Nakayama from comment #0) > Description of problem: > === > - After running cassandra several days, cassandra CrashLooping with > following errors: > > INFO 19:33:25 Initializing system.sstable_activity > INFO 19:33:38 Opening > /cassandra_data/data/system/sstable_activity- > 5a1ff267ace03f128563cfae6103c65e/la-340-big (35865 bytes) > INFO 19:33:38 Opening > /cassandra_data/data/system/sstable_activity- > 5a1ff267ace03f128563cfae6103c65e/la-311-big (34049 bytes) > ... > INFO 19:38:41 Replaying > /cassandra_data/commitlog/CommitLog-5-1474325237113.log (CL version 5, > messaging version 9, compression null) > INFO 19:38:42 Finished reading > /cassandra_data/commitlog/CommitLog-5-1474325237113.log > INFO 19:38:42 Replaying > /cassandra_data/commitlog/CommitLog-5-1474325237114.log (CL version 5, > messaging version 9, compression null) > ERROR 19:38:42 Exiting due to error while processing commit log during > initialization. > > org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: > Mutation checksum failure at 33296351 in CommitLog-5-1474325237114.log > at > org.apache.cassandra.db.commitlog.CommitLogReplayer. > handleReplayError(CommitLogReplayer.java:622) > [apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2] > at > org.apache.cassandra.db.commitlog.CommitLogReplayer. > replaySyncSection(CommitLogReplayer.java:492) > [apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2] > at > org.apache.cassandra.db.commitlog.CommitLogReplayer. > recover(CommitLogReplayer.java:388) > [apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2] > at > org.apache.cassandra.db.commitlog.CommitLogReplayer. > recover(CommitLogReplayer.java:147) > [apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2] > at > org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:189) > [apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2] > at > org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:169) > [apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2] > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:266) > [apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2] > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java: > 488) [apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2] > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:595) > [apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2] > > Version-Release number of selected component (if applicable): > === > - registry.access.redhat.com/openshift3/metrics-cassandra:3.2.1 > > How reproducible: > Steps to Reproduce: > - 1. Running cassandra several days. > > Additional info: > - We would like to know how to resume this status. You will have to delete the files referenced in the stack trace.
I am lowering the priority on this bugzilla as we cannot reproduce and it is not something we are seeing others encountering. Note we will continue to look into this issue and take it seriously, but there is not much more we can do unless we get more debugging information such as the logs of the Cassandra pod which has failed, and the events at the time of failure. Having the logs after the commit logs have been corrupted is not of much use, and in this case I don't believe there is much which can be done after this corruption has occurred. @jsanda: is there any other information or comments you can make about this issue? Or any other information which would be useful to help determine the cause? Could you also please write a comment on the proper way to handle when the commitlogs are corrupted these situations? We should probably have that documented somewhere. The ability of commit logs being able to be corrupted is of great concern, but its not something we can reproduce and we don't know what is the root of this cause. If its something like the pod is being forcefully killed like in a power outage situation, or something like the disk space getting full then this might be a consequence of these error conditions. It its happening under more normal situations, then we definitely need to be able to bump up the priority on this issue.
Kenjiro can you provide us details about the customer environment so that we can try to set up a similar environment to test and hopefully reproduce? Are NFS mounts being used? I assume yes but want to confirm. Please provide details about the host file system as well. Is Docker running in a virtual machine? If so what virtualization software is being used? Please provide any other relevant info about the environment that might help with reproducing.
It appears that commit logs can get corrupted if Cassandra is killed when its writing the files, which shouldn't be occurring in most situations. So if we could get more information over what is killing the pod, and how its getting killed that would be useful. If it is being killed due to a timeout after the pod receives the termination signal, we can increase the timeout value by running: oc patch rc hawkular-cassandra-1 -p '{"spec":{"template":{"spec":{"terminationGracePeriodSeconds":"1800"}}}}' [note: we already set this in our 3.3+ images where we are also performing a nodetool drain, which further reduces the chances of commitlog corruption] This still wont fix the issue if someone is forcefully killing the pod, or in the case of something like a power outage. But these should be more rare conditions in which we cannot really guard against. What we can do (and will be doing for our 3.4 images) is to ignore the commit log failures if we encounter them. There doesn't appear to be anything we can do to handle these files once they are corrupted, and I believe we only lose 10 seconds of data in these cases anyways (@jsanda: please correct if I am wrong). To skip the commit log failures, you can run: oc patch rc hawkular-cassandra-1 -p '{"spec":{"template":{"spec":{"containers":[{"name":"hawkular-cassandra-1", "env": [{"name": "JVM_OPTS", "value":"-Dcassandra.commitlog.ignorereplayerrors=true"}]}]}}}}' This will allow Cassandra to start up if these errors are encountered.
We should probably push out a fix for 3.2 with these changes, but without knowing the root cause of why the Cassandra pods are failing in the first place, there is no good way to be able to reproduce and make sure this will actually fix the problem.
(In reply to Matt Wringe from comment #25) > > What we can do (and will be doing for our 3.4 images) is to ignore the > commit log failures if we encounter them. There doesn't appear to be > anything we can do to handle these files once they are corrupted, and I > believe we only lose 10 seconds of data in these cases anyways (@jsanda: > please correct if I am wrong). > Correct on both. If a commit log file is corrupted the only recourse is to delete the file. The 10 seconds refers to the interval at which the commit log gets fsync'd. It is configurable; however, lowering it can adversely effect performance. The other safeguard for this situation is replication, but that is outside the scope of this ticket.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:2924
*** Bug 1417729 has been marked as a duplicate of this bug. ***