Description of problem: We have a cluster where we re-scaled metrics, and now metrics won't start up again because the commit log can't be parsed with the 2GB limit applied to Cassandra. Cassandra gets oomkilled, and starts the whole thing over, forever. Version-Release number of selected component (if applicable): atomic-openshift-3.4.1.18-1.git.0.0f9d380.el7.x86_64 How reproducible: always Steps to Reproduce: 1. build up large commit logs 2. scale down metrics 3. scale up metrics Actual results: metrics never starts because cassandra gets oomkilled parsing the commit log Expected results: cassandra should be able to eventually catch up and get into a state where the rest of metrics can start. Additional info:
I think the only way to get around this will be to increase the memory limit for the Cassandra pod. I don't believe there is anyway to have Cassandra not load all of these into memory when processing them at startup. The commit logs should have been processed when the server is shut down though, there shouldn't have been any commit logs to process when it starts up. Do you know how the pod was originally shutdown? @jsanda: any comments about this?
There are 1,738 commit log files by my count. Can we see how much space the /cassandra_data/commitlog directory is using? It seems like a really large number of commit log files. And as Matt pointed out, the commit log files should get replayed on shutdown so that there are no commit log files to replay at start up. Has cassandra been getting oomkilled a lot? I am wondering if the drain operation at shutdown is happening. If it is happening, then we should be seeing a statement like this in the cassandra log: INFO [RMI TCP Connection(234)-127.0.0.1] 2017-07-19 17:54:08,311 StorageService.java:1160 - DRAINED I am not sure why the commit log replay would cause cassandra to get oomkilled. I would like to try the following. Start cassandra with the cassandra.replaylist system property. It takes a comma-separated list of commit log files to replay. Just specify a handful, e.g., -Dcassandra.replaylist=/cassandra_data/commitlog/CommitLog-6-1497956111366.log, /cassandra_data/commitlog/CommitLog-6-1497956111367.log, /cassandra_data/commitlog/CommitLog-6-1497956111368.log, /cassandra_data/commitlog/CommitLog-6-1497956111369.log, /cassandra_data/commitlog/CommitLog-6-1497956111370.log If Cassandra starts up and gets into a normal state, could you please stop the pod? I would like to have a look at the Cassandra logs afterwards to see if the drain operation actually happens. Lastly, I would like to know the contents of /cassandra_data/commitlog after the shutdown.
@John Sanda We restarted the gear with a higher memory limit, so the commit logs have already been processed. So I do not think we have any of that information lying around anymore.
@jsanda: is there something we can be doing here? Such as limiting the size of the commit logs that get generated?
(In reply to Matt Wringe from comment #7) > @jsanda: is there something we can be doing here? Such as limiting the size > of the commit logs that get generated? Write performance in Cassandra pretty much comes down to the commit log. There will likely be a trade off. Reducing the size might eliminate some of the start up problems at the cost of worse performance. I think we need to do some testing and take some measurements to better understand those trade offs. Here is something else to consider. We use commit log compression. It is disabled by default. The commitlog_segment_size_in_mb property in cassandra.yaml specifies the size for commit log segment files. This is basically an upper bound. When a commit log segment is replayed, it will have to first be decompressed, and it will probably be larger than 32 MB. I am pretty sure that the default commit log settings in cassandra.yaml are based on a commit log that is *not* compressed. This article https://goo.gl/s3J4iF suggests that for write heavy workloads suggests a couple configuration changes when using commit log compression. I am thinking that we might want to do a couple things for now. 1) Turn off commit log compression until we understand what types of settings we should be using if we are using commit log compression. 2) Configure the commit log size based on the total available memory. Before we do this, we need to a lot of testing. We need to make sure we understand what can happen when the overall size of the commit log on disk is larger than the total system memory. We need to understand the ramifications of using a smaller commit log. Once we understand these things, then we might want to revisit commit log compression. For now, I think the safe thing is to focus on 1)
In 3.7 we turned off commit log compression, which should prevent this problem from occurring.
(In reply to Matt Wringe from comment #9) > In 3.7 we turned off commit log compression, which should prevent this > problem from occurring. It seems we must attach one PV, so after scaling down and up, commit logs won't be lost
@Matt, Is there a simple way to build up large commit logs? I oc rshed ${hawkular-cassandra-pods}, and copied files under /cassandra_data/commitlog/ to build up more than 2G commit logs, but after scaled up, there are errors in hawkular-cassandra-pods, such as ERROR [main] 2017-10-16 12:08:53,137 CommitLogReplayer.java:688 - Ignoring commit log replay error org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Segment id mismatch (filename 1508147198335, descriptor 1508147198267) in file /cassandra_data/commitlog/CommitLog-6-1508147198335.log at org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:684) [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1] at org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:363) [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1] at org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:145) [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1] at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:181) [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1] at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:161) [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1] at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:295) [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1] at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569) [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1] at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:697) [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1]
(In reply to Junqi Zhao from comment #11) > @Matt, > Is there a simple way to build up large commit logs? > I oc rshed ${hawkular-cassandra-pods}, and copied files under > /cassandra_data/commitlog/ to build up more than 2G commit logs, but after > scaled up, there are errors in hawkular-cassandra-pods, such as > > ERROR [main] 2017-10-16 12:08:53,137 CommitLogReplayer.java:688 - Ignoring > commit log replay error > org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: > Segment id mismatch (filename 1508147198335, descriptor 1508147198267) in > file /cassandra_data/commitlog/CommitLog-6-1508147198335.log > at > org.apache.cassandra.db.commitlog.CommitLogReplayer. > handleReplayError(CommitLogReplayer.java:684) > [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1] > at > org.apache.cassandra.db.commitlog.CommitLogReplayer. > recover(CommitLogReplayer.java:363) > [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1] > at > org.apache.cassandra.db.commitlog.CommitLogReplayer. > recover(CommitLogReplayer.java:145) > [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1] > at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:181) > [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1] > at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:161) > [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1] > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:295) > [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1] > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java: > 569) [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1] > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:697) > [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1] The simplest way at the moment would be to just let Cassandra run and monitor the size of the commit log directory. I can look into writing some automated tests/scripts to generate commit logs, but that will take a little more time.
(In reply to John Sanda from comment #12) > The simplest way at the moment would be to just let Cassandra run and > monitor the size of the commit log directory. I can look into writing some > automated tests/scripts to generate commit logs, but that will take a little > more time. I attached one pv and created 60 projects and each project has one pod, let these pods run for 16 hours, I see the max size for /cassandra_data/commitlog/ is only about 135M, it is far away from 2G
@jsanda: would getting information from the scalability test environment help with respect to the commit log size?
(In reply to Matt Wringe from comment #14) > @jsanda: would getting information from the scalability test environment > help with respect to the commit log size? What kind of info? I suspect that the reason there was no success in comment 13 is that most if not all writes were to the data table. The memtable is regularly getting flushed and the commit log segment can be deleted because all mutations have been written out to SSTables. We may have better luck with writes spread across more, smaller tables (in addition to the data table). The data memtables will still get flushed but the other, smaller tables are small enough to remain on heap and thus the commit log segments cannot be deleted as quickly. That's the theory any way. I can look into writing a test program like this. The other, next best option would be to manually generate the commit log by using Cassandra's internal APIs.
(In reply to John Sanda from comment #15) > (In reply to Matt Wringe from comment #14) > > @jsanda: would getting information from the scalability test environment > > help with respect to the commit log size? > > What kind of info? The size of things like commit logs on a system which is running with a lot more than just 60 pods. > > I suspect that the reason there was no success in comment 13 is that most if > not all writes were to the data table. The memtable is regularly getting > flushed and the commit log segment can be deleted because all mutations have > been written out to SSTables. Comment 13 is only with 60 pods running for 16 hours, I don't think this enough pods or enough time to see if there is a problem here or not. We have seen this memory problem a few times already when restarting the pod and I would like to figure out why its happening exactly and to see if we do something to prevent it from happening.
(In reply to Matt Wringe from comment #16) > (In reply to John Sanda from comment #15) > > (In reply to Matt Wringe from comment #14) > > > @jsanda: would getting information from the scalability test environment > > > help with respect to the commit log size? > > > > What kind of info? > > The size of things like commit logs on a system which is running with a lot > more than just 60 pods. > > > > > I suspect that the reason there was no success in comment 13 is that most if > > not all writes were to the data table. The memtable is regularly getting > > flushed and the commit log segment can be deleted because all mutations have > > been written out to SSTables. > > Comment 13 is only with 60 pods running for 16 hours, I don't think this > enough pods or enough time to see if there is a problem here or not. > > We have seen this memory problem a few times already when restarting the pod > and I would like to figure out why its happening exactly and to see if we do > something to prevent it from happening. Sure, let's see what the commit logs look like in the scalability tests.
(In reply to John Sanda from comment #17) > Sure, let's see what the commit logs look like in the scalability tests. Is someone doing scalability testing?
Mike, are you still doing any scalability testing for 3.7? The volume of data that your testing generates might generate commit logs big enough that we could borrow to verify the fix for this issue.
I was able to generate 2G commitlog today, but here is what happened. After i killed the pod using following oc delete pod -n openshift-infra hawkular-cassandra-1-sn5wx --grace-period=0 --force=true After the pod was killed it was created immediately but the commitlog was empty and nodetool status also did not show load. I am going to try again by killing pod in a different way.
I was able to verify this bug using following version, Metrics images were same version v3.7.0-0.190.0 After getting > 2G commitlog I scaled metrics replica to 0 and back to 1. No OOM were encountered in Cassandra pod. Did it 2 times to be sure and everything was good.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188