Bug 1473013 - Metrics can't recover when the commit log is too big
Summary: Metrics can't recover when the commit log is too big
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.4.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.7.0
Assignee: John Sanda
QA Contact: Vikas Laad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-07-19 20:38 UTC by Sten Turpin
Modified: 2017-11-28 22:04 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-28 22:04:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description Sten Turpin 2017-07-19 20:38:41 UTC
Description of problem: We have a cluster where we re-scaled metrics, and now metrics won't start up again because the commit log can't be parsed with the 2GB limit applied to Cassandra. Cassandra gets oomkilled, and starts the whole thing over, forever. 


Version-Release number of selected component (if applicable): atomic-openshift-3.4.1.18-1.git.0.0f9d380.el7.x86_64


How reproducible: always


Steps to Reproduce:
1. build up large commit logs
2. scale down metrics
3. scale up metrics

Actual results: metrics never starts because cassandra gets oomkilled parsing the commit log 


Expected results: cassandra should be able to eventually catch up and get into a state where the rest of metrics can start. 


Additional info:

Comment 3 Matt Wringe 2017-07-19 21:42:46 UTC
I think the only way to get around this will be to increase the memory limit for the Cassandra pod. I don't believe there is anyway to have Cassandra not load all of these into memory when processing them at startup.

The commit logs should have been processed when the server is shut down though, there shouldn't have been any commit logs to process when it starts up. Do you know how the pod was originally shutdown?

@jsanda: any comments about this?

Comment 4 John Sanda 2017-07-19 22:00:30 UTC
There are 1,738 commit log files by my count. Can we see how much space the /cassandra_data/commitlog directory is using? It seems like a really large number of commit log files. And as Matt pointed out, the commit log files should get replayed on shutdown so that there are no commit log files to replay at start up. Has cassandra been getting oomkilled a lot? I am wondering if the drain operation at shutdown is happening. If it is happening, then we should be seeing a statement like this in the cassandra log:

INFO  [RMI TCP Connection(234)-127.0.0.1] 2017-07-19 17:54:08,311 StorageService.java:1160 - DRAINED


I am not sure why the commit log replay would cause cassandra to get oomkilled. I would like to try the following. Start cassandra with the cassandra.replaylist system property. It takes a comma-separated list of commit log files to replay. Just specify a handful, e.g.,

-Dcassandra.replaylist=/cassandra_data/commitlog/CommitLog-6-1497956111366.log, /cassandra_data/commitlog/CommitLog-6-1497956111367.log, /cassandra_data/commitlog/CommitLog-6-1497956111368.log, /cassandra_data/commitlog/CommitLog-6-1497956111369.log, /cassandra_data/commitlog/CommitLog-6-1497956111370.log

If Cassandra starts up and gets into a normal state, could you please stop the pod? I would like to have a look at the Cassandra logs afterwards to see if the drain operation actually happens. Lastly, I would like to know the contents of /cassandra_data/commitlog after the shutdown.

Comment 6 Wesley Hearn 2017-07-24 19:56:32 UTC
@John Sanda

We restarted the gear with a higher memory limit, so the commit logs have already been processed. So I do not think we have any of that information lying around anymore.

Comment 7 Matt Wringe 2017-08-03 16:06:29 UTC
@jsanda: is there something we can be doing here? Such as limiting the size of the commit logs that get generated?

Comment 8 John Sanda 2017-08-04 03:02:53 UTC
(In reply to Matt Wringe from comment #7)
> @jsanda: is there something we can be doing here? Such as limiting the size
> of the commit logs that get generated?

Write performance in Cassandra pretty much comes down to the commit log. There will likely be a trade off. Reducing the size might eliminate some of the start up problems at the cost of worse performance. I think we need to do some testing and take some measurements to better understand those trade offs.

Here is something else to consider. We use commit log compression. It is disabled by default. The commitlog_segment_size_in_mb property in cassandra.yaml specifies the size for commit log segment files. This is basically an upper bound. When a commit log segment is replayed, it will have to first be decompressed, and it will probably be larger than 32 MB. I am pretty sure that the default commit log settings in cassandra.yaml are based on a commit log that is *not* compressed. 

This article https://goo.gl/s3J4iF suggests that for write heavy workloads suggests a couple configuration changes when using commit log compression. I am thinking that we might want to do a couple things for now. 

1) Turn off commit log compression until we understand what types of settings we should be using if we are using commit log compression. 

2) Configure the commit log size based on the total available memory. Before we do this, we need to a lot of testing. We need to make sure we understand what can happen when the overall size of the commit log on disk is larger than the total system memory. We need to understand the ramifications of using a smaller commit log. Once we understand these things, then we might want to revisit commit log compression.

For now, I think the safe thing is to focus on 1)

Comment 9 Matt Wringe 2017-10-05 18:57:46 UTC
In 3.7 we turned off commit log compression, which should prevent this problem from occurring.

Comment 10 Junqi Zhao 2017-10-16 09:09:41 UTC
(In reply to Matt Wringe from comment #9)
> In 3.7 we turned off commit log compression, which should prevent this
> problem from occurring.

It seems we must attach one PV, so after scaling down and up, commit logs won't be lost

Comment 11 Junqi Zhao 2017-10-16 12:34:53 UTC
@Matt,
Is there a simple way to build up large commit logs?
I oc rshed ${hawkular-cassandra-pods}, and copied files under /cassandra_data/commitlog/ to build up more than 2G commit logs, but after scaled up, there are   errors in hawkular-cassandra-pods, such as

ERROR [main] 2017-10-16 12:08:53,137 CommitLogReplayer.java:688 - Ignoring commit log replay error
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Segment id mismatch (filename 1508147198335, descriptor 1508147198267) in file /cassandra_data/commitlog/CommitLog-6-1508147198335.log
	at org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:684) [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1]
	at org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:363) [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1]
	at org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:145) [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1]
	at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:181) [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1]
	at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:161) [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1]
	at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:295) [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1]
	at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569) [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1]
	at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:697) [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1]

Comment 12 John Sanda 2017-10-16 13:40:30 UTC
(In reply to Junqi Zhao from comment #11)
> @Matt,
> Is there a simple way to build up large commit logs?
> I oc rshed ${hawkular-cassandra-pods}, and copied files under
> /cassandra_data/commitlog/ to build up more than 2G commit logs, but after
> scaled up, there are   errors in hawkular-cassandra-pods, such as
> 
> ERROR [main] 2017-10-16 12:08:53,137 CommitLogReplayer.java:688 - Ignoring
> commit log replay error
> org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
> Segment id mismatch (filename 1508147198335, descriptor 1508147198267) in
> file /cassandra_data/commitlog/CommitLog-6-1508147198335.log
> 	at
> org.apache.cassandra.db.commitlog.CommitLogReplayer.
> handleReplayError(CommitLogReplayer.java:684)
> [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1]
> 	at
> org.apache.cassandra.db.commitlog.CommitLogReplayer.
> recover(CommitLogReplayer.java:363)
> [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1]
> 	at
> org.apache.cassandra.db.commitlog.CommitLogReplayer.
> recover(CommitLogReplayer.java:145)
> [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1]
> 	at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:181)
> [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1]
> 	at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:161)
> [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1]
> 	at
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:295)
> [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1]
> 	at
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:
> 569) [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1]
> 	at
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:697)
> [apache-cassandra-3.0.14.redhat-1.jar:3.0.14.redhat-1]

The simplest way at the moment would be to just let Cassandra run and monitor the size of the commit log directory. I can look into writing some automated tests/scripts to generate commit logs, but that will take a little more time.

Comment 13 Junqi Zhao 2017-10-27 01:32:45 UTC
(In reply to John Sanda from comment #12) 
> The simplest way at the moment would be to just let Cassandra run and
> monitor the size of the commit log directory. I can look into writing some
> automated tests/scripts to generate commit logs, but that will take a little
> more time.

I attached one pv and created 60 projects and each project has one pod, let these pods run for 16 hours, I see the max size for /cassandra_data/commitlog/ is only about 135M, it is far away from 2G

Comment 14 Matt Wringe 2017-10-27 13:02:53 UTC
@jsanda: would getting information from the scalability test environment help with respect to the commit log size?

Comment 15 John Sanda 2017-10-27 13:26:51 UTC
(In reply to Matt Wringe from comment #14)
> @jsanda: would getting information from the scalability test environment
> help with respect to the commit log size?

What kind of info?

I suspect that the reason there was no success in comment 13 is that most if not all writes were to the data table. The memtable is regularly getting flushed and the commit log segment can be deleted because all mutations have been written out to SSTables. 

We may have better luck with writes spread across more, smaller tables (in addition to the data table). The data memtables will still get flushed but the other, smaller tables are small enough to remain on heap and thus the commit log segments cannot be deleted as quickly. That's the theory any way. I can look into writing a test program like this. The other, next best option would be to manually generate the commit log by using Cassandra's internal APIs.

Comment 16 Matt Wringe 2017-10-27 13:41:10 UTC
(In reply to John Sanda from comment #15)
> (In reply to Matt Wringe from comment #14)
> > @jsanda: would getting information from the scalability test environment
> > help with respect to the commit log size?
> 
> What kind of info?

The size of things like commit logs on a system which is running with a lot more than just 60 pods.

> 
> I suspect that the reason there was no success in comment 13 is that most if
> not all writes were to the data table. The memtable is regularly getting
> flushed and the commit log segment can be deleted because all mutations have
> been written out to SSTables. 

Comment 13 is only with 60 pods running for 16 hours, I don't think this enough pods or enough time to see if there is a problem here or not.

We have seen this memory problem a few times already when restarting the pod and I would like to figure out why its happening exactly and to see if we do something to prevent it from happening.

Comment 17 John Sanda 2017-10-27 14:18:58 UTC
(In reply to Matt Wringe from comment #16)
> (In reply to John Sanda from comment #15)
> > (In reply to Matt Wringe from comment #14)
> > > @jsanda: would getting information from the scalability test environment
> > > help with respect to the commit log size?
> > 
> > What kind of info?
> 
> The size of things like commit logs on a system which is running with a lot
> more than just 60 pods.
> 
> > 
> > I suspect that the reason there was no success in comment 13 is that most if
> > not all writes were to the data table. The memtable is regularly getting
> > flushed and the commit log segment can be deleted because all mutations have
> > been written out to SSTables. 
> 
> Comment 13 is only with 60 pods running for 16 hours, I don't think this
> enough pods or enough time to see if there is a problem here or not.
> 
> We have seen this memory problem a few times already when restarting the pod
> and I would like to figure out why its happening exactly and to see if we do
> something to prevent it from happening.

Sure, let's see what the commit logs look like in the scalability tests.

Comment 18 Junqi Zhao 2017-11-02 08:15:32 UTC
(In reply to John Sanda from comment #17)
> Sure, let's see what the commit logs look like in the scalability tests.

Is someone doing scalability testing?

Comment 19 John Sanda 2017-11-02 13:47:54 UTC
Mike, are you still doing any scalability testing for 3.7? The volume of data that your testing generates might generate commit logs big enough that we could borrow to verify the fix for this issue.

Comment 20 Vikas Laad 2017-11-03 20:14:43 UTC
I was able to generate 2G commitlog today, but here is what happened. After i killed the pod using following 

oc delete pod -n openshift-infra hawkular-cassandra-1-sn5wx --grace-period=0 --force=true

After the pod was killed it was created immediately but the commitlog was empty and nodetool status also did not show load. I am going to try again by killing pod in a different way.

Comment 21 Vikas Laad 2017-11-06 18:27:00 UTC
I was able to verify this bug using following version, Metrics images were same version

v3.7.0-0.190.0

After getting > 2G commitlog I scaled metrics replica to 0 and back to 1. No OOM were encountered in Cassandra pod. Did it 2 times to be sure and everything was good.

Comment 24 errata-xmlrpc 2017-11-28 22:04:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.