Bug 1237049 - JustID use case fails under high load
Summary: JustID use case fails under high load
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Data Virtualization 6
Classification: JBoss
Component: ModeShape
Version: 6.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: CR1
: ---
Assignee: Horia Chiorean
QA Contact:
URL:
Whiteboard:
Depends On: 1202354
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-06-30 09:28 UTC by Filip Nguyen
Modified: 2016-02-10 08:49 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Infinispan is not able to acquire a lock under high loads, resulting in errors. As a result, the user will encounter exceptions.
Clone Of:
Environment:
Last Closed: 2016-02-10 08:49:21 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
logs (262.70 KB, application/zip)
2015-06-30 09:29 UTC, Filip Nguyen
no flags Details
second try according to comment 11 (1004.21 KB, application/zip)
2015-07-01 07:03 UTC, Filip Nguyen
no flags Details
log files 2 (295.17 KB, application/zip)
2015-08-17 12:02 UTC, Filip Nguyen
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker MODE-2484 0 Blocker Resolved Possible Infinispan deadlock when processing binary values 2017-05-15 20:27:01 UTC
Red Hat Issue Tracker MODE-2485 0 Blocker Resolved Infinispan lock timeout exceptions should not break repository GC jobs 2017-05-15 20:27:01 UTC
Red Hat Issue Tracker MODE-2486 0 Major Resolved Investigate Lucene index memory usage when indexing nodes that have binary values, text extraction is disabled and full ... 2017-05-15 20:26:59 UTC
Red Hat Issue Tracker MODE-2500 0 Critical Resolved Indexing remote nodes should not attempt to do so using transactions 2017-05-15 20:26:58 UTC

Description Filip Nguyen 2015-06-30 09:28:14 UTC
QA was asked to perform end to end test for JustID use case. Short description of the use case: multiple clients use JCR API to write binary nodes with images, replicated infinispan cache is used backed by Oracle database. 

This usage crashes under high load with the following errors [1] [2] and it seems that the servers cannot recover from it (no more JCR operations can be called). Attaching logs from 4 servers that invoke simultaneuously the JCR operations and form the cluster. Also attaching clients log.

[1] 
Transaction DummyTransaction{xid=DummyXid{id=237831}, status=1} is not in a valid state to be invoking cache operations on.
        at org.infinispan.interceptors.TxInterceptor.enlist(TxInterceptor.java:272)
        at org.infinispan.interceptors.TxInterceptor.enlistIfNeeded(TxInterceptor.java:236)
        at org.infinispan.interceptors.TxInterceptor.enlistReadAndInvokeNext(TxInterceptor.java:230)
        at org.infinispan.interceptors.TxInterceptor.visitGetKeyValueCommand(TxInterceptor.java:226)
        at org.infinispan.commands.read.GetKeyValueCommand.acceptVisitor(GetKeyValueCommand.java:58)
        at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
        at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:134)


[2] 
05:21:18,832 ERROR [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-6,shared=tcp) Exception executing command: org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock after [15 seconds] on key [d7d2e6b317f1e7mode:locks] for requestor [GlobalTransaction:<dvqe12:server12/modeshape>:2281:remote]! Lock held by [GlobalTransaction:<dvqe10:server10/modeshape>:1533:remote]
        at org.infinispan.util.concurrent.locks.LockManagerImpl.lock(LockManagerImpl.java:213)
        at org.infinispan.util.concurrent.locks.LockManagerImpl.acquireLock(LockManagerImpl.java:186)
        at org.infinispan.interceptors.locking.AbstractTxLockingInterceptor.lockKeyAndCheckOwnership(AbstractTxLockingInterceptor.java:192)
        at org.infinispan.interceptors.locking.AbstractTxLockingInterceptor.lockAndRegisterBackupLock(AbstractTxLockingInterceptor.java:129)
        at org.infinispan.interceptors.locking.PessimisticLockingInterceptor.visitLockControlCommand(PessimisticLockingInterceptor.java:253)
        at org.infinispan.commands.control.LockControlCommand.acceptVisitor(LockControlCommand.java:131)
        at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
        at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:134)
        at org.infinispan.commands.AbstractVisitor.visitLockControlCommand(AbstractVisitor.java:177)
        at org.infinispan.commands.control.LockControlCommand.acceptVisitor(LockControlCommand.java:131)
        at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
        at org.infinispan.interceptors.T

Comment 1 Filip Nguyen 2015-06-30 09:29:11 UTC
Created attachment 1044630 [details]
logs

Comment 3 Horia Chiorean 2015-06-30 12:30:15 UTC
can these failures be reproduced locally ? If yes, can you please provide the relevant test setup & scripts. Thanks.

Comment 4 Filip Nguyen 2015-06-30 12:31:05 UTC
I have not been able to reproduce locally. Even in the cluster it works for low load. It starts failing under high load.

Comment 5 Horia Chiorean 2015-06-30 12:34:56 UTC
Can you define what high load means ? It is lots of reads, lots of writes or both ? 
From the logs I can see the transactions failing because Infinispan locking is failing, but that's about it.

Comment 6 Filip Nguyen 2015-06-30 12:40:08 UTC
Both reads and writes. Multiple threads (web service calls) simultanuously read and write the nodes. The domain is formed by 4 servers. 2 of those servers are used for writing and the other two are used for reading.

Comment 7 Horia Chiorean 2015-06-30 12:50:13 UTC
Looking at the server logs - 08,10,11 & 12, all the exceptions come from the same source: Infinispan is not able to aquire a lock on an entry due to a timeout.

Did you try changing the config on the ISPN cache & increasing the lock timeout ? If not, it's something worth trying out in the EAP ISPN cache configuration:

<cache-containter>
 <cache>
   <locking isolation="READ_COMMITTED" acquire-timeout="45000"/>

Comment 8 Horia Chiorean 2015-06-30 13:02:40 UTC
The exception from the client log is an indication that https://bugzilla.redhat.com/show_bug.cgi?id=1198302 has perhaps not been fixed entirely.

Has this ISPN issue been verified under the same load as the ModeShape JustID use case ? If not, it's very important that the same load test (in terms of reading/writing directly to ISPN) is performed.

Comment 9 Filip Nguyen 2015-06-30 13:07:49 UTC
Horia, I will try to increase the acquire timeout. But I am not sure how this could be causing the cluster not to recover from such error. As I said after the error starts happening the JCR API calls fail every time and only thing that helps is to clean and restart the whole domain.

ISPN-4810 has been verified with higher load with different automated test and it appears to be fixed.

Comment 10 Horia Chiorean 2015-06-30 13:08:38 UTC
To elaborate more on the above comment: the ISPN fix has to be validated in the same configuration (replicated cache & load) as the ModeShape Just ID use case.

Comment 11 Horia Chiorean 2015-06-30 13:15:50 UTC
I have no idea why, when locking fails, the rest of the JCR calls start failing. Perhaps ISPN gets into an invalid state at which point ModeShape cannot function.
Moreover so, the exception from the logs is always raised from a background thread - not the main application/code thread.

I cannot find in the QE server logs any other information except the locking timeout exception, which is coming from ISPN and is converted by ModeShape. 

If testing with increased locking timeout doesn't change anything I can try to add a workaround in the code where this exception (in this particular case) is only logged as a warning and the thread will retry to execute the code at some point in the future. This will not change the fact that ISPN can't lock the nodes however.

Comment 12 Filip Nguyen 2015-06-30 13:17:19 UTC
Comment 10: Yes makes sense.

Comment 13 Filip Nguyen 2015-06-30 13:21:05 UTC
Comment 11: Ok, I have restarted the test with Acquire Timeout (ms):
45000

Comment 14 Anne-Louise Tangring 2015-06-30 15:27:36 UTC
Please also check the similar test case for BZ 1198302
All this comes from Inifispan and this should tell us if this runs correctly with inifnispan or not.

Comment 15 Filip Nguyen 2015-07-01 07:03:39 UTC
Created attachment 1044921 [details]
second try according to comment 11

Comment 16 Filip Nguyen 2015-07-01 07:07:15 UTC
Comment 11: It helped. There are still some lock errors but it seems that it partialy does recover. The test is still running (adding 30000 image nodes) but I started hitting out of memory error (see logs in try2.zip). What is this error? It seems that it is not a memory problem for the EAP server itself but its error that is thrown by Lucene. 
 1) Is it possible to increase the amount of memory? 
 2) How come this error even comes up, the request are meant to be stateless and only limiting factor should be disk space.

Comment 17 Filip Nguyen 2015-07-01 08:42:33 UTC
Comment 14: I have taken a look at the test for BZ1198302. It currently uses Arquillian and is run on Jenkins. I will have to migrate it to the cluster. This will take 1-2 days and plus the runtime of the test is 24 hours. Since we have still testing to finish until end of this week I will have to push this "check" to next week.

Comment 18 Horia Chiorean 2015-07-01 09:54:04 UTC
(In reply to Filip Nguyen from comment #16)

I've opened 2 ModeShape JIRA's for the lock issues. Regardless of these lock issues, I think it's good in general (maybe this should be part of the docs) to increase lock timeout under heavy load.

Regarding Lucene's/Hibernate Search memory issue, I don't know where that comes from, I've never seen it before. It's even more strange if you say that the server memory appears to be fine. I've opened a separate JIRA for this which I will investigate as well. 
I think you can work around this last issue however, by making sure the node type which holds the binary value is not queryable (see the description of https://issues.jboss.org/browse/MODE-2486).

In the meantime, I'll ack this BZ and it should be marked as a blocker for DV 6.2.

Comment 19 Horia Chiorean 2015-07-03 13:21:12 UTC
(In reply to Filip Nguyen from comment #16)
I noticed in the log where the OOM is reported by Hibernate Search that the JVM is run with -Xmx256m. 256 MB is nowhere near enough especially when 
  1) running under highload
  2) working with binary values - which have to be read and stored in the Binary Store each time a binary property is created

ModeShape's unit tests (run during a regular build) are using: -Xmx1524M. Please make sure you're using *at least that* when running the load tests *on each of the servers*.

Comment 20 Horia Chiorean 2015-07-09 10:14:39 UTC
Updated the DV 6.2 product branch with fixes for the locking issues. 

It should be noted that if the load is high and the same parts of a hierarchy/nodes are being updated/changed simultaneously, **there will be lock contention no matter what happens**. This means that in those particular cases, ISPN lock timeout should be increased, like described in https://bugzilla.redhat.com/show_bug.cgi?id=1237049#c7

Regarding the OOM message: what happens is that if an OOM occurs on the server while the Lucene indexing writer is active - i.e. due to the server instance (EAP) not having enough memory - Lucene will catch it and set a boolean flag together with printing this message. 
The reason why upon investigation the server memory looks (looked fine) is that there are no memory leaks during indexing. However, because of the server not having enough memory to handle the load (see https://bugzilla.redhat.com/show_bug.cgi?id=1237049#c19) at some point in time an OOM occurs, Lucene catches it and prints the message and later on the GC runs and frees up memory (albeit too late).

The solution here (as described by see https://bugzilla.redhat.com/show_bug.cgi?id=1237049#c19 and in the corresponding JIRA issue https://issues.jboss.org/browse/MODE-2486) **is to make sure the server instance(s) have enough memory to handle the load test scenario**

Comment 21 Horia Chiorean 2015-07-28 07:59:42 UTC
Also note that the status of https://bugzilla.redhat.com/show_bug.cgi?id=1202354 indicates that the ISPN bug which caused data corruption in Modeshape appears to not be fully fixed & available in EAP 6.4.

This is a blocking requirement for any JustID testing.

Comment 22 Filip Nguyen 2015-08-17 12:02:04 UTC
Created attachment 1063784 [details]
log files 2

With ER4 I still experience lost nodes. No more out-of-memory exceptions and deadlocks (thanks to the adjusted settings indicated earlier in the comments of this BZ).

Comment 23 Horia Chiorean 2015-08-17 12:12:26 UTC
Are tests removing any nodes or just adding & searching ?

This exception is the main "manifestation" of the ISPN bug which causes corrupt data. See https://bugzilla.redhat.com/show_bug.cgi?id=1202354 (linked to this issue).

Comment 24 Filip Nguyen 2015-08-17 12:20:06 UTC
Both. Adding and in the meantime searching. Cluster has 4 nodes. 2 nodes are used for writing and 2 for searching. After the test finishes the assert tries to verify all nodes have been written. This assert fails with higher loads.

Comment 25 Horia Chiorean 2015-08-17 12:30:12 UTC
I can't say for sure if this is ModeShape's issue or caused by the ISPN bug.
What I do know for a fact, is that the ISPN issue will corrupt data causing exactly these types of exceptions.

Therefore, I'm reluctant to change anything in the ModeShape codebase until this use case has been run against a server which has the ISPN fix.

Comment 26 Horia Chiorean 2015-08-18 07:56:09 UTC
Filip, in the test case code which is being run, when are 'Rendities' nodes being created ?

So in the following sequence of calls:

2015-08-17 07:39:50 [pool-1-thread-4] INFO  GetHttpCall:14 - Getting over SEARCH URL: http://10.16.91.80:8080/justid/photos?photoset=fotoset690

2015-08-17 07:39:50 [pool-1-thread-4] INFO  HttpUtil:34 - Getting url:http://10.16.91.80:8080/justid/photos?photoset=fotoset690

2015-08-17 07:39:50 [pool-1-thread-4] INFO  HttpUtil:17 - Posting:AddRenditionRequest [photosetName=fotoset690, renditionName=rendition0] to url: http://10.16.91.79:8080/justid/rendition

2015-08-17 07:39:50 [pool-1-thread-2] INFO  HttpUtil:17 - Posting:AddRenditionRequest [photosetName=fotoset690, renditionName=rendition1] to url: http://10.16.91.79:8080/justid/rendition

2015-08-17 07:39:50 [pool-1-thread-1] INFO  HttpUtil:17 - Posting:AddRenditionRequest [photosetName=fotoset690, renditionName=rendition2] to url: http://10.16.91.77:8080/justid/rendition

2015-08-17 07:39:50 [pool-1-thread-3] INFO  HttpUtil:17 - Posting:AddRenditionRequest [photosetName=fotoset690, renditionName=rendition3] to url: http://10.16.91.77:8080/justid/rendition

when is the /fotoset/2001/2/4/1/fotoset690/Rendities node supposed to be created ?

Comment 27 Filip Nguyen 2015-08-18 08:33:03 UTC
I will try to answer by showing parts of code and comment with overall idea. If this is not clear enough let me know I will try to approach the explanation from different angle.

First all the photosets are being created in this loop:

// Adding photos in parallel
		for (long i = 0; i < datasetSize; i++) {
			requestTracker.awaitRequests(20);

			String photosetname = NameConstants.PHOTO_SET + i;
			if (i % photosetsInHour == 0)
				c.add(Calendar.HOUR, 1);

			submitAddPhotoRequest(loadBalancer.getOrefUrl(photosUrl), c,
					photosetname);
		}


Then after all photosets (which takes hours for big photoset count) are uploaded there is a second loop begins and Rendities are being appended to those phototets 


		for (long i = 0; i < datasetSize; i++) {
			requestTracker.awaitRequests(20);

			String photosetname = NameConstants.PHOTO_SET + i;
			if (i % photosetsInHour == 0)
				c.add(Calendar.HOUR, 1);

			submitSearchPhotoset(loadBalancer.getOscfUrl(String.format(
					searchUrl, photosetname)));

			for (int k = 0; k < 4; k++) {
				submitAddRenditionRequest(
						loadBalancer.getOrefUrl(renditionUrl), c, photosetname,
						k);
			}
		}

Only after this second loop ends and all pending HTTP requests are done, there is the Assert that checks all photosets and rendities are in place.

Comment 28 Horia Chiorean 2015-08-18 08:42:00 UTC
so in the above example, first the creation and upload of "/fotoset/2001/2/4/1/fotoset690" takes place and then during "submitAddRenditionRequest" the "/fotoset/2001/2/4/1/fotoset690/Rendities" node is created (presumably)

Can you pls share (either directly or via a link) the code which performs the "addRenditionRequest" ?

The reason for these questions is that from reading the latest logs, it seems the client test code sends a bunch of async requests to the server, meaning these requests can be processed by the server in any order really. So I'm trying to understand (based on how the code is actually written) why/if exceptions like:

javax.jcr.PathNotFoundException: No item exists at path Rendities relative to /fotoset/2001/2/4/4/fotoset999 in workspace "default"
	at org.modeshape.jcr.AbstractJcrNode.getNode(AbstractJcrNode.java:847)
	at org.modeshape.jcr.AbstractJcrNode.getNode(AbstractJcrNode.java:766)
	at org.modeshape.jcr.AbstractJcrNode.getNode(AbstractJcrNode.java:109)
	at org.jboss.qe.JustIdWriterWS.createFolderNode(JustIdWriterWS.java:339)
	at org.jboss.qe.JustIdWriterWS.addRendition(JustIdWriterWS.java:153) 

can ever happen. 

The "Rendities" node seems to be a collection of "renditions" presumably. Are these types of nodes ever removed by the code - by issuing an explicit remove call ?

Comment 29 Filip Nguyen 2015-08-18 09:42:36 UTC
THe reason these exceptions happen is that the client does submit parallel search requests. If these fail its not taken as a problem, its just to simulate traffic. What is actual problem is that in the final assert the node is not there.

I will send you the sources through e-mail, I am not sure you have permission to see our Git.

Comment 30 Filip Nguyen 2015-08-18 09:45:07 UTC
And no, there is no explicit remove call to Rendities. Actually the code doesn't ever remove any node.

Comment 31 Horia Chiorean 2015-08-20 13:03:53 UTC
Moving back to assigned, because in investigating the latest logs and analyzing the code there seems to be an additional problem that can arise around indexing.

Comment 32 Horia Chiorean 2015-08-24 08:35:32 UTC
Added an additional commit which ads better handling of indexing of remote changes.

Comment 33 Horia Chiorean 2015-09-14 07:12:26 UTC
"Infinispan is not able to acquire a lock under high loads, resulting in errors. As a result, the user will encounter exceptions."

This is not correct. In high load/high contention cases, the client should change the ISPN configuration and increase the locking timeout, as described in: https://bugzilla.redhat.com/show_bug.cgi?id=1237049#c7

Comment 35 Horia Chiorean 2015-10-05 06:02:02 UTC
No, this has to be retested on EAP 6.4.4, because of a blocking Infinispan fix: https://bugzilla.redhat.com/show_bug.cgi?id=1202354 which was not part of EAP 6.4.3 (on which DV 6.2 was released).

Once EAP 6.4.4 is available, QE should move this to ON_QA and re-run the test.

Comment 36 Horia Chiorean 2015-11-16 10:10:23 UTC
Moving to ON_QA since EAP 6.4.4 is available with the ISPN fix.

Comment 37 Filip Elias 2015-11-16 10:24:28 UTC
The reproducer passed. 
Tested using EAP 6.4.4 and DV 6.2.


Note You need to log in before you can comment on or make changes to this bug.