Bug 1237049
Summary: | JustID use case fails under high load | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [JBoss] JBoss Data Virtualization 6 | Reporter: | Filip Nguyen <fnguyen> | ||||||||
Component: | ModeShape | Assignee: | Horia Chiorean <hchiorea> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 6.2.0 | CC: | atangrin, dlesage, drieden, felias, fnguyen, hchiorea, mbaluch, vhalbert | ||||||||
Target Milestone: | CR1 | Keywords: | QA-Closed, TestBlocker | ||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: |
Infinispan is not able to acquire a lock under high loads, resulting in errors. As a result, the user will encounter exceptions.
|
Story Points: | --- | ||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2016-02-10 08:49:21 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 1202354 | ||||||||||
Bug Blocks: | |||||||||||
Attachments: |
|
Description
Filip Nguyen
2015-06-30 09:28:14 UTC
Created attachment 1044630 [details]
logs
can these failures be reproduced locally ? If yes, can you please provide the relevant test setup & scripts. Thanks. I have not been able to reproduce locally. Even in the cluster it works for low load. It starts failing under high load. Can you define what high load means ? It is lots of reads, lots of writes or both ? From the logs I can see the transactions failing because Infinispan locking is failing, but that's about it. Both reads and writes. Multiple threads (web service calls) simultanuously read and write the nodes. The domain is formed by 4 servers. 2 of those servers are used for writing and the other two are used for reading. Looking at the server logs - 08,10,11 & 12, all the exceptions come from the same source: Infinispan is not able to aquire a lock on an entry due to a timeout. Did you try changing the config on the ISPN cache & increasing the lock timeout ? If not, it's something worth trying out in the EAP ISPN cache configuration: <cache-containter> <cache> <locking isolation="READ_COMMITTED" acquire-timeout="45000"/> The exception from the client log is an indication that https://bugzilla.redhat.com/show_bug.cgi?id=1198302 has perhaps not been fixed entirely. Has this ISPN issue been verified under the same load as the ModeShape JustID use case ? If not, it's very important that the same load test (in terms of reading/writing directly to ISPN) is performed. Horia, I will try to increase the acquire timeout. But I am not sure how this could be causing the cluster not to recover from such error. As I said after the error starts happening the JCR API calls fail every time and only thing that helps is to clean and restart the whole domain. ISPN-4810 has been verified with higher load with different automated test and it appears to be fixed. To elaborate more on the above comment: the ISPN fix has to be validated in the same configuration (replicated cache & load) as the ModeShape Just ID use case. I have no idea why, when locking fails, the rest of the JCR calls start failing. Perhaps ISPN gets into an invalid state at which point ModeShape cannot function. Moreover so, the exception from the logs is always raised from a background thread - not the main application/code thread. I cannot find in the QE server logs any other information except the locking timeout exception, which is coming from ISPN and is converted by ModeShape. If testing with increased locking timeout doesn't change anything I can try to add a workaround in the code where this exception (in this particular case) is only logged as a warning and the thread will retry to execute the code at some point in the future. This will not change the fact that ISPN can't lock the nodes however. Comment 10: Yes makes sense. Comment 11: Ok, I have restarted the test with Acquire Timeout (ms): 45000 Please also check the similar test case for BZ 1198302 All this comes from Inifispan and this should tell us if this runs correctly with inifnispan or not. Created attachment 1044921 [details] second try according to comment 11 Comment 11: It helped. There are still some lock errors but it seems that it partialy does recover. The test is still running (adding 30000 image nodes) but I started hitting out of memory error (see logs in try2.zip). What is this error? It seems that it is not a memory problem for the EAP server itself but its error that is thrown by Lucene. 1) Is it possible to increase the amount of memory? 2) How come this error even comes up, the request are meant to be stateless and only limiting factor should be disk space. Comment 14: I have taken a look at the test for BZ1198302. It currently uses Arquillian and is run on Jenkins. I will have to migrate it to the cluster. This will take 1-2 days and plus the runtime of the test is 24 hours. Since we have still testing to finish until end of this week I will have to push this "check" to next week. (In reply to Filip Nguyen from comment #16) I've opened 2 ModeShape JIRA's for the lock issues. Regardless of these lock issues, I think it's good in general (maybe this should be part of the docs) to increase lock timeout under heavy load. Regarding Lucene's/Hibernate Search memory issue, I don't know where that comes from, I've never seen it before. It's even more strange if you say that the server memory appears to be fine. I've opened a separate JIRA for this which I will investigate as well. I think you can work around this last issue however, by making sure the node type which holds the binary value is not queryable (see the description of https://issues.jboss.org/browse/MODE-2486). In the meantime, I'll ack this BZ and it should be marked as a blocker for DV 6.2. (In reply to Filip Nguyen from comment #16) I noticed in the log where the OOM is reported by Hibernate Search that the JVM is run with -Xmx256m. 256 MB is nowhere near enough especially when 1) running under highload 2) working with binary values - which have to be read and stored in the Binary Store each time a binary property is created ModeShape's unit tests (run during a regular build) are using: -Xmx1524M. Please make sure you're using *at least that* when running the load tests *on each of the servers*. Updated the DV 6.2 product branch with fixes for the locking issues. It should be noted that if the load is high and the same parts of a hierarchy/nodes are being updated/changed simultaneously, **there will be lock contention no matter what happens**. This means that in those particular cases, ISPN lock timeout should be increased, like described in https://bugzilla.redhat.com/show_bug.cgi?id=1237049#c7 Regarding the OOM message: what happens is that if an OOM occurs on the server while the Lucene indexing writer is active - i.e. due to the server instance (EAP) not having enough memory - Lucene will catch it and set a boolean flag together with printing this message. The reason why upon investigation the server memory looks (looked fine) is that there are no memory leaks during indexing. However, because of the server not having enough memory to handle the load (see https://bugzilla.redhat.com/show_bug.cgi?id=1237049#c19) at some point in time an OOM occurs, Lucene catches it and prints the message and later on the GC runs and frees up memory (albeit too late). The solution here (as described by see https://bugzilla.redhat.com/show_bug.cgi?id=1237049#c19 and in the corresponding JIRA issue https://issues.jboss.org/browse/MODE-2486) **is to make sure the server instance(s) have enough memory to handle the load test scenario** Also note that the status of https://bugzilla.redhat.com/show_bug.cgi?id=1202354 indicates that the ISPN bug which caused data corruption in Modeshape appears to not be fully fixed & available in EAP 6.4. This is a blocking requirement for any JustID testing. Created attachment 1063784 [details]
log files 2
With ER4 I still experience lost nodes. No more out-of-memory exceptions and deadlocks (thanks to the adjusted settings indicated earlier in the comments of this BZ).
Are tests removing any nodes or just adding & searching ? This exception is the main "manifestation" of the ISPN bug which causes corrupt data. See https://bugzilla.redhat.com/show_bug.cgi?id=1202354 (linked to this issue). Both. Adding and in the meantime searching. Cluster has 4 nodes. 2 nodes are used for writing and 2 for searching. After the test finishes the assert tries to verify all nodes have been written. This assert fails with higher loads. I can't say for sure if this is ModeShape's issue or caused by the ISPN bug. What I do know for a fact, is that the ISPN issue will corrupt data causing exactly these types of exceptions. Therefore, I'm reluctant to change anything in the ModeShape codebase until this use case has been run against a server which has the ISPN fix. Filip, in the test case code which is being run, when are 'Rendities' nodes being created ? So in the following sequence of calls: 2015-08-17 07:39:50 [pool-1-thread-4] INFO GetHttpCall:14 - Getting over SEARCH URL: http://10.16.91.80:8080/justid/photos?photoset=fotoset690 2015-08-17 07:39:50 [pool-1-thread-4] INFO HttpUtil:34 - Getting url:http://10.16.91.80:8080/justid/photos?photoset=fotoset690 2015-08-17 07:39:50 [pool-1-thread-4] INFO HttpUtil:17 - Posting:AddRenditionRequest [photosetName=fotoset690, renditionName=rendition0] to url: http://10.16.91.79:8080/justid/rendition 2015-08-17 07:39:50 [pool-1-thread-2] INFO HttpUtil:17 - Posting:AddRenditionRequest [photosetName=fotoset690, renditionName=rendition1] to url: http://10.16.91.79:8080/justid/rendition 2015-08-17 07:39:50 [pool-1-thread-1] INFO HttpUtil:17 - Posting:AddRenditionRequest [photosetName=fotoset690, renditionName=rendition2] to url: http://10.16.91.77:8080/justid/rendition 2015-08-17 07:39:50 [pool-1-thread-3] INFO HttpUtil:17 - Posting:AddRenditionRequest [photosetName=fotoset690, renditionName=rendition3] to url: http://10.16.91.77:8080/justid/rendition when is the /fotoset/2001/2/4/1/fotoset690/Rendities node supposed to be created ? I will try to answer by showing parts of code and comment with overall idea. If this is not clear enough let me know I will try to approach the explanation from different angle. First all the photosets are being created in this loop: // Adding photos in parallel for (long i = 0; i < datasetSize; i++) { requestTracker.awaitRequests(20); String photosetname = NameConstants.PHOTO_SET + i; if (i % photosetsInHour == 0) c.add(Calendar.HOUR, 1); submitAddPhotoRequest(loadBalancer.getOrefUrl(photosUrl), c, photosetname); } Then after all photosets (which takes hours for big photoset count) are uploaded there is a second loop begins and Rendities are being appended to those phototets for (long i = 0; i < datasetSize; i++) { requestTracker.awaitRequests(20); String photosetname = NameConstants.PHOTO_SET + i; if (i % photosetsInHour == 0) c.add(Calendar.HOUR, 1); submitSearchPhotoset(loadBalancer.getOscfUrl(String.format( searchUrl, photosetname))); for (int k = 0; k < 4; k++) { submitAddRenditionRequest( loadBalancer.getOrefUrl(renditionUrl), c, photosetname, k); } } Only after this second loop ends and all pending HTTP requests are done, there is the Assert that checks all photosets and rendities are in place. so in the above example, first the creation and upload of "/fotoset/2001/2/4/1/fotoset690" takes place and then during "submitAddRenditionRequest" the "/fotoset/2001/2/4/1/fotoset690/Rendities" node is created (presumably) Can you pls share (either directly or via a link) the code which performs the "addRenditionRequest" ? The reason for these questions is that from reading the latest logs, it seems the client test code sends a bunch of async requests to the server, meaning these requests can be processed by the server in any order really. So I'm trying to understand (based on how the code is actually written) why/if exceptions like: javax.jcr.PathNotFoundException: No item exists at path Rendities relative to /fotoset/2001/2/4/4/fotoset999 in workspace "default" at org.modeshape.jcr.AbstractJcrNode.getNode(AbstractJcrNode.java:847) at org.modeshape.jcr.AbstractJcrNode.getNode(AbstractJcrNode.java:766) at org.modeshape.jcr.AbstractJcrNode.getNode(AbstractJcrNode.java:109) at org.jboss.qe.JustIdWriterWS.createFolderNode(JustIdWriterWS.java:339) at org.jboss.qe.JustIdWriterWS.addRendition(JustIdWriterWS.java:153) can ever happen. The "Rendities" node seems to be a collection of "renditions" presumably. Are these types of nodes ever removed by the code - by issuing an explicit remove call ? THe reason these exceptions happen is that the client does submit parallel search requests. If these fail its not taken as a problem, its just to simulate traffic. What is actual problem is that in the final assert the node is not there. I will send you the sources through e-mail, I am not sure you have permission to see our Git. And no, there is no explicit remove call to Rendities. Actually the code doesn't ever remove any node. Moving back to assigned, because in investigating the latest logs and analyzing the code there seems to be an additional problem that can arise around indexing. Added an additional commit which ads better handling of indexing of remote changes. "Infinispan is not able to acquire a lock under high loads, resulting in errors. As a result, the user will encounter exceptions." This is not correct. In high load/high contention cases, the client should change the ISPN configuration and increase the locking timeout, as described in: https://bugzilla.redhat.com/show_bug.cgi?id=1237049#c7 No, this has to be retested on EAP 6.4.4, because of a blocking Infinispan fix: https://bugzilla.redhat.com/show_bug.cgi?id=1202354 which was not part of EAP 6.4.3 (on which DV 6.2 was released). Once EAP 6.4.4 is available, QE should move this to ON_QA and re-run the test. Moving to ON_QA since EAP 6.4.4 is available with the ISPN fix. The reproducer passed. Tested using EAP 6.4.4 and DV 6.2. |