REPL non-tx cache with LevelDB JNI, executed in edg-perflab (Red Hat Enterprise Linux Server release 6.5 (Santiago), 2.6.32-431.1.2.el6.x86_64) <leveldbStore xmlns="urn:infinispan:config:store:leveldb:6.0" implementationType="JNI" location="/home_local/tmp/ispn-leveldb-jni/data" expiredLocation="/home_local/tmp/ispn-leveldb-jni/expired" purgeOnStartup="true" /> I have loaded the cache with 100000 1kB entries, and when cacheManager.stop() was called, I got JVM segfaults/silent terminations with such messages: # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007f9c753aaf84, pid=21149, tid=140309268805376 # # JRE version: Java(TM) SE Runtime Environment (7.0_51-b13) (build 1.7.0_51-b13) # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [libleveldbjni-64-1-1012947400470038599.17-redhat+0x40f84] leveldb::Version::ForEachOverlapping(leveldb::Slice, leveldb::Slice, void*, bool (*)(void*, int, leveldb::FileMetaData*))+0x134 # # Core dump written. Default location: /home_local/jenkins_tmp/smartfrog/radargun/slave04/core or core.21149 # # An error report file with more information is saved as: # /home_local/jenkins_tmp/smartfrog/radargun/slave04/hs_err_pid21149.log pthread destroy mutex: Device or resource busy or without segfault: pthread lock: Invalid argument or: pure virtual method called terminate called without an active exception I also got segfault with this: [thread 140284642879232 also had an error] pthread destroy mutex: Device or resource busy
That looks great with freshly built http://download.eng.bos.redhat.com/brewroot/repos/jb-edg-6-rhel-6-build/latest/maven/org/fusesource/leveldbjni/leveldbjni-all/1.13-redhat.002/leveldbjni-all-1.13-redhat.002.jar Job: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/JDG/view/PERF-LIB/job/jdg-radargun-leveldb-jni-test/ Note: we will need to put respective JAR file into our zip (used in job) again, once CR3 is out. I am expecting this BZ ON_QA for 6.3.0 CR3. Setting target release.
Just CCing Alan :)) (+ thank you Alan for your help with quick pre-CR3 verification)
Brilliantly awesome and quick fix :P CR3 bits are ok, logs are clear as a mountain spring :) VERIFIED
Unfortunately, this is reproducible in JDG 6.3.0 CR3. The previous verification by Tomas did not stop a single node during the test. This job reproduces the segfaults with CR1 and CR3: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/jdg-radargun-elasticity-repl-leveldb
The new Jenkins job starts a cluster of nodes in library mode, and then tries to stop and start a single node in the cluster 3 times. The crash in the JNI code happens when stopping the node the first time. The test case code is not using JON, it is using the Infinispan/JDG API to stop the cache and cachestore on the single node. This might happen when a node is being removed from the cluster.
Created attachment 925183 [details] crash log Attaching crash log from one instance of this issue.
I think that LevelDB can't handle correctly concurrent close and operations in another threads. I've assembled https://github.com/rvansa/jdg/tree/BZ1113585/LevelDB_JVM_crash/jdg_6.3.x with semaphore giving exclusive access for close operation and the test which was previously crashing the node now passes.
Divya: It can affect throughput because any thread writing the store has to acquire the permit from the semaphore. However, writes can proceed concurrently; the only synchronization is some atomic CAS operation inside the semaphore.
Verified that the JVM crash does not exist in JDG 6.3.1 ER1. Performance test with and without this fix is next.
Executed distributed and replicated tests with JDG 6.3.0 ahd 6.3.1 ER1. No performance regressions for reads or writes were observed. https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/JDG/view/PERF-LIB/job/jdg-radargun-leveldb-jni-test/