Bug 1113585 - LevelDBStore.stop() crashes JVM in native code
Summary: LevelDBStore.stop() crashes JVM in native code
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Data Grid 6
Classification: JBoss
Component: Infinispan
Version: 6.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ER1
: 6.3.1
Assignee: Tristan Tarrant
QA Contact: Martin Gencur
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-06-26 13:09 UTC by Radim Vansa
Modified: 2015-01-26 14:05 UTC (History)
7 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2015-01-26 14:05:06 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
crash log (78.93 KB, text/x-log)
2014-08-08 13:48 UTC, Radim Vansa
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker ISPN-4651 0 Major Resolved LevelDB crashes JVM when stop() is called concurrently with write() 2015-05-13 01:56:50 UTC

Description Radim Vansa 2014-06-26 13:09:22 UTC
REPL non-tx cache with LevelDB JNI, executed in edg-perflab (Red Hat Enterprise Linux Server release 6.5 (Santiago), 2.6.32-431.1.2.el6.x86_64)

         <leveldbStore xmlns="urn:infinispan:config:store:leveldb:6.0"
                       implementationType="JNI"
                       location="/home_local/tmp/ispn-leveldb-jni/data"
                       expiredLocation="/home_local/tmp/ispn-leveldb-jni/expired"
                       purgeOnStartup="true" />


I have loaded the cache with 100000 1kB entries, and when cacheManager.stop() was called, I got JVM segfaults/silent terminations with such messages:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f9c753aaf84, pid=21149, tid=140309268805376
#
# JRE version: Java(TM) SE Runtime Environment (7.0_51-b13) (build 1.7.0_51-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libleveldbjni-64-1-1012947400470038599.17-redhat+0x40f84]  leveldb::Version::ForEachOverlapping(leveldb::Slice, leveldb::Slice, void*, bool (*)(void*, int, leveldb::FileMetaData*))+0x134
#
# Core dump written. Default location: /home_local/jenkins_tmp/smartfrog/radargun/slave04/core or core.21149
#
# An error report file with more information is saved as:
# /home_local/jenkins_tmp/smartfrog/radargun/slave04/hs_err_pid21149.log
pthread destroy mutex: Device or resource busy

or without segfault:

pthread lock: Invalid argument

or:

pure virtual method called
terminate called without an active exception

I also got segfault with this:
[thread 140284642879232 also had an error]

pthread destroy mutex: Device or resource busy

Comment 2 Tomas Sykora 2014-07-09 10:53:16 UTC
That looks great with freshly built http://download.eng.bos.redhat.com/brewroot/repos/jb-edg-6-rhel-6-build/latest/maven/org/fusesource/leveldbjni/leveldbjni-all/1.13-redhat.002/leveldbjni-all-1.13-redhat.002.jar 

Job: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/JDG/view/PERF-LIB/job/jdg-radargun-leveldb-jni-test/

Note: we will need to put respective JAR file into our zip (used in job) again, once CR3 is out.

I am expecting this BZ ON_QA for 6.3.0 CR3. Setting target release.

Comment 3 Tomas Sykora 2014-07-09 10:54:40 UTC
Just CCing Alan :))

(+ thank you Alan for your help with quick pre-CR3 verification)

Comment 4 Tomas Sykora 2014-07-10 11:08:32 UTC
Brilliantly awesome and quick fix :P

CR3 bits are ok, logs are clear as a mountain spring :)

VERIFIED

Comment 5 Alan Field 2014-07-14 13:12:20 UTC
Unfortunately, this is reproducible in JDG 6.3.0 CR3. The previous verification by Tomas did not stop a single node during the test. This job reproduces the segfaults with CR1 and CR3:

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/jdg-radargun-elasticity-repl-leveldb

Comment 7 Alan Field 2014-07-14 15:47:47 UTC
The new Jenkins job starts a cluster of nodes in library mode, and then tries to stop and start a single node in the cluster 3 times. The crash in the JNI code happens when stopping the node the first time. The test case code is not using JON, it is using the Infinispan/JDG API to stop the cache and cachestore on the single node. This might happen when a node is being removed from the cluster.

Comment 12 Radim Vansa 2014-08-08 13:48:17 UTC
Created attachment 925183 [details]
crash log

Attaching crash log from one instance of this issue.

Comment 13 Radim Vansa 2014-08-19 11:16:14 UTC
I think that LevelDB can't handle correctly concurrent close and operations in another threads. I've assembled https://github.com/rvansa/jdg/tree/BZ1113585/LevelDB_JVM_crash/jdg_6.3.x with semaphore giving exclusive access for close operation and the test which was previously crashing the node now passes.

Comment 15 Radim Vansa 2014-08-19 11:41:29 UTC
Divya: It can affect throughput because any thread writing the store has to acquire the permit from the semaphore. However, writes can proceed concurrently; the only synchronization is some atomic CAS operation inside the semaphore.

Comment 18 Alan Field 2014-08-21 15:15:07 UTC
Verified that the JVM crash does not exist in JDG 6.3.1 ER1. Performance test with and without this fix is next.

Comment 19 Alan Field 2014-08-22 12:01:05 UTC
Executed distributed and replicated tests with JDG 6.3.0 ahd 6.3.1 ER1. No performance regressions for reads or writes were observed.

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/JDG/view/PERF-LIB/job/jdg-radargun-leveldb-jni-test/


Note You need to log in before you can comment on or make changes to this bug.