Bug 1113585 - LevelDBStore.stop() crashes JVM in native code
Summary: LevelDBStore.stop() crashes JVM in native code
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Data Grid 6
Classification: JBoss
Component: Infinispan
Version: 6.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ER1
: 6.3.1
Assignee: Tristan Tarrant
QA Contact: Martin Gencur
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-06-26 13:09 UTC by Radim Vansa
Modified: 2015-01-26 14:05 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously in Red Hat JBoss Data Grid, when a cache using LevelDB cache store was stopped (for example, as a consequence of stopping the cache manager), the LevelDB native implementation caused a segmentation fault in the JVM process. As a result of this segmentation fault, the process crashed. This issue is now fixed in JBoss Data Grid 6.3.1 so that using the LevelDB cache store native implementation works as expected.
Clone Of:
Environment:
Last Closed: 2015-01-26 14:05:06 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
crash log (78.93 KB, text/x-log)
2014-08-08 13:48 UTC, Radim Vansa
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker ISPN-4651 0 Major Resolved LevelDB crashes JVM when stop() is called concurrently with write() 2015-05-13 01:56:50 UTC

Description Radim Vansa 2014-06-26 13:09:22 UTC
REPL non-tx cache with LevelDB JNI, executed in edg-perflab (Red Hat Enterprise Linux Server release 6.5 (Santiago), 2.6.32-431.1.2.el6.x86_64)

         <leveldbStore xmlns="urn:infinispan:config:store:leveldb:6.0"
                       implementationType="JNI"
                       location="/home_local/tmp/ispn-leveldb-jni/data"
                       expiredLocation="/home_local/tmp/ispn-leveldb-jni/expired"
                       purgeOnStartup="true" />


I have loaded the cache with 100000 1kB entries, and when cacheManager.stop() was called, I got JVM segfaults/silent terminations with such messages:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f9c753aaf84, pid=21149, tid=140309268805376
#
# JRE version: Java(TM) SE Runtime Environment (7.0_51-b13) (build 1.7.0_51-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libleveldbjni-64-1-1012947400470038599.17-redhat+0x40f84]  leveldb::Version::ForEachOverlapping(leveldb::Slice, leveldb::Slice, void*, bool (*)(void*, int, leveldb::FileMetaData*))+0x134
#
# Core dump written. Default location: /home_local/jenkins_tmp/smartfrog/radargun/slave04/core or core.21149
#
# An error report file with more information is saved as:
# /home_local/jenkins_tmp/smartfrog/radargun/slave04/hs_err_pid21149.log
pthread destroy mutex: Device or resource busy

or without segfault:

pthread lock: Invalid argument

or:

pure virtual method called
terminate called without an active exception

I also got segfault with this:
[thread 140284642879232 also had an error]

pthread destroy mutex: Device or resource busy

Comment 2 Tomas Sykora 2014-07-09 10:53:16 UTC
That looks great with freshly built http://download.eng.bos.redhat.com/brewroot/repos/jb-edg-6-rhel-6-build/latest/maven/org/fusesource/leveldbjni/leveldbjni-all/1.13-redhat.002/leveldbjni-all-1.13-redhat.002.jar 

Job: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/JDG/view/PERF-LIB/job/jdg-radargun-leveldb-jni-test/

Note: we will need to put respective JAR file into our zip (used in job) again, once CR3 is out.

I am expecting this BZ ON_QA for 6.3.0 CR3. Setting target release.

Comment 3 Tomas Sykora 2014-07-09 10:54:40 UTC
Just CCing Alan :))

(+ thank you Alan for your help with quick pre-CR3 verification)

Comment 4 Tomas Sykora 2014-07-10 11:08:32 UTC
Brilliantly awesome and quick fix :P

CR3 bits are ok, logs are clear as a mountain spring :)

VERIFIED

Comment 5 Alan Field 2014-07-14 13:12:20 UTC
Unfortunately, this is reproducible in JDG 6.3.0 CR3. The previous verification by Tomas did not stop a single node during the test. This job reproduces the segfaults with CR1 and CR3:

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/jdg-radargun-elasticity-repl-leveldb

Comment 7 Alan Field 2014-07-14 15:47:47 UTC
The new Jenkins job starts a cluster of nodes in library mode, and then tries to stop and start a single node in the cluster 3 times. The crash in the JNI code happens when stopping the node the first time. The test case code is not using JON, it is using the Infinispan/JDG API to stop the cache and cachestore on the single node. This might happen when a node is being removed from the cluster.

Comment 12 Radim Vansa 2014-08-08 13:48:17 UTC
Created attachment 925183 [details]
crash log

Attaching crash log from one instance of this issue.

Comment 13 Radim Vansa 2014-08-19 11:16:14 UTC
I think that LevelDB can't handle correctly concurrent close and operations in another threads. I've assembled https://github.com/rvansa/jdg/tree/BZ1113585/LevelDB_JVM_crash/jdg_6.3.x with semaphore giving exclusive access for close operation and the test which was previously crashing the node now passes.

Comment 15 Radim Vansa 2014-08-19 11:41:29 UTC
Divya: It can affect throughput because any thread writing the store has to acquire the permit from the semaphore. However, writes can proceed concurrently; the only synchronization is some atomic CAS operation inside the semaphore.

Comment 18 Alan Field 2014-08-21 15:15:07 UTC
Verified that the JVM crash does not exist in JDG 6.3.1 ER1. Performance test with and without this fix is next.

Comment 19 Alan Field 2014-08-22 12:01:05 UTC
Executed distributed and replicated tests with JDG 6.3.0 ahd 6.3.1 ER1. No performance regressions for reads or writes were observed.

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/JDG/view/PERF-LIB/job/jdg-radargun-leveldb-jni-test/


Note You need to log in before you can comment on or make changes to this bug.