1113585 – LevelDBStore.stop() crashes JVM in native code

Bug 1113585 - LevelDBStore.stop() crashes JVM in native code

Summary: LevelDBStore.stop() crashes JVM in native code

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Data Grid 6
Classification:	JBoss
Component:	Infinispan
Sub Component:
Version:	6.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	ER1
Target Release:	6.3.1
Assignee:	Tristan Tarrant
QA Contact:	Martin Gencur
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-06-26 13:09 UTC by Radim Vansa
Modified:	2015-01-26 14:05 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-01-26 14:05:06 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
crash log (78.93 KB, text/x-log) 2014-08-08 13:48 UTC, Radim Vansa	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	ISPN-4651	0	Major	Resolved	LevelDB crashes JVM when stop() is called concurrently with write()	2015-05-13 01:56:50 UTC

Description Radim Vansa 2014-06-26 13:09:22 UTC

REPL non-tx cache with LevelDB JNI, executed in edg-perflab (Red Hat Enterprise Linux Server release 6.5 (Santiago), 2.6.32-431.1.2.el6.x86_64)

         <leveldbStore xmlns="urn:infinispan:config:store:leveldb:6.0"
                       implementationType="JNI"
                       location="/home_local/tmp/ispn-leveldb-jni/data"
                       expiredLocation="/home_local/tmp/ispn-leveldb-jni/expired"
                       purgeOnStartup="true" />


I have loaded the cache with 100000 1kB entries, and when cacheManager.stop() was called, I got JVM segfaults/silent terminations with such messages:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f9c753aaf84, pid=21149, tid=140309268805376
#
# JRE version: Java(TM) SE Runtime Environment (7.0_51-b13) (build 1.7.0_51-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libleveldbjni-64-1-1012947400470038599.17-redhat+0x40f84]  leveldb::Version::ForEachOverlapping(leveldb::Slice, leveldb::Slice, void*, bool (*)(void*, int, leveldb::FileMetaData*))+0x134
#
# Core dump written. Default location: /home_local/jenkins_tmp/smartfrog/radargun/slave04/core or core.21149
#
# An error report file with more information is saved as:
# /home_local/jenkins_tmp/smartfrog/radargun/slave04/hs_err_pid21149.log
pthread destroy mutex: Device or resource busy

or without segfault:

pthread lock: Invalid argument

or:

pure virtual method called
terminate called without an active exception

I also got segfault with this:
[thread 140284642879232 also had an error]

pthread destroy mutex: Device or resource busy

Comment 2 Tomas Sykora 2014-07-09 10:53:16 UTC

That looks great with freshly built http://download.eng.bos.redhat.com/brewroot/repos/jb-edg-6-rhel-6-build/latest/maven/org/fusesource/leveldbjni/leveldbjni-all/1.13-redhat.002/leveldbjni-all-1.13-redhat.002.jar 

Job: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/JDG/view/PERF-LIB/job/jdg-radargun-leveldb-jni-test/

Note: we will need to put respective JAR file into our zip (used in job) again, once CR3 is out.

I am expecting this BZ ON_QA for 6.3.0 CR3. Setting target release.

Comment 3 Tomas Sykora 2014-07-09 10:54:40 UTC

Just CCing Alan :))

(+ thank you Alan for your help with quick pre-CR3 verification)

Comment 4 Tomas Sykora 2014-07-10 11:08:32 UTC

Brilliantly awesome and quick fix :P

CR3 bits are ok, logs are clear as a mountain spring :)

VERIFIED

Comment 5 Alan Field 2014-07-14 13:12:20 UTC

Unfortunately, this is reproducible in JDG 6.3.0 CR3. The previous verification by Tomas did not stop a single node during the test. This job reproduces the segfaults with CR1 and CR3:

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/jdg-radargun-elasticity-repl-leveldb

Comment 7 Alan Field 2014-07-14 15:47:47 UTC

The new Jenkins job starts a cluster of nodes in library mode, and then tries to stop and start a single node in the cluster 3 times. The crash in the JNI code happens when stopping the node the first time. The test case code is not using JON, it is using the Infinispan/JDG API to stop the cache and cachestore on the single node. This might happen when a node is being removed from the cluster.

Comment 12 Radim Vansa 2014-08-08 13:48:17 UTC

Created attachment 925183 [details]
crash log

Attaching crash log from one instance of this issue.

Comment 13 Radim Vansa 2014-08-19 11:16:14 UTC

I think that LevelDB can't handle correctly concurrent close and operations in another threads. I've assembled https://github.com/rvansa/jdg/tree/BZ1113585/LevelDB_JVM_crash/jdg_6.3.x with semaphore giving exclusive access for close operation and the test which was previously crashing the node now passes.

Comment 15 Radim Vansa 2014-08-19 11:41:29 UTC

Divya: It can affect throughput because any thread writing the store has to acquire the permit from the semaphore. However, writes can proceed concurrently; the only synchronization is some atomic CAS operation inside the semaphore.

Comment 18 Alan Field 2014-08-21 15:15:07 UTC

Verified that the JVM crash does not exist in JDG 6.3.1 ER1. Performance test with and without this fix is next.

Comment 19 Alan Field 2014-08-22 12:01:05 UTC

Executed distributed and replicated tests with JDG 6.3.0 ahd 6.3.1 ER1. No performance regressions for reads or writes were observed.

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/JDG/view/PERF-LIB/job/jdg-radargun-leveldb-jni-test/

Note You need to log in before you can comment on or make changes to this bug.