Bug 1529570

Summary: Gfapi segfaults when used in multiple threads
Product: [Community] GlusterFS Reporter: Chris Holcombe <xfactor973>
Component: libgfapiAssignee: bugs <bugs>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: mainlineCC: bugs, fw, jthottan, moagrawa, ndevos, rkavunga, skoduri
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-25 04:20:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chris Holcombe 2017-12-28 18:07:21 UTC
Description of problem: gfapi seems thread safe when multiple threads are making calls to a single connection.  I've run into segfaults though when multiple threads each have their own connection to the cluster.  I've produced a minimal example showing what is going on: 


Version-Release number of selected component (if applicable):


How reproducible: 100%


Steps to Reproduce:
0. Setup gluster server:
   a. mkdir /mnt/gluster-brick
   b. gluster vol create test $HOSTNAME:/mnt/gluster-brick force
   c. gluster vol start test
   d. 
1. Clone the github repo: https://github.com/cholcombe973/gfapi_bug/blob/master/src/main.rs
2. Compile the binary: 
   a. Install rustup: https://rustup.rs/
   b. run: `cargo build` in the gfapi_bug directory
3. Run the binary on any gluster server.  
   a. run: `gdb ./target/debug/gfapi_bug`
   b. keep running it until it segfaults.  It doesn't happen every time with gdb but it will happen every time without gdb.  


Actual results: 
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/chris/scratch/./target/debug/gfapi_bug 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff53ff700 (LWP 11991)]
[New Thread 0x7ffff51fe700 (LWP 11992)]
[New Thread 0x7ffff4bff700 (LWP 11993)]
[New Thread 0x7ffff43fe700 (LWP 11994)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff51fe700 (LWP 11992)]
0x00007ffff669ec2e in _gf_log () from /lib64/libglusterfs.so.0
(gdb) where
#0  0x00007ffff669ec2e in _gf_log () from /lib64/libglusterfs.so.0
#1  0x00007ffff66c6c71 in mem_pools_init_early () from /lib64/libglusterfs.so.0
#2  0x00007ffff7bbb3ff in glfs_new () from /lib64/libgfapi.so.0
#3  0x0000555555565430 in gfapi_sys::gluster::{{impl}}::connect (volume_name=..., server=..., port=24007)
    at /home/chris/.cargo/registry/src/github.com-1ecc6299db9ec823/gfapi-sys-0.3.6/src/gluster.rs:257
#4  0x0000555555563e7a in gfapi_bug::main::{{closure}} () at src/main.rs:16
#5  0x000055555555fbdd in std::sys_common::backtrace::__rust_begin_short_backtrace<closure,()> (f=...)
    at /checkout/src/libstd/sys_common/backtrace.rs:134
#6  0x0000555555564a15 in std::thread::{{impl}}::spawn::{{closure}}::{{closure}}<closure,()> () at /checkout/src/libstd/thread/mod.rs:400
#7  0x000055555556367d in std::panic::{{impl}}::call_once<(),closure> (self=..., _args=0) at /checkout/src/libstd/panic.rs:296
#8  0x00005555555637b2 in std::panicking::try::do_call<std::panic::AssertUnwindSafe<closure>,()> (data=0x7ffff51fdba8 "")
    at /checkout/src/libstd/panicking.rs:480
#9  0x000055555558ba9d in panic_unwind::__rust_maybe_catch_panic () at /checkout/src/libpanic_unwind/lib.rs:99
#10 0x00005555555636fc in std::panicking::try<(),std::panic::AssertUnwindSafe<closure>> (f=...) at /checkout/src/libstd/panicking.rs:459
#11 0x00005555555636a4 in std::panic::catch_unwind<std::panic::AssertUnwindSafe<closure>,()> (f=...) at /checkout/src/libstd/panic.rs:361
#12 0x00005555555648c5 in std::thread::{{impl}}::spawn::{{closure}}<closure,()> () at /checkout/src/libstd/thread/mod.rs:399
#13 0x0000555555564c48 in alloc::boxed::{{impl}}::call_box<(),closure> (self=0x7ffff5427020, args=0) at /checkout/src/liballoc/boxed.rs:726
#14 0x0000555555583e9c in call_once<(),()> () at /checkout/src/liballoc/boxed.rs:736
#15 start_thread () at /checkout/src/libstd/sys_common/thread.rs:24
#16 std::sys::imp::thread::{{impl}}::new::thread_start () at /checkout/src/libstd/sys/unix/thread.rs:90
#17 0x00007ffff7592e25 in start_thread (arg=0x7ffff51fe700) at pthread_create.c:308
#18 0x00007ffff70aa34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Expected results: I expected the program to print the cluster uuid serveral times and exit.


Additional info:
I don't know if this is related but valgrind is showing an invalid read:
valgrind --leak-check=full ./gfapi_bug
==32839== Memcheck, a memory error detector
==32839== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==32839== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==32839== Command: ./scratch
==32839== 
cluster vol id: Ok(Uuid("175ddabd-ea14-4dbf-b797-0f8801e4eb95"))
cluster vol id: Ok(Uuid("175ddabd-ea14-4dbf-b797-0f8801e4eb95"))
cluster vol id: Ok(Uuid("175ddabd-ea14-4dbf-b797-0f8801e4eb95"))
cluster vol id: Ok(Uuid("175ddabd-ea14-4dbf-b797-0f8801e4eb95"))
cluster vol id: Ok(Uuid("175ddabd-ea14-4dbf-b797-0f8801e4eb95"))
cluster vol id: Ok(Uuid("175ddabd-ea14-4dbf-b797-0f8801e4eb95"))
cluster vol id: Ok(Uuid("175ddabd-ea14-4dbf-b797-0f8801e4eb95"))
cluster vol id: Ok(Uuid("175ddabd-ea14-4dbf-b797-0f8801e4eb95"))
cluster vol id: Ok(Uuid("175ddabd-ea14-4dbf-b797-0f8801e4eb95"))
cluster vol id: Ok(Uuid("175ddabd-ea14-4dbf-b797-0f8801e4eb95"))
==32839== Invalid read of size 4
==32839==    at 0x62F359E: __gf_free (in /usr/lib64/libglusterfs.so.0.0.1)
==32839==    by 0x29A4DA52: ??? (in /usr/lib64/glusterfs/3.12.1/rpc-transport/socket.so)
==32839==    by 0x400FB79: _dl_fini (in /usr/lib64/ld-2.17.so)
==32839==    by 0x58D4A68: __run_exit_handlers (in /usr/lib64/libc-2.17.so)
==32839==    by 0x58D4AB4: exit (in /usr/lib64/libc-2.17.so)
==32839==    by 0x58BDC0B: (below main) (in /usr/lib64/libc-2.17.so)

Comment 1 Niels de Vos 2018-01-15 12:08:47 UTC
Thanks Chris. I can reproduce the crashes on Fedora 27 with the steps you provided.

  # dnf install rust cargo glusterfs-server

It does not consistently crash for me, but surely 1 out of 30 times. The stack traces look a little different though, this one is the most common that I encountered after running the reproducer a couple of times.

#0  0x00007f7251cdf69b in raise () from /lib64/libc.so.6
#1  0x00007f7251ce13b1 in abort () from /lib64/libc.so.6
#2  0x00007f7251d29a87 in __libc_message () from /lib64/libc.so.6
#3  0x00007f7251d30e8e in malloc_printerr () from /lib64/libc.so.6
#4  0x00007f7251d32989 in _int_free () from /lib64/libc.so.6
#5  0x00007f7251d3b2ee in free () from /lib64/libc.so.6
#6  0x00007f72528d7611 in priv_glfs_free_from_ctx (fs=<optimized out>) at glfs.c:865
#7  0x00007f72528d7fa8 in pub_glfs_fini (fs=0x7f7244000bf0) at glfs.c:1295


I'm not sure when I have time to look into this further, maybe Poornima gets to is before me?

Comment 2 Chris Holcombe 2018-01-18 18:09:30 UTC
That's interesting.  What version of gfapi comes with fedora 27?  I was using 3.12 on centos.  I figured with rust maybe I was pushing gfapi harder than usual?  It's super easy to spin up threads and punish the library :)

Comment 3 Shyamsundar 2018-10-23 14:54:26 UTC
Release 3.12 has been EOLd and this bug was still found to be in the NEW state, hence moving the version to mainline, to triage the same and take appropriate actions.

Comment 5 Mohit Agrawal 2020-02-25 04:20:16 UTC
Hi,

 I am not able to reproduce it on latest master, It seems the issue is no longer valid for now.
 Please reopen it if you face the issue again.

Thanks,
Mohit Agrawal