Bug 1475632

Summary: Brick Multiplexing: Brick process crashed at changetimerecorder(ctr) translator when restarting volumes
Product: [Community] GlusterFS Reporter: Mohit Agrawal <moagrawa>
Component: coreAssignee: Mohit Agrawal <moagrawa>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: unspecified    
Version: mainlineCC: amukherj, bugs, nchilaka, rhinduja, rhs-bugs, storage-qa-internal
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: brick-multiplexing
Fixed In Version: glusterfs-3.13.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1472129 Environment:
Last Closed: 2017-12-08 17:35:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1472129    
Bug Blocks:    

Description Mohit Agrawal 2017-07-27 03:47:45 UTC
+++ This bug was initially created as a clone of Bug #1472129 +++

Description of problem:
====================
As part of verification of BZ#1468514 I was doing below
Had 50 volumes all 1x3 with brick mux enabled, stopped all of them in sequence and then started them again in sequence, did this for 200 times.
Seems to have crashed at around ~170th loop


#for j in {1..200};do echo "#########################################";date;for i in $(gluster v list);do gluster v stop $i --mode=script;done;ps -ef|grep glusterfsd;echo "############ end of loop $j ########################" ;for k in $(gluster v list);do gluster v start $k;done;done



also was doing IO to only one volume throughout, ie vol_9
#for i in {1..1000};do mkdir dir.$i;for j in {1..500000};do touch dir.$i/file.$i.$j ;done;done


As discussed at https://bugzilla.redhat.com/show_bug.cgi?id=1468514#c22
raising a new bug as this was hit in ctr xlator code







#0  0x00007ff98a30c1f7 in raise () from /lib64/libc.so.6
#1  0x00007ff98a30d8e8 in abort () from /lib64/libc.so.6
#2  0x00007ff98a34bf47 in __libc_message () from /lib64/libc.so.6
#3  0x00007ff98a351b54 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007ff98a3537aa in _int_free () from /lib64/libc.so.6
#5  0x00007ff97d95e311 in gf_sql_connection_fini (sql_connection=sql_connection@entry=0x7ff8e8496b50) at gfdb_sqlite3.c:42
#6  0x00007ff97d95e38a in gf_sqlite3_fini (db_conn=0x7ff92ca04470) at gfdb_sqlite3.c:507
#7  0x00007ff97d957156 in fini_db (_conn_node=0x7ff92ca04470) at gfdb_data_store.c:326
#8  0x00007ff97db78679 in notify (this=0x7ff92c5b3670, event=9, data=0x7ff92c5b5a00) at changetimerecorder.c:2178
#9  0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5b3670, event=event@entry=9, data=data@entry=0x7ff92c5b5a00) at xlator.c:549
#10 0x00007ff98bd3ac12 in default_notify (this=this@entry=0x7ff92c5b5a00, event=9, data=data@entry=0x7ff92c5b6d50) at defaults.c:3139
#11 0x00007ff97d48d0e3 in notify (this=0x7ff92c5b5a00, event=<optimized out>, data=0x7ff92c5b6d50) at changelog.c:2312
#12 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5b5a00, event=event@entry=9, data=data@entry=0x7ff92c5b6d50) at xlator.c:549
#13 0x00007ff98bd3ac12 in default_notify (this=this@entry=0x7ff92c5b6d50, event=9, data=<optimized out>) at defaults.c:3139
#14 0x00007ff97d059181 in notify (this=0x7ff92c5b6d50, event=<optimized out>, data=<optimized out>) at bit-rot-stub.c:243
#15 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5b6d50, event=event@entry=9, data=data@entry=0x7ff92c5b7e00) at xlator.c:549
#16 0x00007ff98bd3ac12 in default_notify (this=0x7ff92c5b7e00, event=9, data=<optimized out>) at defaults.c:3139
#17 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5b7e00, event=event@entry=9, data=data@entry=0x7ff92c5b8c40) at xlator.c:549
#18 0x00007ff98bd3ac12 in default_notify (this=0x7ff92c5b8c40, event=9, data=<optimized out>) at defaults.c:3139
#19 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5b8c40, event=event@entry=9, data=data@entry=0x7ff92c5b9a00) at xlator.c:549
#20 0x00007ff98bd3ac12 in default_notify (this=0x7ff92c5b9a00, event=9, data=<optimized out>) at defaults.c:3139
#21 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5b9a00, event=event@entry=9, data=data@entry=0x7ff92c5bab00) at xlator.c:549
#22 0x00007ff98bd3ac12 in default_notify (this=0x7ff92c5bab00, event=9, data=<optimized out>) at defaults.c:3139
#23 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5bab00, event=event@entry=9, data=data@entry=0x7ff92c5bbb40) at xlator.c:549
#24 0x00007ff98bd3ac12 in default_notify (this=0x7ff92c5bbb40, event=9, data=<optimized out>) at defaults.c:3139
#25 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5bbb40, event=event@entry=9, data=data@entry=0x7ff92c5bcb70) at xlator.c:549
#26 0x00007ff98bd3ac12 in default_notify (this=0x7ff92c5bcb70, event=9, data=<optimized out>) at defaults.c:3139
#27 0x00007ff97c3e9a32 in notify (this=0x7ff92c5bcb70, event=<optimized out>, data=0x7ff92c5bdba0) at upcall.c:2511
#28 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5bcb70, event=event@entry=9, data=data@entry=0x7ff92c5bdba0) at xlator.c:549
#29 0x00007ff98bd3ac12 in default_notify (this=this@entry=0x7ff92c5bdba0, event=9, data=<optimized out>) at defaults.c:3139
#30 0x00007ff97c1d19e3 in notify (this=0x7ff92c5bdba0, event=<optimized out>, data=<optimized out>) at io-threads.c:1053
#31 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5bdba0, event=event@entry=9, data=data@entry=0x7ff92c5bea90) at xlator.c:549
#32 0x00007ff98bd3ac12 in default_notify (this=0x7ff92c5bea90, event=9, data=<optimized out>) at defaults.c:3139
#33 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5bea90, event=event@entry=9, data=data@entry=0x7ff92c5c06e0) at xlator.c:549
#34 0x00007ff98bd3ac12 in default_notify (this=this@entry=0x7ff92c5c06e0, event=event@entry=9, data=data@entry=0x7ff92c5c1890) at defaults.c:3139
#35 0x00007ff977bd84fb in notify (this=0x7ff92c5c06e0, event=9, data=0x7ff92c5c1890) at barrier.c:534
#36 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5c06e0, event=event@entry=9, data=data@entry=0x7ff92c5c1890) at xlator.c:549
#37 0x00007ff98bd3ac12 in default_notify (this=this@entry=0x7ff92c5c1890, event=9, data=<optimized out>) at defaults.c:3139
#38 0x00007ff9779ccd98 in notify (this=0x7ff92c5c1890, event=<optimized out>, data=<optimized out>) at index.c:2530
#39 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5c1890, event=event@entry=9, data=data@entry=0x7ff92c5c2dc0) at xlator.c:549
#40 0x00007ff98bd3ac12 in default_notify (this=0x7ff92c5c2dc0, event=9, data=<optimized out>) at defaults.c:3139
#41 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5c2dc0, event=event@entry=9, data=data@entry=0x7ff92c5c4230) at xlator.c:549
#42 0x00007ff98bd3ac12 in default_notify (this=this@entry=0x7ff92c5c4230, event=9, data=data@entry=0x7ff92c5c5880) at defaults.c:3139
#43 0x00007ff97758f3da in notify (this=0x7ff92c5c4230, event=<optimized out>, data=0x7ff92c5c5880) at io-stats.c:4150
#44 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5c4230, event=event@entry=9, data=data@entry=0x7ff92c5c5880) at xlator.c:549
#45 0x00007ff98bd3ac12 in default_notify (this=this@entry=0x7ff92c5c5880, event=event@entry=9, data=data@entry=0x7ff92c5c5880) at defaults.c:3139
#46 0x00007ff97711cd9f in notify (this=<optimized out>, event=<optimized out>, data=0x7ff92c5c5880) at server.c:1587
#47 0x000056273ffcf387 in glusterfs_handle_terminate (req=0x7ff8e8002930) at glusterfsd-mgmt.c:263
#48 0x00007ff98bcd9f02 in synctask_wrap (old_task=<optimized out>) at syncop.c:375
#49 0x00007ff98a31dd40 in ?? () from /lib64/libc.so.6






Thread 548 (Thread 0x7ff8f44f5700 (LWP 481)):
#0  0x00007ff98ab06945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ff97d052d4b in br_stub_worker (data=<optimized out>) at bit-rot-stub-helpers.c:369
#2  0x00007ff98ab02e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ff98a3cf34d in clone () from /lib64/libc.so.6

Thread 547 (Thread 0x7ff872fb6700 (LWP 482)):
#0  0x00007ff98ab06945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ff97d495d73 in changelog_ev_connector (data=0x7ff92e267438) at changelog-ev-handle.c:202
#2  0x00007ff98ab02e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ff98a3cf34d in clone () from /lib64/libc.so.6

Thread 546 (Thread 0x7ff870fb2700 (LWP 488)):
#0  0x00007ff98a3961ad in nanosleep () from /lib64/libc.so.6
#1  0x00007ff98a396044 in sleep () from /lib64/libc.so.6
#2  0x00007ff97e5d866d in posix_health_check_thread_proc (data=0x7ff92e21b0a0) at posix-helpers.c:1872
#3  0x00007ff98ab02e25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ff98a3cf34d in clone () from /lib64/libc.so.6

Thread 545 (Thread 0x7ff8770bd700 (LWP 461)):
#0  0x00007ff98a3c67a3 in select () from /lib64/libc.so.6
#1  0x00007ff97d495fba in changelog_ev_dispatch (data=0x7ff951d30248) at changelog-ev-handle.c:349
#2  0x00007ff98ab02e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ff98a3cf34d in clone () from /lib64/libc.so.6

Thread 544 (Thread 0x7ff87b1c4700 (LWP 442)):
#0  0x00007ff98a3961ad in nanosleep () from /lib64/libc.so.6
#1  0x00007ff98a396044 in sleep () from /lib64/libc.so.6
#2  0x00007ff97e5d866d in posix_health_check_thread_proc (data=0x7ff92dc6c570) at posix-helpers.c:1872
#3  0x00007ff98ab02e25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ff98a3cf34d in clone () from /lib64/libc.so.6

Thread 543 (Thread 0x7ff8799c1700 (LWP 445)):
#0  0x00007ff98a3cf923 in epoll_wait () from /lib64/libc.so.6
#1  0x00007ff98bcfdfd2 in event_dispatch_epoll_worker (data=0x7ff92e21a260) at event-epoll.c:638
#2  0x00007ff98ab02e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ff98a3cf34d in clone () from /lib64/libc.so.6

Thread 542 (Thread 0x7ff87f2cb700 (LWP 423)):
#0  0x00007ff98ab06945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ff97e5d8b9b in posix_fsyncer_pick (this=this@entry=0x7ff9517353b0, head=head@entry=0x7ff87f2cae80) at posix-helpers.c:2006
#2  0x00007ff97e5d8e25 in posix_fsyncer (d=0x7ff9517353b0) at posix-helpers.c:2105
#3  0x00007ff98ab02e25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ff98a3cf34d in clone () from /lib64/libc.so.6

Thread 541 (Thread 0x7ff9142f5700 (LWP 434)):
#0  0x00007ff98ab06945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ff9779c6ec5 in index_worker (data=<optimized out>) at index.c:217
#2  0x00007ff98ab02e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ff98a3cf34d in clone () from /lib64/libc.so.6

Thread 540 (Thread 0x7ff9440f1700 (LWP 435)):
#0  0x00007ff98ab06cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ff97c1cf90c in iot_worker (data=0x7ff92dc9be80) at io-threads.c:180
---Type <return> to continue, or q <return> to quit---




Version-Release number of selected component (if applicable):
-----
3.8.4-34



Steps to Reproduce:
1.have 50 1x3 vols and brick mux enabled
2.mount one of the vol and keep doing IOs
3.restart all volumes for about 200 times

for j in {1..200};do echo "#########################################";date;for i in $(gluster v list);do gluster v stop $i --mode=script;done;ps -ef|grep glusterfsd;echo "############ end of loop $j ########################" ;for k in $(gluster v list);do gluster v start $k;done;done

--- Additional comment from nchilaka on 2017-07-18 02:52:22 EDT ---

Kindly have this code checked in all the xlators, as BZ#1468514 seems like similar bt except that this was hit in different xlator

--- Additional comment from RHEL Product and Program Management on 2017-07-18 02:52:30 EDT ---

This request has been proposed as a blocker, but a release flag has
not been requested. Please set a release flag to ? to ensure we may
track this bug against the appropriate upcoming release, and reset
the blocker flag to ?.

--- Additional comment from nchilaka on 2017-07-18 03:03:21 EDT ---

sosreports and core at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1472129/

--- Additional comment from Rejy M Cyriac on 2017-07-19 03:04:46 EDT ---

At the 'RHGS 3.3.0 - Release Blocker Bug Triage and Status Check' meeting on 19 July, it was decided to ACCEPT this BZ for fix at the RHGS 3.3.0 release

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-07-19 03:04:52 EDT ---

Since this bug has been approved for the RHGS 3.3.0 release of Red Hat Gluster Storage 3, through release flag 'rhgs-3.3.0+', and through the Internal Whiteboard entry of '3.3.0', the Target Release is being automatically set to 'RHGS 3.3.0'

Comment 1 Worker Ant 2017-07-27 04:00:56 UTC
REVIEW: https://review.gluster.org/17888 (changetimerecorder: Brick process crashed at ctr xlator in notify function) posted (#1) for review on master by MOHIT AGRAWAL (moagrawa)

Comment 2 Worker Ant 2017-07-28 03:12:14 UTC
COMMIT: https://review.gluster.org/17888 committed in master by Jeff Darcy (jeff.us) 
------
commit fc0fce2c8f0f4c9ca373fdc754cab6359c9fcdcb
Author: Mohit Agrawal <moagrawa>
Date:   Thu Jul 27 09:25:32 2017 +0530

    changetimerecorder: Brick process crashed at ctr xlator in notify function
    
    Problem: Sometime brick process is getting crash in notify function at the
             time of cleanup db connection while brick mux is enabled.
    
    Solution: In changetimerrecorder (ctr) notify function after cleanup
              db connection set to db_conn to NULL to avoid reuse the same
              db connection again.
    
    Note: Below is the backtrace pattern showing by brick process
          #0  0x00007ff98a30c1f7 in raise () from /lib64/libc.so.6
          #1  0x00007ff98a30d8e8 in abort () from /lib64/libc.so.6
          #2  0x00007ff98a34bf47 in __libc_message () from /lib64/libc.so.6
          #3  0x00007ff98a351b54 in malloc_printerr () from /lib64/libc.so.6
          #4  0x00007ff98a3537aa in _int_free () from /lib64/libc.so.6
          #5  0x00007ff97d95e311 in gf_sql_connection_fini (sql_connection=sql_connection@entry=0x7ff8e8496b50) at gfdb_sqlite3.c:42
          #6  0x00007ff97d95e38a in gf_sqlite3_fini (db_conn=0x7ff92ca04470) at gfdb_sqlite3.c:507
          #7  0x00007ff97d957156 in fini_db (_conn_node=0x7ff92ca04470) at gfdb_data_store.c:326
          #8  0x00007ff97db78679 in notify (this=0x7ff92c5b3670, event=9, data=0x7ff92c5b5a00) at changetimerecorder.c:2178
          #9  0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5b3670, event=event@entry=9, data=data@entry=0x7ff92c5b5a00) at xlator.c:549
          #10 0x00007ff98bd3ac12 in default_notify (this=this@entry=0x7ff92c5b5a00, event=9, data=data@entry=0x7ff92c5b6d50) at defaults.c:3139
    
    BUG: 1475632
    Change-Id: Idd4bfdb4629c4799ac477ade81228065212683fb
    Signed-off-by: Mohit Agrawal <moagrawa>
    Reviewed-on: https://review.gluster.org/17888
    Reviewed-by: Jeff Darcy <jeff.us>
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Atin Mukherjee <amukherj>

Comment 3 Shyamsundar 2017-12-08 17:35:59 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.13.0, please open a new bug report.

glusterfs-3.13.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-December/000087.html
[2] https://www.gluster.org/pipermail/gluster-users/