Bug 1472129 - Brick Multiplexing: Brick process crashed at changetimerecorder(ctr) translator when restarting volumes
Summary: Brick Multiplexing: Brick process crashed at changetimerecorder(ctr) translat...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: core
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: RHGS 3.3.0
Assignee: Mohit Agrawal
QA Contact: Nag Pavan Chilakam
URL:
Whiteboard: brick-multiplexing
Depends On:
Blocks: 1417151 1475632 1476109
TreeView+ depends on / blocked
 
Reported: 2017-07-18 06:49 UTC by Nag Pavan Chilakam
Modified: 2017-09-21 05:02 UTC (History)
3 users (show)

Fixed In Version: glusterfs-3.8.4-37
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1475632 1476109 (view as bug list)
Environment:
Last Closed: 2017-09-21 05:02:13 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:2774 0 normal SHIPPED_LIVE glusterfs bug fix and enhancement update 2017-09-21 08:16:29 UTC

Description Nag Pavan Chilakam 2017-07-18 06:49:06 UTC
Description of problem:
====================
As part of verification of BZ#1468514 I was doing below
Had 50 volumes all 1x3 with brick mux enabled, stopped all of them in sequence and then started them again in sequence, did this for 200 times.
Seems to have crashed at around ~170th loop


#for j in {1..200};do echo "#########################################";date;for i in $(gluster v list);do gluster v stop $i --mode=script;done;ps -ef|grep glusterfsd;echo "############ end of loop $j ########################" ;for k in $(gluster v list);do gluster v start $k;done;done



also was doing IO to only one volume throughout, ie vol_9
#for i in {1..1000};do mkdir dir.$i;for j in {1..500000};do touch dir.$i/file.$i.$j ;done;done


As discussed at https://bugzilla.redhat.com/show_bug.cgi?id=1468514#c22
raising a new bug as this was hit in ctr xlator code







#0  0x00007ff98a30c1f7 in raise () from /lib64/libc.so.6
#1  0x00007ff98a30d8e8 in abort () from /lib64/libc.so.6
#2  0x00007ff98a34bf47 in __libc_message () from /lib64/libc.so.6
#3  0x00007ff98a351b54 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007ff98a3537aa in _int_free () from /lib64/libc.so.6
#5  0x00007ff97d95e311 in gf_sql_connection_fini (sql_connection=sql_connection@entry=0x7ff8e8496b50) at gfdb_sqlite3.c:42
#6  0x00007ff97d95e38a in gf_sqlite3_fini (db_conn=0x7ff92ca04470) at gfdb_sqlite3.c:507
#7  0x00007ff97d957156 in fini_db (_conn_node=0x7ff92ca04470) at gfdb_data_store.c:326
#8  0x00007ff97db78679 in notify (this=0x7ff92c5b3670, event=9, data=0x7ff92c5b5a00) at changetimerecorder.c:2178
#9  0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5b3670, event=event@entry=9, data=data@entry=0x7ff92c5b5a00) at xlator.c:549
#10 0x00007ff98bd3ac12 in default_notify (this=this@entry=0x7ff92c5b5a00, event=9, data=data@entry=0x7ff92c5b6d50) at defaults.c:3139
#11 0x00007ff97d48d0e3 in notify (this=0x7ff92c5b5a00, event=<optimized out>, data=0x7ff92c5b6d50) at changelog.c:2312
#12 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5b5a00, event=event@entry=9, data=data@entry=0x7ff92c5b6d50) at xlator.c:549
#13 0x00007ff98bd3ac12 in default_notify (this=this@entry=0x7ff92c5b6d50, event=9, data=<optimized out>) at defaults.c:3139
#14 0x00007ff97d059181 in notify (this=0x7ff92c5b6d50, event=<optimized out>, data=<optimized out>) at bit-rot-stub.c:243
#15 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5b6d50, event=event@entry=9, data=data@entry=0x7ff92c5b7e00) at xlator.c:549
#16 0x00007ff98bd3ac12 in default_notify (this=0x7ff92c5b7e00, event=9, data=<optimized out>) at defaults.c:3139
#17 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5b7e00, event=event@entry=9, data=data@entry=0x7ff92c5b8c40) at xlator.c:549
#18 0x00007ff98bd3ac12 in default_notify (this=0x7ff92c5b8c40, event=9, data=<optimized out>) at defaults.c:3139
#19 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5b8c40, event=event@entry=9, data=data@entry=0x7ff92c5b9a00) at xlator.c:549
#20 0x00007ff98bd3ac12 in default_notify (this=0x7ff92c5b9a00, event=9, data=<optimized out>) at defaults.c:3139
#21 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5b9a00, event=event@entry=9, data=data@entry=0x7ff92c5bab00) at xlator.c:549
#22 0x00007ff98bd3ac12 in default_notify (this=0x7ff92c5bab00, event=9, data=<optimized out>) at defaults.c:3139
#23 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5bab00, event=event@entry=9, data=data@entry=0x7ff92c5bbb40) at xlator.c:549
#24 0x00007ff98bd3ac12 in default_notify (this=0x7ff92c5bbb40, event=9, data=<optimized out>) at defaults.c:3139
#25 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5bbb40, event=event@entry=9, data=data@entry=0x7ff92c5bcb70) at xlator.c:549
#26 0x00007ff98bd3ac12 in default_notify (this=0x7ff92c5bcb70, event=9, data=<optimized out>) at defaults.c:3139
#27 0x00007ff97c3e9a32 in notify (this=0x7ff92c5bcb70, event=<optimized out>, data=0x7ff92c5bdba0) at upcall.c:2511
#28 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5bcb70, event=event@entry=9, data=data@entry=0x7ff92c5bdba0) at xlator.c:549
#29 0x00007ff98bd3ac12 in default_notify (this=this@entry=0x7ff92c5bdba0, event=9, data=<optimized out>) at defaults.c:3139
#30 0x00007ff97c1d19e3 in notify (this=0x7ff92c5bdba0, event=<optimized out>, data=<optimized out>) at io-threads.c:1053
#31 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5bdba0, event=event@entry=9, data=data@entry=0x7ff92c5bea90) at xlator.c:549
#32 0x00007ff98bd3ac12 in default_notify (this=0x7ff92c5bea90, event=9, data=<optimized out>) at defaults.c:3139
#33 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5bea90, event=event@entry=9, data=data@entry=0x7ff92c5c06e0) at xlator.c:549
#34 0x00007ff98bd3ac12 in default_notify (this=this@entry=0x7ff92c5c06e0, event=event@entry=9, data=data@entry=0x7ff92c5c1890) at defaults.c:3139
#35 0x00007ff977bd84fb in notify (this=0x7ff92c5c06e0, event=9, data=0x7ff92c5c1890) at barrier.c:534
#36 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5c06e0, event=event@entry=9, data=data@entry=0x7ff92c5c1890) at xlator.c:549
#37 0x00007ff98bd3ac12 in default_notify (this=this@entry=0x7ff92c5c1890, event=9, data=<optimized out>) at defaults.c:3139
#38 0x00007ff9779ccd98 in notify (this=0x7ff92c5c1890, event=<optimized out>, data=<optimized out>) at index.c:2530
#39 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5c1890, event=event@entry=9, data=data@entry=0x7ff92c5c2dc0) at xlator.c:549
#40 0x00007ff98bd3ac12 in default_notify (this=0x7ff92c5c2dc0, event=9, data=<optimized out>) at defaults.c:3139
#41 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5c2dc0, event=event@entry=9, data=data@entry=0x7ff92c5c4230) at xlator.c:549
#42 0x00007ff98bd3ac12 in default_notify (this=this@entry=0x7ff92c5c4230, event=9, data=data@entry=0x7ff92c5c5880) at defaults.c:3139
#43 0x00007ff97758f3da in notify (this=0x7ff92c5c4230, event=<optimized out>, data=0x7ff92c5c5880) at io-stats.c:4150
#44 0x00007ff98bca0dc2 in xlator_notify (xl=0x7ff92c5c4230, event=event@entry=9, data=data@entry=0x7ff92c5c5880) at xlator.c:549
#45 0x00007ff98bd3ac12 in default_notify (this=this@entry=0x7ff92c5c5880, event=event@entry=9, data=data@entry=0x7ff92c5c5880) at defaults.c:3139
#46 0x00007ff97711cd9f in notify (this=<optimized out>, event=<optimized out>, data=0x7ff92c5c5880) at server.c:1587
#47 0x000056273ffcf387 in glusterfs_handle_terminate (req=0x7ff8e8002930) at glusterfsd-mgmt.c:263
#48 0x00007ff98bcd9f02 in synctask_wrap (old_task=<optimized out>) at syncop.c:375
#49 0x00007ff98a31dd40 in ?? () from /lib64/libc.so.6






Thread 548 (Thread 0x7ff8f44f5700 (LWP 481)):
#0  0x00007ff98ab06945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ff97d052d4b in br_stub_worker (data=<optimized out>) at bit-rot-stub-helpers.c:369
#2  0x00007ff98ab02e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ff98a3cf34d in clone () from /lib64/libc.so.6

Thread 547 (Thread 0x7ff872fb6700 (LWP 482)):
#0  0x00007ff98ab06945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ff97d495d73 in changelog_ev_connector (data=0x7ff92e267438) at changelog-ev-handle.c:202
#2  0x00007ff98ab02e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ff98a3cf34d in clone () from /lib64/libc.so.6

Thread 546 (Thread 0x7ff870fb2700 (LWP 488)):
#0  0x00007ff98a3961ad in nanosleep () from /lib64/libc.so.6
#1  0x00007ff98a396044 in sleep () from /lib64/libc.so.6
#2  0x00007ff97e5d866d in posix_health_check_thread_proc (data=0x7ff92e21b0a0) at posix-helpers.c:1872
#3  0x00007ff98ab02e25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ff98a3cf34d in clone () from /lib64/libc.so.6

Thread 545 (Thread 0x7ff8770bd700 (LWP 461)):
#0  0x00007ff98a3c67a3 in select () from /lib64/libc.so.6
#1  0x00007ff97d495fba in changelog_ev_dispatch (data=0x7ff951d30248) at changelog-ev-handle.c:349
#2  0x00007ff98ab02e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ff98a3cf34d in clone () from /lib64/libc.so.6

Thread 544 (Thread 0x7ff87b1c4700 (LWP 442)):
#0  0x00007ff98a3961ad in nanosleep () from /lib64/libc.so.6
#1  0x00007ff98a396044 in sleep () from /lib64/libc.so.6
#2  0x00007ff97e5d866d in posix_health_check_thread_proc (data=0x7ff92dc6c570) at posix-helpers.c:1872
#3  0x00007ff98ab02e25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ff98a3cf34d in clone () from /lib64/libc.so.6

Thread 543 (Thread 0x7ff8799c1700 (LWP 445)):
#0  0x00007ff98a3cf923 in epoll_wait () from /lib64/libc.so.6
#1  0x00007ff98bcfdfd2 in event_dispatch_epoll_worker (data=0x7ff92e21a260) at event-epoll.c:638
#2  0x00007ff98ab02e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ff98a3cf34d in clone () from /lib64/libc.so.6

Thread 542 (Thread 0x7ff87f2cb700 (LWP 423)):
#0  0x00007ff98ab06945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ff97e5d8b9b in posix_fsyncer_pick (this=this@entry=0x7ff9517353b0, head=head@entry=0x7ff87f2cae80) at posix-helpers.c:2006
#2  0x00007ff97e5d8e25 in posix_fsyncer (d=0x7ff9517353b0) at posix-helpers.c:2105
#3  0x00007ff98ab02e25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ff98a3cf34d in clone () from /lib64/libc.so.6

Thread 541 (Thread 0x7ff9142f5700 (LWP 434)):
#0  0x00007ff98ab06945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ff9779c6ec5 in index_worker (data=<optimized out>) at index.c:217
#2  0x00007ff98ab02e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ff98a3cf34d in clone () from /lib64/libc.so.6

Thread 540 (Thread 0x7ff9440f1700 (LWP 435)):
#0  0x00007ff98ab06cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ff97c1cf90c in iot_worker (data=0x7ff92dc9be80) at io-threads.c:180
---Type <return> to continue, or q <return> to quit---




Version-Release number of selected component (if applicable):
-----
3.8.4-34



Steps to Reproduce:
1.have 50 1x3 vols and brick mux enabled
2.mount one of the vol and keep doing IOs
3.restart all volumes for about 200 times

for j in {1..200};do echo "#########################################";date;for i in $(gluster v list);do gluster v stop $i --mode=script;done;ps -ef|grep glusterfsd;echo "############ end of loop $j ########################" ;for k in $(gluster v list);do gluster v start $k;done;done

Comment 3 Nag Pavan Chilakam 2017-07-18 07:03:21 UTC
sosreports and core at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1472129/

Comment 6 Mohit Agrawal 2017-07-27 04:02:55 UTC
Patch is posted  https://review.gluster.org/17888  on upstream.

Comment 7 Atin Mukherjee 2017-07-28 06:57:56 UTC
upstream 3.12 patch : https://review.gluster.org/#/c/17902/
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/113754/

Comment 9 Nag Pavan Chilakam 2017-08-07 05:56:05 UTC
tested on 3.8.4-37
Ran for 200 times the loop of start and stop of 50 volumes. I didn;t hit the crash,
hence moving to verified

Steps to Reproduce:
1.have 50 1x3 vols and brick mux enabled
2.mount one of the vol and keep doing IOs
3.restart all volumes for about 200 times

for j in {1..200};do echo "#########################################";date;for i in $(gluster v list);do gluster v stop $i --mode=script;done;ps -ef|grep glusterfsd;echo "############ end of loop $j ########################" ;for k in $(gluster v list);do gluster v start $k;done;done

Comment 11 errata-xmlrpc 2017-09-21 05:02:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774


Note You need to log in before you can comment on or make changes to this bug.