Bug 1599220 - glusterd crashed and core generated at gd_mgmt_v3_unlock_timer_cbk after huge number of volumes were created
Summary: glusterd crashed and core generated at gd_mgmt_v3_unlock_timer_cbk after huge...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterd
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: RHGS 3.4.z Batch Update 3
Assignee: Sanju
QA Contact: Upasana
URL:
Whiteboard:
Depends On:
Blocks: 1630922
TreeView+ depends on / blocked
 
Reported: 2018-07-09 09:01 UTC by Nag Pavan Chilakam
Modified: 2019-02-04 07:41 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.12.2-33
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1630922 (view as bug list)
Environment:
Last Closed: 2019-02-04 07:41:25 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0263 0 None None None 2019-02-04 07:41:37 UTC

Description Nag Pavan Chilakam 2018-07-09 09:01:40 UTC
Description of problem:
=======================

was running a script to create and delete volumes over a period of 2 days.
noticed that glusterd crashed after about 2 days

Looks like glusterd crashed as there were more thant 1900 volumes


I have been creating new volumes(ie about 50) in every iteration, but not deleting them.
After about 34 iterations and about 1500+ volumes present, below issues were seen

1) volume create failed on 35th loop for one volume "volume create: 35-indiaArb-10: failed: Locking failed on rhs-client38.lab.eng.blr.redhat.com. Please check log file for details."

2)one of the nodes ie  rhs-client38.lab.eng.blr.redhat.com seems to have got disconnected and the node seemed to have got rebooted, and post that volume create started to fail on volumes which are being created on one of the nodes due to "failed: Host rhs-client38.lab.eng.blr.redhat.com not connected"
3) glusterd crashed, due to the possible load on the node where the script was running
4) few of the nodes got rebooted by themselves(possibly due to load)



Note: IOs were planned for a few volumes, but they didn't succeed due to too much load on clients,and failed initially itself, so IOs don't come into picture for this issue



Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f2cc839a44e in gd_mgmt_v3_unlock_timer_cbk () from /usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so
Missing separate debuginfos, use: debuginfo-install glusterfs-fuse-3.12.2-13.el7rhgs.x86_64
(gdb) bt
#0  0x00007f2cc839a44e in gd_mgmt_v3_unlock_timer_cbk () from /usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so
#1  0x00007f2cd385a982 in gf_timer_proc () from /lib64/libglusterfs.so.0
#2  0x00007f2cd26abdd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f2cd1f74b3d in clone () from /lib64/libc.so.6
(gdb) t a a bt

Thread 13 (Thread 0x7f2cd3d34780 (LWP 18999)):
#0  0x00007f2cd26acf47 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f2cd38aaaf8 in event_dispatch_epoll () from /lib64/libglusterfs.so.0
#2  0x000055bed2fd0247 in main (argc=5, argv=<optimized out>) at glusterfsd.c:2550

Thread 12 (Thread 0x7f2cc3574700 (LWP 19008)):
#0  0x00007f2cd1f75113 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f2cd38aa392 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0
#2  0x00007f2cd26abdd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f2cd1f74b3d in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f2c0efb5700 (LWP 6240)):
#0  0x00007f2cd26afcf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f2cd3887e68 in syncenv_task () from /lib64/libglusterfs.so.0
#2  0x00007f2cd3888d30 in syncenv_processor () from /lib64/libglusterfs.so.0
#3  0x00007f2cd26abdd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f2cd1f74b3d in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f2c0e7b4700 (LWP 6241)):
#0  0x00007f2cd26afcf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f2cd3887e68 in syncenv_task () from /lib64/libglusterfs.so.0
#2  0x00007f2cd3888d30 in syncenv_processor () from /lib64/libglusterfs.so.0
#3  0x00007f2cd26abdd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f2cd1f74b3d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f2c11fbb700 (LWP 3257)):
#0  0x00007f2cd26afcf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f2cd3887e68 in syncenv_task () from /lib64/libglusterfs.so.0
#2  0x00007f2cd3888d30 in syncenv_processor () from /lib64/libglusterfs.so.0
#3  0x00007f2cd26abdd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f2cd1f74b3d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f2cca64b700 (LWP 19001)):
#0  0x00007f2cd26b3411 in sigwait () from /lib64/libpthread.so.0
#1  0x000055bed2fd352b in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2137
#2  0x00007f2cd26abdd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f2cd1f74b3d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f2cc3d75700 (LWP 19007)):
#0  0x00007f2cd26af945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f2cc839614b in hooks_worker () from /usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so
#2  0x00007f2cd26abdd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f2cd1f74b3d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f2c0f7b6700 (LWP 6239)):
#0  0x00007f2cd26afcf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f2cd3887e68 in syncenv_task () from /lib64/libglusterfs.so.0
#2  0x00007f2cd3888d30 in syncenv_processor () from /lib64/libglusterfs.so.0
#3  0x00007f2cd26abdd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f2cd1f74b3d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f2cc9e4a700 (LWP 19002)):
#0  0x00007f2cd1f3b4fd in nanosleep () from /lib64/libc.so.6
#1  0x00007f2cd1f3b394 in sleep () from /lib64/libc.so.6
#2  0x00007f2cd387520d in pool_sweeper () from /lib64/libglusterfs.so.0
#3  0x00007f2cd26abdd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f2cd1f74b3d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f2c117ba700 (LWP 3258)):
#0  0x00007f2cd26afcf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit--- 
#1  0x00007f2cd3887e68 in syncenv_task () from /lib64/libglusterfs.so.0
#2  0x00007f2cd3888d30 in syncenv_processor () from /lib64/libglusterfs.so.0
#3  0x00007f2cd26abdd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f2cd1f74b3d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f2c0ffb7700 (LWP 6238)):
#0  0x00007f2cd26afcf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f2cd3887e68 in syncenv_task () from /lib64/libglusterfs.so.0
#2  0x00007f2cd3888d30 in syncenv_processor () from /lib64/libglusterfs.so.0
#3  0x00007f2cd26abdd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f2cd1f74b3d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f2c107b8700 (LWP 6237)):
#0  0x00007f2cd26afcf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f2cd3887e68 in syncenv_task () from /lib64/libglusterfs.so.0
#2  0x00007f2cd3888d30 in syncenv_processor () from /lib64/libglusterfs.so.0
#3  0x00007f2cd26abdd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f2cd1f74b3d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f2ccae4c700 (LWP 19000)):
#0  0x00007f2cc839a44e in gd_mgmt_v3_unlock_timer_cbk () from /usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so
#1  0x00007f2cd385a982 in gf_timer_proc () from /lib64/libglusterfs.so.0
#2  0x00007f2cd26abdd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f2cd1f74b3d in clone () from /lib64/libc.so.6


Version-Release number of selected component (if applicable):


How reproducible:
--------------
hit it once

Steps to Reproduce:
1.had a 6 node cluster
2. run script to create and start volumes in batches and in iterations (Script will be attached)
3. above crash was hit

Comment 4 Nag Pavan Chilakam 2018-07-09 13:58:21 UTC
sosreports @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1599220/


some sosreports may be missing due to glusterd taking huge time due to number of volumes

core is in rhs-client18

Comment 5 Nag Pavan Chilakam 2018-07-09 13:59:56 UTC
also each node has test-logs directory with top command o/p

Comment 10 Atin Mukherjee 2018-10-12 06:12:22 UTC
Let us pull in https://review.gluster.org/21228 to avoid this crash anyways as a bullet proof fix. As I mentioned earlier, the test done here is not within a supported configuration.

Comment 15 Nag Pavan Chilakam 2018-12-27 10:16:29 UTC
Please let us know for verifying this bug, should we be still testing with creations of about 2000 volumes?
Is there anything else specific to be tested with the fix.

Comment 16 Sanju 2018-12-27 12:49:02 UTC
(In reply to nchilaka from comment #15)
> Please let us know for verifying this bug, should we be still testing with
> creations of about 2000 volumes?

That should be sufficient.
> Is there anything else specific to be tested with the fix.

Nope.

Thanks,
Sanju

Comment 22 errata-xmlrpc 2019-02-04 07:41:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0263


Note You need to log in before you can comment on or make changes to this bug.