1217589 – glusterd crashed while schdeuler was creating snapshots when bit rot was enabled on the volumes

Bug 1217589 - glusterd crashed while schdeuler was creating snapshots when bit rot was enabled on the volumes

Summary: glusterd crashed while schdeuler was creating snapshots when bit rot was enab...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	qe_tracker_everglades 1224189
TreeView+	depends on / blocked

Reported:	2015-04-30 18:16 UTC by senaik
Modified:	2016-06-22 05:17 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Clones:	1224189 (view as bug list)
Environment:
Last Closed:	2016-06-22 05:17:47 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description senaik 2015-04-30 18:16:57 UTC

Description of problem:
======================
Enable bit rot on the volumes and scheduled snapshots to be created every 5 mins on the volume - first snapshot creation failed as glusterd crashed


Version-Release number of selected component (if applicable):
=============================================================
gluster --version
glusterfs 3.7.0alpha0 built on Apr 28 2015 01:55:23


How reproducible:
=================
1/1

Steps to Reproduce:
===================
1.Create 3 volumes - 8x2 vol with replica 2 hot tier attached , 12 brick disperse volume with redundacny 4 , 3 brick distribute vol 

2.Enable USS, quota and bit rot on all volumes 

3.Fuse and NFS mount all volumes 

4.Initialise scheduler on all nodes and enable it 

5. Add 3 jobs to create snapshots on 3 volumes every 5 mins 
[root@rhs-arch-srv4 ~]# snap_scheduler.py list
JOB_NAME         SCHEDULE         OPERATION        VOLUME NAME      
--------------------------------------------------------------------
J1_vol0          */5 * * * *      Snapshot Create  vol0             
J1_vol1          */5 * * * *      Snapshot Create  vol1             
J1_vol2          */5 * * * *      Snapshot Create  vol2 

6.First snapshot create failed 

04-30 23:30:01,156 gcron.py:67 takeSnap] DEBUG Running command 'gluster snapshot create Scheduled-J1_vol1-vol1 vol1'
[2015-04-30 23:30:01,162 gcron.py:95 doJob] DEBUG /var/run/gluster/shared_storage/snaps/lock_files/J1_vol2 last modified at Thu Apr 30 23:25:08 2015
[2015-04-30 23:30:01,162 gcron.py:97 doJob] DEBUG Processing job Scheduled-J1_vol2-vol2
[2015-04-30 23:30:01,163 gcron.py:67 takeSnap] DEBUG Running command 'gluster snapshot create Scheduled-J1_vol2-vol2 vol2'
[2015-04-30 23:30:06,827 gcron.py:74 takeSnap] DEBUG Command 'gluster snapshot create Scheduled-J1_vol1-vol1 vol1' returned '1'
[2015-04-30 23:30:06,830 gcron.py:74 takeSnap] DEBUG Command 'gluster snapshot create Scheduled-J1_vol2-vol2 vol2' returned '1'
[2015-04-30 23:30:06,832 gcron.py:74 takeSnap] DEBUG Command 'gluster snapshot create Scheduled-J1_vol0-vol0 vol0' returned '1'
[2015-04-30 23:30:06,830 gcron.py:77 takeSnap] ERROR Snapshot of vol2 failed
[2015-04-30 23:30:06,828 gcron.py:77 takeSnap] ERROR Snapshot of vol1 failed
[2015-04-30 23:30:06,833 gcron.py:77 takeSnap] ERROR Snapshot of vol0 failed
[2015-04-30 23:30:06,838 gcron.py:78 takeSnap] ERROR Command output:
[2015-04-30 23:30:06,838 gcron.py:78 takeSnap] ERROR Command output:
[2015-04-30 23:30:06,838 gcron.py:78 takeSnap] ERROR Command output:
[2015-04-30 23:30:06,839 gcron.py:79 takeSnap] ERROR snapshot create: failed: quorum is not met

[2015-04-30 23:30:06,839 gcron.py:79 takeSnap] ERROR snapshot create: failed: One or more bricks may be down.

[2015-04-30 23:30:06,839 gcron.py:79 takeSnap] ERROR snapshot create: failed: quorum is not met

[2015-04-30 23:30:06,839 gcron.py:101 doJob] ERROR Job Scheduled-J1_vol2-vol2 failed
[2015-04-30 23:30:06,839 gcron.py:101 doJob] ERROR Job Scheduled-J1_vol1-vol1 failed
[2015-04-30 23:30:06,839 gcron.py:101 doJob] ERROR Job Scheduled-J1_vol0-vol0 failed
~

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2015-04-30 17:45:22
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.0alpha0
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x32d3621dc6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x32d363dadf]
/lib64/libc.so.6[0x33f6c326a0]
/usr/lib64/liburcu-bp.so.1(rcu_read_unlock_bp+0x16)[0x7f07aa97ad16]
/usr/lib64/glusterfs/3.7.0alpha0/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_commit+0x1c2)[0x7f07aac78392]
/usr/lib64/glusterfs/3.7.0alpha0/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_initiate_snap_phases+0x748)[0x7f07aac7c1b8]
/usr/lib64/glusterfs/3.7.0alpha0/xlator/mgmt/glusterd.so(glusterd_handle_snapshot_create+0x4c0)[0x7f07aac67b50]
/usr/lib64/glusterfs/3.7.0alpha0/xlator/mgmt/glusterd.so(glusterd_handle_snapshot_fn+0x821)[0x7f07aac73901]
/usr/lib64/glusterfs/3.7.0alpha0/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7f07aabbee5f]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x32d3661d12]
/lib64/libc.so.6[0x33f6c438f0]


10.70.34.50:
===========
core.11333

Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/bitd -p /var/lib/glusterd'.
Program terminated with signal 7, Bus error.
#0  0x00007fbce0242c54 in gf_changelog_reborp_rpcsvc_notify ()
   from /usr/lib64/libgfchangelog.so.0
Missing separate debuginfos, use: debuginfo-install glusterfs-3.7.0alpha0-0.17.gited96153.el6.x86_64
(gdb) bt
#0  0x00007fbce0242c54 in gf_changelog_reborp_rpcsvc_notify ()
   from /usr/lib64/libgfchangelog.so.0
#1  0x00000032d3a09e64 in rpcsvc_notify () from /usr/lib64/libgfrpc.so.0
#2  0x00000032d3a0b7b8 in rpc_transport_notify () from /usr/lib64/libgfrpc.so.0
#3  0x00007fbce14bb632 in ?? ()
   from /usr/lib64/glusterfs/3.7.0alpha0/rpc-transport/socket.so
#4  0x00000032d367d060 in ?? () from /usr/lib64/libglusterfs.so.0
#5  0x00000033f70079d1 in start_thread () from /lib64/libpthread.so.0
#6  0x00000033f6ce89dd in clone () from /lib64/libc.so.6

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
core.23765 - tracked by BZ 1211640

(gdb) bt
#0  0x00007f07aa97ad16 in rcu_read_unlock_bp () from /usr/lib64/liburcu-bp.so.1
#1  0x00007f07aac78392 in glusterd_mgmt_v3_commit ()
   from /usr/lib64/glusterfs/3.7.0alpha0/xlator/mgmt/glusterd.so
#2  0x00007f07aac7c1b8 in glusterd_mgmt_v3_initiate_snap_phases ()
   from /usr/lib64/glusterfs/3.7.0alpha0/xlator/mgmt/glusterd.so
#3  0x00007f07aac67b50 in glusterd_handle_snapshot_create ()
   from /usr/lib64/glusterfs/3.7.0alpha0/xlator/mgmt/glusterd.so
#4  0x00007f07aac73901 in glusterd_handle_snapshot_fn ()
   from /usr/lib64/glusterfs/3.7.0alpha0/xlator/mgmt/glusterd.so
#5  0x00007f07aabbee5f in glusterd_big_locked_handler ()
   from /usr/lib64/glusterfs/3.7.0alpha0/xlator/mgmt/glusterd.so
#6  0x00000032d3661d12 in synctask_wrap () from /usr/lib64/libglusterfs.so.0
#7  0x00000033f6c438f0 in ?? () from /lib64/libc.so.6
#8  0x0000000000000000 in ?? ()


10.70.36.2 :
===========
core.19223

Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/bitd -p /var/lib/glusterd'.
Program terminated with signal 7, Bus error.
#0  0x00007f87d54d3c54 in gf_changelog_reborp_rpcsvc_notify ()
   from /usr/lib64/libgfchangelog.so.0
Missing separate debuginfos, use: debuginfo-install glusterfs-3.7.0alpha0-0.17.gited96153.el6.x86_64
(gdb) bt
#0  0x00007f87d54d3c54 in gf_changelog_reborp_rpcsvc_notify ()
   from /usr/lib64/libgfchangelog.so.0
#1  0x0000003588e09e64 in rpcsvc_notify () from /usr/lib64/libgfrpc.so.0
#2  0x0000003588e0b7b8 in rpc_transport_notify () from /usr/lib64/libgfrpc.so.0
#3  0x00007f87d674c632 in ?? ()
   from /usr/lib64/glusterfs/3.7.0alpha0/rpc-transport/socket.so
#4  0x0000003588a7d060 in ?? () from /usr/lib64/libglusterfs.so.0
#5  0x0000003a968079d1 in start_thread () from /lib64/libpthread.so.0
#6  0x0000003a964e89dd in clone () from /lib64/libc.so

10.70.36.4:
==========
core.24094

Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/bitd -p /var/lib/glusterd'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000003efac21734 in gf_log_flush () from /usr/lib64/libglusterfs.so.0
Missing separate debuginfos, use: debuginfo-install glusterfs-3.7.0alpha0-0.17.gited96153.el6.x86_64
(gdb) bt
#0  0x0000003efac21734 in gf_log_flush () from /usr/lib64/libglusterfs.so.0
#1  0x0000003efac3d7ed in gf_print_trace () from /usr/lib64/libglusterfs.so.0
#2  <signal handler called>
#3  0x00007f6b6400e820 in ?? ()
#4  0x00007f6b7cb5cc0a in gf_changelog_reborp_rpcsvc_notify ()
   from /usr/lib64/libgfchangelog.so.0
#5  0x0000003efb408425 in rpcsvc_handle_disconnect ()
   from /usr/lib64/libgfrpc.so.0
#6  0x0000003efb409f60 in rpcsvc_notify () from /usr/lib64/libgfrpc.so.0
#7  0x0000003efb40b7b8 in rpc_transport_notify () from /usr/lib64/libgfrpc.so.0
#8  0x00007f6b7ddd86a1 in ?? ()
   from /usr/lib64/glusterfs/3.7.0alpha0/rpc-transport/socket.so
#9  0x0000003efac7d060 in ?? () from /usr/lib64/libglusterfs.so.0
#10 0x00000035324079d1 in start_thread () from /lib64/libpthread.so.0
#11 0x00000035320e89dd in clone () from /lib64/libc.so.6

Actual results:


Expected results:


Additional info:

Comment 2 senaik 2015-05-05 07:00:51 UTC

Proposing this bug as a blocker.

Comment 3 Atin Mukherjee 2015-05-05 12:19:38 UTC

The backtrace of this crash is same for #BZ 1211640, so marking it as duplicate.

*** This bug has been marked as a duplicate of bug 1211640 ***

Comment 4 senaik 2015-05-05 13:46:55 UTC

Atin,

Reopening as I needed some clarification.

Core 19223 and 23765 are related/tracked by the bug you mentioned above and another bug (bug 1207146).

However, the core on 10.70.36.4:
==========
core.24094

seems to be different and the backtrace is different than the other two cores. Can you please check that and clarify?

Reposting the backtrace for clarity.

10.70.36.4:
==========
core.24094

Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/bitd -p /var/lib/glusterd'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000003efac21734 in gf_log_flush () from /usr/lib64/libglusterfs.so.0
Missing separate debuginfos, use: debuginfo-install glusterfs-3.7.0alpha0-0.17.gited96153.el6.x86_64
(gdb) bt
#0  0x0000003efac21734 in gf_log_flush () from /usr/lib64/libglusterfs.so.0
#1  0x0000003efac3d7ed in gf_print_trace () from /usr/lib64/libglusterfs.so.0
#2  <signal handler called>
#3  0x00007f6b6400e820 in ?? ()
#4  0x00007f6b7cb5cc0a in gf_changelog_reborp_rpcsvc_notify ()
   from /usr/lib64/libgfchangelog.so.0
#5  0x0000003efb408425 in rpcsvc_handle_disconnect ()
   from /usr/lib64/libgfrpc.so.0
#6  0x0000003efb409f60 in rpcsvc_notify () from /usr/lib64/libgfrpc.so.0
#7  0x0000003efb40b7b8 in rpc_transport_notify () from /usr/lib64/libgfrpc.so.0
#8  0x00007f6b7ddd86a1 in ?? ()
   from /usr/lib64/glusterfs/3.7.0alpha0/rpc-transport/socket.so
#9  0x0000003efac7d060 in ?? () from /usr/lib64/libglusterfs.so.0
#10 0x00000035324079d1 in start_thread () from /lib64/libpthread.so.0
#11 0x00000035320e89dd in clone () from /lib64/libc.so.6

Comment 5 Gaurav Kumar Garg 2015-05-15 06:20:16 UTC

Hi seema, 

the core.24094 is same as https://bugzilla.redhat.com/show_bug.cgi?id=1207146 bug. so its a bitrot crash core. its a known issue. this core is not a glusterd crash core.  


so glusterd crash is solved by https://bugzilla.redhat.com/show_bug.cgi?id=1211640 bug. 

for glusterd crash patch for bug https://bugzilla.redhat.com/show_bug.cgi?id=1211640 have already merged.

could you reproduce this bug again and let us know that what is crashing glusterd or bitrot ???


we need more information regarding this.

Comment 6 senaik 2015-05-15 06:42:37 UTC

Gaurav, 

As mentioned in Comment 4, bt of core.24094 and bt reported in BZ 1207146 looks different.

Also I faced both glusterd and bitd crash which are tracked by BZ 1207146 and 1211640 . But core.24094 looks different from what is reported in both these 2 bugs. 

Request you to please analyse core.24094. 

Please find the sosreports below: 
================================
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/snapshots/1217589/

Comment 7 Atin Mukherjee 2015-05-15 18:18:07 UTC

(In reply to senaik from comment #4)
> Atin,
> 
> Reopening as I needed some clarification.
> 
> Core 19223 and 23765 are related/tracked by the bug you mentioned above and
> another bug (bug 1207146).
> 
> However, the core on 10.70.36.4:
> ==========
> core.24094
> 
> seems to be different and the backtrace is different than the other two
> cores. Can you please check that and clarify?
> 
> Reposting the backtrace for clarity.
> 
> 10.70.36.4:
> ==========
> core.24094
> 
> Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id
> gluster/bitd -p /var/lib/glusterd'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x0000003efac21734 in gf_log_flush () from /usr/lib64/libglusterfs.so.0
> Missing separate debuginfos, use: debuginfo-install
> glusterfs-3.7.0alpha0-0.17.gited96153.el6.x86_64
> (gdb) bt
> #0  0x0000003efac21734 in gf_log_flush () from /usr/lib64/libglusterfs.so.0
> #1  0x0000003efac3d7ed in gf_print_trace () from /usr/lib64/libglusterfs.so.0
> #2  <signal handler called>
> #3  0x00007f6b6400e820 in ?? ()
> #4  0x00007f6b7cb5cc0a in gf_changelog_reborp_rpcsvc_notify ()
>    from /usr/lib64/libgfchangelog.so.0
> #5  0x0000003efb408425 in rpcsvc_handle_disconnect ()
>    from /usr/lib64/libgfrpc.so.0
> #6  0x0000003efb409f60 in rpcsvc_notify () from /usr/lib64/libgfrpc.so.0
> #7  0x0000003efb40b7b8 in rpc_transport_notify () from
> /usr/lib64/libgfrpc.so.0
> #8  0x00007f6b7ddd86a1 in ?? ()
>    from /usr/lib64/glusterfs/3.7.0alpha0/rpc-transport/socket.so
> #9  0x0000003efac7d060 in ?? () from /usr/lib64/libglusterfs.so.0
> #10 0x00000035324079d1 in start_thread () from /lib64/libpthread.so.0
> #11 0x00000035320e89dd in clone () from /lib64/libc.so.6


Seema,

I believe Gaurav has already clarified about it. Clearing the needinfo.

Thanks,
Atin

Comment 8 Atin Mukherjee 2015-05-15 18:20:18 UTC

Seema,

Backtrace of #Bug 1207146 looks pretty similar to the one which you hit. 1207146 is in modified state but I am unable to find any patch against it. Could you retest it and see if you are hitting the crash?

Thanks,
Atin

Comment 9 senaik 2015-05-18 06:48:31 UTC

Atin, 

I'd like if you would post the patch details in the bug and move it ON_QA if you are sure it is fixed. I'm in the middle of another run, and might take some time before I can get back to this.

Comment 10 Atin Mukherjee 2015-05-18 06:51:42 UTC

Seema,

Unfortunately I don't have information on the patch which has solved #BZ 1207146. Bitrot team can comment on it.

Note You need to log in before you can comment on or make changes to this bug.