1211640 – glusterd crash when snapshot create was in progress on different volumes at the same time - job edited to create snapshots at the given time

Bug 1211640 - glusterd crash when snapshot create was in progress on different volumes at the same time - job edited to create snapshots at the given time

Summary: glusterd crash when snapshot create was in progress on different volumes at t...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Anand Nekkunti
QA Contact:
Docs Contact:
URL:
Whiteboard:	glusterd
Depends On:
Blocks:	qe_tracker_everglades 1216942 1224231
TreeView+	depends on / blocked

Reported:	2015-04-14 13:32 UTC by senaik
Modified:	2016-06-16 12:50 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.8rc2
Clone Of:
Clones:	1216942 1224231 (view as bug list)
Environment:
Last Closed:	2016-06-16 12:50:21 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description senaik 2015-04-14 13:32:04 UTC

Description of problem:
=======================
Edited 2 jobs to create snapshots on 2 different volumes at the same time, snap create failed as glusterd crashed

Version-Release number of selected component (if applicable):
=============================================================
gluster --version
glusterfs 3.7dev built on Apr 13 2015 07:14:27


How reproducible:
=================
1/1

Steps to Reproduce:
===================
1.Create 2 volumes (vol0 and vol1)and start it. 
Fuse and NFS mount the volume and create some IO

2.Create another shared storage and fuse mount the shared storage on all nodes

3.Initialise the snapshot scheduler on all nodes using snap_scheduler.py init

4.Enable the snap scheduler on one of the nodes using snap_scheduler.py enable

5.Added a job which created snapshot on vol0 at 18:20 
  Added another job which created snapshot on vol1 at 18:20

snap_scheduler.py add "J1_vol0"  "20 18 * * * " "vol0"
snap_scheduler.py add "J1_vol1"  "20 18 * * * " "vol1"

Snapshots were created successfully on the volume 

gluster snapshot list
Scheduled-J1_vol0-vol0_GMT-2015.04.14-12.50.01
Scheduled-J1_vol1-vol1_GMT-2015.04.14-12.50.01

6.Edit both the jobs to create snapshots at 18:36
snap_scheduler.py edit "J1_vol1"  "36 18 * * * " "vol1"
snap_scheduler.py edit "J1_vol0"  "36 18 * * * " "vol0"

Snapshot creation failed 

-------------Part of var/log/glusterfs/gcron.log----------

[2015-04-14 18:36:01,350 gcron.py:67 takeSnap] DEBUG Running command 'gluster snapshot create Scheduled-J1_vol0-vol0 vol0'
[2015-04-14 18:36:01,351 gcron.py:95 doJob] DEBUG /var/run/gluster/shared_storage/snaps/lock_files/J1_vol1 last modified at Tue Apr 14 18:20:26 2015
[2015-04-14 18:36:01,351 gcron.py:97 doJob] DEBUG Processing job Scheduled-J1_vol1-vol1
[2015-04-14 18:36:01,352 gcron.py:67 takeSnap] DEBUG Running command 'gluster snapshot create Scheduled-J1_vol1-vol1 vol1'
[2015-04-14 18:36:20,009 gcron.py:74 takeSnap] DEBUG Command 'gluster snapshot create Scheduled-J1_vol0-vol0 vol0' returned '1'
[2015-04-14 18:36:20,009 gcron.py:74 takeSnap] DEBUG Command 'gluster snapshot create Scheduled-J1_vol1-vol1 vol1' returned '1'
[2015-04-14 18:36:20,010 gcron.py:77 takeSnap] ERROR Snapshot of vol0 failed
[2015-04-14 18:36:20,014 gcron.py:78 takeSnap] ERROR Command output:
[2015-04-14 18:36:20,014 gcron.py:79 takeSnap] ERROR 
[2015-04-14 18:36:20,014 gcron.py:101 doJob] ERROR Job Scheduled-J1_vol0-vol0 failed
[2015-04-14 18:36:20,010 gcron.py:77 takeSnap] ERROR Snapshot of vol1 failed
[2015-04-14 18:36:20,019 gcron.py:78 takeSnap] ERROR Command output:
[2015-04-14 18:36:20,020 gcron.py:79 takeSnap] ERROR 
[2015-04-14 18:36:20,020 gcron.py:101 doJob] ERROR Job Scheduled-J1_vol1-vol1 failed
------------------------------------------------------------

Actual results:
==============
glusterd crashed

Expected results:
================
There should be no crash observed

Additional info:
================
[2015-04-14 12:38:44.527287] W [socket.c:642:__socket_rwv] 0-quotad: readv on /var/run/gluster/7089eb2213ea459a8a12ba
56023bd163.socket failed (No data available)
[2015-04-14 12:38:47.918396] W [socket.c:642:__socket_rwv] 0-quotad: readv on /var/run/gluster/7089eb2213ea459a8a12ba
56023bd163.socket failed (No data available)
The message "I [MSGID: 106006] [glusterd-snapd-svc.c:379:glusterd_snapdsvc_rpc_notify] 0-management: snapd has discon
nected from glusterd." repeated 2 times between [2015-04-14 12:38:34.733354] and [2015-04-14 12:38:40.636734]
The message "I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: nfs has disconn
ected from glusterd." repeated 3 times between [2015-04-14 12:37:56.254712] and [2015-04-14 12:38:40.641386]
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2015-04-14 13:06:18
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7dev


(gdb) bt
#0  0x00007fe862f85d16 in rcu_read_unlock_bp () from /usr/lib64/liburcu-bp.so.1
#1  0x00007fe863283252 in glusterd_mgmt_v3_commit (op=GD_OP_SNAP, op_ctx=0x7fe800000003, req_dict=0x7fe85c344c4c, op_errstr=0x7fe85879ace8, txn_generation=3)
    at glusterd-mgmt.c:1232
#2  0x00007fe863287078 in glusterd_mgmt_v3_initiate_snap_phases (req=0x1345108, op=GD_OP_SNAP, dict=0x7fe85c45d87c) at glusterd-mgmt.c:1998
#3  0x00007fe863272a10 in glusterd_handle_snapshot_create (req=0x1345108, op=GD_OP_SNAP, dict=0x7fe85c45d87c, err_str=<value optimized out>, len=140635893514400)
    at glusterd-snapshot.c:3763
#4  0x00007fe86327e7c1 in glusterd_handle_snapshot_fn (req=0x1345108) at glusterd-snapshot.c:8305
#5  0x00007fe8631c9d7f in glusterd_big_locked_handler (req=0x1345108, actor_fn=0x7fe86327dfa0 <glusterd_handle_snapshot_fn>) at glusterd-handler.c:83
#6  0x0000003abac61c72 in synctask_wrap (old_task=<value optimized out>) at syncop.c:375
#7  0x0000003a964438f0 in ?? () from /lib64/libc.so.6
#8  0x0000000000000000 in ?? ()

Comment 2 Avra Sengupta 2015-04-15 06:57:10 UTC

The crash is seen at rcu_read_unlock_bp() which is not affected by change of snapshot schedules. Moving it to glusterd core team

Comment 3 Atin Mukherjee 2015-04-15 15:12:15 UTC

http://review.gluster.org/10147 has introduced it and we are working on to find out a solution for this. At worst case the mentioned patch will be reverted which would solve this problem.

Comment 4 Anand Avati 2015-04-17 09:39:14 UTC

REVIEW: http://review.gluster.org/10285 (glusterd:Implementation of sync lock as recurcive lock to avoid dead lock.) posted (#1) for review on master by Anand Nekkunti (anekkunt)

Comment 5 Anand Avati 2015-04-17 09:56:41 UTC

REVIEW: http://review.gluster.org/10285 (glusterd: Implementation of sync lock as recursive lock to avoid dead lock.) posted (#2) for review on master by Anand Nekkunti (anekkunt)

Comment 6 Anand Avati 2015-04-17 17:29:53 UTC

REVIEW: http://review.gluster.org/10285 (glusterd: Implementation of sync lock as recursive lock to avoid dead lock.) posted (#3) for review on master by Anand Nekkunti (anekkunt)

Comment 7 Anand Avati 2015-04-18 12:10:18 UTC

REVIEW: http://review.gluster.org/10285 (glusterd: Implementation of sync lock as recursive lock to avoid dead lock.) posted (#4) for review on master by Anand Nekkunti (anekkunt)

Comment 8 Anand Avati 2015-04-18 17:34:40 UTC

REVIEW: http://review.gluster.org/10285 (glusterd: Implementation of sync lock as recursive lock to avoid dead lock.) posted (#5) for review on master by Anand Nekkunti (anekkunt)

Comment 9 senaik 2015-04-21 06:10:07 UTC

Facing the issue multiple times while testing Snapshots. 
Proposing this bug as a blocker

Comment 10 Anand Avati 2015-04-22 04:46:30 UTC

REVIEW: http://review.gluster.org/10285 (glusterd: Implementation of sync lock as recursive lock to avoid crash.) posted (#6) for review on master by Anand Nekkunti (anekkunt)

Comment 11 Anand Avati 2015-04-22 09:18:54 UTC

REVIEW: http://review.gluster.org/10285 (libglusterfs: Implementation of sync lock as recursive lock to avoid crash.) posted (#7) for review on master by Anand Nekkunti (anekkunt)

Comment 12 Anand Avati 2015-04-22 11:25:22 UTC

REVIEW: http://review.gluster.org/10285 (libglusterfs: Implementation of sync lock as recursive lock to avoid crash.) posted (#8) for review on master by Anand Nekkunti (anekkunt)

Comment 13 Anand Avati 2015-04-22 12:27:46 UTC

REVIEW: http://review.gluster.org/10285 (libglusterfs: Implementation of sync lock as recursive lock to avoid crash.) posted (#9) for review on master by Anand Nekkunti (anekkunt)

Comment 14 Anand Avati 2015-04-22 17:10:57 UTC

REVIEW: http://review.gluster.org/10285 (libglusterfs: Implementation of sync lock as recursive lock to avoid crash.) posted (#10) for review on master by Anand Nekkunti (anekkunt)

Comment 15 Anand Avati 2015-04-23 09:44:02 UTC

REVIEW: http://review.gluster.org/10285 (libglusterfs: Implementation of sync lock as recursive lock to avoid crash.) posted (#11) for review on master by Anand Nekkunti (anekkunt)

Comment 16 Anand Avati 2015-04-24 10:29:18 UTC

REVIEW: http://review.gluster.org/10285 (libglusterfs: Implementation of sync lock as recursive lock to avoid crash.) posted (#12) for review on master by Anand Nekkunti (anekkunt)

Comment 17 Anand Avati 2015-04-27 08:44:50 UTC

REVIEW: http://review.gluster.org/10285 (libglusterfs: Implementation of sync lock as recursive lock to avoid crash.) posted (#13) for review on master by Anand Nekkunti (anekkunt)

Comment 18 Anand Avati 2015-04-28 08:53:31 UTC

COMMIT: http://review.gluster.org/10285 committed in master by Vijay Bellur (vbellur) 
------
commit ada6b3a8800867934af57a57d5312f5a5d8374f0
Author: anand <anekkunt>
Date:   Fri Apr 17 14:19:46 2015 +0530

    libglusterfs: Implementation of sync lock as recursive lock to avoid crash.
    
    Problem : In glusterd,we are using big lock which is implemented based on sync
    task frame work for thread synchronization and rcu lock for data consistency.
    sync task frame work swap the threads  if there is no worker poll threads
    available,due to this rcu lock and rcu unlock was happening in different threads
    (urcu-bp will not allow this),resulting into glusterd crash.
    
    fix : To avoid releasing the sync lock(big lock) in between rcu critical
    section,implemented sync lock as recursive lock.
    
    More details:
    link : http://www.spinics.net/lists/gluster-devel/msg14632.html
    
    Change-Id: I2b56c1caf3f0470f219b1adcaf62cce29cdc6b88
    BUG: 1211640
    Signed-off-by: anand <anekkunt>
    Reviewed-on: http://review.gluster.org/10285
    Reviewed-by: Atin Mukherjee <amukherj>
    Tested-by: Gluster Build System <jenkins.com>
    Tested-by: NetBSD Build System
    Reviewed-by: Vijay Bellur <vbellur>

Comment 19 Anand Avati 2015-04-28 16:58:01 UTC

REVIEW: http://review.gluster.org/10432 (libglusterfs: Implementation of sync lock as recursive lock to avoid crash.) posted (#1) for review on release-3.7 by Anand Nekkunti (anekkunt)

Comment 20 Atin Mukherjee 2015-05-05 12:19:38 UTC

*** Bug 1217589 has been marked as a duplicate of this bug. ***

Comment 22 Nagaprasad Sathyanarayana 2015-10-25 14:58:50 UTC

Fix for this BZ is already present in a GlusterFS release. You can find clone of this BZ, fixed in a GlusterFS release and closed. Hence closing this mainline BZ as well.

Comment 23 Niels de Vos 2016-06-16 12:50:21 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.