Bug 1201633

Summary:	[epoll+Snapshot] : Snapd crashed while trying to list snaps under .snaps folder
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	senaik
Component:	snapshot	Assignee:	Poornima G <pgurusid>
Status:	CLOSED ERRATA	QA Contact:	senaik
Severity:	high	Docs Contact:
Priority:	high
Version:	rhgs-3.0	CC:	annair, pgurusid, rcyriac, rhs-bugs, rjoseph, storage-qa-internal, vagarwal
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.0.4
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	USS
Fixed In Version:	glusterfs-3.6.0.53-1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1202290 (view as bug list)		Environment:
Last Closed:	2015-03-26 06:37:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1182947, 1202290

Description senaik 2015-03-13 07:18:30 UTC

Description of problem:
=======================
While trying to list the snapshots under .snaps folder, snapd crashed with client.event-threads and server.event-threads set to 4 and 5 respectively.


Version-Release number of selected component (if applicable):
=============================================================
glusterfs 3.6.0.51

How reproducible:
=================
tried once


Steps to Reproduce:
==================
1.Create a 6x2 dist-rep volume and start it 
enable quota on the volume
set client.event-threads to 4 
set server.event-threads to 5 

2.Fuse and NFS mount the volume and create some IO on it 
for i in {1..20} ; do cp -rvf /etc fetc.$i ; done
for i in {1..20} ; do cp -rvf /etc n1etc.$i ; done

3.Create 256 snapshots in a loop 

4.Activate 60 snapshots in loop 

5.Enable USS on the volume

6.After 256 snapshot creation is completed, activate all the remaining snapshots 

From fuse mount, cd to .snaps and list the snapshots -> successful
From NFS mount, cd to .snaps and list the snapshots -> No such file/directory 
From fuse mount, cd to .snaps and list snapshots again -> failed with "Transport endpoint is not connected"


Actual results:
==============
While listing snapshots under .snaps snapd crashed 


Expected results:
================
All snapshots should be listed under .snaps successfully


Additional info:
================
[root@inception core]# gluster v i 
 
Volume Name: vol0
Type: Distributed-Replicate
Volume ID: d1ac6dec-a438-4c9f-9b0a-671396088e40
Status: Started
Snap Volume: no
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: inception.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick2: rhs-arch-srv2.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick3: rhs-arch-srv3.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick4: rhs-arch-srv4.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick5: inception.lab.eng.blr.redhat.com:/rhs/brick2/b2
Brick6: rhs-arch-srv2.lab.eng.blr.redhat.com:/rhs/brick2/b2
Brick7: rhs-arch-srv3.lab.eng.blr.redhat.com:/rhs/brick2/b2
Brick8: rhs-arch-srv4.lab.eng.blr.redhat.com:/rhs/brick2/b2
Brick9: inception.lab.eng.blr.redhat.com:/rhs/brick3/b3
Brick10: rhs-arch-srv2.lab.eng.blr.redhat.com:/rhs/brick3/b3
Brick11: rhs-arch-srv3.lab.eng.blr.redhat.com:/rhs/brick3/b3
Brick12: rhs-arch-srv4.lab.eng.blr.redhat.com:/rhs/brick3/b3
Options Reconfigured:
features.uss: enable
features.barrier: disable
client.event-threads: 4
server.event-threads: 5
features.quota: on
performance.readdir-ahead: on
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256

bt  :
===
om_err-1.41.12-21.el6.x86_64 libgcc-4.4.7-11.el6.x86_64 libselinux-2.0.94-5.8.el6.x86_64 ncurses-libs-5.7-3.20090208.el6.x86_64 openssl-1.0.1e-30.el6_6.5.x86_64 readline-6.0-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x0000003f236093a0 in ?? ()
#1  0x00000033e4425060 in gf_log_set_log_buf_size (buf_size=0) at logging.c:256
#2  0x00000033e44251ff in gf_log_disable_suppression_before_exit (ctx=0x22b3010) at logging.c:427
#3  0x00000033e443bac5 in gf_print_trace (signum=11, ctx=0x22b3010) at common-utils.c:493
#4  0x0000003f232326a0 in ?? ()
#5  0x0000000000000000 in ?? ()


[root@inception core]# gluster v status vol0
Status of volume: vol0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick inception.lab.eng.blr.redhat.com:/rhs
/brick1/b1                                  49155     0          Y       24032
Brick rhs-arch-srv2.lab.eng.blr.redhat.com:
/rhs/brick1/b1                              49155     0          Y       1709 
Brick rhs-arch-srv3.lab.eng.blr.redhat.com:
/rhs/brick1/b1                              49155     0          Y       1212 
Brick rhs-arch-srv4.lab.eng.blr.redhat.com:
/rhs/brick1/b1                              49155     0          Y       27968
Brick inception.lab.eng.blr.redhat.com:/rhs
/brick2/b2                                  49156     0          Y       24045
Brick rhs-arch-srv2.lab.eng.blr.redhat.com:
/rhs/brick2/b2                              49156     0          Y       1722 
Brick rhs-arch-srv3.lab.eng.blr.redhat.com:
/rhs/brick2/b2                              49156     0          Y       1226 
Brick rhs-arch-srv4.lab.eng.blr.redhat.com:
/rhs/brick2/b2                              49156     0          Y       27981
Brick inception.lab.eng.blr.redhat.com:/rhs
/brick3/b3                                  49157     0          Y       24058
Brick rhs-arch-srv2.lab.eng.blr.redhat.com:
/rhs/brick3/b3                              49157     0          Y       1735 
Brick rhs-arch-srv3.lab.eng.blr.redhat.com:
/rhs/brick3/b3                              49157     0          Y       1239 
Brick rhs-arch-srv4.lab.eng.blr.redhat.com:
/rhs/brick3/b3                              49157     0          Y       27994
Snapshot Daemon on localhost                N/A       N/A        N       31079
NFS Server on localhost                     2049      0          Y       31087
Self-heal Daemon on localhost               N/A       N/A        Y       24079
Quota Daemon on localhost                   N/A       N/A        Y       24119
Snapshot Daemon on rhs-arch-srv2.lab.eng.bl
r.redhat.com                                49845     0          Y       24540
NFS Server on rhs-arch-srv2.lab.eng.blr.red
hat.com                                     2049      0          Y       24548
Self-heal Daemon on rhs-arch-srv2.lab.eng.b
lr.redhat.com                               N/A       N/A        Y       1756 
Quota Daemon on rhs-arch-srv2.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       1776 
Snapshot Daemon on rhs-arch-srv4.lab.eng.bl
r.redhat.com                                49845     0          Y       27158
NFS Server on rhs-arch-srv4.lab.eng.blr.red
hat.com                                     2049      0          Y       27171
Self-heal Daemon on rhs-arch-srv4.lab.eng.b
lr.redhat.com                               N/A       N/A        Y       28015
Quota Daemon on rhs-arch-srv4.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       28034
Snapshot Daemon on rhs-arch-srv3.lab.eng.bl
r.redhat.com                                49845     0          Y       31320
NFS Server on rhs-arch-srv3.lab.eng.blr.red
hat.com                                     2049      0          Y       31330
Self-heal Daemon on rhs-arch-srv3.lab.eng.b
lr.redhat.com                               N/A       N/A        Y       1260 
Quota Daemon on rhs-arch-srv3.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       1279 
 
Task Status of Volume vol0
------------------------------------------------------------------------------
There are no active volume tasks

Comment 3 senaik 2015-03-13 12:13:43 UTC

Some more details to add to problem description : 

2 snapshot activate had failed on machine rhs-arch-srv2.lab.eng.blr.redhat.com. All the other snapshots were activated successfully

[2015-03-13 01:07:02.423336]  : snapshot activate S254 : FAILED : Commit failed on 10.70.34.50. Please check log file for details.
[2015-03-13 01:07:06.692653]  : snapshot activate S255 : FAILED : Commit failed on 10.70.34.50. Please check log file for details.

Comment 4 Poornima G 2015-03-13 14:06:06 UTC

Following is the bt from crash:

#0  __pthread_mutex_lock (mutex=0x320) at pthread_mutex_lock.c:50
#1  0x00000033e4425060 in gf_log_set_log_buf_size (buf_size=0) at logging.c:256
#2  0x00000033e44251ff in gf_log_disable_suppression_before_exit (ctx=0x22b3010) at logging.c:427
#3  0x00000033e443bac5 in gf_print_trace (signum=11, ctx=0x22b3010) at common-utils.c:493
#4  <signal handler called>
#5  0x00000033e444f731 in __gf_free (free_ptr=0x7f911ef33c50) at mem-pool.c:231
#6  0x00000033e443da02 in gf_timer_proc (ctx=0x7f911ef35630) at timer.c:207
#7  0x0000003f236079d1 in start_thread (arg=0x7f8eb197b700) at pthread_create.c:301
#8  0x0000003f232e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

In the test case multiple snapshots were created and then activated. And after activation snapshots were accessed using USS. while accessing these snapshot the crash is seen.

code wise this crash is happening during the timer thread destruction. Timer thread is destroyed as part of glfs_fini.

Normally glfs_fini is called when snapshots are deactivated or deleted. But in this case no snapshots were deleted or deactivated. In this case glfs_fini is called due to failure in glfs_init.

For some reason the snapshot brick is not in started state leading to failure in glfs_init. We could not figure out the exact cause of this since the brick and snapshot logs were missing from the sos-report.

But anyway when glfs_init fails we call glfs_fini to clean up the resources allocated. In the timer thread current THIS is overwritten and never restored, leading to wrong value of THIS which causes segmentation fault in __gf_free function.

We will send a patch to address this problem.

Comment 8 senaik 2015-03-20 07:26:14 UTC

Version : glusterfs 3.6.0.53
======== 
Repeated the steps as mentioned in the Description , did not face any crash. 

Marking the bug as 'Verified'


[root@inception ~]# gluster v i 
 
Volume Name: vol0
Type: Distributed-Replicate
Volume ID: ef518dd8-2416-4347-bcf7-ba042128e89c
Status: Started
Snap Volume: no
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: inception.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick2: rhs-arch-srv2.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick3: rhs-arch-srv3.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick4: rhs-arch-srv4.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick5: inception.lab.eng.blr.redhat.com:/rhs/brick2/b2
Brick6: rhs-arch-srv2.lab.eng.blr.redhat.com:/rhs/brick2/b2
Brick7: rhs-arch-srv3.lab.eng.blr.redhat.com:/rhs/brick2/b2
Brick8: rhs-arch-srv4.lab.eng.blr.redhat.com:/rhs/brick2/b2
Brick9: inception.lab.eng.blr.redhat.com:/rhs/brick3/b3
Brick10: rhs-arch-srv2.lab.eng.blr.redhat.com:/rhs/brick3/b3
Brick11: rhs-arch-srv3.lab.eng.blr.redhat.com:/rhs/brick3/b3
Brick12: rhs-arch-srv4.lab.eng.blr.redhat.com:/rhs/brick3/b3
Options Reconfigured:
features.uss: enable
features.barrier: disable
client.event-threads: 4
server.event-threads: 5
server.allow-insecure: on
features.quota: on
performance.readdir-ahead: on
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256



[root@inception ~]# gluster v status 
Status of volume: vol0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick inception.lab.eng.blr.redhat.com:/rhs
/brick1/b1                                  49152     0          Y       22780
Brick rhs-arch-srv2.lab.eng.blr.redhat.com:
/rhs/brick1/b1                              49152     0          Y       25239
Brick rhs-arch-srv3.lab.eng.blr.redhat.com:
/rhs/brick1/b1                              49152     0          Y       19975
Brick rhs-arch-srv4.lab.eng.blr.redhat.com:
/rhs/brick1/b1                              49152     0          Y       18271
Brick inception.lab.eng.blr.redhat.com:/rhs
/brick2/b2                                  49153     0          Y       22793
Brick rhs-arch-srv2.lab.eng.blr.redhat.com:
/rhs/brick2/b2                              49153     0          Y       25252
Brick rhs-arch-srv3.lab.eng.blr.redhat.com:
/rhs/brick2/b2                              49153     0          Y       19988
Brick rhs-arch-srv4.lab.eng.blr.redhat.com:
/rhs/brick2/b2                              49153     0          Y       18284
Brick inception.lab.eng.blr.redhat.com:/rhs
/brick3/b3                                  49154     0          Y       22806
Brick rhs-arch-srv2.lab.eng.blr.redhat.com:
/rhs/brick3/b3                              49154     0          Y       25265
Brick rhs-arch-srv3.lab.eng.blr.redhat.com:
/rhs/brick3/b3                              49154     0          Y       20001
Brick rhs-arch-srv4.lab.eng.blr.redhat.com:
/rhs/brick3/b3                              49154     0          Y       18297
Snapshot Daemon on localhost                49923     0          Y       6328 
NFS Server on localhost                     2049      0          Y       6336 
Self-heal Daemon on localhost               N/A       N/A        Y       22827
Quota Daemon on localhost                   N/A       N/A        Y       22869
Snapshot Daemon on rhs-arch-srv3.lab.eng.bl
r.redhat.com                                49923     0          Y       20196
NFS Server on rhs-arch-srv3.lab.eng.blr.red
hat.com                                     2049      0          Y       20205
Self-heal Daemon on rhs-arch-srv3.lab.eng.b
lr.redhat.com                               N/A       N/A        Y       20022
Quota Daemon on rhs-arch-srv3.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       20043
Snapshot Daemon on rhs-arch-srv2.lab.eng.bl
r.redhat.com                                49923     0          Y       10590
NFS Server on rhs-arch-srv2.lab.eng.blr.red
hat.com                                     2049      0          Y       10598
Self-heal Daemon on rhs-arch-srv2.lab.eng.b
lr.redhat.com                               N/A       N/A        Y       25286
Quota Daemon on rhs-arch-srv2.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       25318
Snapshot Daemon on rhs-arch-srv4.lab.eng.bl
r.redhat.com                                49923     0          Y       11216
NFS Server on rhs-arch-srv4.lab.eng.blr.red
hat.com                                     2049      0          Y       11225
Self-heal Daemon on rhs-arch-srv4.lab.eng.b
lr.redhat.com                               N/A       N/A        Y       18318
Quota Daemon on rhs-arch-srv4.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       18338
 
Task Status of Volume vol0
------------------------------------------------------------------------------
There are no active volume tasks

Comment 10 errata-xmlrpc 2015-03-26 06:37:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0682.html