Bug 1134316

Summary:	[SNAPSHOT]: snapshot create fails with error in log "Failed to open directory <xyz>, due to many open files"
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Rahul Hinduja <rhinduja>
Component:	snapshot	Assignee:	Vijaikumar Mallikarjuna <vmallika>
Status:	CLOSED ERRATA	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.0	CC:	nsathyan, rhinduja, rhs-bugs, smohan, ssamanta, storage-qa-internal, vmallika
Target Milestone:	---
Target Release:	RHGS 3.0.0
Hardware:	x86_64
OS:	Linux
Whiteboard:	SNAPSHOT
Fixed In Version:	glusterfs-3.6.0.28-1	Doc Type:	Bug Fix
Doc Text:	Default 'open fd limit' which is 1024 is not sufficient and only ~500 bricks can connect to glusterd with 2 socket connections for each brick. Now the limit is increased to 65536, with this glusterd can connect upto 32768 bricks	Story Points:	---
Clone Of:
Clones:	1136352 (view as bug list)		Environment:
Last Closed:	2014-09-22 19:45:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1136352, 1145095

Description Rahul Hinduja 2014-08-27 09:39:33 UTC

Description of problem:
=======================

While creating snapshots in parallel of 4 volumes, when IO was in progress, snapshot creation failed as follows:

snapshot create: failed: Commit failed on localhost. Please check log file for details.
Snapshot command failed

real	0m3.404s
user	0m0.073s
sys	0m0.018s



Log Snippet:
============

[2014-08-27 08:30:29.404215] D [store.c:372:gf_store_save_value] 0-: returning: 0
[2014-08-27 08:30:29.404256] E [store.c:96:gf_store_sync_direntry] 0-management: Failed to open directory /var/lib/glusterd/snaps/a138/f24479f48660489e9471a5d87e0d4d50/bricks, due to Too many open files
[2014-08-27 08:30:29.404270] D [store.c:418:gf_store_handle_new] 0-: Returning -1
[2014-08-27 08:30:29.404281] E [store.c:53:gf_store_handle_create_on_absence] 0-: Unable to create store handle for path: /var/lib/glusterd/snaps/a138/f24479f48660489e9471a5d87e0d4d50/bricks/rhs-arch-srv4.lab.eng.blr.redhat.com:-var-run-gluster-snaps-f24479f48660489e9471a5d87e0d4d50-brick4-b1
[2014-08-27 08:30:29.404294] D [glusterd-store.c:427:glusterd_store_brickinfo] 0-management: Returning with -1
[2014-08-27 08:30:29.404305] D [glusterd-store.c:1147:glusterd_store_brickinfos] 0-management: Returning -1
[2014-08-27 08:30:29.404334] D [glusterd-store.c:1348:glusterd_store_perform_volume_store] 0-management: Returning -1
[2014-08-27 08:30:29.404471] E [store.c:161:gf_store_unlink_tmppath] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(glusterd_store_volinfo+0x34) [0x7fcac2f33514] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(glusterd_store_volume_cleanup_tmp+0x11) [0x7fcac2f2c851] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(glusterd_store_bricks_cleanup_tmp+0x3c) [0x7fcac2f2c7ec]))) 0-store: invalid argument: shandle
[2014-08-27 08:30:29.404492] D [glusterd-store.c:1552:glusterd_store_volinfo] 0-management: Returning -1
[2014-08-27 08:30:29.404508] E [glusterd-snapshot.c:4384:glusterd_do_snap_vol] 0-management: Failed to store snapshot volinfo (f24479f48660489e9471a5d87e0d4d50) for snap a138
[2014-08-27 08:30:29.404521] D [glusterd-utils.c:7897:glusterd_brick_stop] 0-management: About to stop glusterfs for brick inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/f24479f48660489e9471a5d87e0d4d50/brick1/b1
[2014-08-27 08:30:29.404535] D [glusterd-utils.c:7907:glusterd_brick_stop] 0-management: returning 0 
[2014-08-27 08:30:29.404560] D [glusterd-snapshot.c:2268:glusterd_lvm_snapshot_remove] 0-management: Brick inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/f24479f48660489e9471a5d87e0d4d50/brick1/b1 already deleted.
[2014-08-27 08:30:29.404575] D [glusterd-snapshot.c:2260:glusterd_lvm_snapshot_remove] 0-management: rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/snaps/f24479f48660489e9471a5d87e0d4d50/brick2/b1 belongs to a different node
[2014-08-27 08:30:29.404589] D [glusterd-snapshot.c:2260:glusterd_lvm_snapshot_remove] 0-management: rhs-arch-srv3.lab.eng.blr.redhat.com:/var/run/gluster/snaps/f24479f48660489e9471a5d87e0d4d50/brick3/b1 belongs to a different node
[2014-08-27 08:30:29.404601] D [glusterd-snapshot.c:2260:glusterd_lvm_snapshot_remove] 0-management: rhs-arch-srv4.lab.eng.blr.redhat.com:/var/run/gluster/snaps/f24479f48660489e9471a5d87e0d4d50/brick4/b1 belongs to a different node
[2014-08-27 08:30:29.404712] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed rhs-arch-srv2.lab.eng.blr.redhat.com:-var-run-gluster-snaps-f24479f48660489e9471a5d87e0d4d50-brick2-b1
[2014-08-27 08:30:29.404743] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed inception.lab.eng.blr.redhat.com:-var-run-gluster-snaps-f24479f48660489e9471a5d87e0d4d50-brick1-b1
[2014-08-27 08:30:29.404771] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed rhs-arch-srv3.lab.eng.blr.redhat.com:-var-run-gluster-snaps-f24479f48660489e9471a5d87e0d4d50-brick3-b1
[2014-08-27 08:30:29.404798] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed rhs-arch-srv4.lab.eng.blr.redhat.com:-var-run-gluster-snaps-f24479f48660489e9471a5d87e0d4d50-brick4-b1
[2014-08-27 08:30:29.404833] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed bricks
[2014-08-27 08:30:29.404862] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed node_state.info
[2014-08-27 08:30:29.404887] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed rbstate
[2014-08-27 08:30:29.404911] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed info
[2014-08-27 08:30:29.404949] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed f24479f4-8660-489e-9471-a5d87e0d4d50.deleted



Version-Release number of selected component (if applicable):
==============================================================

1/1


Steps Carried:
==============
1. Create 4 node cluster
2. Create 4 volumes
3. Mount the volume to client (Fuse and NFS)
4. Start copying and removing from each client with syntax like from each mount:

for i in {1..20}; do cp -rvf /etc etc.$i ; sleep 5; rm -rvf etc.$i;done

5. While IO is in progress, start creating snapshots of each volume in loop of 1..256 from nodes in cluster using:

Node1:
for i in {1..256}; do time gluster snapshot create a$i vol0; done

Node2:
for i in {1..256}; do time gluster snapshot create b$i vol1; done

Node3:
for i in {1..256}; do time gluster snapshot create b$i vol2; done

Node4:
for i in {1..256}; do time gluster snapshot create d$i vol3; done


Actual results:
===============

After around 450+ snapshot creating the snapshot creation failed on all nodes with below error

E [store.c:96:gf_store_sync_direntry] 0-management: Failed to open directory /var/lib/glusterd/snaps/a138/f24479f48660489e9471a5d87e0d4d50/bricks, due to Too many open files


Expected results:
=================

Snapshot creation should succeed

Comment 3 Vijaikumar Mallikarjuna 2014-08-28 06:47:58 UTC

Can you please check open file limit on the machine with the below command. I think it is running out of the 'open files limit' where so many bricks are running from 450+ snapshots might have opened some files.

'ulimit -n'


Also please can you please provide output of 'lsof'?

Comment 4 rjoseph 2014-09-02 05:49:33 UTC

Tested on dev machines but not able to reproduce the problem. Requested QE to reproduce the problem again so that it can be investigated further.

Comment 6 Vijaikumar Mallikarjuna 2014-09-02 09:52:06 UTC

This issue is not specific to a snapshot.
Default 'open FD limit' is 1024. As the number of volumes/bricks increases, brick-to-glusterd socket FDs also increases in glusterd and runs out of the limit.

Solution is to set the 'Open FD' limit to higher value in glusterd

Comment 7 Vijaikumar Mallikarjuna 2014-09-03 06:19:06 UTC

Patch submitted: https://code.engineering.redhat.com/gerrit/#/c/31894/

Comment 8 ssamanta 2014-09-03 07:06:36 UTC

After discussing with Engineering Leads approving this to be in Denali(RHS3.0).

Comment 11 errata-xmlrpc 2014-09-22 19:45:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html