Description of problem: ======================= While creating snapshots in parallel of 4 volumes, when IO was in progress, snapshot creation failed as follows: snapshot create: failed: Commit failed on localhost. Please check log file for details. Snapshot command failed real 0m3.404s user 0m0.073s sys 0m0.018s Log Snippet: ============ [2014-08-27 08:30:29.404215] D [store.c:372:gf_store_save_value] 0-: returning: 0 [2014-08-27 08:30:29.404256] E [store.c:96:gf_store_sync_direntry] 0-management: Failed to open directory /var/lib/glusterd/snaps/a138/f24479f48660489e9471a5d87e0d4d50/bricks, due to Too many open files [2014-08-27 08:30:29.404270] D [store.c:418:gf_store_handle_new] 0-: Returning -1 [2014-08-27 08:30:29.404281] E [store.c:53:gf_store_handle_create_on_absence] 0-: Unable to create store handle for path: /var/lib/glusterd/snaps/a138/f24479f48660489e9471a5d87e0d4d50/bricks/rhs-arch-srv4.lab.eng.blr.redhat.com:-var-run-gluster-snaps-f24479f48660489e9471a5d87e0d4d50-brick4-b1 [2014-08-27 08:30:29.404294] D [glusterd-store.c:427:glusterd_store_brickinfo] 0-management: Returning with -1 [2014-08-27 08:30:29.404305] D [glusterd-store.c:1147:glusterd_store_brickinfos] 0-management: Returning -1 [2014-08-27 08:30:29.404334] D [glusterd-store.c:1348:glusterd_store_perform_volume_store] 0-management: Returning -1 [2014-08-27 08:30:29.404471] E [store.c:161:gf_store_unlink_tmppath] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(glusterd_store_volinfo+0x34) [0x7fcac2f33514] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(glusterd_store_volume_cleanup_tmp+0x11) [0x7fcac2f2c851] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(glusterd_store_bricks_cleanup_tmp+0x3c) [0x7fcac2f2c7ec]))) 0-store: invalid argument: shandle [2014-08-27 08:30:29.404492] D [glusterd-store.c:1552:glusterd_store_volinfo] 0-management: Returning -1 [2014-08-27 08:30:29.404508] E [glusterd-snapshot.c:4384:glusterd_do_snap_vol] 0-management: Failed to store snapshot volinfo (f24479f48660489e9471a5d87e0d4d50) for snap a138 [2014-08-27 08:30:29.404521] D [glusterd-utils.c:7897:glusterd_brick_stop] 0-management: About to stop glusterfs for brick inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/f24479f48660489e9471a5d87e0d4d50/brick1/b1 [2014-08-27 08:30:29.404535] D [glusterd-utils.c:7907:glusterd_brick_stop] 0-management: returning 0 [2014-08-27 08:30:29.404560] D [glusterd-snapshot.c:2268:glusterd_lvm_snapshot_remove] 0-management: Brick inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/f24479f48660489e9471a5d87e0d4d50/brick1/b1 already deleted. [2014-08-27 08:30:29.404575] D [glusterd-snapshot.c:2260:glusterd_lvm_snapshot_remove] 0-management: rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/snaps/f24479f48660489e9471a5d87e0d4d50/brick2/b1 belongs to a different node [2014-08-27 08:30:29.404589] D [glusterd-snapshot.c:2260:glusterd_lvm_snapshot_remove] 0-management: rhs-arch-srv3.lab.eng.blr.redhat.com:/var/run/gluster/snaps/f24479f48660489e9471a5d87e0d4d50/brick3/b1 belongs to a different node [2014-08-27 08:30:29.404601] D [glusterd-snapshot.c:2260:glusterd_lvm_snapshot_remove] 0-management: rhs-arch-srv4.lab.eng.blr.redhat.com:/var/run/gluster/snaps/f24479f48660489e9471a5d87e0d4d50/brick4/b1 belongs to a different node [2014-08-27 08:30:29.404712] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed rhs-arch-srv2.lab.eng.blr.redhat.com:-var-run-gluster-snaps-f24479f48660489e9471a5d87e0d4d50-brick2-b1 [2014-08-27 08:30:29.404743] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed inception.lab.eng.blr.redhat.com:-var-run-gluster-snaps-f24479f48660489e9471a5d87e0d4d50-brick1-b1 [2014-08-27 08:30:29.404771] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed rhs-arch-srv3.lab.eng.blr.redhat.com:-var-run-gluster-snaps-f24479f48660489e9471a5d87e0d4d50-brick3-b1 [2014-08-27 08:30:29.404798] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed rhs-arch-srv4.lab.eng.blr.redhat.com:-var-run-gluster-snaps-f24479f48660489e9471a5d87e0d4d50-brick4-b1 [2014-08-27 08:30:29.404833] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed bricks [2014-08-27 08:30:29.404862] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed node_state.info [2014-08-27 08:30:29.404887] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed rbstate [2014-08-27 08:30:29.404911] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed info [2014-08-27 08:30:29.404949] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed f24479f4-8660-489e-9471-a5d87e0d4d50.deleted Version-Release number of selected component (if applicable): ============================================================== 1/1 Steps Carried: ============== 1. Create 4 node cluster 2. Create 4 volumes 3. Mount the volume to client (Fuse and NFS) 4. Start copying and removing from each client with syntax like from each mount: for i in {1..20}; do cp -rvf /etc etc.$i ; sleep 5; rm -rvf etc.$i;done 5. While IO is in progress, start creating snapshots of each volume in loop of 1..256 from nodes in cluster using: Node1: for i in {1..256}; do time gluster snapshot create a$i vol0; done Node2: for i in {1..256}; do time gluster snapshot create b$i vol1; done Node3: for i in {1..256}; do time gluster snapshot create b$i vol2; done Node4: for i in {1..256}; do time gluster snapshot create d$i vol3; done Actual results: =============== After around 450+ snapshot creating the snapshot creation failed on all nodes with below error E [store.c:96:gf_store_sync_direntry] 0-management: Failed to open directory /var/lib/glusterd/snaps/a138/f24479f48660489e9471a5d87e0d4d50/bricks, due to Too many open files Expected results: ================= Snapshot creation should succeed
Can you please check open file limit on the machine with the below command. I think it is running out of the 'open files limit' where so many bricks are running from 450+ snapshots might have opened some files. 'ulimit -n' Also please can you please provide output of 'lsof'?
Tested on dev machines but not able to reproduce the problem. Requested QE to reproduce the problem again so that it can be investigated further.
This issue is not specific to a snapshot. Default 'open FD limit' is 1024. As the number of volumes/bricks increases, brick-to-glusterd socket FDs also increases in glusterd and runs out of the limit. Solution is to set the 'Open FD' limit to higher value in glusterd
Patch submitted: https://code.engineering.redhat.com/gerrit/#/c/31894/
After discussing with Engineering Leads approving this to be in Denali(RHS3.0).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html