Bug 1134316
| Summary: | [SNAPSHOT]: snapshot create fails with error in log "Failed to open directory <xyz>, due to many open files" | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Rahul Hinduja <rhinduja> | |
| Component: | snapshot | Assignee: | Vijaikumar Mallikarjuna <vmallika> | |
| Status: | CLOSED ERRATA | QA Contact: | Rahul Hinduja <rhinduja> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | rhgs-3.0 | CC: | nsathyan, rhinduja, rhs-bugs, smohan, ssamanta, storage-qa-internal, vmallika | |
| Target Milestone: | --- | |||
| Target Release: | RHGS 3.0.0 | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | SNAPSHOT | |||
| Fixed In Version: | glusterfs-3.6.0.28-1 | Doc Type: | Bug Fix | |
| Doc Text: |
Default 'open fd limit' which is 1024 is not sufficient and only ~500 bricks can connect to glusterd with 2 socket connections for each brick.
Now the limit is increased to 65536, with this glusterd can connect upto 32768 bricks
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1136352 (view as bug list) | Environment: | ||
| Last Closed: | 2014-09-22 19:45:40 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1136352, 1145095 | |||
Can you please check open file limit on the machine with the below command. I think it is running out of the 'open files limit' where so many bricks are running from 450+ snapshots might have opened some files. 'ulimit -n' Also please can you please provide output of 'lsof'? Tested on dev machines but not able to reproduce the problem. Requested QE to reproduce the problem again so that it can be investigated further. This issue is not specific to a snapshot. Default 'open FD limit' is 1024. As the number of volumes/bricks increases, brick-to-glusterd socket FDs also increases in glusterd and runs out of the limit. Solution is to set the 'Open FD' limit to higher value in glusterd Patch submitted: https://code.engineering.redhat.com/gerrit/#/c/31894/ After discussing with Engineering Leads approving this to be in Denali(RHS3.0). Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html |
Description of problem: ======================= While creating snapshots in parallel of 4 volumes, when IO was in progress, snapshot creation failed as follows: snapshot create: failed: Commit failed on localhost. Please check log file for details. Snapshot command failed real 0m3.404s user 0m0.073s sys 0m0.018s Log Snippet: ============ [2014-08-27 08:30:29.404215] D [store.c:372:gf_store_save_value] 0-: returning: 0 [2014-08-27 08:30:29.404256] E [store.c:96:gf_store_sync_direntry] 0-management: Failed to open directory /var/lib/glusterd/snaps/a138/f24479f48660489e9471a5d87e0d4d50/bricks, due to Too many open files [2014-08-27 08:30:29.404270] D [store.c:418:gf_store_handle_new] 0-: Returning -1 [2014-08-27 08:30:29.404281] E [store.c:53:gf_store_handle_create_on_absence] 0-: Unable to create store handle for path: /var/lib/glusterd/snaps/a138/f24479f48660489e9471a5d87e0d4d50/bricks/rhs-arch-srv4.lab.eng.blr.redhat.com:-var-run-gluster-snaps-f24479f48660489e9471a5d87e0d4d50-brick4-b1 [2014-08-27 08:30:29.404294] D [glusterd-store.c:427:glusterd_store_brickinfo] 0-management: Returning with -1 [2014-08-27 08:30:29.404305] D [glusterd-store.c:1147:glusterd_store_brickinfos] 0-management: Returning -1 [2014-08-27 08:30:29.404334] D [glusterd-store.c:1348:glusterd_store_perform_volume_store] 0-management: Returning -1 [2014-08-27 08:30:29.404471] E [store.c:161:gf_store_unlink_tmppath] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(glusterd_store_volinfo+0x34) [0x7fcac2f33514] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(glusterd_store_volume_cleanup_tmp+0x11) [0x7fcac2f2c851] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(glusterd_store_bricks_cleanup_tmp+0x3c) [0x7fcac2f2c7ec]))) 0-store: invalid argument: shandle [2014-08-27 08:30:29.404492] D [glusterd-store.c:1552:glusterd_store_volinfo] 0-management: Returning -1 [2014-08-27 08:30:29.404508] E [glusterd-snapshot.c:4384:glusterd_do_snap_vol] 0-management: Failed to store snapshot volinfo (f24479f48660489e9471a5d87e0d4d50) for snap a138 [2014-08-27 08:30:29.404521] D [glusterd-utils.c:7897:glusterd_brick_stop] 0-management: About to stop glusterfs for brick inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/f24479f48660489e9471a5d87e0d4d50/brick1/b1 [2014-08-27 08:30:29.404535] D [glusterd-utils.c:7907:glusterd_brick_stop] 0-management: returning 0 [2014-08-27 08:30:29.404560] D [glusterd-snapshot.c:2268:glusterd_lvm_snapshot_remove] 0-management: Brick inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/f24479f48660489e9471a5d87e0d4d50/brick1/b1 already deleted. [2014-08-27 08:30:29.404575] D [glusterd-snapshot.c:2260:glusterd_lvm_snapshot_remove] 0-management: rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/snaps/f24479f48660489e9471a5d87e0d4d50/brick2/b1 belongs to a different node [2014-08-27 08:30:29.404589] D [glusterd-snapshot.c:2260:glusterd_lvm_snapshot_remove] 0-management: rhs-arch-srv3.lab.eng.blr.redhat.com:/var/run/gluster/snaps/f24479f48660489e9471a5d87e0d4d50/brick3/b1 belongs to a different node [2014-08-27 08:30:29.404601] D [glusterd-snapshot.c:2260:glusterd_lvm_snapshot_remove] 0-management: rhs-arch-srv4.lab.eng.blr.redhat.com:/var/run/gluster/snaps/f24479f48660489e9471a5d87e0d4d50/brick4/b1 belongs to a different node [2014-08-27 08:30:29.404712] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed rhs-arch-srv2.lab.eng.blr.redhat.com:-var-run-gluster-snaps-f24479f48660489e9471a5d87e0d4d50-brick2-b1 [2014-08-27 08:30:29.404743] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed inception.lab.eng.blr.redhat.com:-var-run-gluster-snaps-f24479f48660489e9471a5d87e0d4d50-brick1-b1 [2014-08-27 08:30:29.404771] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed rhs-arch-srv3.lab.eng.blr.redhat.com:-var-run-gluster-snaps-f24479f48660489e9471a5d87e0d4d50-brick3-b1 [2014-08-27 08:30:29.404798] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed rhs-arch-srv4.lab.eng.blr.redhat.com:-var-run-gluster-snaps-f24479f48660489e9471a5d87e0d4d50-brick4-b1 [2014-08-27 08:30:29.404833] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed bricks [2014-08-27 08:30:29.404862] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed node_state.info [2014-08-27 08:30:29.404887] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed rbstate [2014-08-27 08:30:29.404911] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed info [2014-08-27 08:30:29.404949] D [glusterd-utils.c:12803:glusterd_recursive_rmdir] 0-management: Removed f24479f4-8660-489e-9471-a5d87e0d4d50.deleted Version-Release number of selected component (if applicable): ============================================================== 1/1 Steps Carried: ============== 1. Create 4 node cluster 2. Create 4 volumes 3. Mount the volume to client (Fuse and NFS) 4. Start copying and removing from each client with syntax like from each mount: for i in {1..20}; do cp -rvf /etc etc.$i ; sleep 5; rm -rvf etc.$i;done 5. While IO is in progress, start creating snapshots of each volume in loop of 1..256 from nodes in cluster using: Node1: for i in {1..256}; do time gluster snapshot create a$i vol0; done Node2: for i in {1..256}; do time gluster snapshot create b$i vol1; done Node3: for i in {1..256}; do time gluster snapshot create b$i vol2; done Node4: for i in {1..256}; do time gluster snapshot create d$i vol3; done Actual results: =============== After around 450+ snapshot creating the snapshot creation failed on all nodes with below error E [store.c:96:gf_store_sync_direntry] 0-management: Failed to open directory /var/lib/glusterd/snaps/a138/f24479f48660489e9471a5d87e0d4d50/bricks, due to Too many open files Expected results: ================= Snapshot creation should succeed