Bug 1464150

Summary: [GSS] Unable to delete snapshot because it's in use
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Simon Reber <sreber>
Component: snapshotAssignee: Sunny Kumar <sunkumar>
Status: CLOSED ERRATA QA Contact: Vinayak Papnoi <vpapnoi>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.1CC: amukherj, atumball, bkunal, nchilaka, pdhange, rabhat, rhinduja, rhs-bugs, rkavunga, sheggodu, sreber, srmukher, storage-qa-internal, sunkumar
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.12.2-2 Doc Type: Enhancement
Doc Text:
GlusterFS used to mount deactivated snapshot(s) under /run/gluster/snaps by default. Furthermore, the snapshot status command should show relevant information for the deactivated snapshot(s). Since we have a mount, there is a possibility that some process may access the mount causing issues while unmounting the volume during the snapshot deletion. This feature assures that GlusterFS does not mount deactivated snapshot(s) and displays the text 'N/A (Deactivated Snapshot)' in Volume Group filed for snapshot status command.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 06:32:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1482023, 1506098    
Bug Blocks: 1408949, 1472361, 1503135    
Attachments:
Description Flags
sosreport from one of the gluster node none

Description Simon Reber 2017-06-22 14:12:19 UTC
Created attachment 1290740 [details]
sosreport from one of the gluster node

Description of problem:

Customer is running Red Hat Gluster Server and heavily using snapshot(s). They create and delete snapshot(s) with a self maintained script to meet their policies.

From time to time, they notice that snapshot(s) are not properly deleted on all nodes. Means they need to-do a manual clean-up, removing the volumes and everything else under /var/lib/gluster on their own.

While investigating, we found that one part of the problem is, that the snapshot(s) are automatically mounted under /run/gluster/snaps/<snapshot>. This is causing issues, as processes are accessing these shared (such as monitoring, etc.) and therefore keeping a open file handler.

Since removing a snapshot also means umounting /run/gluster/snaps/<snapshot> it' becomes clear that this operation will fail if some process is accessing the share at this given time.

Messages found in the logs of gluster are as following:

[2017-06-18 09:00:01.762895] I [MSGID: 106091] [glusterd-snapshot.c:6263:glusterd_snapshot_remove_commit] 0-management: Successfully marked snap vol_fast_registry_GMT-2017.06.15-10.03.01 for decommission.
[2017-06-18 09:00:01.763358] W [MSGID: 106073] [glusterd-snapshot.c:2780:glusterd_do_lvm_snapshot_remove] 0-management: Getting the root of the brick for volume 489a7157c2d54d269e6644856202e779 (snap vol_fast_registry_GMT-2017.06.15-10.03.01) failed. Removing lv (/dev/vg_fast_registry/489a7157c2d54d269e6644856202e779_0).
[2017-06-18 09:00:01.783014] E [MSGID: 106044] [glusterd-snapshot.c:2834:glusterd_do_lvm_snapshot_remove] 0-management: removing snapshot of the brick (glusternode03a:/run/gluster/snaps/489a7157c2d54d269e6644856202e779/brick1/registry) of device /dev/vg_fast_registry/489a7157c2d54d269e6644856202e779_0 failed
[2017-06-18 09:00:01.783044] E [MSGID: 106044] [glusterd-snapshot.c:2962:glusterd_lvm_snapshot_remove] 0-management: Failed to remove the snapshot /run/gluster/snaps/489a7157c2d54d269e6644856202e779/brick1/registry (/dev/vg_fast_registry/489a7157c2d54d269e6644856202e779_0)
[2017-06-18 09:00:01.783112] W [MSGID: 106033] [glusterd-snapshot.c:3008:glusterd_lvm_snapshot_remove] 0-management: Failed to rmdir: /run/gluster/snaps/489a7157c2d54d269e6644856202e779/, err: Directory not empty. More than one glusterd running on this node. [Directory not empty]
[2017-06-18 09:00:01.783124] W [MSGID: 106044] [glusterd-snapshot.c:3079:glusterd_snap_volume_remove] 0-management: Failed to remove lvm snapshot volume 489a7157c2d54d269e6644856202e779
[2017-06-18 09:00:01.783133] W [MSGID: 106044] [glusterd-snapshot.c:3154:glusterd_snap_remove] 0-management: Failed to remove volinfo 489a7157c2d54d269e6644856202e779 for snap vol_fast_registry_GMT-2017.06.15-10.03.01
[2017-06-18 09:00:01.783151] E [MSGID: 106044] [glusterd-snapshot.c:6298:glusterd_snapshot_remove_commit] 0-management: Failed to remove snap vol_fast_registry_GMT-2017.06.15-10.03.01
[2017-06-18 09:00:01.783159] E [MSGID: 106044] [glusterd-snapshot.c:8308:glusterd_snapshot] 0-management: Failed to delete snapshot
[2017-06-18 09:00:01.783169] W [MSGID: 106123] [glusterd-mgmt.c:272:gd_mgmt_v3_commit_fn] 0-management: Snapshot Commit Failed
[2017-06-18 09:00:01.783175] E [MSGID: 106123] [glusterd-mgmt.c:1414:glusterd_mgmt_v3_commit] 0-management: Commit failed for operation Snapshot on local node


and on the other node.


[2017-06-18 09:00:02.965240] E [MSGID: 106095] [glusterd-snapshot-utils.c:3365:glusterd_umount] 0-management: umounting /run/gluster/snaps/482f7bb2668440e5908b1cf7e32247e8/brick2 failed (Bad file descriptor) [Bad file descriptor]
[2017-06-18 09:00:05.972554] E [MSGID: 106038] [glusterd-snapshot.c:2818:glusterd_do_lvm_snapshot_remove] 0-management: umount failed for path /run/gluster/snaps/482f7bb2668440e5908b1cf7e32247e8/brick2 (brick: /run/gluster/snaps/482f7bb2668440e5908b1cf7e32247e8/brick2/registry): Bad file descriptor.
[2017-06-18 09:00:05.972602] E [MSGID: 106044] [glusterd-snapshot.c:2962:glusterd_lvm_snapshot_remove] 0-management: Failed to remove the snapshot /run/gluster/snaps/482f7bb2668440e5908b1cf7e32247e8/brick2/registry (/dev/vg_fast_registry/482f7bb2668440e5908b1cf7e32247e8_0)
[2017-06-18 09:00:10.780390] W [MSGID: 106033] [glusterd-snapshot.c:3008:glusterd_lvm_snapshot_remove] 0-management: Failed to rmdir: /run/gluster/snaps/482f7bb2668440e5908b1cf7e32247e8/, err: Directory not empty. More than one glusterd running on this node. [Directory not empty]
[2017-06-18 09:00:10.780424] W [MSGID: 106044] [glusterd-snapshot.c:3079:glusterd_snap_volume_remove] 0-management: Failed to remove lvm snapshot volume 482f7bb2668440e5908b1cf7e32247e8
[2017-06-18 09:00:10.780436] W [MSGID: 106044] [glusterd-snapshot.c:3154:glusterd_snap_remove] 0-management: Failed to remove volinfo 482f7bb2668440e5908b1cf7e32247e8 for snap vol_fast_registry_GMT-2017.06.17-10.03.01
[...]
The message "E [MSGID: 106095] [glusterd-snapshot-utils.c:3365:glusterd_umount] 0-management: umounting /run/gluster/snaps/482f7bb2668440e5908b1cf7e32247e8/brick2 failed (Bad file descriptor) [Bad file descriptor]" repeated 2 times between [2017-06-18 09:00:02.965240] and [2017-06-18 09:00:04.972387]


The question is, is there a way to prevent this from happening. Avoiding to mount them would certainly be a good option. But so far I was unable to find an configuration option that would allow that.

Version-Release number of selected component (if applicable):

 - glusterfs-3.7.9-12.el7rhgs.x86_64

How reproducible:

randomly

Steps to Reproduce:
1. Access /run/gluster/snaps/<snapshot> when a snapshot is scheduled to be deleted
2. There might be other ways, but so far I was not able to reproduce it
3.

Actual results:

snapshot is not properly removed from all Red Hat Gluster Server nodes

Expected results:

snapshot correctly removed. Also if that means that snapshots are not automatically mounted.

Additional info:

Comment 3 Mohammed Rafi KC 2017-06-23 06:44:33 UTC
If I understand the problem correctly, some external application has opened a fd in snapshot mount path when a snapshot delete is scheduled. Because of the open fd, glusterd couldn't unmount the path, hence snapshot delete is failed. Please correct me if I'm wrong.

We need to have the snapshot brick mounted when we activate snapshot. In Snapshot create, we do all the pre-requests to create a volume (snapshot can be considered as a read-only gluster volume) like creating the bricks, setting required xattr if required etc, like a normal volume. But this can be revisited.

But I assume the problem here is that an external application (in this case, monitoring tool) accessed gluster mount point which prevented the unmount operation.

@rbhat,

Do you think we need to club brick mounting till the snapshot is activated ?

Comment 11 Mohammed Rafi KC 2017-09-21 07:03:52 UTC
Upstream patch : https://review.gluster.org/18047

Comment 12 Simon Reber 2017-10-06 07:32:07 UTC
*** Bug 1490699 has been marked as a duplicate of this bug. ***

Comment 13 Sunny Kumar 2017-10-23 10:35:50 UTC
Upstream patch : https://review.gluster.org/#/c/18047/

Comment 16 Vinayak Papnoi 2018-03-01 09:58:52 UTC
Build : glusterfs-3.12.2-4.el7rhgs.x86_64

Newly created snapshots aren't mounted unless they are activated.
Deletion of snapshots is successful even after accessing /var/run/gluster/snaps during the deletion.

Hence, moving bug to verified.

Comment 17 Prashant Dhange 2018-08-27 03:40:33 UTC
*** Bug 1616151 has been marked as a duplicate of this bug. ***

Comment 19 errata-xmlrpc 2018-09-04 06:32:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607