Bug 1578153 - GlusterFS snapshots cannot be deleted
Summary: GlusterFS snapshots cannot be deleted
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: GlusterFS
Classification: Community
Component: snapshot
Version: mainline
Hardware: x86_64
OS: Linux
medium
urgent
Target Milestone: ---
Assignee: Raghavendra Bhat
QA Contact: Rahul Hinduja
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-14 22:45 UTC by matts
Modified: 2020-03-12 13:03 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-12 13:03:53 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
This is the full gzipped glusterd.log file (4.25 MB, application/x-gzip)
2018-05-23 18:00 UTC, matts
no flags Details

Description matts 2018-05-14 22:45:32 UTC
Description of problem:

from node1:

sudo gluster --mode=script snapshot delete backup
snapshot delete: failed: Pre Validation failed on node2. Please check log file for details.
Snapshot command failed

[2018-03-07 19:18:23.578278] E [MSGID: 106057] [glusterd-snapshot.c:5996:glusterd_snapshot_remove_prevalidate] 0-management: Snapshot (backup) does not exist [Invalid argument]
[2018-03-07 19:18:23.578505] W [MSGID: 106044] [glusterd-snapshot.c:8785:glusterd_snapshot_prevalidate] 0-management: Snapshot remove validation failed
[2018-03-07 19:18:23.578685] W [MSGID: 106122] [glusterd-mgmt.c:156:gd_mgmt_v3_pre_validate_fn] 0-management: Snapshot Prevalidate Failed
[2018-03-07 19:18:23.578750] E [MSGID: 106122] [glusterd-mgmt-handler.c:337:glusterd_handle_pre_validate_fn] 0-management: Pre Validation failed on operation Snapshot

On node1:

sudo gluster snapshot info
output:

Snapshot                  : backup
Snap UUID                 : 3c018235-814b-4ffb-9213-2ed85a19a87e
Created                   : 2018-02-19 20:00:01
Snap Volumes:
        Snap Volume Name          : 4f34660a81e74d9b9bb37a253b79b7b3
        Origin Volume name        : storage
        Snaps taken for storage      : 1
        Snaps available for storage  : 255
        Status                    : Stopped


Version-Release number of selected component (if applicable):
glusterfs 3.10.12

On node2:

sudo gluster snapshot info
output:

No snapshots present

How reproducible:
Easily

Steps to Reproduce:
1. Create Snapshot
2. Delete Snapshot
3.

Actual results:
snapshot delete: failed: Pre Validation failed on node2. Please check log file for details.

Expected results:
Snapshot deleted

Additional info:

Comment 2 Sunny Kumar 2018-05-16 09:22:56 UTC
Hi,

I tried to reproduce this bug but not able see this behavior.
Can you please share sos report or glusterd log or gluster peer status command output.

-Sunny

Comment 3 matts 2018-05-16 17:09:08 UTC
Hi Sunny,

Thanks for your response

From node #1

root@textmining-infrastructure-storage-server-1:~# gluster peer status 
Number of Peers: 1

Hostname: gluster2
Uuid: f560a214-edd6-4242-b408-8a629b0f70e1
State: Peer in Cluster (Connected)

-------------------------------------------

From node #2

root@textmining-infrastructure-storage-server-2:~# gluster peer status 
Number of Peers: 1

Hostname: textmining-infrastructure-storage-server-1
Uuid: 767c7364-e3a9-40d2-a2e2-71ae16f122c8
State: Peer in Cluster (Connected)
Other names:
textmining-infrastructure-storage-server-1.c.textmining-144321.internal


-----------------------------------


Glusterd log output 

The message "I [MSGID: 106499] [glusterd-handler.c:4363:__glusterd_handle_status_volume] 0-management: Received status volume req for volume storage" repeated 7 times between [2018-05-16 17:03:19.748439] and [2018-05-16 17:05:04.438398]
[2018-05-16 17:05:19.241140] I [MSGID: 106488] [glusterd-handler.c:1538:__glusterd_handle_cli_get_volume] 0-management: Received get vol req
[2018-05-16 17:05:19.329356] I [MSGID: 106487] [glusterd-handler.c:1475:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2018-05-16 17:05:19.484001] I [MSGID: 106499] [glusterd-handler.c:4363:__glusterd_handle_status_volume] 0-management: Received status volume req for volume storage
[2018-05-16 17:06:23.061268] E [MSGID: 106057] [glusterd-snapshot.c:5996:glusterd_snapshot_remove_prevalidate] 0-management: Snapshot (backup) does not exist [Invalid argument]
[2018-05-16 17:06:23.061310] W [MSGID: 106044] [glusterd-snapshot.c:8785:glusterd_snapshot_prevalidate] 0-management: Snapshot remove validation failed
[2018-05-16 17:06:23.061321] W [MSGID: 106122] [glusterd-mgmt.c:156:gd_mgmt_v3_pre_validate_fn] 0-management: Snapshot Prevalidate Failed
[2018-05-16 17:06:23.061331] E [MSGID: 106122] [glusterd-mgmt-handler.c:337:glusterd_handle_pre_validate_fn] 0-management: Pre Validation failed on operation Snapshot
The message "I [MSGID: 106488] [glusterd-handler.c:1538:__glusterd_handle_cli_get_volume] 0-management: Received get vol req" repeated 15 times between [2018-05-16 17:05:19.241140] and [2018-05-16 17:07:04.226327]
The message "I [MSGID: 106487] [glusterd-handler.c:1475:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req" repeated 7 times between [2018-05-16 17:05:19.329356] and [2018-05-16 17:07:04.316285]
The message "I [MSGID: 106499] [glusterd-handler.c:4363:__glusterd_handle_status_volume] 0-management: Received status volume req for volume storage" repeated 7 times between [2018-05-16 17:05:19.484001] and [2018-05-16 17:07:04.442139]

Comment 5 matts 2018-05-21 18:34:37 UTC
Hi Sunny,

Have you been able to find the reason for this issue, based on the information recently provided?

Comment 6 Sunny Kumar 2018-05-23 04:34:47 UTC
Hi Matt,

The log you shared is not sufficient please share complete glusterd log.

-Sunny

Comment 7 matts 2018-05-23 18:00:36 UTC
Created attachment 1440716 [details]
This is the full gzipped glusterd.log file

The full glusterd log is attached, as requested.

Comment 8 matts 2018-05-30 23:00:41 UTC
Hi Sunny, Has the extended log provided you with any more insight into the problem?

Comment 9 matts 2018-07-24 21:13:47 UTC
Hi,

There has not been an update on this issue for over a month.  Can you please confirm that you are still investigating?

Many thanks,
Matt

Comment 10 Sunny Kumar 2018-07-25 11:37:25 UTC
Hi matt,

Apologies for the late response, I was busy with some other stuffs. Anyway I got some time and went through the attached log. I observed that you tried to replace brick and lots of unsuccessful snap creation attempts, tried to attach brick which were already part of volume. So can you please describe exactly what you tried to do step by step so that your setup landed in this shape. This is necessary as I am not able to reproduce this behavior.

- Sunny

Comment 11 Alexander 2018-07-26 10:54:58 UTC
Hi Sunny,

There are no exact steps to reproduce, this problem randomly appears. It may work 5 days without fail, other time it may fail twice in a day.

So instead I can describe our setup:
1. Two GlusterFS 4.1 server nodes (we tried with 3.10 and 3.12 as well with same result). And we also tried 3 nodes - and it seems it only increases fail rate.
2. Four GlusterFS 4.1 client nodes.
3. A lot of small files about 10-40Kb in total about 1Tb of data.
4. Ubuntu 16.04 as hosting OS. We recently switched to CentOS 7, but don't have diagnostics information yet for this, because we are recovering from backup and this takes about 5 days.
5. Cron job for making snapshots on 1st GlusterFS server instance every 2 hours. It first creates GlusterFS snapshot, then makes Google Cloud disk snapshot from instance, then removes Gluster FS snapshot. We are using this algorithm for full backups.

As I said before, after some unpredictable number of iterations it fails. But we never had it working more than a week.

We also used glusterfs_exporter for Prometheus metrics and using it greatly increases fails rate. Actually it will only able to make 2-3 iterations before fail.

So I can suggest following steps to reproduce:
1. Create volume with bunch of small files for about 1Tb.
2. Setup two GlusterFS servers configuration on Ubuntu 16.04.
3. Create cron job for backup script which works as described above (take snapshot, 5 sec pause, drop snapshot). You can decrease interval from 2 hours to 10 minutes. I believe it will increase fail rate.
4. Run glusterfs exporter and pull information from it every 0.5 seconds with script.

I think that should be enough to get fail fast enough.

Comment 12 matts 2018-07-31 23:44:52 UTC
Hi Sunny,

Has the info that Alexander posted provided any assistance in replicating our reported issue?

Matt

Comment 13 matts 2018-08-08 00:33:19 UTC
Hi Sunny,

Have you been able to reproduce our reported issue?

Regards
Matt

Comment 15 matts 2018-08-21 17:40:06 UTC
Hi Sunny,

Do you have any updates?

Regards
Matt

Comment 16 Atin Mukherjee 2018-11-09 04:50:35 UTC
This is an upstream bug and the product has been incorrectly choosen. As 3.10 version is EOLed,I'm moving this to mainline.

Comment 17 Worker Ant 2020-03-12 13:03:53 UTC
This bug is moved to https://github.com/gluster/glusterfs/issues/993, and will be tracked there from now on. Visit GitHub issues URL for further details


Note You need to log in before you can comment on or make changes to this bug.