Description of problem: from node1: sudo gluster --mode=script snapshot delete backup snapshot delete: failed: Pre Validation failed on node2. Please check log file for details. Snapshot command failed [2018-03-07 19:18:23.578278] E [MSGID: 106057] [glusterd-snapshot.c:5996:glusterd_snapshot_remove_prevalidate] 0-management: Snapshot (backup) does not exist [Invalid argument] [2018-03-07 19:18:23.578505] W [MSGID: 106044] [glusterd-snapshot.c:8785:glusterd_snapshot_prevalidate] 0-management: Snapshot remove validation failed [2018-03-07 19:18:23.578685] W [MSGID: 106122] [glusterd-mgmt.c:156:gd_mgmt_v3_pre_validate_fn] 0-management: Snapshot Prevalidate Failed [2018-03-07 19:18:23.578750] E [MSGID: 106122] [glusterd-mgmt-handler.c:337:glusterd_handle_pre_validate_fn] 0-management: Pre Validation failed on operation Snapshot On node1: sudo gluster snapshot info output: Snapshot : backup Snap UUID : 3c018235-814b-4ffb-9213-2ed85a19a87e Created : 2018-02-19 20:00:01 Snap Volumes: Snap Volume Name : 4f34660a81e74d9b9bb37a253b79b7b3 Origin Volume name : storage Snaps taken for storage : 1 Snaps available for storage : 255 Status : Stopped Version-Release number of selected component (if applicable): glusterfs 3.10.12 On node2: sudo gluster snapshot info output: No snapshots present How reproducible: Easily Steps to Reproduce: 1. Create Snapshot 2. Delete Snapshot 3. Actual results: snapshot delete: failed: Pre Validation failed on node2. Please check log file for details. Expected results: Snapshot deleted Additional info:
Hi, I tried to reproduce this bug but not able see this behavior. Can you please share sos report or glusterd log or gluster peer status command output. -Sunny
Hi Sunny, Thanks for your response From node #1 root@textmining-infrastructure-storage-server-1:~# gluster peer status Number of Peers: 1 Hostname: gluster2 Uuid: f560a214-edd6-4242-b408-8a629b0f70e1 State: Peer in Cluster (Connected) ------------------------------------------- From node #2 root@textmining-infrastructure-storage-server-2:~# gluster peer status Number of Peers: 1 Hostname: textmining-infrastructure-storage-server-1 Uuid: 767c7364-e3a9-40d2-a2e2-71ae16f122c8 State: Peer in Cluster (Connected) Other names: textmining-infrastructure-storage-server-1.c.textmining-144321.internal ----------------------------------- Glusterd log output The message "I [MSGID: 106499] [glusterd-handler.c:4363:__glusterd_handle_status_volume] 0-management: Received status volume req for volume storage" repeated 7 times between [2018-05-16 17:03:19.748439] and [2018-05-16 17:05:04.438398] [2018-05-16 17:05:19.241140] I [MSGID: 106488] [glusterd-handler.c:1538:__glusterd_handle_cli_get_volume] 0-management: Received get vol req [2018-05-16 17:05:19.329356] I [MSGID: 106487] [glusterd-handler.c:1475:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2018-05-16 17:05:19.484001] I [MSGID: 106499] [glusterd-handler.c:4363:__glusterd_handle_status_volume] 0-management: Received status volume req for volume storage [2018-05-16 17:06:23.061268] E [MSGID: 106057] [glusterd-snapshot.c:5996:glusterd_snapshot_remove_prevalidate] 0-management: Snapshot (backup) does not exist [Invalid argument] [2018-05-16 17:06:23.061310] W [MSGID: 106044] [glusterd-snapshot.c:8785:glusterd_snapshot_prevalidate] 0-management: Snapshot remove validation failed [2018-05-16 17:06:23.061321] W [MSGID: 106122] [glusterd-mgmt.c:156:gd_mgmt_v3_pre_validate_fn] 0-management: Snapshot Prevalidate Failed [2018-05-16 17:06:23.061331] E [MSGID: 106122] [glusterd-mgmt-handler.c:337:glusterd_handle_pre_validate_fn] 0-management: Pre Validation failed on operation Snapshot The message "I [MSGID: 106488] [glusterd-handler.c:1538:__glusterd_handle_cli_get_volume] 0-management: Received get vol req" repeated 15 times between [2018-05-16 17:05:19.241140] and [2018-05-16 17:07:04.226327] The message "I [MSGID: 106487] [glusterd-handler.c:1475:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req" repeated 7 times between [2018-05-16 17:05:19.329356] and [2018-05-16 17:07:04.316285] The message "I [MSGID: 106499] [glusterd-handler.c:4363:__glusterd_handle_status_volume] 0-management: Received status volume req for volume storage" repeated 7 times between [2018-05-16 17:05:19.484001] and [2018-05-16 17:07:04.442139]
Hi Sunny, Have you been able to find the reason for this issue, based on the information recently provided?
Hi Matt, The log you shared is not sufficient please share complete glusterd log. -Sunny
Created attachment 1440716 [details] This is the full gzipped glusterd.log file The full glusterd log is attached, as requested.
Hi Sunny, Has the extended log provided you with any more insight into the problem?
Hi, There has not been an update on this issue for over a month. Can you please confirm that you are still investigating? Many thanks, Matt
Hi matt, Apologies for the late response, I was busy with some other stuffs. Anyway I got some time and went through the attached log. I observed that you tried to replace brick and lots of unsuccessful snap creation attempts, tried to attach brick which were already part of volume. So can you please describe exactly what you tried to do step by step so that your setup landed in this shape. This is necessary as I am not able to reproduce this behavior. - Sunny
Hi Sunny, There are no exact steps to reproduce, this problem randomly appears. It may work 5 days without fail, other time it may fail twice in a day. So instead I can describe our setup: 1. Two GlusterFS 4.1 server nodes (we tried with 3.10 and 3.12 as well with same result). And we also tried 3 nodes - and it seems it only increases fail rate. 2. Four GlusterFS 4.1 client nodes. 3. A lot of small files about 10-40Kb in total about 1Tb of data. 4. Ubuntu 16.04 as hosting OS. We recently switched to CentOS 7, but don't have diagnostics information yet for this, because we are recovering from backup and this takes about 5 days. 5. Cron job for making snapshots on 1st GlusterFS server instance every 2 hours. It first creates GlusterFS snapshot, then makes Google Cloud disk snapshot from instance, then removes Gluster FS snapshot. We are using this algorithm for full backups. As I said before, after some unpredictable number of iterations it fails. But we never had it working more than a week. We also used glusterfs_exporter for Prometheus metrics and using it greatly increases fails rate. Actually it will only able to make 2-3 iterations before fail. So I can suggest following steps to reproduce: 1. Create volume with bunch of small files for about 1Tb. 2. Setup two GlusterFS servers configuration on Ubuntu 16.04. 3. Create cron job for backup script which works as described above (take snapshot, 5 sec pause, drop snapshot). You can decrease interval from 2 hours to 10 minutes. I believe it will increase fail rate. 4. Run glusterfs exporter and pull information from it every 0.5 seconds with script. I think that should be enough to get fail fast enough.
Hi Sunny, Has the info that Alexander posted provided any assistance in replicating our reported issue? Matt
Hi Sunny, Have you been able to reproduce our reported issue? Regards Matt
Hi Sunny, Do you have any updates? Regards Matt
This is an upstream bug and the product has been incorrectly choosen. As 3.10 version is EOLed,I'm moving this to mainline.
This bug is moved to https://github.com/gluster/glusterfs/issues/993, and will be tracked there from now on. Visit GitHub issues URL for further details