Description of problem: We have a GlusterFS with two server nodes with replica 2 and cron job which takes snapshots every 2 hours (it takes about 30 seconds to complete snapshot, so no overlap). Cron runs on first node (gluster1). It may work few times without problem, but then it randomly stops working with following error: snapshot create: failed: Commit failed on gluster2. Please check log file for details. Snapshot command failed When I check logs on second node (you may notice failed remove as well, because it tries to clean-up backup in case of error): [2018-06-25 13:09:34.180240] I [MSGID: 106488] [glusterd-handler.c:1549:__glusterd_handle_cli_get_volume] 0-management: Received get vol req [2018-06-25 13:09:34.228196] I [MSGID: 106487] [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2018-06-25 13:09:34.278582] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume storage [2018-06-25 13:09:50.875267] E [MSGID: 106029] [glusterd-snapshot.c:4750:glusterd_take_lvm_snapshot] 0-management: taking snapshot of the brick (/data) of device /dev/mapper/vg.glusterfs-data failed [2018-06-25 13:09:50.875382] E [MSGID: 106029] [glusterd-snapshot.c:5179:glusterd_take_brick_snapshot] 0-management: Failed to take snapshot of brick gluster2:/data [2018-06-25 13:09:50.875404] E [MSGID: 106029] [glusterd-snapshot.c:6542:glusterd_take_brick_snapshot_task] 0-management: Failed to take backend snapshot for brick gluster2:/run/gluster/snaps/0ccbbaedca7b455ebcf410fbd0ada884/brick1 volume(0ccbbaedca7b455ebcf410fbd0ada884) [2018-06-25 13:09:50.875605] E [MSGID: 106029] [glusterd-snapshot.c:6686:glusterd_schedule_brick_snapshot] 0-management: Failed to create snapshot [2018-06-25 13:09:50.875630] E [MSGID: 106029] [glusterd-snapshot.c:7002:glusterd_snapshot_create_commit] 0-management: Failed to take backend snapshot backup [2018-06-25 13:09:50.877362] E [MSGID: 106029] [glusterd-snapshot.c:8449:glusterd_snapshot] 0-management: Failed to create snapshot [2018-06-25 13:09:50.877395] W [MSGID: 106122] [glusterd-mgmt.c:299:gd_mgmt_v3_commit_fn] 0-management: Snapshot Commit Failed [2018-06-25 13:09:50.877412] E [MSGID: 106122] [glusterd-mgmt-handler.c:603:glusterd_handle_commit_fn] 0-management: commit failed on operation Snapshot [2018-06-25 13:09:50.882749] W [glusterd-locks.c:622:glusterd_mgmt_v3_lock] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.0/xlator/mgmt/glusterd.so(+0xd724c) [0x7f8be9f9524c] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.0/xlator/mgmt/glusterd.so(+0xd6f9d) [0x7f8be9f94f9d] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.0/xlator/mgmt/glusterd.so(+0xdd4ad) [0x7f8be9f9b4ad] ) 0-management: Lock for storage held by 8997f40a-c987-406d-8cca-cf2d340993f9 [2018-06-25 13:09:50.882124] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume storage [2018-06-25 13:09:50.882783] E [MSGID: 106118] [glusterd-syncop.c:1919:gd_sync_task_begin] 0-management: Unable to acquire lock for storage [2018-06-25 13:09:51.000769] E [MSGID: 106056] [glusterd-snapshot.c:5947:glusterd_snapshot_remove_prevalidate] 0-management: Snapshot (backup) does not exist [Invalid argument] [2018-06-25 13:09:51.000916] W [MSGID: 106043] [glusterd-snapshot.c:8776:glusterd_snapshot_prevalidate] 0-management: Snapshot remove validation failed [2018-06-25 13:09:51.000933] W [MSGID: 106121] [glusterd-mgmt.c:156:gd_mgmt_v3_pre_validate_fn] 0-management: Snapshot Prevalidate Failed [2018-06-25 13:09:51.000954] E [MSGID: 106121] [glusterd-mgmt-handler.c:337:glusterd_handle_pre_validate_fn] 0-management: Pre Validation failed on operation Snapshot [2018-06-25 13:10:01.952888] E [MSGID: 106056] [glusterd-snapshot.c:5947:glusterd_snapshot_remove_prevalidate] 0-management: Snapshot (backup) does not exist [Invalid argument] [2018-06-25 13:10:01.952924] W [MSGID: 106043] [glusterd-snapshot.c:8776:glusterd_snapshot_prevalidate] 0-management: Snapshot remove validation failed When I run backup script next time: snapshot create: failed: Snapshot backup already exists Snapshot command failed ERROR: Failed to create GlusterFS snapshot But if you try to remove snapshot it fails, because servers becomes to inconsistent state where first node says snapshot already exists, but second node says it doesn't have snapshot. So you either can't create or delete it. If you restart glusterd on first node then snapshot "disappears" so you can try to take new snapshot, but it fails with same error as initial. The only way I found to resolve this issue (for next few backups if I was lucky): - restart GlusterFS on first node; - stop GlusterFS on second node; - killall gluster* on second node; - start GlusterFS on second node. Version-Release number of selected component (if applicable): GlusterFS 4.1 (also tried for 3.10 and 3.12 with same result). How reproducible: I have Ansible playbook and Terraform file which I can adapt if you need them, but in general it is a usual setup with two server nodes and two bricks: $ sudo gluster volume info storage Volume Name: storage Type: Replicate Volume ID: 2d11f59f-d9cb-4cd3-a7c5-4556075bed88 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: gluster2:/data Brick2: gluster1:/data Options Reconfigured: features.barrier: disable cluster.granular-entry-heal: enable transport.address-family: inet nfs.disable: on performance.client-io-threads: off Steps to Reproduce: 1. 2. 3. Actual results: Snapshot fails and becomes in inconsistent state Expected results: Snapshot takes successfully Additional info:
Dup of https://bugzilla.redhat.com/show_bug.cgi?id=1594819 ?
For some reason it was created twice. Yes, it is a dup. Please, close this.
*** This bug has been marked as a duplicate of bug 1594819 ***