Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1594818

Summary: Failed to take snapshot (occasionally)
Product: [Community] GlusterFS Reporter: Alexander <alexander.i.bondarev>
Component: snapshotAssignee: bugs <bugs>
Status: CLOSED DUPLICATE QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.12CC: alexander.i.bondarev, amukherj, bugs, rhs-bugs, sankarshan, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-02 08:01:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alexander 2018-06-25 13:35:25 UTC
Description of problem:

We have a GlusterFS with two server nodes with replica 2 and cron job which takes snapshots every 2 hours (it takes about 30 seconds to complete snapshot, so no overlap). Cron runs on first node (gluster1). It may work few times without problem, but then it randomly stops working with following error:

snapshot create: failed: Commit failed on gluster2. Please check log file for details.
Snapshot command failed

When I check logs on second node (you may notice failed remove as well, because it tries to clean-up backup in case of error):

[2018-06-25 13:09:34.180240] I [MSGID: 106488] [glusterd-handler.c:1549:__glusterd_handle_cli_get_volume] 0-management: Received get vol req
[2018-06-25 13:09:34.228196] I [MSGID: 106487] [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2018-06-25 13:09:34.278582] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume storage
[2018-06-25 13:09:50.875267] E [MSGID: 106029] [glusterd-snapshot.c:4750:glusterd_take_lvm_snapshot] 0-management: taking snapshot of the brick (/data) of device /dev/mapper/vg.glusterfs-data failed
[2018-06-25 13:09:50.875382] E [MSGID: 106029] [glusterd-snapshot.c:5179:glusterd_take_brick_snapshot] 0-management: Failed to take snapshot of brick gluster2:/data
[2018-06-25 13:09:50.875404] E [MSGID: 106029] [glusterd-snapshot.c:6542:glusterd_take_brick_snapshot_task] 0-management: Failed to take backend snapshot for brick gluster2:/run/gluster/snaps/0ccbbaedca7b455ebcf410fbd0ada884/brick1 volume(0ccbbaedca7b455ebcf410fbd0ada884)
[2018-06-25 13:09:50.875605] E [MSGID: 106029] [glusterd-snapshot.c:6686:glusterd_schedule_brick_snapshot] 0-management: Failed to create snapshot
[2018-06-25 13:09:50.875630] E [MSGID: 106029] [glusterd-snapshot.c:7002:glusterd_snapshot_create_commit] 0-management: Failed to take backend snapshot backup
[2018-06-25 13:09:50.877362] E [MSGID: 106029] [glusterd-snapshot.c:8449:glusterd_snapshot] 0-management: Failed to create snapshot
[2018-06-25 13:09:50.877395] W [MSGID: 106122] [glusterd-mgmt.c:299:gd_mgmt_v3_commit_fn] 0-management: Snapshot Commit Failed
[2018-06-25 13:09:50.877412] E [MSGID: 106122] [glusterd-mgmt-handler.c:603:glusterd_handle_commit_fn] 0-management: commit failed on operation Snapshot
[2018-06-25 13:09:50.882749] W [glusterd-locks.c:622:glusterd_mgmt_v3_lock] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.0/xlator/mgmt/glusterd.so(+0xd724c) [0x7f8be9f9524c] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.0/xlator/mgmt/glusterd.so(+0xd6f9d) [0x7f8be9f94f9d] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.0/xlator/mgmt/glusterd.so(+0xdd4ad) [0x7f8be9f9b4ad] ) 0-management: Lock for storage held by 8997f40a-c987-406d-8cca-cf2d340993f9
[2018-06-25 13:09:50.882124] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume storage
[2018-06-25 13:09:50.882783] E [MSGID: 106118] [glusterd-syncop.c:1919:gd_sync_task_begin] 0-management: Unable to acquire lock for storage
[2018-06-25 13:09:51.000769] E [MSGID: 106056] [glusterd-snapshot.c:5947:glusterd_snapshot_remove_prevalidate] 0-management: Snapshot (backup) does not exist [Invalid argument]
[2018-06-25 13:09:51.000916] W [MSGID: 106043] [glusterd-snapshot.c:8776:glusterd_snapshot_prevalidate] 0-management: Snapshot remove validation failed
[2018-06-25 13:09:51.000933] W [MSGID: 106121] [glusterd-mgmt.c:156:gd_mgmt_v3_pre_validate_fn] 0-management: Snapshot Prevalidate Failed
[2018-06-25 13:09:51.000954] E [MSGID: 106121] [glusterd-mgmt-handler.c:337:glusterd_handle_pre_validate_fn] 0-management: Pre Validation failed on operation Snapshot
[2018-06-25 13:10:01.952888] E [MSGID: 106056] [glusterd-snapshot.c:5947:glusterd_snapshot_remove_prevalidate] 0-management: Snapshot (backup) does not exist [Invalid argument]
[2018-06-25 13:10:01.952924] W [MSGID: 106043] [glusterd-snapshot.c:8776:glusterd_snapshot_prevalidate] 0-management: Snapshot remove validation failed

When I run backup script next time:
snapshot create: failed: Snapshot backup already exists
Snapshot command failed
ERROR: Failed to create GlusterFS snapshot

But if you try to remove snapshot it fails, because servers becomes to inconsistent state where first node says snapshot already exists, but second node says it doesn't have snapshot. So you either can't create or delete it.

If you restart glusterd on first node then snapshot "disappears" so you can try to take new snapshot, but it fails with same error as initial. The only way I found to resolve this issue (for next few backups if I was lucky):
- restart GlusterFS on first node;
- stop GlusterFS on second node;
- killall gluster* on second node;
- start GlusterFS on second node.

Version-Release number of selected component (if applicable):
GlusterFS 4.1 (also tried for 3.10 and 3.12 with same result).

How reproducible:
I have Ansible playbook and Terraform file which I can adapt if you need them, but in general it is a usual setup with two server nodes and two bricks:

$ sudo gluster volume info storage              
                                                                                                      
Volume Name: storage                                                                                  
Type: Replicate                                                                                       
Volume ID: 2d11f59f-d9cb-4cd3-a7c5-4556075bed88                                                       
Status: Started                                                                                       
Snapshot Count: 0                                                                                     
Number of Bricks: 1 x 2 = 2                                                                           
Transport-type: tcp                                                                                   
Bricks:                                                                                               
Brick1: gluster2:/data                                                                                
Brick2: gluster1:/data                                                                                
Options Reconfigured:                                                                                 
features.barrier: disable                                                                             
cluster.granular-entry-heal: enable                                                                   
transport.address-family: inet                                                                        
nfs.disable: on                                                                                       
performance.client-io-threads: off                                                                    

Steps to Reproduce:
1. 
2.
3.

Actual results:
Snapshot fails and becomes in inconsistent state

Expected results:
Snapshot takes successfully

Additional info:

Comment 2 Yaniv Kaul 2018-07-02 07:49:15 UTC
Dup of https://bugzilla.redhat.com/show_bug.cgi?id=1594819 ?

Comment 3 Alexander 2018-07-02 07:55:09 UTC
For some reason it was created twice. Yes, it is a dup. Please, close this.

Comment 4 Yaniv Kaul 2018-07-02 08:01:27 UTC

*** This bug has been marked as a duplicate of bug 1594819 ***