1594818 – Failed to take snapshot (occasionally)

Bug 1594818 - Failed to take snapshot (occasionally)

Summary: Failed to take snapshot (occasionally)

Keywords:
Status:	CLOSED DUPLICATE of bug 1594819
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	snapshot
Sub Component:
Version:	3.12
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-06-25 13:35 UTC by Alexander
Modified:	2018-07-02 08:01 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-07-02 08:01:27 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Alexander 2018-06-25 13:35:25 UTC

Description of problem:

We have a GlusterFS with two server nodes with replica 2 and cron job which takes snapshots every 2 hours (it takes about 30 seconds to complete snapshot, so no overlap). Cron runs on first node (gluster1). It may work few times without problem, but then it randomly stops working with following error:

snapshot create: failed: Commit failed on gluster2. Please check log file for details.
Snapshot command failed

When I check logs on second node (you may notice failed remove as well, because it tries to clean-up backup in case of error):

[2018-06-25 13:09:34.180240] I [MSGID: 106488] [glusterd-handler.c:1549:__glusterd_handle_cli_get_volume] 0-management: Received get vol req
[2018-06-25 13:09:34.228196] I [MSGID: 106487] [glusterd-handler.c:1486:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2018-06-25 13:09:34.278582] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume storage
[2018-06-25 13:09:50.875267] E [MSGID: 106029] [glusterd-snapshot.c:4750:glusterd_take_lvm_snapshot] 0-management: taking snapshot of the brick (/data) of device /dev/mapper/vg.glusterfs-data failed
[2018-06-25 13:09:50.875382] E [MSGID: 106029] [glusterd-snapshot.c:5179:glusterd_take_brick_snapshot] 0-management: Failed to take snapshot of brick gluster2:/data
[2018-06-25 13:09:50.875404] E [MSGID: 106029] [glusterd-snapshot.c:6542:glusterd_take_brick_snapshot_task] 0-management: Failed to take backend snapshot for brick gluster2:/run/gluster/snaps/0ccbbaedca7b455ebcf410fbd0ada884/brick1 volume(0ccbbaedca7b455ebcf410fbd0ada884)
[2018-06-25 13:09:50.875605] E [MSGID: 106029] [glusterd-snapshot.c:6686:glusterd_schedule_brick_snapshot] 0-management: Failed to create snapshot
[2018-06-25 13:09:50.875630] E [MSGID: 106029] [glusterd-snapshot.c:7002:glusterd_snapshot_create_commit] 0-management: Failed to take backend snapshot backup
[2018-06-25 13:09:50.877362] E [MSGID: 106029] [glusterd-snapshot.c:8449:glusterd_snapshot] 0-management: Failed to create snapshot
[2018-06-25 13:09:50.877395] W [MSGID: 106122] [glusterd-mgmt.c:299:gd_mgmt_v3_commit_fn] 0-management: Snapshot Commit Failed
[2018-06-25 13:09:50.877412] E [MSGID: 106122] [glusterd-mgmt-handler.c:603:glusterd_handle_commit_fn] 0-management: commit failed on operation Snapshot
[2018-06-25 13:09:50.882749] W [glusterd-locks.c:622:glusterd_mgmt_v3_lock] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.0/xlator/mgmt/glusterd.so(+0xd724c) [0x7f8be9f9524c] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.0/xlator/mgmt/glusterd.so(+0xd6f9d) [0x7f8be9f94f9d] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.0/xlator/mgmt/glusterd.so(+0xdd4ad) [0x7f8be9f9b4ad] ) 0-management: Lock for storage held by 8997f40a-c987-406d-8cca-cf2d340993f9
[2018-06-25 13:09:50.882124] I [MSGID: 106499] [glusterd-handler.c:4314:__glusterd_handle_status_volume] 0-management: Received status volume req for volume storage
[2018-06-25 13:09:50.882783] E [MSGID: 106118] [glusterd-syncop.c:1919:gd_sync_task_begin] 0-management: Unable to acquire lock for storage
[2018-06-25 13:09:51.000769] E [MSGID: 106056] [glusterd-snapshot.c:5947:glusterd_snapshot_remove_prevalidate] 0-management: Snapshot (backup) does not exist [Invalid argument]
[2018-06-25 13:09:51.000916] W [MSGID: 106043] [glusterd-snapshot.c:8776:glusterd_snapshot_prevalidate] 0-management: Snapshot remove validation failed
[2018-06-25 13:09:51.000933] W [MSGID: 106121] [glusterd-mgmt.c:156:gd_mgmt_v3_pre_validate_fn] 0-management: Snapshot Prevalidate Failed
[2018-06-25 13:09:51.000954] E [MSGID: 106121] [glusterd-mgmt-handler.c:337:glusterd_handle_pre_validate_fn] 0-management: Pre Validation failed on operation Snapshot
[2018-06-25 13:10:01.952888] E [MSGID: 106056] [glusterd-snapshot.c:5947:glusterd_snapshot_remove_prevalidate] 0-management: Snapshot (backup) does not exist [Invalid argument]
[2018-06-25 13:10:01.952924] W [MSGID: 106043] [glusterd-snapshot.c:8776:glusterd_snapshot_prevalidate] 0-management: Snapshot remove validation failed

When I run backup script next time:
snapshot create: failed: Snapshot backup already exists
Snapshot command failed
ERROR: Failed to create GlusterFS snapshot

But if you try to remove snapshot it fails, because servers becomes to inconsistent state where first node says snapshot already exists, but second node says it doesn't have snapshot. So you either can't create or delete it.

If you restart glusterd on first node then snapshot "disappears" so you can try to take new snapshot, but it fails with same error as initial. The only way I found to resolve this issue (for next few backups if I was lucky):
- restart GlusterFS on first node;
- stop GlusterFS on second node;
- killall gluster* on second node;
- start GlusterFS on second node.

Version-Release number of selected component (if applicable):
GlusterFS 4.1 (also tried for 3.10 and 3.12 with same result).

How reproducible:
I have Ansible playbook and Terraform file which I can adapt if you need them, but in general it is a usual setup with two server nodes and two bricks:

$ sudo gluster volume info storage              
                                                                                                      
Volume Name: storage                                                                                  
Type: Replicate                                                                                       
Volume ID: 2d11f59f-d9cb-4cd3-a7c5-4556075bed88                                                       
Status: Started                                                                                       
Snapshot Count: 0                                                                                     
Number of Bricks: 1 x 2 = 2                                                                           
Transport-type: tcp                                                                                   
Bricks:                                                                                               
Brick1: gluster2:/data                                                                                
Brick2: gluster1:/data                                                                                
Options Reconfigured:                                                                                 
features.barrier: disable                                                                             
cluster.granular-entry-heal: enable                                                                   
transport.address-family: inet                                                                        
nfs.disable: on                                                                                       
performance.client-io-threads: off                                                                    

Steps to Reproduce:
1. 
2.
3.

Actual results:
Snapshot fails and becomes in inconsistent state

Expected results:
Snapshot takes successfully

Additional info:

Comment 2 Yaniv Kaul 2018-07-02 07:49:15 UTC

Dup of https://bugzilla.redhat.com/show_bug.cgi?id=1594819 ?

Comment 3 Alexander 2018-07-02 07:55:09 UTC

For some reason it was created twice. Yes, it is a dup. Please, close this.

Comment 4 Yaniv Kaul 2018-07-02 08:01:27 UTC


*** This bug has been marked as a duplicate of bug 1594819 ***

Note You need to log in before you can comment on or make changes to this bug.