Bug 1115748 - Bricks are unsync after recevery even if heal says everything is fine
Summary: Bricks are unsync after recevery even if heal says everything is fine
Keywords:
Status: CLOSED DUPLICATE of bug 1113894
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: 3.5.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Ravishankar N
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-07-03 03:36 UTC by Miloš Kozák
Modified: 2014-07-22 05:14 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-07-22 05:14:56 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
Logs (1.11 MB, application/gzip)
2014-07-03 10:54 UTC, Miloš Kozák
no flags Details
another test (3.17 MB, application/gzip)
2014-07-04 02:34 UTC, Miloš Kozák
no flags Details

Description Miloš Kozák 2014-07-03 03:36:08 UTC
Description of problem:
I experience problem that after connection recovery file which I was writing in, during connection crash, does not sync onto other server.

Version-Release number of selected component (if applicable):
3.5.1

How reproducible:
I used two virtual servers connected via bridge, so it is easier to simulate disconnection. I mount volume onto both servers, then on each I create file, to check that the replica 2 works. Then I start to write to a file using dd from one server. In the middle of writing I remove one VM from bridge, so that gluster looses connection. Writing carries on. I wait until the dd finishes. After a while a recreate connection. Gluster finds that healing is necessary, everything seems that is healed but the data in bricks does not match

Steps to Reproduce:
1. Start writing to a file on mounted volume
2. Disconnect nodes
3. Reconnect and wait until gluster annonces that everything is healed

Actual results:
Data in bricks does not match. On one brick we can see data from the last moment before disconnection.

Expected results:
Bricks are sync

Additional info:
I am attaching logs. In test2 are enclosed also outputs from strace

Comment 1 Ravishankar N 2014-07-03 04:15:18 UTC
Thanks for the bug report Milos. I think the attachments were missed out. Could you please upload them?

Also some questions if you don't mind:
1."gluster annonces that everything is healed"-->So 'gluster volume heal volname info' now shows zero entries (as opposed to the 'possibly undergoing heal' you previously noticed in the mails)?

2.Could you also check the extended attributes of the file on the bricks; you can paste the output of the below command on both bricks.

getfattr -d -m . -e hex /dist1/brick/fs/<dd_file_name>

Comment 2 Miloš Kozák 2014-07-03 10:54:50 UTC
Created attachment 914444 [details]
Logs

Comment 3 Miloš Kozák 2014-07-03 10:59:16 UTC
Hi,
1. You are right, it now does not indicate the 'possibly undergoing healing' after a while. Right after VMs are reconnected it indicates that for sure. In other words it actually seems, that everything is OK, but data are on disks inconsistent..

This situaction suprised me yesterday when I did these tests quite a lot. The previous day it was according to my email. Only change I can see is that I cleaned up everything and rebooted systems..


2. 

[root@node1 ~]# getfattr -d -m . -e hex /dist1/brick/fs/break 
getfattr: Removing leading '/' from absolute path names
# file: dist1/brick/fs/break
trusted.afr.vg0-client-0=0x000000000000000000000000
trusted.afr.vg0-client-1=0x000000000000000000000000
trusted.gfid=0x051ca514aeb54942ae15ffa05de59a82



[root@node2 ~]# getfattr -d -m . -e hex /dist1/brick/fs/break
getfattr: Removing leading '/' from absolute path names
# file: dist1/brick/fs/break
trusted.afr.vg0-client-0=0x000000000000000000000000
trusted.afr.vg0-client-1=0x000000000000000000000000
trusted.gfid=0x051ca514aeb54942ae15ffa05de59a82

Comment 4 Pranith Kumar K 2014-07-03 11:02:54 UTC
hi Milos,
     How are you confirming that the data on disks is in-consistent?

Pranith

Comment 5 Miloš Kozák 2014-07-03 11:10:35 UTC
Hi,
first I noticed by df command, that made me discover a bit more.

ls -l shows proper length, but du does not.

On the behalf of your question I calculated md5:

[root@node1 ~]# md5sum /dist1/brick/fs/*
d8b61b2c0025919d5321461045c8226f  /dist1/brick/fs/break
d8b61b2c0025919d5321461045c8226f  /dist1/brick/fs/node1
d8b61b2c0025919d5321461045c8226f  /dist1/brick/fs/node2
[root@node1 ~]# ls -l /dist1/brick/fs/*
-rw-r--r-- 2 root root 524288000 Jul  3 05:16 /dist1/brick/fs/break
-rw-r--r-- 2 root root 524288000 Jul  3 05:10 /dist1/brick/fs/node1
-rw-r--r-- 2 root root 524288000 Jul  3 05:10 /dist1/brick/fs/node2
[root@node1 ~]# du /dist1/brick/fs/*
116352	/dist1/brick/fs/break
512000	/dist1/brick/fs/node1
512000	/dist1/brick/fs/node2



[root@node2 ~]# md5sum /dist1/brick/fs/
break       .glusterfs/ node1       node2       
[root@node2 ~]# md5sum /dist1/brick/fs/*
d8b61b2c0025919d5321461045c8226f  /dist1/brick/fs/break
d8b61b2c0025919d5321461045c8226f  /dist1/brick/fs/node1
d8b61b2c0025919d5321461045c8226f  /dist1/brick/fs/node2
[root@node2 ~]# ls -l /dist1/brick/fs/*
-rw-r--r-- 2 root root 524288000 Jul  3 05:16 /dist1/brick/fs/break
-rw-r--r-- 2 root root 524288000 Jul  3 05:10 /dist1/brick/fs/node1
-rw-r--r-- 2 root root 524288000 Jul  3 05:10 /dist1/brick/fs/node2
[root@node2 ~]# du /dist1/brick/fs/*
512000	/dist1/brick/fs/break
512000	/dist1/brick/fs/node1
512000	/dist1/brick/fs/node2


Oh snap.. how can be md5 same when `du` does not fit? Do you think that it is only du (synthetic test) related problem?

Comment 6 Pranith Kumar K 2014-07-03 11:21:01 UTC
Now you are talking, I think this is sparse file, i.e. file with holes. So basically when a file with holes is healed it is not healing the holes properly.
Detailed steps to re-create the issue, mainly the kind of images(qcow2?) you used for creating the images etc would help.

Comment 7 Miloš Kozák 2014-07-03 11:40:56 UTC
Should I test it with fallocate or do you have any other suggestion how to allocate not sparse file?

Comment 8 Pranith Kumar K 2014-07-03 12:08:00 UTC
There is no harm done for data consistency. Because of this, we are using extra disk space because self-heal is not retaining the hole. All I wanted to know is, how did you create the VM file?

Comment 9 Miloš Kozák 2014-07-03 12:15:01 UTC
The VM files were created using truncate -S XXG command. The same way were created files for data that are mounted as /dist1/brick (/dev/sdb)

Comment 10 Miloš Kozák 2014-07-04 02:34:46 UTC
Created attachment 914626 [details]
another test

Test with regular files not sparse ones

Comment 11 Miloš Kozák 2014-07-04 02:38:22 UTC
I am not sure if I got your last comment properly, but how can container of VM influence the healing process?

Anyway I could reproduce this error with ssh copy. Procedure is basically the same as described originally but the difference is the healing stucks as mentioned in email forum. 

I enclosed logs are dumps as well

Comment 12 Ravishankar N 2014-07-14 17:18:35 UTC
Doesn't look like a sparse file issue- during my tests, a sparse file created on node 1 (from mount point) when node 2 was down was successfully healed to the latter.

Not much of info on test1 and test2 logs. The glustershd log in test2:node1 actually indicate successful heal.

Some observations from test3 logs:
1.prints.txt shows pending counts for AFR extended attributes of both files.
The file named 'node2' with gfid=0x04f688ebda444739bc2e932000dab387 is the one 
which shows 'possibly under heal'

2.Both nodes' glustershd logs show this repeatedly:
I [afr-self-heald.c:1687:afr_dir_exclusive_crawl] 0-vg0-replicate-0: Another crawl is in progress for vg0-client-0

3.brick log contains muliple warnings:
[2014-07-03 23:30:19.816435] W [inodelk.c:392:pl_inodelk_log_cleanup] 0-vg0-server: releasing lock on 04f688eb-da44-4739-bc2e-932000dab387 held by {client=0xd00180, pid=-1 lk-owner=08449eb0ec7f0000}

Comment 13 Miloš Kozák 2014-07-14 17:25:55 UTC
Hi, thank you for insight. However, I am not sure if I can deduct anything out of it? Does it mean this is different bug? What makes me wonder is you write that according to the logs the file was healed, but gluster volume heal indicates otherwise.. 

Do you want me to generate another tests?

Comment 14 Ravishankar N 2014-07-14 17:37:49 UTC
The bug seems to be what you described in the mail, i.e the heal is not able to proceed. Only test2 logs show heal completed. test3 logs don't. 

Can you generate statedumps when the heal happens to be hung? I see from the command history logs that you have run the statedump command but here's a quick howto anyway:

1.mkdir -p /var/run/gluster (if dir does not exist)
2.gluster volume statedump <volume name> (gives statedump of bricks)
3.`kill -USR1 <pid of self-heal daemon process)`

Do steps 2 and 3 twice, with an interval of say 1 minute in between. Do this when the heal info shows "possibly undergoing...." and attach the dumps that are created.

Comment 15 Ravishankar N 2014-07-14 17:43:09 UTC
Also, an strace of the self-heal daemon would be helpful (you had provided strace of the brick process earlier) as that is the one that does self-heal.

Comment 16 Ravishankar N 2014-07-15 08:58:19 UTC
Milos, never mind the state dump, I reproduced the issue on 3.5.1 release and it happens that the fix http://review.gluster.org/#/c/8187/ (BZ# 1113894) has recently been merged in the 3.5 branch. It should be available in the 3.5.2 release. With this fix, the heal does not hang anymore.

Test procedure:Same as in bug description, except I used iptable rules to block traffic between the 2 nodes. After I unblocking traffic, heal info shows "possibly undergoing heal".

If you can test with the fix and confirm, I'll close the bug. Thanks!

Comment 17 Ravishankar N 2014-07-22 05:14:56 UTC
Closing the bug as it is a manifestation of the same issue described in BZ 1113894. 

Note: How I tested:
1.Create 1x2 replica on 2 nodes, fuse mount it on node 1.
2.create file using dd from fuse mount on node 1.
3.On node 1,while dd is going on,block all traffic from/to node 2:
iptables -A INPUT -p all -s <node 2's IP> -j REJECT
iptables -A OUTPUT -p all -d <node 2's IP> -j REJECT
4.stop dd process 
5.unblock traffic: iptables -F
6.run `gluster v heal <volname> info`
It should not give 'possibly undergoing heal'

*** This bug has been marked as a duplicate of bug 1113894 ***


Note You need to log in before you can comment on or make changes to this bug.