1476205 – [EC]: md5sum mismatches every time for a file from the fuse client on EC volume

Bug 1476205 - [EC]: md5sum mismatches every time for a file from the fuse client on EC volume

Summary: [EC]: md5sum mismatches every time for a file from the fuse client on EC volume

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	disperse
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Ashish Pandey
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1472764
Blocks:	1476868 1476870
TreeView+	depends on / blocked

Reported:	2017-07-28 09:39 UTC by Ashish Pandey
Modified:	2017-12-08 17:35 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.13.0
Clone Of:	1472764
Clones:	1476868 1476870 (view as bug list)
Environment:
Last Closed:	2017-08-10 11:13:51 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Ashish Pandey 2017-07-28 09:42:15 UTC

Description of problem:
-----------------------

some files are giving different md5sum when executed from different mounts on the same volume.
Example:
:/master on /mnt/glusterfs type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
:/master on /mnt/master type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
:/slave on /mnt/slave type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
[root@fan ~]# md5sum /mnt/glusterfs/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
9c0de9e06e88ddba88248e9c8e502f82  /mnt/glusterfs/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
[root@fan ~]# ls -l /mnt/glusterfs/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
-rwxrwxr--. 3 42023 13883 13526 Jul 18 05:29 /mnt/glusterfs/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
[root@fan ~]# ls -l /mnt/master/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
-rwxrwxr--. 3 42023 13883 13526 Jul 18 05:29 /mnt/master/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
[root@fan ~]# md5sum /mnt/master/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
16f92bf8f96094a214b1d0293c1accda  /mnt/master/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T

Further Analysis by Ashish shows the following:

Looks like files are corrupted at backend, to confirm mounted the volume on 4 mount points and got the different md5sum. This could be because all mount points are reading file from different sets of 4 bricks. To confirm this, killed 2 bricks from each subvolume and again calculated md5sum. This time all md5sums are same.



[root@fan tmp]# md5sum /mnt/glusterfs/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
0bc19eac0693355a2e954ba3d66d4ef1  /mnt/glusterfs/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T

[root@fan tmp]# md5sum /mnt/master/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
0bc19eac0693355a2e954ba3d66d4ef1  /mnt/master/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T

[root@fan tmp]# md5sum /mnt/temp_master/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
0bc19eac0693355a2e954ba3d66d4ef1  /mnt/temp_master/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T

[root@fan tmp]# md5sum /mnt/temp-2/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
0bc19eac0693355a2e954ba3d66d4ef1  /mnt/temp-2/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T


Steps carried:
==============

The steps carried were in geo-rep setup, however the checksum missmatches on master. Providing only the steps that is needed to avoid confusion. 

1. Create 6 node Master Cluster
2. Create EC volume on Master (2x(4+2))
3. Mount the volume (Fuse)
4. Using crefi, create the data set. fops carried inorder were:

create, chmod, chown, chgrp, hardlink, symlink, truncate

5. After every fop, calculate the checksum of mount (At least twice)
6. After truncate, checksum never matched

Actual results:
===============

Checksum mismatches for a file (Might be data corruption) every time calculated from different mount for the same volume. 


Expected results:
=================

Checksum should match all the time from all the different mounts.

Comment 2 Ashish Pandey 2017-07-28 09:43:54 UTC

RCA by Pranith -

Some notes, didn't find the RCA yet.

Only the last fragment of the file is corrupted. On closer inspection, it is found that the file on 5th brick is the bad-one. It is as if the write didn't happen on the 5th brick but only happened on the rest of the bricks. So the next step is to find why was the write got skipped.

Rahul,
    For more debugging we need /etc/services file from the machine where you ran crefi. Could you attach it to the bz?

Pranith

--- Additional comment from Pranith Kumar K on 2017-07-27 21:51:51 EDT ---

(In reply to Pranith Kumar K from comment #10)
> Some notes, didn't find the RCA yet.
> 
> Only the last fragment of the file is corrupted. On closer inspection, it is
> found that the file on 5th brick is the bad-one. It is as if the write
> didn't happen on the 5th brick but only happened on the rest of the bricks.
> So the next step is to find why was the write got skipped.
> 
> Rahul,
>     For more debugging we need /etc/services file from the machine where you
> ran crefi. Could you attach it to the bz?
> 
> Pranith

I found it in one of the rhgs machines and the md5sum is matching with what we have in the bz. I will use that

--- Additional comment from Pranith Kumar K on 2017-07-27 22:12:20 EDT ---

Some notes, still no RC
07:24:01 :) ⚡ md5sum /home/gfs/ec_?/5*
ecb9f6f29ebbb72e3e612b0aec775b36  /home/gfs/ec_0/596c62e1%%HPJLBVY86T
6fe5b21f27a5c9dc72ae3422102327b2  /home/gfs/ec_1/596c62e1%%HPJLBVY86T
6f82c16ed99b901929737ddf11dd920b  /home/gfs/ec_2/596c62e1%%HPJLBVY86T
bcac001cf1cac8860e4ae233a35cd7ef  /home/gfs/ec_3/596c62e1%%HPJLBVY86T
6921a33bd9c1083aaec582cde9973b61  /home/gfs/ec_4/596c62e1%%HPJLBVY86T <<<---- this is the bad fragment
b9d70081b5b41c9f980e20bf40e1fd13  /home/gfs/ec_5/596c62e1%%HPJLBVY86T

I did the following to see what fragment corresponds to the bad md5sum that is present in the file we got.

07:24:53 :) ⚡ for i in {12288..14336}; do Crefi/crefi.py --multi -n 1 -b 1 -d 1 --size $i -t text --fop=create /mnt/ec2; done

root@dhcp35-190 - ~ 
07:26:49 :) ⚡ md5sum /home/gfs/ec_4/level00/* | grep 6921a33bd9c1083aaec582cde9973b61
6921a33bd9c1083aaec582cde9973b61  /home/gfs/ec_4/level00/597a99d1%%4SKACGWCDK

root@dhcp35-190 - ~ 
07:27:07 :) ⚡ ls -l /mnt/ec2/level00/597a99d1%%4SKACGWCDK
-rw-r--r--. 1 root root 14336 Jul 28 07:26 /mnt/ec2/level00/597a99d1%%4SKACGWCDK

07:39:48 :) ⚡ truncate -s 13526 /mnt/ec2/level00/597a99d1%%4SKACGWCDK

root@dhcp35-190 - ~ 
07:40:10 :) ⚡ md5sum /home/gfs/ec_?/level00/597a99d1%%4SKACGWCDK
ecb9f6f29ebbb72e3e612b0aec775b36  /home/gfs/ec_0/level00/597a99d1%%4SKACGWCDK
6fe5b21f27a5c9dc72ae3422102327b2  /home/gfs/ec_1/level00/597a99d1%%4SKACGWCDK
6f82c16ed99b901929737ddf11dd920b  /home/gfs/ec_2/level00/597a99d1%%4SKACGWCDK
bcac001cf1cac8860e4ae233a35cd7ef  /home/gfs/ec_3/level00/597a99d1%%4SKACGWCDK
f039f8c213c0b31b2d16da87de8da6b7  /home/gfs/ec_4/level00/597a99d1%%4SKACGWCDK<------ This fragment is correct.
b9d70081b5b41c9f980e20bf40e1fd13  /home/gfs/ec_5/level00/597a99d1%%4SKACGWCDK

So This means either the write was never wound on the file when this bug happened or it failed silently.

This will be the next things to look into.

--- Additional comment from Pranith Kumar K on 2017-07-28 01:27:26 EDT ---

Ashish,
     Didn't find any evidence for how it failed, but there is a bug in EC. When writev in truncate fails, then it is not marking that brick as bad. Verified it in gdb. We may need more help from Rahul to find how write exactly failed. Are the quota errors contributing to any failures is also something we may need to check.

I put one of the brick processes in gdb and made it return error without performing any write:
Thread 7 "glusterfsd" hit Breakpoint 3, server3_3_writev (req=0x7fbf100ca140) at server-rpc-fops.c:4100
4100	        server_state_t      *state  = NULL;
(gdb) n
...
(gdb) 
4119	        frame = get_frame_from_request (req);
(gdb) 
4120	        if (!frame) {
(gdb) p frame=0
$3 = (call_frame_t *) 0x0
(gdb) n
4122	                SERVER_REQ_SET_ERROR (req, ret);
(gdb) 
4123	                goto out;
(gdb) 
4172	        free (args.xdata.xdata_val);
(gdb) 
4174	        if (op_errno)
(gdb) c

I got the same scenario mentioned in this bug:
root@dhcp35-190 - ~ 
10:34:11 :) ⚡ getfattr -d -m. -e hex /home/gfs/ec_?/services
getfattr: Removing leading '/' from absolute path names
# file: home/gfs/ec_0/services
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a686f6d655f726f6f745f743a733000
trusted.ec.config=0x0000080602000200
trusted.ec.dirty=0x00000000000000000000000000000000
trusted.ec.size=0x00000000000034d6
trusted.ec.version=0x00000000000000270000000000000027
trusted.gfid=0xeacb572971f5457a9a5706020255f312

# file: home/gfs/ec_1/services
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a686f6d655f726f6f745f743a733000
trusted.ec.config=0x0000080602000200
trusted.ec.dirty=0x00000000000000000000000000000000
trusted.ec.size=0x00000000000034d6
trusted.ec.version=0x00000000000000270000000000000027
trusted.gfid=0xeacb572971f5457a9a5706020255f312

# file: home/gfs/ec_2/services
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a686f6d655f726f6f745f743a733000
trusted.ec.config=0x0000080602000200
trusted.ec.dirty=0x00000000000000000000000000000000
trusted.ec.size=0x00000000000034d6
trusted.ec.version=0x00000000000000270000000000000027
trusted.gfid=0xeacb572971f5457a9a5706020255f312

# file: home/gfs/ec_3/services
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a686f6d655f726f6f745f743a733000
trusted.ec.config=0x0000080602000200
trusted.ec.dirty=0x00000000000000000000000000000000
trusted.ec.size=0x00000000000034d6
trusted.ec.version=0x00000000000000270000000000000027
trusted.gfid=0xeacb572971f5457a9a5706020255f312

# file: home/gfs/ec_4/services
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a686f6d655f726f6f745f743a733000
trusted.ec.config=0x0000080602000200
trusted.ec.dirty=0x00000000000000000000000000000000
trusted.ec.size=0x00000000000034d6
trusted.ec.version=0x00000000000000270000000000000027
trusted.gfid=0xeacb572971f5457a9a5706020255f312

# file: home/gfs/ec_5/services
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a686f6d655f726f6f745f743a733000
trusted.ec.config=0x0000080602000200
trusted.ec.dirty=0x00000000000000000000000000000000
trusted.ec.size=0x00000000000034d6
trusted.ec.version=0x00000000000000270000000000000027
trusted.gfid=0xeacb572971f5457a9a5706020255f312


root@dhcp35-190 - ~ 
10:34:13 :) ⚡ md5sum  /home/gfs/ec_?/services
ecb9f6f29ebbb72e3e612b0aec775b36  /home/gfs/ec_0/services
6fe5b21f27a5c9dc72ae3422102327b2  /home/gfs/ec_1/services
6f82c16ed99b901929737ddf11dd920b  /home/gfs/ec_2/services
bcac001cf1cac8860e4ae233a35cd7ef  /home/gfs/ec_3/services
6921a33bd9c1083aaec582cde9973b61  /home/gfs/ec_4/services <<<<------ same md5sum as the bad file in Rahul's setup and all xattrs match as if everything is fine when it is not.
b9d70081b5b41c9f980e20bf40e1fd13  /home/gfs/ec_5/services

Ashish,
    Can you send out a patch? In both ec_truncate_open_cbk, ec_truncate_writev's cbk, we need to update lock->good_mask correctly. I don't find any other way for the file to end up in the state it did on Rahul's setup. We just need to confirm this is indeed the case. The only other psychotic case will be if the brick sends a success without performing the write.

Rahul,
    Will it be possible to get a tcpdump of a run where this happens? That will confirm how this bug happened. The bug we found based on the information you had on your setup I explained above is a day-1 bug.

Pranith

Comment 3 Worker Ant 2017-07-28 10:08:03 UTC

REVIEW: https://review.gluster.org/17906 (ec/cluster: Update failure of fop on a brick properly) posted (#1) for review on master by Ashish Pandey (aspandey)

Comment 4 Worker Ant 2017-07-31 06:25:15 UTC

REVIEW: https://review.gluster.org/17906 (ec/cluster: Update failure of fop on a brick properly) posted (#2) for review on master by Ashish Pandey (aspandey)

Comment 5 Worker Ant 2017-07-31 11:55:31 UTC

COMMIT: https://review.gluster.org/17906 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit 12906467f0f8df2edc07ea8412c0c6be6fae11b3
Author: Ashish Pandey <aspandey>
Date:   Fri Jul 28 15:18:29 2017 +0530

    ec/cluster: Update failure of fop on a brick properly
    
    Problem:
    In case of truncate, if writev or open fails on a brick,
    in some cases it does not mark the failure onlock->good_mask.
    This causes the update of size and version on all the bricks
    even if it has failed on one of the brick. That ultimately
    causes a data corruption.
    
    Solution:
    In callback of such writev and open calls, mark fop->good
    for parent too.
    
    Thanks Pranith Kumar K <pkarampu> for finding the
    root cause.
    
    Change-Id: I8a1da2888bff53b91a0d362b8c44fcdf658e7466
    BUG: 1476205
    Signed-off-by: Ashish Pandey <aspandey>
    Reviewed-on: https://review.gluster.org/17906
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 6 Shyamsundar 2017-12-08 17:35:59 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.13.0, please open a new bug report.

glusterfs-3.13.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-December/000087.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.