Bug 1472764

Summary:	[EC]: md5sum mismatches every time for a file from the fuse client on EC volume
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Rahul Hinduja <rhinduja>
Component:	disperse	Assignee:	Ashish Pandey <aspandey>
Status:	CLOSED ERRATA	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.2	CC:	amukherj, nchilaka, pkarampu, rcyriac, rhinduja, rhs-bugs, sheggodu, storage-qa-internal
Target Milestone:	---	Keywords:	Regression
Target Release:	RHGS 3.3.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.8.4-37	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1476205 (view as bug list)		Environment:
Last Closed:	2017-09-21 05:04:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1417151, 1476205

Description Rahul Hinduja 2017-07-19 11:32:47 UTC

Description of problem:
-----------------------

some files are giving different md5sum when executed from different mounts on the same volume.
Example:
10.70.43.4:/master on /mnt/glusterfs type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
10.70.43.4:/master on /mnt/master type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
10.70.43.27:/slave on /mnt/slave type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
[root@fan ~]# md5sum /mnt/glusterfs/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
9c0de9e06e88ddba88248e9c8e502f82  /mnt/glusterfs/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
[root@fan ~]# ls -l /mnt/glusterfs/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
-rwxrwxr--. 3 42023 13883 13526 Jul 18 05:29 /mnt/glusterfs/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
[root@fan ~]# ls -l /mnt/master/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
-rwxrwxr--. 3 42023 13883 13526 Jul 18 05:29 /mnt/master/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
[root@fan ~]# md5sum /mnt/master/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
16f92bf8f96094a214b1d0293c1accda  /mnt/master/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T

Further Analysis by Ashish shows the following:

Looks like files are corrupted at backend, to confirm mounted the volume on 4 mount points and got the different md5sum. This could be because all mount points are reading file from different sets of 4 bricks. To confirm this, killed 2 bricks from each subvolume and again calculated md5sum. This time all md5sums are same.



[root@fan tmp]# md5sum /mnt/glusterfs/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
0bc19eac0693355a2e954ba3d66d4ef1  /mnt/glusterfs/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T

[root@fan tmp]# md5sum /mnt/master/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
0bc19eac0693355a2e954ba3d66d4ef1  /mnt/master/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T

[root@fan tmp]# md5sum /mnt/temp_master/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
0bc19eac0693355a2e954ba3d66d4ef1  /mnt/temp_master/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T

[root@fan tmp]# md5sum /mnt/temp-2/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T
0bc19eac0693355a2e954ba3d66d4ef1  /mnt/temp-2/thread0/level01/level11/level21/level31/level41/level51/level61/level71/596c62e1%%HPJLBVY86T


Version-Release number of selected component (if applicable):
==============================================================

glusterfs-3.8.4-18.5.el7rhgs.x86_64


How reproducible:
=================
Will retry


Steps carried:
==============

The steps carried were in geo-rep setup, however the checksum missmatches on master. Providing only the steps that is needed to avoid confusion. 

1. Create 6 node Master Cluster
2. Create EC volume on Master (2x(4+2))
3. Mount the volume (Fuse)
4. Using crefi, create the data set. fops carried inorder were:

create, chmod, chown, chgrp, hardlink, symlink, truncate

5. After every fop, calculate the checksum of mount (At least twice)
6. After truncate, checksum never matched

Actual results:
===============

Checksum mismatches for a file (Might be data corruption) every time calculated from different mount for the same volume. 


Expected results:
=================

Checksum should match all the time from all the different mounts.

Comment 3 Ashish Pandey 2017-07-19 16:45:52 UTC

(In reply to Rahul Hinduja from comment #0)
> Description of problem:
> -----------------------
> 
> some files are giving different md5sum when executed from different mounts
> on the same volume.
> Example:
> 10.70.43.4:/master on /mnt/glusterfs type fuse.glusterfs
> (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,
> max_read=131072)
> 10.70.43.4:/master on /mnt/master type fuse.glusterfs
> (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,
> max_read=131072)
> 10.70.43.27:/slave on /mnt/slave type fuse.glusterfs
> (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,
> max_read=131072)
> [root@fan ~]# md5sum
> /mnt/glusterfs/thread0/level01/level11/level21/level31/level41/level51/
> level61/level71/596c62e1%%HPJLBVY86T
> 9c0de9e06e88ddba88248e9c8e502f82 
> /mnt/glusterfs/thread0/level01/level11/level21/level31/level41/level51/
> level61/level71/596c62e1%%HPJLBVY86T
> [root@fan ~]# ls -l
> /mnt/glusterfs/thread0/level01/level11/level21/level31/level41/level51/
> level61/level71/596c62e1%%HPJLBVY86T
> -rwxrwxr--. 3 42023 13883 13526 Jul 18 05:29
> /mnt/glusterfs/thread0/level01/level11/level21/level31/level41/level51/
> level61/level71/596c62e1%%HPJLBVY86T
> [root@fan ~]# ls -l
> /mnt/master/thread0/level01/level11/level21/level31/level41/level51/level61/
> level71/596c62e1%%HPJLBVY86T
> -rwxrwxr--. 3 42023 13883 13526 Jul 18 05:29
> /mnt/master/thread0/level01/level11/level21/level31/level41/level51/level61/
> level71/596c62e1%%HPJLBVY86T
> [root@fan ~]# md5sum
> /mnt/master/thread0/level01/level11/level21/level31/level41/level51/level61/
> level71/596c62e1%%HPJLBVY86T
> 16f92bf8f96094a214b1d0293c1accda 
> /mnt/master/thread0/level01/level11/level21/level31/level41/level51/level61/
> level71/596c62e1%%HPJLBVY86T
> 
> Further Analysis by Ashish shows the following:
> 
> Looks like files are corrupted at backend, to confirm mounted the volume on
> 4 mount points and got the different md5sum. This could be because all mount
> points are reading file from different sets of 4 bricks. To confirm this,
> killed 2 bricks from each subvolume and again calculated md5sum. This time
> all md5sums are same.
> 
> 
> 
> [root@fan tmp]# md5sum
> /mnt/glusterfs/thread0/level01/level11/level21/level31/level41/level51/
> level61/level71/596c62e1%%HPJLBVY86T
> 0bc19eac0693355a2e954ba3d66d4ef1 
> /mnt/glusterfs/thread0/level01/level11/level21/level31/level41/level51/
> level61/level71/596c62e1%%HPJLBVY86T
> 
> [root@fan tmp]# md5sum
> /mnt/master/thread0/level01/level11/level21/level31/level41/level51/level61/
> level71/596c62e1%%HPJLBVY86T
> 0bc19eac0693355a2e954ba3d66d4ef1 
> /mnt/master/thread0/level01/level11/level21/level31/level41/level51/level61/
> level71/596c62e1%%HPJLBVY86T
> 
> [root@fan tmp]# md5sum
> /mnt/temp_master/thread0/level01/level11/level21/level31/level41/level51/
> level61/level71/596c62e1%%HPJLBVY86T
> 0bc19eac0693355a2e954ba3d66d4ef1 
> /mnt/temp_master/thread0/level01/level11/level21/level31/level41/level51/
> level61/level71/596c62e1%%HPJLBVY86T
> 
> [root@fan tmp]# md5sum
> /mnt/temp-2/thread0/level01/level11/level21/level31/level41/level51/level61/
> level71/596c62e1%%HPJLBVY86T
> 0bc19eac0693355a2e954ba3d66d4ef1 
> /mnt/temp-2/thread0/level01/level11/level21/level31/level41/level51/level61/
> level71/596c62e1%%HPJLBVY86T
> 
> 
> Version-Release number of selected component (if applicable):
> ==============================================================
> 
> glusterfs-3.8.4-18.5.el7rhgs.x86_64
> 
> 
> How reproducible:
> =================
> Will retry
> 
> 
> Steps carried:
> ==============
> 
> The steps carried were in geo-rep setup, however the checksum missmatches on
> master. Providing only the steps that is needed to avoid confusion. 
> 
> 1. Create 6 node Master Cluster
> 2. Create EC volume on Master (2x(4+2))
> 3. Mount the volume (Fuse)
> 4. Using crefi, create the data set. fops carried inorder were:
> 
> create, chmod, chown, chgrp, hardlink, symlink, truncate
> 
> 5. After every fop, calculate the checksum of mount (At least twice)
> 6. After truncate, checksum never matched
> 

Just to clarify - 

- Fops order does not mention when did you (script) write data on the file
- Is it possible that while IO is going on, you truncated the file?
- Can we anyhow find out what truncate command actually truncated the file in   
  question? I mean the "truncate -s <??>" 
- Also, what was the original size of the file if it happened after write completed?




> Actual results:
> ===============
> 
> Checksum mismatches for a file (Might be data corruption) every time
> calculated from different mount for the same volume. 
> 
> 
> Expected results:
> =================
> 
> Checksum should match all the time from all the different mounts.

Comment 4 Rahul Hinduja 2017-07-20 06:25:03 UTC

> 
> Just to clarify - 
> 
> - Fops order does not mention when did you (script) write data on the file
> - Is it possible that while IO is going on, you truncated the file?

The next fop starts only once the earlier fop is completed. 

> - Can we anyhow find out what truncate command actually truncated the file
> in   
>   question? I mean the "truncate -s <??>" 

AFAIK, CREFI do not have a way, output do not help here. I will keep a watch when I try next time. 

> - Also, what was the original size of the file if it happened after write
> completed?

Its the random workload. Each file contains a different size between min=1K to Max=5k. After truncate, can not provide the actual size at the moment.

Comment 12 Pranith Kumar K 2017-07-28 02:12:20 UTC

Some notes, still no RC
07:24:01 :) ⚡ md5sum /home/gfs/ec_?/5*
ecb9f6f29ebbb72e3e612b0aec775b36  /home/gfs/ec_0/596c62e1%%HPJLBVY86T
6fe5b21f27a5c9dc72ae3422102327b2  /home/gfs/ec_1/596c62e1%%HPJLBVY86T
6f82c16ed99b901929737ddf11dd920b  /home/gfs/ec_2/596c62e1%%HPJLBVY86T
bcac001cf1cac8860e4ae233a35cd7ef  /home/gfs/ec_3/596c62e1%%HPJLBVY86T
6921a33bd9c1083aaec582cde9973b61  /home/gfs/ec_4/596c62e1%%HPJLBVY86T <<<---- this is the bad fragment
b9d70081b5b41c9f980e20bf40e1fd13  /home/gfs/ec_5/596c62e1%%HPJLBVY86T

I did the following to see what fragment corresponds to the bad md5sum that is present in the file we got.

07:24:53 :) ⚡ for i in {12288..14336}; do Crefi/crefi.py --multi -n 1 -b 1 -d 1 --size $i -t text --fop=create /mnt/ec2; done

root@dhcp35-190 - ~ 
07:26:49 :) ⚡ md5sum /home/gfs/ec_4/level00/* | grep 6921a33bd9c1083aaec582cde9973b61
6921a33bd9c1083aaec582cde9973b61  /home/gfs/ec_4/level00/597a99d1%%4SKACGWCDK

root@dhcp35-190 - ~ 
07:27:07 :) ⚡ ls -l /mnt/ec2/level00/597a99d1%%4SKACGWCDK
-rw-r--r--. 1 root root 14336 Jul 28 07:26 /mnt/ec2/level00/597a99d1%%4SKACGWCDK

07:39:48 :) ⚡ truncate -s 13526 /mnt/ec2/level00/597a99d1%%4SKACGWCDK

root@dhcp35-190 - ~ 
07:40:10 :) ⚡ md5sum /home/gfs/ec_?/level00/597a99d1%%4SKACGWCDK
ecb9f6f29ebbb72e3e612b0aec775b36  /home/gfs/ec_0/level00/597a99d1%%4SKACGWCDK
6fe5b21f27a5c9dc72ae3422102327b2  /home/gfs/ec_1/level00/597a99d1%%4SKACGWCDK
6f82c16ed99b901929737ddf11dd920b  /home/gfs/ec_2/level00/597a99d1%%4SKACGWCDK
bcac001cf1cac8860e4ae233a35cd7ef  /home/gfs/ec_3/level00/597a99d1%%4SKACGWCDK
f039f8c213c0b31b2d16da87de8da6b7  /home/gfs/ec_4/level00/597a99d1%%4SKACGWCDK<------ This fragment is correct.
b9d70081b5b41c9f980e20bf40e1fd13  /home/gfs/ec_5/level00/597a99d1%%4SKACGWCDK

So This means either the write was never wound on the file when this bug happened or it failed silently.

This will be the next things to look into.

Comment 13 Pranith Kumar K 2017-07-28 05:27:26 UTC

Ashish,
     Didn't find any evidence for how it failed, but there is a bug in EC. When writev in truncate fails, then it is not marking that brick as bad. Verified it in gdb. We may need more help from Rahul to find how write exactly failed. Are the quota errors contributing to any failures is also something we may need to check.

I put one of the brick processes in gdb and made it return error without performing any write:
Thread 7 "glusterfsd" hit Breakpoint 3, server3_3_writev (req=0x7fbf100ca140) at server-rpc-fops.c:4100
4100	        server_state_t      *state  = NULL;
(gdb) n
...
(gdb) 
4119	        frame = get_frame_from_request (req);
(gdb) 
4120	        if (!frame) {
(gdb) p frame=0
$3 = (call_frame_t *) 0x0
(gdb) n
4122	                SERVER_REQ_SET_ERROR (req, ret);
(gdb) 
4123	                goto out;
(gdb) 
4172	        free (args.xdata.xdata_val);
(gdb) 
4174	        if (op_errno)
(gdb) c

I got the same scenario mentioned in this bug:
root@dhcp35-190 - ~ 
10:34:11 :) ⚡ getfattr -d -m. -e hex /home/gfs/ec_?/services
getfattr: Removing leading '/' from absolute path names
# file: home/gfs/ec_0/services
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a686f6d655f726f6f745f743a733000
trusted.ec.config=0x0000080602000200
trusted.ec.dirty=0x00000000000000000000000000000000
trusted.ec.size=0x00000000000034d6
trusted.ec.version=0x00000000000000270000000000000027
trusted.gfid=0xeacb572971f5457a9a5706020255f312

# file: home/gfs/ec_1/services
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a686f6d655f726f6f745f743a733000
trusted.ec.config=0x0000080602000200
trusted.ec.dirty=0x00000000000000000000000000000000
trusted.ec.size=0x00000000000034d6
trusted.ec.version=0x00000000000000270000000000000027
trusted.gfid=0xeacb572971f5457a9a5706020255f312

# file: home/gfs/ec_2/services
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a686f6d655f726f6f745f743a733000
trusted.ec.config=0x0000080602000200
trusted.ec.dirty=0x00000000000000000000000000000000
trusted.ec.size=0x00000000000034d6
trusted.ec.version=0x00000000000000270000000000000027
trusted.gfid=0xeacb572971f5457a9a5706020255f312

# file: home/gfs/ec_3/services
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a686f6d655f726f6f745f743a733000
trusted.ec.config=0x0000080602000200
trusted.ec.dirty=0x00000000000000000000000000000000
trusted.ec.size=0x00000000000034d6
trusted.ec.version=0x00000000000000270000000000000027
trusted.gfid=0xeacb572971f5457a9a5706020255f312

# file: home/gfs/ec_4/services
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a686f6d655f726f6f745f743a733000
trusted.ec.config=0x0000080602000200
trusted.ec.dirty=0x00000000000000000000000000000000
trusted.ec.size=0x00000000000034d6
trusted.ec.version=0x00000000000000270000000000000027
trusted.gfid=0xeacb572971f5457a9a5706020255f312

# file: home/gfs/ec_5/services
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a686f6d655f726f6f745f743a733000
trusted.ec.config=0x0000080602000200
trusted.ec.dirty=0x00000000000000000000000000000000
trusted.ec.size=0x00000000000034d6
trusted.ec.version=0x00000000000000270000000000000027
trusted.gfid=0xeacb572971f5457a9a5706020255f312


root@dhcp35-190 - ~ 
10:34:13 :) ⚡ md5sum  /home/gfs/ec_?/services
ecb9f6f29ebbb72e3e612b0aec775b36  /home/gfs/ec_0/services
6fe5b21f27a5c9dc72ae3422102327b2  /home/gfs/ec_1/services
6f82c16ed99b901929737ddf11dd920b  /home/gfs/ec_2/services
bcac001cf1cac8860e4ae233a35cd7ef  /home/gfs/ec_3/services
6921a33bd9c1083aaec582cde9973b61  /home/gfs/ec_4/services <<<<------ same md5sum as the bad file in Rahul's setup and all xattrs match as if everything is fine when it is not.
b9d70081b5b41c9f980e20bf40e1fd13  /home/gfs/ec_5/services

Ashish,
    Can you send out a patch? In both ec_truncate_open_cbk, ec_truncate_writev's cbk, we need to update lock->good_mask correctly. I don't find any other way for the file to end up in the state it did on Rahul's setup. We just need to confirm this is indeed the case. The only other psychotic case will be if the brick sends a success without performing the write.

Rahul,
    Will it be possible to get a tcpdump of a run where this happens? That will confirm how this bug happened. The bug we found based on the information you had on your setup I explained above is a day-1 bug.

Pranith

Comment 16 Atin Mukherjee 2017-08-01 01:33:03 UTC

release-3.12 patch : https://review.gluster.org/17932
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/113931/

Comment 18 Rahul Hinduja 2017-08-09 10:41:24 UTC

Have tried the use case mentioned in the description for about 7 times and haven't seen this issue with build glusterfs-3.8.4-37 and glusterfs-3.8.4-38. It was intermittent issue to start at the beginning. Have used truncate extensively in the testing. 

Moving this bug to verified state for now. If we hit this issue again, will reopen.

Comment 19 Nag Pavan Chilakam 2017-08-09 14:30:51 UTC

I have tried truncates of files and md5sum is consistent across brick down scenarios, all bricks up scenarios when compared with md5sum of the same files on the client local filesystem

1)created a file of 1.5gb on localFS and copied same to mount ===>checksum consistent
2)truncate file to 1gb on both localfs and glusterfs mount--->again checksums match
3)did umount of client and remounted and check checksum, still consistent 
4)brought down bricks in cyclic order, checksums still consistent

testbuild:3.8.4-38 on 1x(4+2) ec vol

Comment 21 errata-xmlrpc 2017-09-21 05:04:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774